© Daniel van Flymen 2020
D. van FlymenLearn Blockchain by Building Onehttps://doi.org/10.1007/978-1-4842-5171-3_2

2. A Way to Identify Everything

Daniel van Flymen1 
(1)
New York, NY, USA
 

If you’re interested in blockchains and cryptocurrencies, then you’ve probably heard of hashing (or hashes). The idea of hashing is crypto-bedrock—a crucial ingredient in blockchain infrastructure. In this chapter you’ll gain a working knowledge of hashing and why it’s so important.

Project setup

In the spirit of learning by doing, we’re going to create a new project which we’ll build upon incrementally, chapter by chapter. We’ll call it funcoin:
$ poetry new funcoin
Hop into the funcoin folder:
$ cd funcoin
The funcoin folder should look like this:
.
├── funcoin
│  └── __init__.py
├── pyproject.toml
└── tests
  ├── __init__.py
  └── test_funcoin.py
Project Structure

This folder structure is loosely what all modern Python projects look like. It was defined by PEP (Python Enhancement Proposal) 518. Poetry enforced this folder structure when you create a new project

Let’s create a file for the purposes of learning this chapter; in the project folder funcoin, call it chapter_2.py. Your project folder should now look like this:
.
├── README.rst
├── funcoin
│  ├── __init__.py
│  └── chapter_2.py
├── pyproject.toml
└── tests
   ├── __init__.py
   └── test_funcoin.py

Hash functions

Theoretically, hashing is the act of identifying data. It’s a special way to assign a unique, random value to any data—a sentence, photograph, spreadsheet, or downloaded program. You can think of hash functions as “identification machines”—something which assigns a value to a specific input. The input is arbitrary—it could be images, documents, files, raw bytes, numbers, anything you like—but the output is always predictable for the same input.

Naming Conventions

People tend to use the term hash both as a verb and a noun: “Check out the hash of this photo” (noun), or “Let’s hash the following file” (verb). The correct term for the output of a hash function is a “digest,” but “hash” has become commonplace.

Uniqueness

The second sentence earlier claims that hashes are unique and random—this is slightly false. In this book, we’re dealing with a special class of hash functions called cryptographic hash functions where the computed hash is chosen from a sufficiently large enough pool of values such that two inputs sharing a value is highly improbable in our known universe. So, to rephrase the sentence: hashes are as-good-as-you-can-practically-get-to-unique.

Example 1: Hashing in Python

Python ships with the standard stash of popular hash functions; they are available in the hashlib library. Let’s fire up Python and start playing with them.
import hashlib
# Hash functions expect bytes as input: the encode() method turns strings to bytes
input_bytes = b"backpack"
output = hashlib.sha256(input_bytes)
# We use hexdigest() to convert bytes to hex because it's easier to read
print(output.hexdigest())
Listing 2-1

hashing_strings.py

What was the output that you got? It should’ve been:
5f00368a6ad231c3c439c4f6bc33c27014b4d35a904ff1656d74f9528636f496.

Now, try changing one of the letters in “backpack” to uppercase. Try adding a space to the end. Play around with the input each time. Have you noticed that the resulting hashes are completely different? Good. Then you’ve just discovered the avalanche property: minor changes in input yield large changes in output.

Typically, hash functions are considered cryptographic if they satisfy the following properties:
  • Deterministic: The same input always yields the same hash.

  • Intractability: It’s infeasible to find the input for a given hash except by exhaustion (trying a gargantuan amount of possible inputs).

  • Collision-safety: It’s infeasible to find two different inputs which output the same hash.

  • Avalanche effect: The smallest change in input should yield a hash so different that the new hash appears uncorrelated with the old hash.

  • Speed: It’s computationally fast to generate a hash.

Choice of Hash Function

Peer-to-peer blockchains make their choice of hash function known in their protocols: Bitcoin uses double sha256, while Ethereum uses keccak256. The important thing to know is that all of these hash functions do the same thing: they provide predictable output for a given input.

Let’s recap what we’ve learnt so far.

Term

Explanation

Example

Hash function

A function that identifies anything you feed into it by outputting a gargantuan, random hexadecimal number. This number is always the same for the same input.

A cryptographic hash function such as sha256, md5, blake2b, etc.

Hash/digest

The output of the hash function: a huge, random hexadecimal number.

The hash of the word “dan”: ec4f...f1cb

Example 2: Hashing images

Here’s an image of a $100 bill.
../images/475256_1_En_2_Chapter/475256_1_En_2_Fig1_HTML.jpg
Figure 2-1

$100 bill

The hash of this image is e25641dc52387baba19751783ae4e060.

Here’s the same image, with a small modification.
../images/475256_1_En_2_Chapter/475256_1_En_2_Fig2_HTML.jpg
Figure 2-2

$100 bill (modified)

Can you spot the modification? Maybe, maybe not. Look closer. (I made the word "STATES" singular.) But here’s the interesting part; if I run this image through a hash function, the result is f4c56f530133b8de6b3b0b39a610be32. This may not seem amazing, but if you compare the two hashes, you’ll notice that they are vastly different and seemingly random. I’ll line them up for you:
e25641dc52387baba19751783ae4e060
f4c56f530133b8de6b3b0b39a610be32
Note

The breakthrough idea behind Bitcoin is that hashes can be used to prove that work was done by a computer—we’ll be exploring this in the upcoming chapters, but for now, keep in mind that Bitcoin uses the apparent “randomness” of cryptographic hashes.

Let’s try hashing our own image (or file), in Python.
from hashlib import sha256
file = open("my_image.jpg", "rb")
hash = sha256(file.read()).hexdigest()
file.close()
print(f"The hash of my file is: {hash}")
Listing 2-2

hashing_files.py

Let’s go through the preceding code and explain it, line by line:
from hashlib import sha256

This line imports the sha256 hashing function from the hashlib library which is included with Python. sha256 is one of the most popular hashing functions (it’s used extensively in Bitcoin).

Here, we open a file called my_image.jpg residing in the same directory as your code:
file = open("my_image.jpg", "rb")
The "rb" argument denotes that the file should be in read-only mode, and read as bytes.
hash = sha256(file.read()).hexdigest()

Then we read the opened file into our hash function, like so: sha256(file.read()), and assign the hexdigest(), that is, the output as a hex string, to a variable called hash.

Finally, we close the opened file and print the hash to the terminal.

Verifying downloaded files from the Internet

You can use the above script to verify that a downloaded file has not been tampered with by a third party. Reputable websites will advertise what the hash (sometimes called a checksum) of the file should be, so that you can verify it locally.

Analogies

If you’re still in the dark, here’s a real-life analogy to drive the idea home.

Imagine you’re at the airport, passing through a security check, and you dump your backpack on the conveyor belt. But the conveyor belt is equipped with a special kind of scanner—a machine that X-rays your belongings and outputs a hash.

The backpack is scanned, and the scanner pops out an 8-digit hash identifying your backpack: 13371817. If your backpack and all the items in it are exactly the same, the scanner will always output the same number. But if you now decide to remove an item from your backpack, the scanner will return an entirely different number.

Irreversibility

As we’ve discussed, blockchains make use of cryptographic hash functions, which are hash functions that are extremely hard to reverse. Meaning, if we’re given a hash, it’s unfeasibly difficult (on a supercomputing level) to ever guess what the input was.

In the preceding example, if I gave you the hash e25641dc52387baba19751783ae4e060, you’d have an extremely difficult time guessing that it was a photograph of a $100 bill.

It’s crucial important to understand this concept, as it forms the basis of what secures Bitcoin and blockchains in general. And there are many different hash functions with different properties, as we’ll see later.

Example 3: Sending untamperable emails

The Problem

Alice wants to send Bob an email over an insecure channel like the open Internet. Bob isn’t interested if other people can read the email, but he wants to ensure that it hasn’t been tampered with. How can Bob verify that the email message hasn’t been tampered with?

Note

Before we give a solution, using the hashing knowledge you’ve just absorbed, can you think how such a scheme may be devised?

The solution

The overall idea is for Alice and Bob to agree on a shared secret beforehand. Alic then removes the secret phrase from the message, and sends it to Bob with the hash. Bob then performs the same procedure as Alice: he appends the pre-shared secret phrase and outputs a hash. If the hash isn’t the same as the one Alice sent, he’ll know that the message has been tampered with.

To summarize
  1. 1.

    Alice and Bob both share a secret phrase S.

     
  2. 2.

    Alice then creates a hash H of the message M with the secret appended to the end of the message: H = hash(M + S).

     
  3. 3.

    Alice sends H and M to Bob (the message and the computed hash).

     
  4. 4.

    Bob checks the message integrity by calculating H himself to see if it’s the same as the hash Alice advertised .

     
Note

This general scheme describes a range of protocols called HMAC (Hashed Message Authentication Code) and is described by IETF RFC 2104.

Let’s look at an example. Before Alice and Bob begin communicating, they both agree on a secret password, bolognese. Alice formulates her email using the given scheme. Here’s what it looks like when Bob receives it:
From: <Alice> [email protected]
Subject: Have you heard about Bitcoin?
Hey Bob,I think you should learn about Blockchains! I've been investing in Bitcoin and currently have exactly 12.03 BTC in my account.
hash: 71890dc61c21370874d2a7b74064396cb613a1924f09aa06925abc7842e6802c

Bob notices that Alice has included the hash in the body of the email. He strips the hash then appends the secret phrase (bolognese) to the body of the email and computes the hash of it. If the hash matches the included hash in the body of the email, then he can conclude that the message hasn’t been tampered with (and Alice does in fact have 12.03 BTC).

It’s important that Alice and Bob both know the protocol that they’ll be using for verification. This includes the hash function to be used (sha256) and where to put the secret phrase in the context of the message (they agree to append it to the end), and any other data that may further secure the scheme (such as a timestamp).

Let’s look at how Bob may verify Alice’s message in Python.
from hashlib import sha256
secret_phrase = "bolognese"
def get_hash_with_secret_phrase(input_data, secret_phrase):
    combined = input_data + secret_phrase
    return sha256(combined.encode()).hexdigest()
email_body = "Hey Bob, I think you should learn about Blockchains! "
             "I've been investing in Bitcoin and currently have exactly 12.03 BTC in my  account."
print(get_hash_with_secret_phrase(email_body,  secret_phrase))
Listing 2-3

hashing_emails.py

When Bob runs this code, he’ll be able to verify Alice’s hash.

How preventing spam led to proofs of work

Did you know that hash functions can be used to prevent email spam? The algorithm behind this idea is called a “Proof-of-Work” algorithm, and it forms the core of Bitcoin. It was invented by British cryptographer Adam Back in 1997, he called it Hashcash.

The general idea is to only accept emails whose hashes satisfy a constraint. This causes the sender of the email to perform some sort of computational work before sending an email. In other words, it becomes expensive for the sender to spam people.

But what is the constraint? Hashes are just numbers, but they are seemingly random, so we can apply any constraint we want. I could make the rule that anyone sending me an email must provide a hash of the email that is an odd number. But this wouldn’t be a good constraint because a sender would have a 50% of generating an odd hash.

The Hashcash algorithm applies the constraint that a hash value must be lower than a certain number. This sounds deceivingly simple, but think about this for a moment: since cryptographic hash functions output a random number for a given input, how could we ensure that an email body hashed to a certain number? We can’t. The only thing we can do is insert arbitrary data into the email body and keep trial-and-erroring until our hash satisfies the constraint.

Let’s look at an example, and let’s also make the constraint that the email body should hash to a number lower than 10,000.

Note

I’m using an eight-digit hash function that only outputs decimal numbers here for the sake of simplicity.

subject: Buy  Bitcoins!
Hey Ethan,I think you should buy some Bitcoins!
hash: 95119035
Listing 2-4

Should this email be accepted by our server?

No. The hash of the body of this email is 95119035, which isn’t less than 10,000. Thus, the sender has not done the work required: he must ensure that the hash of the email body satisfies the constraint.

One way the sender can do this is by inserting arbitrary data in the email body:
subject: Buy  Bitcoins!
Hey Ethan, I think you should buy some Bitcoins!
s8763ASdh727212213098
hash: 00000891

Now we see that the hash of the body is 891, satisfying our constraint! (I’ve padded the hash with 0s because hexadecimal numbers are usually shown this way). Our email server will hash the body of the email and see that it is indeed 891 and will accept the email.

This is what proof of work is: the work is proven by displaying a particular hash, and the server can easily verify it. Another way to think about this scheme is that it is something that is difficult to generate but easy to verify.

Note

This is the critical idea that allows any proof-of-work blockchain to generate (or mine) money by proving unequivocally that computational work was done to everyone on the network. The energy is consumed by exploiting the seemingly random nature of cryptographic hashes!

Summary

At this point, you should be practically familiar with hashing functions and how to hash arbitrary data. To summarize, you should be comfortable
  1. 1.

    Importing the hashlib library in Python

     
  2. 2.

    Hashing any data structure or binary file in Python

     
  3. 3.

    Understanding that cryptographic hashing functions are unfeasibly difficult to reverse

     

Still don’t get the gist of it?

That’s OK. This stuff is hard. If you’re still uncertain about hashing, it’s worthwhile to pause here and review this chapter again until it sinks in. It’s arguably the most important aspect of the blockchain data structure, and one we’ll use extensively going forward.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.133.228