If you’re interested in blockchains and cryptocurrencies, then you’ve probably heard of hashing (or hashes). The idea of hashing is crypto-bedrock—a crucial ingredient in blockchain infrastructure. In this chapter you’ll gain a working knowledge of hashing and why it’s so important.
Project setup
This folder structure is loosely what all modern Python projects look like. It was defined by PEP (Python Enhancement Proposal) 518. Poetry enforced this folder structure when you create a new project
Hash functions
Theoretically, hashing is the act of identifying data. It’s a special way to assign a unique, random value to any data—a sentence, photograph, spreadsheet, or downloaded program. You can think of hash functions as “identification machines”—something which assigns a value to a specific input. The input is arbitrary—it could be images, documents, files, raw bytes, numbers, anything you like—but the output is always predictable for the same input.
People tend to use the term hash both as a verb and a noun: “Check out the hash of this photo” (noun), or “Let’s hash the following file” (verb). The correct term for the output of a hash function is a “digest,” but “hash” has become commonplace.
The second sentence earlier claims that hashes are unique and random—this is slightly false. In this book, we’re dealing with a special class of hash functions called cryptographic hash functions where the computed hash is chosen from a sufficiently large enough pool of values such that two inputs sharing a value is highly improbable in our known universe. So, to rephrase the sentence: hashes are as-good-as-you-can-practically-get-to-unique.
Example 1: Hashing in Python
hashing_strings.py
Now, try changing one of the letters in “backpack” to uppercase. Try adding a space to the end. Play around with the input each time. Have you noticed that the resulting hashes are completely different? Good. Then you’ve just discovered the avalanche property: minor changes in input yield large changes in output.
Deterministic: The same input always yields the same hash.
Intractability: It’s infeasible to find the input for a given hash except by exhaustion (trying a gargantuan amount of possible inputs).
Collision-safety: It’s infeasible to find two different inputs which output the same hash.
Avalanche effect: The smallest change in input should yield a hash so different that the new hash appears uncorrelated with the old hash.
Speed: It’s computationally fast to generate a hash.
Peer-to-peer blockchains make their choice of hash function known in their protocols: Bitcoin uses double sha256, while Ethereum uses keccak256. The important thing to know is that all of these hash functions do the same thing: they provide predictable output for a given input.
Term | Explanation | Example |
---|---|---|
Hash function | A function that identifies anything you feed into it by outputting a gargantuan, random hexadecimal number. This number is always the same for the same input. | A cryptographic hash function such as sha256, md5, blake2b, etc. |
Hash/digest | The output of the hash function: a huge, random hexadecimal number. | The hash of the word “dan”: ec4f...f1cb |
Example 2: Hashing images
The hash of this image is e25641dc52387baba19751783ae4e060.
The breakthrough idea behind Bitcoin is that hashes can be used to prove that work was done by a computer—we’ll be exploring this in the upcoming chapters, but for now, keep in mind that Bitcoin uses the apparent “randomness” of cryptographic hashes.
hashing_files.py
This line imports the sha256 hashing function from the hashlib library which is included with Python. sha256 is one of the most popular hashing functions (it’s used extensively in Bitcoin).
Then we read the opened file into our hash function, like so: sha256(file.read()), and assign the hexdigest(), that is, the output as a hex string, to a variable called hash.
Finally, we close the opened file and print the hash to the terminal.
You can use the above script to verify that a downloaded file has not been tampered with by a third party. Reputable websites will advertise what the hash (sometimes called a checksum) of the file should be, so that you can verify it locally.
Analogies
If you’re still in the dark, here’s a real-life analogy to drive the idea home.
Imagine you’re at the airport, passing through a security check, and you dump your backpack on the conveyor belt. But the conveyor belt is equipped with a special kind of scanner—a machine that X-rays your belongings and outputs a hash.
The backpack is scanned, and the scanner pops out an 8-digit hash identifying your backpack: 13371817. If your backpack and all the items in it are exactly the same, the scanner will always output the same number. But if you now decide to remove an item from your backpack, the scanner will return an entirely different number.
Irreversibility
As we’ve discussed, blockchains make use of cryptographic hash functions, which are hash functions that are extremely hard to reverse. Meaning, if we’re given a hash, it’s unfeasibly difficult (on a supercomputing level) to ever guess what the input was.
In the preceding example, if I gave you the hash e25641dc52387baba19751783ae4e060, you’d have an extremely difficult time guessing that it was a photograph of a $100 bill.
It’s crucial important to understand this concept, as it forms the basis of what secures Bitcoin and blockchains in general. And there are many different hash functions with different properties, as we’ll see later.
Example 3: Sending untamperable emails
The Problem
Alice wants to send Bob an email over an insecure channel like the open Internet. Bob isn’t interested if other people can read the email, but he wants to ensure that it hasn’t been tampered with. How can Bob verify that the email message hasn’t been tampered with?
Before we give a solution, using the hashing knowledge you’ve just absorbed, can you think how such a scheme may be devised?
The solution
The overall idea is for Alice and Bob to agree on a shared secret beforehand. Alic then removes the secret phrase from the message, and sends it to Bob with the hash. Bob then performs the same procedure as Alice: he appends the pre-shared secret phrase and outputs a hash. If the hash isn’t the same as the one Alice sent, he’ll know that the message has been tampered with.
- 1.
Alice and Bob both share a secret phrase S.
- 2.
Alice then creates a hash H of the message M with the secret appended to the end of the message: H = hash(M + S).
- 3.
Alice sends H and M to Bob (the message and the computed hash).
- 4.
Bob checks the message integrity by calculating H himself to see if it’s the same as the hash Alice advertised .
This general scheme describes a range of protocols called HMAC (Hashed Message Authentication Code) and is described by IETF RFC 2104.
Bob notices that Alice has included the hash in the body of the email. He strips the hash then appends the secret phrase (bolognese) to the body of the email and computes the hash of it. If the hash matches the included hash in the body of the email, then he can conclude that the message hasn’t been tampered with (and Alice does in fact have 12.03 BTC).
It’s important that Alice and Bob both know the protocol that they’ll be using for verification. This includes the hash function to be used (sha256) and where to put the secret phrase in the context of the message (they agree to append it to the end), and any other data that may further secure the scheme (such as a timestamp).
hashing_emails.py
When Bob runs this code, he’ll be able to verify Alice’s hash.
How preventing spam led to proofs of work
Did you know that hash functions can be used to prevent email spam? The algorithm behind this idea is called a “Proof-of-Work” algorithm, and it forms the core of Bitcoin. It was invented by British cryptographer Adam Back in 1997, he called it Hashcash.
The general idea is to only accept emails whose hashes satisfy a constraint. This causes the sender of the email to perform some sort of computational work before sending an email. In other words, it becomes expensive for the sender to spam people.
But what is the constraint? Hashes are just numbers, but they are seemingly random, so we can apply any constraint we want. I could make the rule that anyone sending me an email must provide a hash of the email that is an odd number. But this wouldn’t be a good constraint because a sender would have a 50% of generating an odd hash.
The Hashcash algorithm applies the constraint that a hash value must be lower than a certain number. This sounds deceivingly simple, but think about this for a moment: since cryptographic hash functions output a random number for a given input, how could we ensure that an email body hashed to a certain number? We can’t. The only thing we can do is insert arbitrary data into the email body and keep trial-and-erroring until our hash satisfies the constraint.
Let’s look at an example, and let’s also make the constraint that the email body should hash to a number lower than 10,000.
I’m using an eight-digit hash function that only outputs decimal numbers here for the sake of simplicity.
Should this email be accepted by our server?
No. The hash of the body of this email is 95119035, which isn’t less than 10,000. Thus, the sender has not done the work required: he must ensure that the hash of the email body satisfies the constraint.
Now we see that the hash of the body is 891, satisfying our constraint! (I’ve padded the hash with 0s because hexadecimal numbers are usually shown this way). Our email server will hash the body of the email and see that it is indeed 891 and will accept the email.
This is what proof of work is: the work is proven by displaying a particular hash, and the server can easily verify it. Another way to think about this scheme is that it is something that is difficult to generate but easy to verify.
This is the critical idea that allows any proof-of-work blockchain to generate (or mine) money by proving unequivocally that computational work was done to everyone on the network. The energy is consumed by exploiting the seemingly random nature of cryptographic hashes!
Summary
- 1.
Importing the hashlib library in Python
- 2.
Hashing any data structure or binary file in Python
- 3.
Understanding that cryptographic hashing functions are unfeasibly difficult to reverse
Still don’t get the gist of it?
That’s OK. This stuff is hard. If you’re still uncertain about hashing, it’s worthwhile to pause here and review this chapter again until it sinks in. It’s arguably the most important aspect of the blockchain data structure, and one we’ll use extensively going forward.