InfoSec Basics - Hashing

RG
Aug 23, 2020
5 min read

Every field has basic concepts that are essential to real understanding of the field. In the context of Information Security, “hashing” is one of those concepts. Some technology people throw the term around on the assumption that “everyone” knows what it means, even when they don’t fully understand themselves.

As happens not infrequently, people spend whole careers on the details and nuances of topics like this, but the high-level concept of hashing is not really that hard.

A hash is a computer algorithm that converts a data block of arbitrary size to a unique value of fixed length. In other words, a “hashing function” is a program which converts a “chunk” of data, however large, to a “fingerprint” or “hash” value, which is limited to a specific length.

As an example, using an old/obsolete hash known as MD5, here is an illustration using the word “Chicken” (using https://passwordsgenerator.net/md5-hash-generator/):

NOTE: I am using MD5 for convenience, since most later hash functions generate longer output strings. (By the way, “MD” is short for “Message Digest”)

When Information Security people discuss hashing, we are usually referring to a “cryptographic hash function” (https://en.wikipedia.org/wiki/Cryptographic_hash_function). It’s still a hash function, but not all hash functions are cryptographically “secure”. The main properties of a cryptographic hash function include:

Deterministic – The hash function must always generate the same output for a given input
One-way – It should be “computationally infeasible” to find the input value that generates a given hash. (In practice, this means that the only way to find the original value is to hash all possible input values until you get the one you want, which should take a very, very long time.
Collision-resistant – It should be “computationally infeasible” to find two different messages which generate the same hash value
Small changes to the input generate large changes in the output (“avalanche effect”). In the example above, see how the values “chicken” and “Chicken” produce radically different hash values.

What does all this mean, though? Ignoring all the math-speak, a hash is a way to generate a unique, fixed-length, “fingerprint” of a block of data (generally a file or string), where you can go from the data to the hash, but can’t go back from the hash to the data.

John Oliver (Last Week Tonight, Season 5, Episode 4), in his episode on Blockchain, includes a “really helpful, really dumb metaphor” to describe why hashing is secure (https://www.youtube.com/watch?v=g6iDZspbRMg – starting at about 7:50)

So, now that we have an idea of what a hash is (including a mental image which may never fade...), what’s it for?

One common use for hashing is to generate a “fingerprint” of a file, in order to ensure that the file has not been altered along the way. If I am provided a file, and the hash of that file, I can download the file, hash it myself, then compare “my” hash with the “original” hash. If the two hash values match, I can be confident that the file is “valid” - ie, that it is the same as the source file.

That explains why we want the hashing algorithm to have the properties noted above – ie, that it is deterministic, one-way, colliision-resistant, and shows the “avalanche effect”. With those properties, we can guarantee - to a high degree of confidence - that the two files are identical.

Now that we have a general idea of what a hash is, and what it’s for, why are there so many different hash functions? This is where things can get very complex, very quickly, as different hash algorithms have been developed over the years, for different reasons.

The table below shows a few key bits of information on a few common cryptographic hashes. (https://en.wikipedia.org/wiki/Cryptographic_hash_function)

In practice, the main reason a cryptographic hash goes “out of style” is that it is “broken”, which either means that computer power has increased to the point that so-called “brute force attacks” (ie, hash every possible value until you get the hash you want) become viable, or that a mathematical flaw is found in the algorithm, which makes it easier to break than was thought.

As noted, this is a small subset of a very large number of cryptographic hashes, used by various groups for various different reasons. A full treatment would be more along the lines of a PhD thesis than a blog post, so I’ll leave it at this summary level.

One interesting note: While the name may suggest otherwise, SHA-3 is not based on SHA-1 or SHA-2, but is instead a different algorithm. The “SHA” designation merely means that the US government selected the algorithm as one of its standards – the “Secure Hash Algorithms” https://en.wikipedia.org/wiki/Secure_Hash_Algorithms.

At this point, I’d like to introduce the world-famous (in the field of cryptography, at any rate) Alice and Bob. (https://en.wikipedia.org/wiki/Alice_and_Bob) They are best known as the characters used in cryptographic thought-experiments to illustrate concepts. (It should be noted that they are not necessarily human – they could be computers, different programs on a single computer, or even artificial intelligences)

To illustrate hashing, consider that Alice has a large file – let’s call it chicken.txt – that she wants to send to Bob. On the first try, Bob can’t read the file – somehow, it’s gotten corrupted. Alice can try to resend the file, but Bob wants to be sure that the new file isn’t also going to get corrupted, so Alice generates a hash of the file – let’s call it chicken_hash.txt, using the MD5 hash (no longer secure, but no need to worry about that here).

Alice sends Bob the file chicken.txt, then the hash file chicken_hash.txt (maybe by text, or phone, or snail-mail – I may discuss alternate transmission channels at some future date...).

When Bob gets the file, he uses the same hash function (that’s vital) to generate a hash of the file he has received, then compares it with the file chicken_hash.txt. If the two match, he can be confident that THIS time, the file came through correctly, since any (even single-bit) difference in chicken.txt would have resulted in a totally different hash value.

I think one of the biggest problems with hashing is that people throw the word around without explaining it, sometimes without even understanding what it means. While the details can get extremely complex, the basic concept is not too hard – just takes a bit of time to get used to it.

Cheers!

ADDENDUM

When I tried to find the John Oliver clip on Cryptocurrency, I found a post criticizing the episode and describing all of the things John Oliver “gets wrong” (https://blog.erratasec.com/2018/03/what-john-oliver-gets-wrong-about.html) I was astonished by how truly, profoundly, hilariously the author missed the point of Mr. Oliver’s roughly 10-minute COMEDY bit which focused mainly on some of the outrageous claims made by some of the more colourful members of the cryptocurrency community. If you can believe it, the criticism started with the statement that any discussion of Bitcoin “should always start with Satoshi Nakamoto’s original paper.”

If Mr. Oliver had started there, he’d most likely have lost his COMEDY audience in the first sentence. At no point did Mr. Oliver suggest he was going to give a comprehensive treatment of either the history or technology behind cryptocurrencies in general, or Bitcoin in particular. (Did I mention that John Oliver’s piece was COMEDY?)

All of that said, I would actually be very interested to hear Mr. Oliver do a follow-up piece on Satoshi Nakamoto (https://en.wikipedia.org/wiki/Satoshi_Nakamoto). Quite an interesting character, actually, as “Satoshi Nakamoto” is almost certainly a pseudonym, and may well be a composite identity for a number of people. I’m sure Mr. Oliver and his team could produce a very interesting bit on this person who may not be a (single) person. (Pinocchio references come most immediately to mind...)

Cheers!