Let me invent a simple "password hashing algorithm" to show you how it works. Unlike the other examples in this thread, this one is actually viable, if you can live with a few bizarre password restrictions. Your password is two large prime numbers, x and y. For example:
x = 48112959837082048697
y = 54673257461630679457
You can easily write a computer program to calculate xy in O(N^2) time, where N is the number of digits in x and y. (Basically that means that it takes four times as long if the numbers are twice as long. There are faster algorithms, but that's irrelevant.) Store xy in the password database.
x*y = 2630492240413883318777134293253671517529
A child in fifth grade, given enough scratch paper, could figure out that answer. But how do you reverse it? There are many algorithms people have devised for factoring large numbers, but even the best algorithms are slow compared to how quickly you can multiply x by y. And none of those algorithms could be performed by a fifth grader, unless the numbers were very small (e.g., x=3, y=5).
That is the key property: the computation is much simpler going forwards than backwards. For many problems, you must invent a completely new algorithm to reverse a computation.
This has nothing to do with injective or bijective functions. When you are cracking a password, it often doesn't matter if you get the same password or if you get a different password with the same hash. The hash function is designed so it is hard to reverse it and get any answer at all, even a different password with the same hash. In crypto-speak: a hash function vulnerable to a preimage attack is utterly worthless. (The password hashing algorithm above is injective if you have a rule that x < y.)
What do cryptography experts do? Sometimes, they try to figure out new algorithms to reverse a hash function (pre-image). They do exactly what you say: analyze the algorithm and try to reverse it. Some algorithms have been reversed before, others have not.
Exercise for the reader: Suppose the password database contains the following entry:
What is the password? (This one is actually not too difficult for a computer.)
Footnote: Due to the small number of passwords that people choose in practice, a good password hash is not merely difficult to compute backwards but also time-consuming to compute forwards, to slow down dictionary attacks. As another layer of protection, randomized salt prevents the use of precomputed attack tables (such as "rainbow tables").
Footnote 2: How do we know that it is hard to reverse a hash function? Unfortunately, we don't. We just don't know any easy ways to reverse hash functions. Making a hash function that is provably difficult to reverse is the holy grail of hash function design, and it has not been achieved yet (perhaps it will never happen).
Now THAT is a good question.
We must first give a precision: many one-way functions, in particular hash function as commonly used in cryptography, accept inputs from a space which is much larger than the space of output values. For instance, SHA-256 is defined for inputs which are strings of up to 18446744073709551615 bits; there are 218446744073709551616-1 possible inputs, but since the output is always a sequence of 256 bits, there are only 2256 possible outputs for SHA-256. Necessarily, some distinct inputs yield the same output. Therefore, for a given output of SHA-256, it is not possible to unambiguously recover the input which was used, but, possibly, it might be possible to compute an input which yields the given output value. Preimage resistance is about that: the difficulty of finding a matching input for an output (regardless of how that output was obtained in the first place).
So we talk about a function that everybody can compute over any input (using a publicly known program, no secret value involved -- we are not talking about encryption).
What academics say
It is unclear whether one-way functions can actually exist. Right now, we have many functions that no one knows how to invert; but this does not mean that they are impossible to invert, in a mathematical sense. Note, though, that it is not proven that one-way functions cannot exist, so hope remains. Some people suspect that whether one-way functions may exist or not could be one of these irksome mathematical assertions which can be neither proven nor disproved (Gödel's theorem proves that such things must exist). But there is no proof of that either.
Therefore, there is no proof that any given hash function is really resistant to preimages.
There are some functions which can be linked to well-known hard problems. For instance, if n is the product of two big primes, then the function x ⟼ x2 mod n is hard to invert: being able to compute square roots modulo a non-prime integer n (on a general basis) is equivalent to being able to factor n, and that problem is known to be hard. Not proven to be hard, mind you; only that mathematicians have tried to efficiently factor big integers for (at least) the last 2500 years, and although some progress has been made, none of these smart people found a really killer algorithm for that. World record for factorization of a "RSA modulus" (a product of two randomly chosen big primes of similar lengths) is a 768-bit integer.
Some hash functions based on such "hard problems" have been proposed; see for instance MASH-1 and MASH-2 (on the RSA problem) and ECOH (with elliptic curves). Only a few such functions exist, because:
- Turning a "hard problem" into a secure hash function is not easy; there are lots of tricky issues. For instance, while extracting square roots modulo a non-prime n is usually hard, there are values for which square root extraction is easy.
- The performance of such hash functions tends to be, let's say, suboptimal. Like being 100x slower than a more commonly used SHA-1.
The more "standard" way of building a hash function is to get cryptographers together and have them gnaw at some proposed designs; the functions which survive cryptanalytic attempts for a few years are then considered "probably robust". The SHA-3 competition is such an effort; the winner should be announced later this year. On the 51 candidates (the ones who succeeded the administrative step), 14 were retained for "round 2" and these 14 have been relatively closely looked at by many cryptographers, and none of them found anything really worth saying about the functions. The list has been reduced to 5 and will be further reduced to 1 "soon", but not for security reasons (most of the actual data was about performance, not resistance).
What makes MD5 hard to invert
Since we do not know how to prove that a function is hard to invert, the best we can do is to give it a try on a specific function, so as to get an "intuition" of how the function achieves its apparent resistance.
I choose MD5, which is well known. Yes, MD5 is "broken", but that's for collisions, not preimages. There is a known preimage attack which is, at least theoretically, faster than the generic way (the "generic way" is "luck", i.e. trying inputs until a match is found, for an average cost of 2128 evaluations since MD5 has a 128-bit output; the Sasaki-Aoki attack has cost 2123.4, which is lower, but still way too high to be actually tried, so the result is still theoretical). But MD5 is relatively simple and has withstood attacks for quite some time, so it is an interesting example.
MD5 consists in a number of evaluations of a "compression function" over data blocks. The input message is first padded, so that its length becomes a multiple of 512 bits. It is then split into 512-bit blocks. A 128-bit running state (held in four 32-bit variables called A, B, C and D) is initialized to a conventional value, then processed with the compression function. The compression function takes the running state and one 512-bit message block, and mixes them into a new value for the running state. When all message blocks have been thus processed, the final value of the running state is the hash output.
So let's concentrate on the compression function. It works like this:
- Inputs: the running state (A B C D) and a message block M. The message block is 512 bits; we split it into 16 32-bit words M0, M1, M2,... M15.
- Output: the new running state value.
- Save the current state in some variables: A → A', B → B', C → C' and D → D'
- Do 64 rounds which look like this:
- Compute T = B + ((A + fi(B, C, D) + Mk + Xi) <<< si). This reads like this: we compute a given function fi (a simple bitwise function, which depends on the round number i) over B, C, and D. Add to that the value of A, one message word Mk and a constant Xi (additions are done modulo 232). Rotate the result to the left by some bits (the shift amount also depends on the round). Finally, add B: the result is T.
- Rotate the state words: D → A, C → D, B → C, T → B.
- Add the saved state values to the current state variables: A + A' → A, B + B' → B, C + C' → C, D + D' → D.
The important point is that there are 64 rounds, but only 16 message words. This means that each message word enters the processing four times. I write that in bold because it is the central point; resistance to preimages comes from that characteristic. Which message word is used in each round is described in the MD5 specification (RFC 1321); the specification also describes the functions fi, the rotate counts si and the 32-bit constants Xi.
Now suppose that you are trying to "invert" MD5; you begin from the output and work slowly up the compression function. First, you must decide the output of round 64. Indeed, the output of the compression function is the sum of the output of round 64, and the saved state (the A' B' C' D' values). You have neither, so you must choose. Your hope is that you will be able to find values for the message words which will allow you to obtain for input of round 1 some values which are coherent with your arbitrary decision on A' and its brothers.
Let's see how things look when you walk the compression function backward. You have the output of a round (the variables A, B, C and D after the round) and you want to recompute the input of that round. You already know the previous values of B, C and D, but for A and Mk you have plenty of choice: each 32-bit value is possible for A, and each has a corresponding Mk. At first, you are glad of that; who would spurn such freedom ? Just choose a random Mk, and this yields the corresponding A with just a few operations (try it !).
But after you have inverted that way 16 rounds (the rounds 49 to 64, since you are working backwards), freedom disappears. You have "chosen" the values of all the message words. When trying to invert round 48, you want to recompute the value of A just before that round; as per the MD5 specification, message word M2 is used in round 48, and you have already chosen the value of M2 (when inverting round 63). So there is only one choice for A. So what, would you say. One choice is sufficient to continue the backward walk. So you continue.
Now, you are at the beginning of the compression function. Remember that, initially, you made an arbitrary choice of the values of A' B' C' D': this allowed you to compute the output of round 64, and begin the backward walk. Now you have obtained the input of round 1, which should be identical to A' B' C' D'... and it does not match. That's quite normal: you chose A' B' C' D' arbitrarily, and you also chose the message words Mk arbitrarily, so it can be expected that it won't work most of the time. So you try to repair the computation, by retrospectively altering either your initial choice of A' B' C' D', or one or several of the random choices for Mk. But each modification on any Mk implies modifications elsewhere, because each Mk is used four times. So you need other modifications to cancel out the other ones, and so on...
At that point you begin to understand the problem of inverting MD5: every time you touch a single bit, it triggers an awful lot of modifications throughout the algorithm, which you need to cancel out by touching other bits, and there are just too many interactions. Basically, you juggle with 2128 balls at the same time, and that's way too much to keep track of all of them.
If each message block was 2048-bit long, split into 64 words, and each message word was used only once in MD5, then you could invert it easily. You do as above: arbitrary selection of A' B' C' D', arbitrary selection of message words for rounds 64 to 5; and for the first four rounds, you just consider the value you wish to obtain for the round input (the value which matches your arbitrary choice of A', B', C' or D') and work out the corresponding message word. Easy as pie. But MD5 does not process data by 2048-bit blocks, but by 512-bit blocks, and each message word is used four times.
Some additional twists
The structure of the compression function of MD5 is actually a generalization of a Feistel cipher. In a Feistel cipher, the data is split into two halves, and, for each round, we alter one half by adding/xoring it to an intermediate value which is computed from the other half and from the key; and then we swap the two halves. Extend this scheme to a four-parts split, and you get the same structure than the MD5 rounds -- with a 90º rotate: MD5 looks like the encryption of the current state using the message block as key (and there is the extra addition of the output of round 64 with the saved state, which departs MD5 from a rotated cipher).
So maybe we can build hash functions out of block ciphers ? Indeed we can: that's what Whirlpool is about. A hash function built over a rotated block cipher (the message block is the key); the block cipher of Whirlpool is "W", a derivative of Rijndael, better known as the AES. But W has bigger blocks (512 bits instead of 128 bits) and a reforged key schedule.
When you make a hash function out of a rotated block cipher, then preimage attacks on the hash function are somewhat equivalent to key reconstruction attacks on the block cipher; so there is some hope that if the block cipher is secure, then so is the hash function. There again, there are snarky details. Also, for such a structure, collisions on the hash function are like related-key attacks on the block cipher; related-key attacks are usually considered non fatal, and often ignored (for instance, they were not part of the evaluation criteria for the AES competition, and Rijndael is reputed a bit flaky in that respect, which is why W has a brand new key schedule).
Some newer designs are built over a block cipher which is not rotated, so that security of the hash function can be derived more directly from security of the block cipher; see for instance the SHA-3 candidate Skein, defined over a block cipher called Threefish.
Conversely, one could try to make a block cipher out of a hash function. See for instance SHACAL, which is SHA-1 "set upright". And, on cue, SHACAL has some related-key weaknesses which are quite similar to the known weaknesses of SHA-1 with regards to collisions (no actual collision was computed, but we have a method which should be almost a million times faster than the generic collision-finding algorithm).
Therefore, contrary to what I said in the introduction of this post, we have been talking encryption all along. There is still much to be discovered and studied about the links between hash functions and symmetric encryption.
there is no TL;DR for this message. Read it whole, or begone.
The first step to to the answer here is seeing examples, like the nice one from @Dietrich, of functions that are much harder to run in one direction than the inverse, and have resisted many attempts to find a speed breakthrough. But the problem is complex, so I'll try to flesh it out some more.
Lots of people seem to be falling into the trap (heh) of thinking that hash functions are actually somehow magical - that they really are absolute "one way functions" that mathematically can't be run backwards at all, just because they are called hashes. This is not a healthy way to think about it in a security forum. It is often wrong in practice. And it is always wrong in theory, given the basic mathematical definition of a function as a mapping from a domain to an image.
All hashes can be reversed, in principle. It may be messy and brutish (as in brute-force), it may take a impractically long time with today's hardware, and it may even hold up over the long haul, but mathematically it is simply a matter of time. As @mucker noted, all the information is there to find the original password. If we forget that, we forget the danger of clever heuristics for cherry-picking likely passwords, which make the news regularly. Hashing is an engineering problem and the primary challenge is an efficiency one - how to make it expensive to find the password given the hash. One of the principle results of that kind of thinking is the importance of making password hashes slow
And the science and mathematics of hashing is only slowly getting better. There really aren't any proofs that any hashes really are hard. @Dietrich's answer is a nice way of illustrating how ideal hash functions might be possible. But just look at the real experts describing how we don't have proofs for any of the best crypto algorithms: What's the mathematical model behind the security claims of symmetric ciphers and digest algorithms?
The fact that LanMan was cited in the question is yet more evidence that we need to avoid idealizing hashes. LanMan is anything but an ideal hash function, easily defeated by a combination of a bit of analysis and a bit of brute forcing. For another popular example of a horrid hash function see MySQL OLD_PASSWORD cryptanalysis?.
So get yourself back out of the trap - falling into it needn't be a one-way trip. Recognize that hashes are reversible, and keep that trusty security-mindset active as you look for the best way to reverse them. That's often the best way to find ones that really are hard to reverse. I'm not trying to cast aspersions on the best practices out there, like bcrypt or PBKDF2 or scrypt. But the evidence is clear that even good programmers get this stuff wrong all too often. so be careful with how you use them and don't try to invent your own.
Because that's how Cryptographic Hash Functions work, they are one-way (from plain to hash) mathematical functions. Algorithms are made and tested specifically to avoid that, and also avoid collisions (2 different plain texts generate the same hash).
You can read more on wikipedia, but the main point of the article is:
The ideal cryptographic hash function has four main or significant properties:
- it is easy (but not necessarily quick) to compute the hash value for any given message
- it is infeasible to generate a message that has a given hash
- it is infeasible to modify a message without changing the hash
- it is infeasible to find two different messages with the same hash
Most of the attacks on hash functions are based on finding collisions (so 2 different plain texts will match the same hash) or pre-generating millions of hashes and comparing them until you find the plain that generated it.
Long history short: if a hash algorithm is reverse-engineerable or can be attacked that way, it's not a good hash algorithm.
For passwords, investigating using BCrypt, this post has a great deal of info on it.
Imagine a hash function that uses a single bit for the hash. So your hash can either be 0 or 1.
And let's say the hash function adds up every byte of data and if the data was even, the hash value is 0. If the data was odd, the hash is 1.
Do you see why you couldn't recover your data by reverse engineering that hash function?
It's the same for actual hash algorithms, only the formulas are significantly better than the function I just described.
Your difficulty may be that you are considering hash as far as their use for passwords. It's not obvious why you cannot recover a 8 character password from a 128 bit hash. But that hash function you use for passwords can also be used to calculate the hash of an entire terabyte of data, and the hash will still take only 128 bits of data. Obviously, you cannot reverse engineer that 128 bit hash and recover your terabyte of data.
Also, assuming you had every possibly permutation of a single terabyte of data, there would be a huge amount of different datas that generate the same hash. After all, if you have more than 2^127 different permutations of data, you become likely to encounter two different datas that have the same hash.
There are algorithms that are inherently not reversible; they change an input A into an output B in such a way that even if you know the exact steps of the algorithm, you can't recover A from B.
A very simple example: convert each character in the password to its ASCII value and sum all the values. There is no way you can recover the original password from the result.
There is one aspect of the problem that people are missing in the previous answers. That is the many-to-one nature of hash functions. Since (most) hash functions are fixed length output (e.g., 256 bit), technically there are infinitely many strings that all hash to the same value.
For example, if you take all all 512 bit strings (of which there are 2^512). There are only 2^256 outputs of the hash function. Thus, for each output of the hash function, there are roughly 2^256 512 bit strings which hash to that value. I say roughly because we don't know if the hash function actually is a random function, it could have slight biases.
Thus, given a digest, there are many strings which hash to the same value. Therefore, if you define "reversing a hash function" as outputting the users password, how is your reversing function going to deal with the potentially infinite number of strings that result in the given digest?
You're asking "why is it important that hash functions be one-way?" It's a security property.
There are two kinds of "hash" (or "message digest" as they're called) in common use today. One is a plain message digest, which you may be familiar with as a checksum algorithm, such as CRC32. The algorithm is designed so that a single bit change in the input will yield a different digest value. The primary purpose of this is to ensure that a message is not been corrupted by accident. CRC32 checksums are present on every TCP/IP packet, and a mis-match results in retransmission to correct the error.
Message digests are often used in cryptography as part of "signing" a message. The message is encrypted by the sender with his private key, and anyone can use the public key to validate that it was encrypted by only the sender. But RSA public key cryptography can only encrypt messages smaller than the key size (256 bytes), which are much shorter than most useful messages. Message digest algorithms produce values smaller than RSA keys. So by encrypting the digest instead of the message, RSA signatures can be used on any size message.
But an ordinary message digest is not secure against an attacker. Consider a very simple checksum that just sums the values of the characters. If you signed such a checksum, I could swap out any other message that yields the same checksum, and the signatures would match, fooling the victim.
Another common use for message digests is password protection during storage. If you encrypt the passwords before storing them in the system, a system administrator who knows the key could decrypt them all. (You may have noticed this problem recently when some web sites were hacked.)
To avoid these problems, a different kind of hash is needed, one that is "cryptographically secure." A secure hash algorithm has two additional properties, collision resistance, and non-reversibility.
Collision resistance means that I should not be able to find a message that produces the same digest. That way I can't swap my evil message for your good message.
Non-reversibility property means that I can't turn a digest back into a plaintext so I can't decrypt the original message, like the user's password.
Creating a digest is a very similar problem to encryption, in that you have to scramble the data in such a way that it leaks no information about the original data. It's even harder, because the same math has to not give any clues about how to successfully create a collision.
Others have explained why good cryptographic hash functions are difficult to reverse - but according to this Wikipedia article, LanMan is poorly designed and can be reversed relatively easily:
Although it is based on DES, a well-studied block cipher, the LM hash is not a true one-way function as the password can be determined from the hash because of several weaknesses in its implementation... By mounting a brute force attack on each half separately, modern desktop machines can crack alphanumeric LM hashes in a few hours... In 2003, Ophcrack, an implementation of the rainbow table technique, was published. It specifically targets the weaknesses of LM encryption, and includes pre-computed data sufficient to crack virtually all alphanumeric LM hashes in a few seconds.
I think there are many reasons, but one is obvious: a digest produced by a hash function can never contain infinite information, since the digest has finite bits. But the hash function can be used to hash inputs of infinite information. The input can actually be anything.
The hardship to find out a collision is not the answer. The real hardship is proving your original data is actually the only possible input that matches a certain digest. I think you may never calculate one input and claim it is the only answer to the digest.
Reversing a mod hash is simple. Ex:- (bytes summatory)
mod (d) = hash. So if you want to generate all the inputs for a hash is
int bytes summatory = int n*int d + int hash what about that?
Ff is the XOR between two blocks its simple, say the bit is one, or
block 1 = 0 and
block 2 = 1, or
block 1 = 1 and
block 2 = 0. Say the bit is 0 or
(b1=0 ^ b2=0) or
(b1=1 ^ b2=1). These are correct answers for the same output.