Elliot Kember started a quest to find a string that is it's own MD5 hash: The Kember Identity Hash.
There are 1632 (340,282,366,920,938,463,463,374,607,431,768,211,456) potential results for a hex-encoded MD5 hash. For example:
md5('abc') = '900150983cd24fb0d6963f7d28e17f72'
md5('def') = '4ed9407630eb1000c0f6b63842defa7d'
Searching through all the potential hashes is a huge problem because of the gamut. Of course, the size of that gamut is exactly the point of a hash: the number of collisions (two inputs with the same hash) is virtually zero.
Unfortunately, he oversimplified and left one ambiguity regarding case. However, based on his examples, I've assumed he's referring to an all-lowercase string.
So far I've checked about 28,571,075,396,407 hashes to see if they match their source. That's around 8.39x10-24% coverage.
To get a handle on how tiny a percentage this is, visualize a progress bar for the search. It should be about 32,025,260 light years wide. The Milky Way, just for reference, is only around 100,000 light years wide, and our nearest neighbor galaxy (Andromeda) is a mere 2,500,000 light years away.
On this progress bar, the portion that I've completed would be one inch wide.
As you can see, the odds of success are fairly low. And according to analysis by ograll, the odds of a match existing at all is around 67%.
Part of his quest was soliciting searchers from others. I wrote one in Groovy, using the JRE-supplied MessageDigest implementation. It now runs on my office workstation when I'm not using it (running 32-bit Windows XP).
After having inexplicable performance issues with threading in my initial searcher, I decided to rewrite the searcher in Java to see if it was a Groovy-specific issue. The Java version is slightly less than four times faster at searching, but the threading issue remains. This now runs on my office workstation when I'm not using it (now running 64-bit Java on CentOS 5).
Both version use the same algorithm: make a 32-character string of randomized hex digits, take it's MD5 hash, hex-encode it, and see if it matches the original. This requires 10-15 lines of code, and the other few hundred lines in each searcher is for scheduling, monitoring, and threading.
An error was encountered retrieving the Java code
import java.security.MessageDigest
import java.text.NumberFormat
import static Math.floor
NumberFormat nFormat = NumberFormat.integerInstance
Object semaphore = new Object()
long totalHashesChecked = 0
long totalHashesSent = 0
long start = System.currentTimeMillis()
long lastSendToMaster = start
String digits = "0123456789abcdef"
int len = digits.length()
List searcherThreads = []
def sendUpdate = { sendToMaster = true ->
long elapsed = (System.currentTimeMillis() - start) / 1000
String timing
if (elapsed < 60) { // less than a minute
timing = "$elapsed sec"
} else if (elapsed < 60 * 60) { // less than a hour
timing = "${floor(elapsed / 60 * 100) / 100} min"
} else if (elapsed < 60 * 60 * 24) { // less than a day
timing = "${floor(elapsed / 60 / 60 * 100) / 100} hrs"
} else {
timing = "${floor(elapsed / 60 / 60 / 24 * 100) / 100} days"
}
def minSinceLastSend = (System.currentTimeMillis() - lastSendToMaster) / 1000 / 60
def rate = (totalHashesChecked - totalHashesSent) / minSinceLastSend
println(
"${nFormat.format(totalHashesChecked)} checks in "
+ "$timing [threads: ${searcherThreads.size()}] "
+ "[checks/min: ${nFormat.format(rate)}]"
)
if (sendToMaster) {
print("[sending to master...")
synchronized (semaphore) {
def diff = totalHashesChecked - totalHashesSent
if (diff != 0) {
try {
new URL(
"http://www.barneyb.com/r/hash_identity.cfm"
+ "?psk=OMITTED"
+ "&count=" + diff
+ "&r=" + Math.random()
).getText()
totalHashesSent += diff
lastSendToMaster = System.currentTimeMillis()
println("done]")
} catch (Exception e) {
println("error ($diff)]")
}
}
}
}
}
def foundIt = { s ->
println("$s hashes to itself!")
sendUpdate()
try {
new URL(
"http://www.barneyb.com/r/hash_identity.cfm"
+ "?psk=OMITTED"
+ "&identity=" + s
+ "&r=" + Math.random()
).getText()
} catch (Exception e) {
println("error sending identity ($s) to master")
}
System.exit(0)
}
def searcher = { params ->
Random random = new Random()
def digest = MessageDigest.getInstance("MD5")
char[] chars = new char[32]
String s, hash
int i, sub, subcount = 5000 + random.nextInt(5000)
println("started searcher thread (${params.id})...")
while (true) {
for (sub = subcount; sub > 0; sub--) {
// build a candidate string
for (i = 0; i < chars.length; i++) {
chars[i] = digits.charAt(random.nextInt(len))
}
s = new String(chars)
// hash it
digest.reset()
digest.update(s.bytes)
hash = new BigInteger(1, digest.digest()).toString(16)
while (hash.length() < 32) {
hash = "0" + hash
}
// check for equality
if (s.equals(hash)) {
println(s)
println()
synchronized (semaphore) {
// this will double-count, but it's well w/in my
// margin of error, so not gonna worry about it
totalHashesChecked += subcount - sub + 1
}
foundIt(s)
break
}
}
synchronized (semaphore) {
totalHashesChecked += subcount
}
if (params.stopFlag) {
println("searcher thread (${params.id}) exiting...")
break
}
}
}
def addSearcher = {
def item = [map: [stopFlag: false, id: "search-" + searcherThreads.size()]]
item.closure = searcher.curry(item.map)
searcherThreads.push(item)
Thread.startDaemon(item.closure)
}
def killSearcher = { force = false ->
if (!force && searcherThreads.size() == 1) {
println("cowardly refusing to kill last searcher. Use 'q' to quit")
return
}
searcherThreads.pop().map.stopFlag = true
}
def killAllSearchers = {
while (searcherThreads.size() > 0) {
killSearcher(true)
}
}
volatile def updateThread
updateThread = Thread.startDaemon {
println("started background monitor...")
for (int i = 1; true; i++) {
Thread.sleep(5000)
if (updateThread == null) {
break
}
sendUpdate(i % 12 == 0) // every minute
}
}
def initialThreads = 1
try {
if (args.length > 0) {
initialThreads = Integer.parseInt(args[0])
}
} catch (NumberFormatException e) {
// oh well
}
// spin up the threads
initialThreads.times {
addSearcher()
}
def isr = new InputStreamReader(System.in)
while (true) {
char c = isr.read();
if (c == 'q') {
killAllSearchers()
} else if (c == '+') {
addSearcher()
} else if (c == '-') {
killSearcher()
}
if (searcherThreads.size() == 0) {
updateThread = null
println("exiting...")
Thread.sleep(1000)
sendUpdate()
break
}
}
In a nutshell, the ambiguity is whether this would be an identity hash (same letters, different case) for a mythical bb3() hash function that is only three characters long:
bb3('5Ae') = '5ae'
The MD5 hash algorithm (like all other message digests) operates on bytes, not character strings. However, humans can deal with strings far better than raw bytes, so MD5s are usually hex encoded for human consumption. "Hex", by the way, referse to hexadecimal notation, or base-16 numbers. Since we only have 10 numeric digits, hex uses the letters a-f (or A-F) to represent the digits from 10-15.
The problem arises because the letters representing the hex digits don't have any case semantics. For example, both "4ed9..." and "4ED9..." are valid representations for md5('def'), because in this case the letters characters aren't letters, they're digits.
Most systems seem to hex-encode with lower case letters, but not all. I've not personally run across any that use a mixture of lower and upper case letters, but it would not be incorrect to do so (just sorta weird).
The input string (the string being hashed), however, is case sensitive because it's not a list of digits, it's a collection of bytes. At the byte level, upper and lower letters are stored differently. For example:
md5('def') = '4ed9407630eb1000c0f6b63842defa7d'
md5('DEF') = '822dd494b3e14a82aa76bd455e6b6f4b'
So the question is whether or not the hex-encoded has can be recased to create a match. For example, consider this result (using a made-up hash function called bb3()):
bb3('5Ae') = '5ae' (or '5Ae' or '5AE')
bb3('5ae') = 'ae7' (or 'aE7' or 'AE7')
What's interesting about the first example is that the hash value has the same letters as the input string, just in a different case. If we change the case of the input string, we'll get a different hash (the second example). However, if we change the case of the output string we won't change the meaning of the result, and we will have lexicographically equal strings (the Kember Identity Hash).
Succinctly stated, the ambiguity is whether mixed-case hashes can be used in the identity. I've made the assumption that all hashes must be lowercase, but Elliot hasn't made a statement one way or the other.