Strings with Zero Hash Code

Here is some light diversion for these difficult times: How to find strings with zero hash code in Java. In particular, how to find one that you can remember. Like "misjudge corona modulation".

In this article, Heinz Kabutz analyzes a performance improvement for avoiding repeated computation of string hashcodes in the standard Java library. In Java 1.2, the hash code method was the following:

public int hashCode() {
   int h = 0;
   int off = offset;
   char val[] = value;
   int len = count;

   for (int i = 0; i < len; i++)
      h = 31*h + val[off++];
   return h;
}

Each character of the string is examined, making hashing slow for long strings—i.e. O(n) where n is the length of the string.

Java 1.3 caches the hash code. If the cached value is zero, the hash is computed.

.jpeg

But what if a string happens to have hash code zero? Then the caching mechanism fails, and the hash code is recomputed each time, which is slow for long strings. How likely is that? Assuming hash codes are equidistributed, there is a chance of 1 in 232 or about 2.32×10-10. That's ten times less than the likelihood of you getting killed by a falling coconut this year.

Still, some evil person might manufacture such strings on purpose. Java 13 improves the caching, using a separate Boolean value to check if the hash code has already been computed.

Heinz' article demonstrates the efficacy by looking at repetitions of the string "ARcZguv". Huh, "ARcZguv"? Oh right, of course, that string has hash code zero. It's easy to check it in JShell. Do it right now! And therefore, looking at the hashing algorithm, so does "ARcZguvARcZguv".

How did Heinz ever come up with this string? Did he try all strings of length 7? And was this the nicest one? Well, if he could, so can I. I whipped up a recursive algorithm that produces all strings of length n and filters by zero hash code. My program ran out of memory instantly. Using upper- and lowercase ASCII letters, there are 527 such strings, which is about a terabyte. So, buying more memory isn't the solution.

Of course, it isn't necessary to compute all strings first. It isn't even necessary to produce any strings. One can just work with an array of index values into the string "ABCDE...xyz". That's an array of seven integers, each between 0 and 51.

And you don't want to produce a new array for each attempt. Instead, update the array in place. How to do that? Imagine there were only ten letters to choose from, not 52. Then you'd want to generate the sequences

0000000
0000001
.......
0000009
0000010
.......
0000099
0000100

To go from one sequence to the next, add 1 to the last element. If the last element was a 9, then set it to zero and add the carry of 1 to the preceding element. Keep going until you no longer have a carry or reached the front.

Look at the complete program for details.

The program runs for a while and produces a total or 232 solutions. That's almost the number I expected: 527/232 is about 239.

For what it's worth, two of these are all uppercase:

PDFYFCD
YYHCCTL

Now imagine a coding interview with a stern interviewer asking: “Well, young person, are there any Java strings with hash code zero?” Wouldn't it be so much cooler if you could brightly answer: “Oh sure, "YYHCCTL"—just check it out in JShell”, instead of mumbling that surely such strings must exist. Except, who is going to remember "YYHCCTL"? That's only marginally better than "ARcZguv".

I modified the program to find solutions with lowercase letters of length 8. Here are all solutions:

aoffckzd atafwjsh bhlijevx brbjscia cyxowxpb ddnqavbj diiqutzn dndrjssr
dwyssqez egjuqmpg elevflik euzwoizs ezuxdhsw foazvcwh hanbdvpq hfibxuiu
hkdcmtby htydvqtb hytekpmf iiegilwr irzhrjiz jbkjpftg jgfkeemk jlakydfo
juvmcaww kxrqrxft lhcsptqa lqxtyrci maivwnmu mfdwlmfy moyxujxb mttyjiqf
pssfqqjt pxngfpcx qcdhomua qlyixkgi qvokbhxq raelkfjy rounibuf srqrxydc
tbbtvuno unszpjnt wjvcgzyf wtldpxkn wygeewdr xmrgwrhc xrmhlqag xwhiaoyk
yzdmqghh zdynzdyp zitoocrt znopdbkx

Some of these seem like they might be pronounceable in some language. I think I could memorize "rounibuf".

What about using real words? None of the words in /usr/share/dict/words has hash code zero, but one could concatenate them. The computation isn't so different. I just treat each word as a “letter” and adjust the hash code formula. The code is here. Without spaces between words, I get this beauty:

corrugatestendentiousness

With a space separator, it gets even better:

pollinating sandboxes

There you have it: a string with zero hash code in Java that you can actually remember.

This was the only result with two words. With three words, there is a steady stream of responses:

A discrepancies ladybug's
A dubiety's finked
A pagodas Reichstag
A raspberry condenser
A vivisection's foil
A's formally brakeman
A's forward's aspire
AMD outcrop's Transvaal
AMD's cryogenics Ogilvy's
AMD's nuke's orchard's
...

But the program runs painfully slowly. We can easily do better by modifying the quadratic 3SUM algorithm. First go through all the words and make a map for quickly looking up words by their hash code:

var byHashCode = new HashMap<Integer, ArrayList<String>>();
for (String word : words) {
   int h = word.hashCode();
   ArrayList<String> hwords = byHashCode.get(h);
   if (hwords == null) {
      hwords = new ArrayList<>();
      byHashCode.put(h, hwords);
   }
   hwords.add(word);
}

Now let s and t iterate over all words. For a given s and t, compute their contribution to the hash code (with the spaces added):

int h1 = (s.hashCode() * 31 + ' ') * ipow(31, t.length() + 1) + t.hashCode() * 31 + ' ';

Then, for each string length k, shift the hash code up so that a hash code of some string of length k can be added:

int h2 = h1 * ipow(31, k);

Now suppose u were a string of length k, so that the combined hash code h2 + u.hashCode() is zero. Then we would find it in our hash table:

for (String u : byHashCode.getOrDefault(-h2, emptyList))
   if (u.length() == k)
      System.out.println(s + " " + t + " " + u);

You can find the complete program here.

Run it and see it for yourself! It rapidly spits out a huge number of solutions, 249441 in all. There are lots of great choices—anything from

Afghanistan shook Knoxville

to

zonked telecaster promise

Pick one that is meaningful for you and memorize it, just in case someone ever asks you for a zero hash code string.

Finally, why are there so many solutions with three words, when they were slim pickings with seven or eight letters? The file /usr/share/dict/words has about 100,000 words. One would expect about 1018/232 = 232831 three-word combinations with hash code zero. The power of large numbers—something that's on everyone's mind these days.

Comments powered by Talkyard.