Unicode—One of the Two Hard Problems on the Internet

As José Paumard said, there are only two hard problems on the Internet: Time zones and Unicode. I got flustered by a blog entitled “Make sure you know which Unicode version is supported by your programming language version”. That turned out to be a red herring. The real culprit was a buggy regex implementation in Java.

The Scary Example

Last week, we had a great virtual unconference, a joint meeting of JAlba and JChateau. With more than a year of COVID, the participants had the audio/video issues under control. Time zones, not so much. José Paumard, the organizer of JChateau, commented that there are only two hard problems on the Internet: Time zones and Unicode. With an accented letter in his name, I am sure he feels the latter issue acutely.

.jpeg

I encountered a similarly grave problem after reading the article by Madalin Ilie who points out that the following code acts differently on Java 8 and Java 11:

public class Test {
   public static void main(String... args) {
      String input = "this is a great \uD83E\uDD76 article";
      String output = input.replaceAll("[\\p{C}\\p{So}]+", "");

      System.out.println("input = " + input);
      System.out.println("output = " + output);
   }
}

On Java 8, the output is:

input = this is a great 🥶 article
output = this is a great ? article

But on Java 11, it is:

input = this is a great 🥶 article
output = this is a great  article

The author notes that the 🥶 character (U+1F976 "Freezing Face") is part of Unicode 11 and concludes: “Even though a Java version can receive, write/store and forward the latest Unicode characters, any attempt to manipulate them might result in weird ? symbols if the Unicode char is not from the version supported by your JRE version”

I was disturbed by that and investigated. Read on for the gory details.

The C and So Categories

The intent of the replacement is to “sanitize” strings, removing weird control characters and emoji. To see why this works, we need to briefly review Unicode categories. Each code point has a single “category” value, such as Lu (uppercase letter), Nd (decimal number), Sm (math symbol) or Pf (final punctuation).

Emojis are assigned the So (other symbols) category, and they will be removed by the above regular expression.

Provided that the JRE knows that they are emoji.

That's where the Unicode version comes in. Java 8, stuck at Unicode 6.2, has no idea that 🥶 is an emoji in Unicode 11. So, it reports the category as Cn (unassigned).

Still, that should be fine. It would then be matched by \p{C}, which includes the five categories shown in the following table

Abbreviation Name Number of characters Description
CcControl65Control characters such as carriage return, line feed, and so on
CfFormat≥ 161Format characters such as soft hyphen or “shorthand format continuing overlap”
CsSurrogate2048The “low” and “high” surrogates used in the UTF-16 encoding
CoPrivate use137468Code points that have no prescribed interpretation
CnOther≤ 83067266 noncharacters and code points that have not yet been assigned

Some of these categories have a fixed size. But the size of Cn can go down as more code points are assigned. And it is theoretically possible (though perhaps unlikely) that new formatting characters are added. It definitely makes sense to filter all of these out.

So, why doesn't Java 8 do the right thing and filter out 🥶 ? It's a bug. Actually, there is more than one bug.

The Bugs

The 🥶 character is encoded in UTF-16 (the encoding used in the JVM) as a pair of surrogates U+D83E and U+DD76. Note that these should not match category Cs. That is intended for unpaired surrogates.

But with Java 8, \p{C} falsely matches the second surrogate. After removing it, the first surrogate remains and is printed as a ?.

This bug is fixed in Java 11.

When investigating, I looked more closely at the matches for the categories Cc, Cf, Cs, Co, Cn, and Cc, that make up category C. Both in Java 8 and Java 11, \p{Cs} falsely matches the second surrogate. This bug is fixed in Java 17.

It seems a bit of a miracle how Java 11 gets the right result for \p{C} when it gets it wrong for \p{Cs}. But that's where the blogger went astray. Java 11 does not know the 🥶 character. It was defined in Unicode 11, and Java 11 only supports Unicode 10. So, it gets matched as Cn, a character that has not been assigned.

Let's try with a character that Java 11 knows, such as 🍹 U+1F379 "Tropical Drink", which is encoded as U+D83C and U+DF79. Sadly,

"\uD83C\uDF79".replaceAll("\\p{C}", "")

yields "\uD83C", which is patently absurd. The string should not have been changed.

This test program shows the issues more clearly. Run it under Java 8, 11, and 17, to see how the behavior changed.

public class Test {
   public static void printString(String caption, String s) {
      System.out.print(caption + ": " + s + " ");
      System.out.print(" [ ");
      for (int i = 0; i < s.length(); i++) System.out.printf("%x ", (int) s.charAt(i));
      System.out.println("]");
   }

   public static void main(String... args) {
      String input = "\uD83E\uDD76\uD83C\uDF79";        
      printString("input", input);
      printString("Removing Cs", input.replaceAll("[\\p{Cs}]+", ""));
      printString("Removing Cn", input.replaceAll("[\\p{Cn}]+", ""));
      printString("Removing Cc, Cf, Cs, Co, Cn", input.replaceAll("[\\p{Cs}\\p{Cf}\\p{Cs}\\p{Co}\\p{Cn}]+", ""));
      printString("Removing C", input.replaceAll("[\\p{C}]+", ""));
   }
}

What Did We Learn Today?

As Unicode gains assigned code points, the result of matching against the Cn or C category changes. That is to be expected. Matching both Cn and So, however, will reliably filter out emojis. At least for this use case, one shouldn't have to worry too much about Unicode versions.

That is, if the implementation isn't buggy, as it was with Java. Obviously, there is no good strategy to defend against such bugs other than (a) independent testing and (b) moving to an updated version of Java.

That benefit of moving forward from ancient Java versions is sometimes forgotten. Right now, there is a fair amount of reluctance to move beyond Java 8 because of the pain that comes with modules. If the new features don't sound sufficiently exciting, perhaps the nagging fear of obscure bugs might provide the required motivation.

Comments powered by Talkyard.