Stop Using char in Java. And Code Points.

.jpg

As I am editing the 13th edition of Core Java, I realize that I need to tone down the coverage of Unicode code points. I used to recommend that readers avoid char and use code points instead, but I came to realize that with modern Unicode, code points are problematic too. Just use String.

The Best of Times, The Worst of Times

Transport yourself back in time to October 1991. Unicode 1.0.0 saw the light of day, and a bright day it was. With 7,129 characters and ample space in its 2-byte reservoir for the characters of all languages, past, present, or future. No more incompatible code pages for European languages, no more multi-byte encodings of Asian languages!

Unicode was an instant success. Operating systems and programming languages eagerly embraced it. In 1996, the Java 1.0 language specification confidently stated:

The integral types are byte, short, int, and long, whose values are 8-bit, 16-bit, 32-bit and 64-bit signed two's-complement integers, respectively, and char, whose values are 16-bit unsigned integers representing Unicode characters.

This happy state of affairs lasted almost ten years, until March 2001. Only five years for Java, though. When Unicode 3.1 was released, it broke through the 16-bit barrier and clocked in at 94,140 code points, due to the addition of many Chinese/Japanese/Korean ideographs. You would have thought they could have counted them ten years earlier...

👋 to fixed-width encoding. Unicode characters, now in the “basic multilingual plane” and 16 “astral planes”, can be encoded with UTF-8, a multi-byte encoding that requires one byte for classic ASCII, four bytes for astral characters (such as the waving hand sign), and two or three bytes for those in between. Over 98% of web pages are encoded with UTF-8. JEP 400 says that UTF-8 should be the default file encoding for Java version 18 and above.

That's cold comfort for the Java VM (and Windows, which also embraced 16-bit characters). Having bought into the 16-bit char world, they have to resort to the icky UTF-16 encoding that uses one 16-bit value for the characters in the basic multilingual plane and two 16-bit values for the astral characters, taking advantage of a small window of “surrogate” characters that could be pressed into service for those astral planes.

var wavingHandSign = "👋";
wavingHandSign.length() // 2
wavingHandSign.substring(0, 1) // A malformed string with one surrogate character

The result: the worst of all worlds. An encoding that is both bulky and variable-width.

As an aside, this is unrelated to the internal storage of strings in the JVM. JEP 254 specifies compact strings, where the JVM stores strings with no code point ≥ 256 in a byte array and all others as a char array. In principle, the JVM is free to choose any internal implementation, such as UTF-8. But for backward compatibility, any method that takes a UTF-16 based index (such as charAt or substring) needs to find the index in constant time. That limits the choices.

And as an aside to the aside: Check out this race condition that Wouter Coekaerts found in the String(char[]) constructor as a result of this optimization.

The Java API for Code Points

In the Core Java book, starting with Java 5, I started telling readers not to use char but use code points instead. Some reviewers were unhappy and felt that one shouldn't make such a big deal out of the complexities that surely affected only a few exotic languages. Perhaps they were sensitive because it put the Java API in a bad light? Here are the highlights of the API support for code points.

The total number of code points is

int length = str.codePointCount(0, str.length());

Due to the variable length coding, you cannot get the nth code point in constant time, like you can with charAt. No method is supplied for that, probably so that programmers don't misuse it. You need to use index values in the UTF-16 encoding (which you might get from a call to indexOf).

To get the code point at index i, call

int cp = str.codePointAt(i);

Of course, i should be the starting index of a code point. (If not, you get the second half of the UTF-16 encoding.) You know there is a code point at 0. And you can find out the number of char of a code point:

int count = Character.charCount(cp) // 1 or 2;

This loop extracts the code points sequentially:

int i = 0;
while (i < s.length()) {
   int cp = sentence.codePointAt(i);
   i += Character.charCount(cp);
   process(cp);
}

That's pretty tedious. Alternatively, since Java 9, you can use the codePoints method that yields a stream of int values, one for each code point. If you like, you can turn the stream into an array like this:

int[] codePoints = s.codePoints().toArray();

Voilà, constant time random access for code points.

Grapheme Clusters

Now code points are no longer what they used to be. Manish Goregaokar argues here that we should stop ascribing meaning to a single code point.

Some languages, such as Hindi, have letters are made from smaller building blocks in Unicode. There is an ever-increasing number of compositions such gender and skin tone modifiers, flag emoji and ad-hoc choices (such as bird + fire = phoenix). A sequence of code points that produces a single perceived shape is called a grapheme cluster.

Consider the 🇮🇹 flag. You perceive a single symbol: the flag of Italy. However, this symbol is composed of two Unicode code points: U+1F1EE (regional indicator symbol letter I) and U+1F1F9 (regional indicator symbol letter T). About 250 flags can be formed with these regional indicators. The pirate flag 🏴‍☠️, on the other hand, is composed of U+1F3F4 (waving black flag), U+200D (zero width joiner), U+2620 (skull and crossbones), and U+FE0F (variation selector-16). In Java, you need four char values to represent the first flag, five for the second.

I don't know why they would do this. There is plenty of room in the Unicode space to accommodate all of these combinations as code points, but I am sure there must be good reasons.

At first, I thought, who cares, exotic languages, genders and skin tones, flags. But of course, it is just a question of time when their use becomes commonplace. When you perceive what looks to you like a character or symbol, it might be made up of one or more code points, each of which is encoded in Java as one or two char values.

Swift has a Character type to represent a single grapheme cluster. A Swift string is a sequence of Character values.

At first I was envious, but I am not sure that such a type adds a lot of value. Why have characters at all? We should just work with strings and substrings.

Just Use Strings

Since Java 20, there is a way of iterating over the grapheme clusters of a string, using the BreakIterator class from Java 1.1. It is easy to overlook because the main usage of that class has been to iterate over words, lines, and sentences.

Here is how to use it:

String s = "Ciao 🇮🇹!";
BreakIterator iter = BreakIterator.getCharacterInstance();
iter.setText(s);
int start = iter.first();
int end = iter.next();
while (end != BreakIterator.DONE) {
   String gc = s.substring(start, end);
   start = end;
   end = iter.next();
   process(gc);
}

Ok, not very pretty.

Here is a much simpler way, clearly not as efficient. I was stunned to find out that this worked since Java 9!

s.split("\\b{g}"); // An array withments "C", "i", "a", "o", " ", "🇮🇹", "!"

Or, to get a stream:

Pattern.compile("\\X").matcher(s).results().map(MatchResult::group)

This might look forbidding, but really, how often do you need to decompose a string into its grapheme clusters? In your existing code base, how often did you decompose a string into its char values or integer code points?

Most often, to peek inside a string, you get index values with indexOf, and then you extract substrings with substring. Or you use regular expression groups. You can still do that.

The key is to treat results of indexOf (and MatchResult.start/end) as opaque values. Don't ascribe any particular meaning to it. Who knows how many char values preceded the match that you were looking for? And who cares?

Ok, it's not quite an opaque value. There are a few things that you can/must use:

Or use one of the many methods that don't take an index: startsWith, contains, endsWith, replace, split, and so on.

What about making strings? Have you assembled them in the past from char values? Probably, the vast majority of your strings were made from string literals, strings returned from other methods, and concatenation. Keep doing that.

TL;DR

What About ...?

In the java.io package, input/output streams process byte values, and readers/writers process char values. How can one stay away from char when working with text files?

I guess it depends on what “working with text files” means. Do you read in a configuration file? Then you probably want the contents in a String or List<String> or Stream<String>, and for that, there are handy one-liners in java.nio.file. Or you have a JSON/TOML/XML/... parser.

If you write a parser by hand, then you may well want to parse the input one code point at a time. And you are probably unconcerned about grapheme clusters. (With java.io, you have to manually assemble code points from char, using Character.isLowSurrogate/isHighSurrogate).

I am not saying that nobody should ever code points. I just think it should be the left for specialized tasks.

What about the classification methods in the Character class such as:

isLetter
isDigit
isSpaceChar
isUpperCase
isLowerCase
isEmoji
. . .

Frankly, if you traverse a string, looking for letters, digits, spaces, and so on, you are probably better off using regular expressions. Regular expressions get a lot of love from the Java developers, as you can from the fact that grapheme cluster support was there since Java 9.

There is one legitimate reason for using code points or even char values: to analyze or assemble characters that follow a pattern.

For example, when parsing a number, it is convenient to find the value of a decimal digit as ch - '0'. Or conversely, when formatting, producing the character '0' + digit.

Or you may want to turn a country code into one of those nifty flag emojis:

String country = locale.getCountry();
String flag = new StringBuilder()
   .appendCodePoint(0x1F1E6 + country.charAt(0) - 'A')
   .appendCodePoint(0x1F1E6 + country.charAt(1) - 'A')
   .toString();

Here I can call charAt with confidence since Locale.getCountry() will return a string of two uppercase ASCII characters.

These cases are rare and easy to isolate in helper methods. In general, work with strings, and treat index values as opaque.

Comments powered by Talkyard.