The Curious Case of the char Type

It's been almost twenty years that Gary Cornell contacted me to tell me “Cay, we're going to write a book on Java.” Those were simpler times. The Java 1.0 API had 211 classes/interfaces. And Unicode was a 16-bit code.

Now we have over 4,000 classes/interfaces in the API, and Unicode has grown to 21 bits.

The latter is an inconvenience for Java programmers. You need to understand some pesky details if you have (or would like to have) customers who use Chinese, or you want to manipulate emoticons or symbols such as 'TROPICAL DRINK' (U+1F379). In particular, you need to know that a Java char is not the same as a Unicode “code point” (i.e. what one intuitively thinks of as a “Unicode chracter”).

A Java String uses the UTF-16 encoding, where most Unicode code points take up one char value, and some take up two. For example, the tropical drink character, erm, code point is encoded as the sequence '\uD83C' '\uDF79'.

So, what does that mean for a Java programmer? First off, you have to be careful with methods such as substring. If you pass inappropriate index values, you can end up with half a code point, which is guaranteed to cause grief later. As long as index values come from a call such as indexOf, you are safe, but don't use str.substring(0, 1) to get the first initial—you might just get half of it.

The char type is now pretty useless for application programmers. If you call str.charAt(i), you might not get all of the code point, and even if you do, it might not be the ith one. Tip: If you need the code points of a string, call:

int[] codePoints = str.codePoints().toArray();

I recently finished the book “Core Java for the Impatient”, where I cover the “good parts” of Java, for programmers who come from another language and want to get to work with Java without sorting through twenty years of historical baggage. In that book, I explain the bad news about char in somewhat mindnumbing detail and conclude with saying “You probably won’t use the char type very much.”

All modesty aside, I think that's a little better than what the Java tutorial has to offer on the subject:

Uffff. What is a “single 16-bit Unicode character”???

A few days ago, I got an email from a reader who had spotted a somewhat unflattering review of the book in Java Magazine. Did the reviewer commend me on giving readers useful advice about avoiding char? No sir. He kvetches that I say that Java has four integer types (int, long, short, byte), when in fact, according to the Java Language Specification, it has five integral types (the last one being char).

That's of course correct, but the language specification has an entirely different purpose than a book for users of a programming language. The spec mentions the char type 113 times, and almost all of the coverage deals with arithmetic on char values and what happens when one converts between char and other types. Programming with strings isn't something that the spec cares much about.

So, it is technically true that char is “integral”, and for a spec writer that categorization is helpful. But is it helpful for an application programmer? It would be a pretty poor idea to use char for integer values, even if they happen to fall in the range from 0 to 65535.

I like to write books for people who put a programming language to practical use, not those who obsess about technical minutiae. And, judging from Core Java, which has been a success for almost twenty years, that's working for the reading public. I'll raise a glass of 'TROPICAL DRINK' (U+1F379) to that!