It's been almost twenty years that Gary Cornell contacted me to tell me “Cay, we're going to write a book on Java.” Those were simpler times. The Java 1.0 API had 211 classes/interfaces. And Unicode was a 16-bit code.
Now we have over 4,000 classes/interfaces in the API, and Unicode has grown to 21 bits.
The latter is an inconvenience for Java programmers. You need to understand some pesky details if you have (or would like to have) customers who use Chinese, or you want to manipulate emoticons or symbols such as
'TROPICAL DRINK' (U+1F379). In particular, you need to know that a Java
char is not the same as a Unicode “code point” (i.e. what one intuitively thinks of as a “Unicode chracter”).
String uses the UTF-16 encoding, where most Unicode code points take up one
char value, and some take up two. For example, the tropical drink character, erm, code point is encoded as the sequence
So, what does that mean for a Java programmer? First off, you have to be careful with methods such as
substring. If you pass inappropriate index values, you can end up with half a code point, which is guaranteed to cause grief later. As long as index values come from a call such as
indexOf, you are safe, but don't use
str.substring(0, 1) to get the first initial—you might just get half of it.
char type is now pretty useless for application programmers. If you call
str.charAt(i), you might not get all of the code point, and even if you do, it might not be the
ith one. Tip: If you need the code points of a string, call:
int codePoints = str.codePoints().toArray();
I recently finished the book “Core Java for the Impatient”, where I cover the “good parts” of Java, for programmers who come from another language and want to get to work with Java without sorting through twenty years of historical baggage. In that book, I explain the bad news about
char in somewhat mindnumbing detail and conclude with saying “You probably won’t use the
char type very much.”
All modesty aside, I think that's a little better than what the Java tutorial has to offer on the subject:
chardata type is a single 16-bit Unicode character. It has a minimum value of
'\u0000'(or 0) and a maximum value of
'\uffff'(or 65,535 inclusive).
Uffff. What is a “single 16-bit Unicode character”???
A few days ago, I got an email from a reader who had spotted a somewhat unflattering review of the book in Java Magazine. Did the reviewer commend me on giving readers useful advice about avoiding
char? No sir. He kvetches that I say that Java has four integer types (
byte), when in fact, according to the Java Language Specification, it has five integral types (the last one being
That's of course correct, but the language specification has an entirely different purpose than a book for users of a programming language. The spec mentions the
char type 113 times, and almost all of the coverage deals with arithmetic on
char values and what happens when one converts between
char and other types. Programming with strings isn't something that the spec cares much about.
So, it is technically true that
char is “integral”, and for a spec writer that categorization is helpful. But is it helpful for an application programmer? It would be a pretty poor idea to use
char for integer values, even if they happen to fall in the range from 0 to 65535.
I like to write books for people who put a programming language to practical use, not those who obsess about technical minutiae. And, judging from Core Java, which has been a success for almost twenty years, that's working for the reading public. I'll raise a glass of
'TROPICAL DRINK' (U+1F379) to that!