What's in a name?

Did you know that Java identifiers can contain €, a “wavy underline” ﹏ and the bell character? If not, read on for the gory details.

Java Identifiers

.jpeg

“What's in a name?”, the bard asks. “That which we call a rose, by any other name would smell as sweet.”

In Java, what's in a name for a variable, method, or class? We all know that. Letters. Digits (but not at the front). Underscore. The dollar sign: MyClass, myMethod, MY_CONSTANT1.

And characters other than ASCII are ok: maMéthode and π are perfectly valid names.

There are a couple of caveats. The $ sign is reserved, by convention, for generated code. And an _ all by itself is, since Java 9, a keyword, which may at one point in the future denote some kind of wildcard, like in Scala.

Wavy Low Line

I am updating the classic Core Java for the Java 17 LTS release. Whenever I saw a Java-related tidbit, I added to my revision notes, which I am now examining. One of them said something about underscore-like characters. Actually, any character in the Punctuation, Connector category is valid in Java names. Like “wavy low line”:

.jpeg

var ﹏ = 0

I didn't know that. For 25 years, I wrote that identifiers contain letters and digits (in many languages), and $, and _. It never occurred to me that there were alternatives to $ and _.

I had to try it out, and sure enough, it worked:

$ jshell
|  Welcome to JShell -- Version 11.0.4
|  For an introduction type: /help intro

jshell> var ﹏ = 0
﹏ ==> 0

Also, the dollar symbol isn't that special. Any currency symbol works:

jshell> var € = 1.2
€ ==> 1.2

Naming Is Hard

We all know that there are two hard problems in computer science: Cache invalidation and naming things.

But I didn't quite appreciate how hard “naming things” is in Java. I thought, ok, names are made up of letters, numbers, and, as I now know, currency symbols, and those weird underscore-like connectors. To confirm, I went to the Java Language Specification Section 3.8 covers identifiers. It is surprisingly vague and references Character.isJavaIdentifierStart(int), Character.isJavaIdentifierPart(int), and Character.isIdentifierIgnorable(int). Time to study the API doc in earnest.

Here is how isJavaIdentifierPart defines when a Unicode character is valid in an identifier:

The API doc doesn't define what a currency symbol is, but a quick test program confirms that it's the same as the CURRENCY_SYMBOL type, which is Unicode category Sc.

And we find two additional sources of valid characters: combining marks and non-spacing marks.

I couldn't figure out how combining marks work, since I am unfortunately not familiar with any of the languages for which they are used. An example of a non-spacing mark is the acute accent    ́ (U+0301). When combined with another letter, it sits on top: é. Ok, that's handy if you have a variable name with an accent:

var méchant = true;

The JLS warns about a potential problem. There is a Unicode character é (U+00E9), and the name méchant (with e followed by U+0301) is different from the identical-looking méchant (with U+00E9 for é).

The Deplorable Ignorables

Now on to the most bizarre characters that are allowed in Java names: the ignorables. Those are characters in the Control and Format categories, except for those that are white space.

You find such beauties as Bell (U+0007) in the Control category and Invisible Times (U+2062) in the Format category.

Any number of them can occur within names, and they are simply ignored.

Why is this is a good idea??? The mind reels.

Names of Public Classes and Packages

From the point of view of the Java language, there is nothing special about the names of public classes. They are identifiers. However, they are stored in the file system, as ClassName.java. And package names give rise to directories in the file system. And what about JAR files? If you use special characters, they must work In all these places,

Way back when, Windows used code page 437 for file systems. In early editions of Core Java, I had to warn readers:

If you make a class Bär and try to run it in Windows 95, you get an error message “cannot find class BΣr”.

These days, Unicode characters work fine in file names. Characters below U+0020 (such as the bell character) are disallowed in NTFS, but fortunately javac removes the ignorable characters before writing class files.

All Together

This file demonstrates all the phenomena that I just discussed. Here it is visually. Chrome, ever the clown, shows the bell character as a telephone.

public class Test {
   public static void main(String[] args) {
      var € = 0;
      var ﹏ = 1;
      var méchant = true;
      var méchant = false; // NOT an error
      var ab = 0; // Bell
      a⁢b = 1; // Invisible Times
   }
}

And in hex:

00000000  70 75 62 6c 69 63 20 63  6c 61 73 73 20 54 65 73  |public class Tes|
00000010  74 20 7b 0a 20 20 20 70  75 62 6c 69 63 20 73 74  |t {.   public st|
00000020  61 74 69 63 20 76 6f 69  64 20 6d 61 69 6e 28 53  |atic void main(S|
00000030  74 72 69 6e 67 5b 5d 20  61 72 67 73 29 20 7b 0a  |tring[] args) {.|
00000040  20 20 20 20 20 20 76 61  72 20 e2 82 ac 20 3d 20  |      var ... = |
00000050  30 3b 0a 20 20 20 20 20  20 76 61 72 20 ef b9 8f  |0;.      var ...|
00000060  20 3d 20 31 3b 0a 20 20  20 20 20 20 76 61 72 20  | = 1;.      var |
00000070  6d c3 a9 63 68 61 6e 74  20 3d 20 74 72 75 65 3b  |m..chant = true;|
00000080  0a 20 20 20 20 20 20 76  61 72 20 6d 65 cc 81 63  |.      var me..c|
00000090  68 61 6e 74 20 3d 20 66  61 6c 73 65 3b 20 2f 2f  |hant = false; //|
000000a0  20 4e 4f 54 20 61 6e 20  65 72 72 6f 72 0a 20 20  | NOT an error.  |
000000b0  20 20 20 20 76 61 72 20  61 07 62 20 3d 20 30 3b  |    var a.b = 0;|
000000c0  20 2f 2f 20 42 65 6c 6c  0a 20 20 20 20 20 20 61  | // Bell.      a|
000000d0  e2 81 a2 62 20 3d 20 31  3b 20 2f 2f 20 49 6e 76  |...b = 1; // Inv|
000000e0  69 73 69 62 6c 65 20 54  69 6d 65 73 0a 20 20 20  |isible Times.   |
000000f0  7d 0a 7d 0a                                       |}.}.|
000000f4

In the hex dump, you can distinguish m..chant and me..chant, where the .. are two bytes of UTF-8 encoding. Note the ignorable characters in a.b and a...b.

See (and hear) the file in the terminal. If your terminal is configured to use an “audible bell”, you should hear a beep.

cat Test.java

When you compile it, there are no error messages:

javac Test.java

What Did We Learn?

.jpeg

Comments powered by Talkyard.