Did you know that Java identifiers can contain €, a “wavy underline” ﹏ and the bell character? If not, read on for the gory details.
“What's in a name?”, the bard asks. “That which we call a rose, by any other name would smell as sweet.”
In Java, what's in a name for a variable, method, or class? We all know that. Letters. Digits (but not at the front). Underscore. The dollar sign: MyClass
, myMethod
, MY_CONSTANT1
.
And characters other than ASCII are ok: maMéthode
and π
are perfectly valid names.
There are a couple of caveats. The $
sign is reserved, by convention, for generated code. And an _
all by itself is, since Java 9, a keyword, which may at one point in the future denote some kind of wildcard, like in Scala.
I am updating the classic Core Java for the Java 17 LTS release. Whenever I saw a Java-related tidbit, I added to my revision notes, which I am now examining. One of them said something about underscore-like characters. Actually, any character in the Punctuation, Connector category is valid in Java names. Like “wavy low line”:
var ﹏ = 0
I didn't know that. For 25 years, I wrote that identifiers contain letters and digits (in many languages), and $
, and _
. It never occurred to me that there were alternatives to $
and _
.
I had to try it out, and sure enough, it worked:
$ jshell | Welcome to JShell -- Version 11.0.4 | For an introduction type: /help intro jshell> var ﹏ = 0 ﹏ ==> 0
Also, the dollar symbol isn't that special. Any currency symbol works:
jshell> var € = 1.2 € ==> 1.2
We all know that there are two hard problems in computer science: Cache invalidation and naming things.
But I didn't quite appreciate how hard “naming things” is in Java. I thought, ok, names are made up of letters, numbers, and, as I now know, currency symbols, and those weird underscore-like connectors. To confirm, I went to the Java Language Specification Section 3.8 covers identifiers. It is surprisingly vague and references Character.isJavaIdentifierStart(int)
, Character.isJavaIdentifierPart(int)
, and Character.isIdentifierIgnorable(int)
. Time to study the API doc in earnest.
Here is how isJavaIdentifierPart
defines when a Unicode character is valid in an identifier:
isIdentifierIgnorable(codePoint)
returns true for the code point The API doc doesn't define what a currency symbol is, but a quick test program confirms that it's the same as the CURRENCY_SYMBOL
type, which is Unicode category Sc.
And we find two additional sources of valid characters: combining marks and non-spacing marks.
I couldn't figure out how combining marks work, since I am unfortunately not familiar with any of the languages for which they are used. An example of a non-spacing mark is the acute accent ́ (U+0301). When combined with another letter, it sits on top: é. Ok, that's handy if you have a variable name with an accent:
var méchant = true;
The JLS warns about a potential problem. There is a Unicode character é (U+00E9), and the name méchant
(with e followed by U+0301) is different from the identical-looking méchant
(with U+00E9 for é).
Now on to the most bizarre characters that are allowed in Java names: the ignorables. Those are characters in the Control and Format categories, except for those that are white space.
You find such beauties as Bell (U+0007) in the Control category and Invisible Times (U+2062) in the Format category.
Any number of them can occur within names, and they are simply ignored.
Why is this is a good idea??? The mind reels.
From the point of view of the Java language, there is nothing special about the names of public classes. They are identifiers. However, they are stored in the file system, as ClassName.java
. And package names give rise to directories in the file system. And what about JAR files? If you use special characters, they must work In all these places,
Way back when, Windows used code page 437 for file systems. In early editions of Core Java, I had to warn readers:
If you make a class Bär
and try to run it in Windows 95, you get an error message “cannot find class BΣr”.
These days, Unicode characters work fine in file names. Characters below U+0020 (such as the bell character) are disallowed in NTFS, but fortunately javac
removes the ignorable characters before writing class files.
This file demonstrates all the phenomena that I just discussed. Here it is visually. Chrome, ever the clown, shows the bell character as a telephone.
public class Test { public static void main(String[] args) { var € = 0; var ﹏ = 1; var méchant = true; var méchant = false; // NOT an error var ab = 0; // Bell ab = 1; // Invisible Times } }
And in hex:
00000000 70 75 62 6c 69 63 20 63 6c 61 73 73 20 54 65 73 |public class Tes| 00000010 74 20 7b 0a 20 20 20 70 75 62 6c 69 63 20 73 74 |t {. public st| 00000020 61 74 69 63 20 76 6f 69 64 20 6d 61 69 6e 28 53 |atic void main(S| 00000030 74 72 69 6e 67 5b 5d 20 61 72 67 73 29 20 7b 0a |tring[] args) {.| 00000040 20 20 20 20 20 20 76 61 72 20 e2 82 ac 20 3d 20 | var ... = | 00000050 30 3b 0a 20 20 20 20 20 20 76 61 72 20 ef b9 8f |0;. var ...| 00000060 20 3d 20 31 3b 0a 20 20 20 20 20 20 76 61 72 20 | = 1;. var | 00000070 6d c3 a9 63 68 61 6e 74 20 3d 20 74 72 75 65 3b |m..chant = true;| 00000080 0a 20 20 20 20 20 20 76 61 72 20 6d 65 cc 81 63 |. var me..c| 00000090 68 61 6e 74 20 3d 20 66 61 6c 73 65 3b 20 2f 2f |hant = false; //| 000000a0 20 4e 4f 54 20 61 6e 20 65 72 72 6f 72 0a 20 20 | NOT an error. | 000000b0 20 20 20 20 76 61 72 20 61 07 62 20 3d 20 30 3b | var a.b = 0;| 000000c0 20 2f 2f 20 42 65 6c 6c 0a 20 20 20 20 20 20 61 | // Bell. a| 000000d0 e2 81 a2 62 20 3d 20 31 3b 20 2f 2f 20 49 6e 76 |...b = 1; // Inv| 000000e0 69 73 69 62 6c 65 20 54 69 6d 65 73 0a 20 20 20 |isible Times. | 000000f0 7d 0a 7d 0a |}.}.| 000000f4
In the hex dump, you can distinguish m..chant
and me..chant
, where the ..
are two bytes of UTF-8 encoding. Note the ignorable characters in a.b
and a...b
.
See (and hear) the file in the terminal. If your terminal is configured to use an “audible bell”, you should hear a beep.
cat Test.java
When you compile it, there are no error messages:
javac Test.java
$
), and connecting punctuation.A-Za-z0-9_
, and you don't have anything to worry about.
Comments powered by Talkyard.