console: 1. to alleviate or lessen the grief, sorrow, or disappointment of; give solace or comfort. — That's what I need more of after trying to demystify the behavior of System.out
in the Windows console. Read on if you want to be consoled and enlightened.
I am updating volume 2 of Core Java and got stuck on the section on text files in the internationalization chapter. How hard can this be in 2016? Surely everyone uses UTF-8 these days. Or do they? Make a file Test.java
public class Test { public static void main(String[] args) { System.out.println(java.nio.charset.Charset.defaultCharset()); } }
Run
java Test
in Windows 10. I have a standard US English version, and I get
windows-1252
Your mileage may differ, of course, depending on the local version of Windows that you have.
Windows-1252 is a superset of the 8-bit ISO 8859-1 encoding, with most of the non-printing characters in the range 0x80-0x9F replaced by goodies such as curly quotes and the Euro character € (U+20AC).
One of the characters that is not encoded by Windows-1252 is the Greek letter uppercase sigma ∑ (U+03A3). So, what do you think will happen when you add this line?
System.out.println("\u20AC\u03A3");
Have a guess:
\u20AC\u03A3
€∑
€?
?∑
Of course, the first answer is wrong. \u20AC
and \u03A3
are Unicode escapes, representing € and ∑ in the UTF-16 encoding that Java uses in String
objects.
The second answer would be right if the default charset was UTF-8. But it can't be since the ∑ characters isn't in Windows-1252. So, the third choice must be the answer.
Actually, it's the fourth.
The Windows console uses a different character set, the truly archaic IBM437 or “code page 437” from the original 1982 IBM Personal Computer. Interestingly, Java knows about that tidbit (see below).
Now try
java Test > out type out
What do you think it happens now?
€∑
€?
?∑
Ç?
If you picked the last choice, pat yourself on the back and do something better with your time than reading this blog.
For the rest of us, where does Ç?
come from???
To understand that, remember that System.out
is an instance of java.io.PrintStream
. That actually makes no sense since you send characters and strings, not bytes, to System.out
. But the Writer
interface was added only in Java 1.1, and of course by then it was far too late to change System.out
to a PrintWriter
since it might have broken some of the dozens of Java programs that were out in the field already.
When you look at the source code for PrintStream
, you'll find a field
private OutputStreamWriter charOut;
That's the writer to which println
sends its output. It's easy enough to get it through reflection:
Field f = PrintStream.class.getDeclaredField("charOut"); f.setAccessible(true); OutputStreamWriter charOut = (OutputStreamWriter) f.get(System.out);
Now we can ask it for its encoding:
System.out.println(charOut.getEncoding());
When you run java Test
without redirection, this line prints
Cp437
With redirection (java Test > out
), you get
Cp1252
It is interesting that the encoding for System.out
changes when you redirect the output. But that still doesn't explain the Ç
character. Actually, out
contains two bytes: 0x80, the Windows-1252 endoding of €, and 0x3F, the encoding of ?. The encoder for Windows-1252 produced a ? when it couldn't encode the ∑.
When you type that file on the Windows console, which uses code page 437, then the 0x80 shows up as Ç
, the character with code page 437 encoding 0x80. And the 0x3F shows up as ?
since the ASCII characters have the same encdoding in both code pages.
That's pretty crazy. You can run
chcp 1252
so that the console and Java writers have the same encoding. Then you get
windows-1252 €? Cp1252
Or you can switch the Windows console to Unicode:
chcp 65001
Then you get
windows-1252 �? Cp1252
In other words, the Java program knows that the console is no longer using code page 437, but it doesn't want to believe its good fortune that it's actually using UTF-8, so it falls back to Windows-1252, emitting € as 0x80 and ? as 0x3F (for the ∑ that Windows-1252 can't encode). The Windows console can't make sense of 0x80 which should never be the first byte of an UTF-8 coding sequence, so it displays a replacement character � (U+FFFD).
That is utter madness. To really get it to work, do this:
chcp 65001 java -Dfile.encoding=UTF-8 Test
Then you can finally see
UTF-8 €∑ UTF8
in the console.
Disclaimer: The file.encoding
property is undocumented and not officially supported, and it has been reported to act inconsistently across Java versions and platforms. This simple use for changing the character encoding for System.out
seems to work. But don't use it as a mechanism for setting the Charset
for arbitrary Writer
instances. Always construct a Writer
with an explicit Charset
.
PS. Here is the complete program for you to copy/paste and experiment.
import java.io.*; import java.lang.reflect.*; public class Test { public static void main(String[] args) throws ReflectiveOperationException { System.out.println(java.nio.charset.Charset.defaultCharset()); System.out.println("\u20AC\u03A3"); Field f = PrintStream.class.getDeclaredField("charOut"); f.setAccessible(true); OutputStreamWriter charOut = (OutputStreamWriter) f.get(System.out); System.out.println(charOut.getEncoding()); } }