Would You Like Your Strings Raw or Medium Rare?

A new feature is proposed for Java 11 (maybe): A simple syntax for strings that can span more than one line and that can contain arbitrary characters without the need for escaping any of them. No more double backslashes in regular expressions and Windows path names. But as with all things simple, the devil is in the details.

raw-well-done

JEP 326 proposes a simple syntax for “raw string literals”—strings that can contain any characters.

Consider for example searching for a period with a regex. The regex is \. since you must escape a period. So in Java, it's Pattern.compile("\\."). To match a backslash, it's Pattern.compile("\\\\"). This can get really confusing.

In fact, it's so confusing that the author of JEP 326 gets it wrong—or maybe has a subtle sense of humor. His example is Pattern.compile("\\\"") to match a ". Of course, you don't need to escape that in a regex, so Pattern.compile("\"") would work fine. Which confirms the point that all that escaping is a mess.

The remedy is simple. Enclose the string in backticks `...`. Nothing inside the backticks needs to be escaped: Pattern.compile(`\.`)

“What if the string contains backticks,” I hear you cry. Then you enclose the string in a sequence of backticks that is longer than the longest sequence inside:

String markdown = ````Writing about JavaScript
```
alert("JavaScript");
```
in Markdown````

That's brilliant. Except there is a teensy catch.

The raw string cannot start with a ` character. For example,

`alert("JavaScript")`

cannot be encoded in a raw string—you'd have to use

"`" + ``alert("JavaScript")```

Stephen Colebourne raises another issue. Consider:

loginWithCredentials(``, `12:"\X`);
doWork();
logout(``, `12:"\X`);

This doesn't actually do any work. The `` isn't an empty user name but the start of the string ", `12:\"X`);\ndoWork();\nlogout(". Presumably one would get used to looking out for such things.

The point of raw strings is that you can paste any string, no matter how many lines and no matter what's inside (provided, of course, it doesn't start with a backtick), and surround it with sufficient backticks. Everything inside is left untouched.

Except line endings. \u000D and \u000D\u000A sequences are always translated to \u000A. As the JEP says: “This translation provides the least-surprising behavior across platforms.”

Maybe that's why Windows finally fixed Notepad.

One of the odd parts of Java is that you can write tokens in UTF-16, for example

public static void main\u0028String\u005b\u005d args\u0029

is perfectly legal, which can come in handy if you ever need to code with a keyboard without parentheses or brackets. I am sure this sounded like a good idea at the time. Go ahead and make a puzzler with a string literal containing the six characters \u0022 (the code for a quotation mark).

In raw strings, this is of course turned off—otherwise you'd have to worry about strings containing the character sequence \u0060 (the code for a backtick). And in a cruel blow to puzzler makers, a raw string must start with actual `—a \u0060 is not allowed.

And now we come to the deep end: “margin management”. Not in the financial engineering sense, but to allow the raw string to “blend in”.

For example,

   ...
   String myNameInABox = `+----+
| Cay |
+-----+
`;
   System.out.print(myNameInABox);

Oh my, that's terrible—it doesn't blend in! And to make the point more forcefully, there is a bug—the top line should have one more -.

Of course, I could have used

   ...
   String myNameInABox = `
+-----+
| Cay |
+-----+`;
   System.out.println(myNameInABox.substring(1)); // Remove initial \n

But then I get a newline at the start. And it's still not blending in! There was an agitated discussion on the spec mailing list about stripping common prefixes from each line, so that I could have written something like

   String myNameInABox = `
                          +-----+
                          | Cay |
                          +-----+
                         `;

That's more like it! Except, those raw strings are no longer raw—more like medium rare. “And what about tabs?”, I hear you cry.

I don't even agree with the “blending in ”premise. I don't want my multiline strings to blend in. They will likely have long lines that will be awkward to indent. It's also good for non-Java code to stand out. In Scala, multiline strings don't “blend in”, and I never heard anyone complain.

If I had to design this feature, I would take a “tough love” approach to raw strings. Raw is raw. If you want \u000D conversion or common prefix stripping (with or without allowing tabs), call a method.

Also, with a bit more thought to the syntax, one can avoid the most vexing problems. I would provide two forms of raw strings:

  1. Delimited with a single ` on either side, and no ` inside: Pattern.compile(`\.`)
  2. A starting delimiter with at least three ` followed by a newline, and an ending delimiter consisting of the same number of `:
       String myNameInABox = ```
    +-----+
    | Cay |
    +-----+```;
    
    Why the newline after the ```? It makes all the lines start at the first column, and it allows for raw strings starting with a `.

Comments powered by Talkyard.