Core Java 13th Edition Finally At the Printer

The classic two-volume Core Java book is updated for each LTS release of Java. For many years, Dmitry Kirsanov Studio turned my XHTML files into print and ebooks, keeping all the formatting just the way I wrote it. This time, presumably in a frenzy of cost-cutting, the publisher tried to use an offshore contractor who was plainly not up to the task. I had to learn how to produce the book myself. Here is how I did it.

.jpg

Most books—95%, in the estimate of my editor at Pearson—are written in Word and typeset with InDesign. Clearly, that workflow works for many authors.

Core Java is a big book, almost 2,000 pages in two volumes. It gets updated frequently. It is full of fiddly formatting. In the first four editions, my coauthor Gary Cornell and myself used Word. It did not work well for us.

There are two fundamental problems: frequent updates and fiddly formatting.

Once the Word document has been turned into whatever is used for typesetting (these days, InDesign), last-minute changes are applied there, and they need to be manually backported to the Word document. That does not always happen correctly, and the next edition Word files start out with errors and inconsistencies.

Also, conversion from Word to another format is never perfect. Someone invariably fixes up something by hand, and nasty errors creep in.

The next four editions were written in FrameMaker, a desktop publishing package. This worked pretty well. The authors, copyeditors, and typesetters edited the same documents. The FrameMaker files were the repository of truth. Unfortunately, by 2010, it was hard to find copyeditors with FrameMaker expertise.

For my next book (Scala for the Impatient), I looked into DocBook (too complex), PearsonML (too proprietary), and Markdown (too simplistic). Then I realized that I could write in XHTML, and the publisher could turn it into whatever XML they liked. Surely one could find a copyeditor who can edit HTML.

That process worked beautifully for the Scala book, two other books, and four editions of Core Java. Thanks to the Kirsanovs who had a magic process of creating PDF out of the XHTML input. With perfect fidelity for all the fiddly formatting.

.jpg

When I tell people that I have authored almost exclusively in HTML for 15 years, they invariably say: But the brackets...

I realize that it is painful to edit HTML with a text editor. It is hard to focus on the text in a sea of brackets. And you have to escape brackets in code. But when I edit HTML, I don't see the brackets. I started out using the admirable Amaya editor that provides both a structural and a WYSIWYG view of the document. When Amaya became abandonware, I created my own Emacs mode. At this point, people invariably roll their eyes, so let's not go there. At any rate, there are obviously plenty of tools for editing HTML. Or if you prefer, edit in Markdown and generate HTML.

What is the upside? CSS! I focus on content, and CSS turns the content into an ebook or a printed book. I write a note as a <div class="note"> (or div.note with my Emacs mode). CSS adds the decoration: an icon, a box, background color, different font, whatever.

Obviously this works for the web, hence, ebooks (see the next section). For print, I use Prince. It turns HTML + CSS into PDF, with custom features to produce all the details that you need for a printed book. (Fun fact: The creator of CSS, Håkon Wium Lie, works for the company that sells Prince, and is active on the tech support forum.)

EPUB

Readers consume Core Java in one of these ways:

A printed book
A PDF, watermarked with their name and email, available at https://informit.com
An ebook, also available at https://informit.com
Amazon Kindle
A web site such as https://oreilly.com or https://vitalsource.com

I'll talk about print and PDF in the next section.

Fortunately for me, the EPUB format takes care of the other use cases. And EPUB is just zipped up XHTML + CSS.

Kindle used to have a proprietary authoring format, but it can now convert EPUB files into its internal format. Most other ebook readers and reading software can display EPUB directly.

The O'Reilly web site also consumes EPUB and shows the pages almost as intended by the author. However, they append a custom stylesheet to enforce some uniformity to the different books on the site. As you might expect, the result is not wonderful when the CSS rules in the EPUB and theirs interfere with each other. VitalSource, bless their heart, just shows the EPUB content.

Many programs can generate an EPUB, but I wanted to know exactly what goes into it. You simply zip up the following:

A couple of trivial files (mimetype, container.xml)
An “OPF” XML file that lists every asset of the EPUB: documents, images, fonts, style sheets. (Caution: Documents containing MathML must be specially marked.)
A table of contents
A cover image

The OPF file is a bit tedious, but it is easy enough to generate it automatically.

There is some weirdness with the ZIP file. The mimetype file must be first entry. My EPUB generator script does this:

zip -q -X -0 $EPUB mimetype 
zip -q -X -r $EPUB * -x mimetype

I wrote a Java program that extracts the table of contents, simply by grabbing the h1 and h2 elements from the HTML files.

If you implement a similar workflow, I highly recommend checking the result with EPUBCheck. This super nitpicky program found a good number of embarrassing mistakes, some of which had been present for multiple editions.

I tested the result with a couple of EPUB readers and sent it to the publisher. Who sent it off to their contractor. And sent me back some inane comments from the contractor. (That's a large part of what a publisher seems to do these days—relay messages back and forth...) The file was not acceptable because it used non-ASCII characters. Silly me, thinking that UTF-8 was good enough in 2024. And all the images had to be in JPEG, not PNG, for that washed-out look that everyone prefers with screen captures. Those are not EPUB requirements, just something that someone working for the publisher came up with way back when, and now there is nobody around to rethink that. Whatever. I modified the script to rewrite non-ASCII characters into numeric character entitity references and convert the images.

A couple of days later, I was delighted to see the book as a “rough cut” on the O'Reilly web site.

.png

Or, almost delighted. I used CSS counters to number the sections, subsections, figures, tables, and code listings. Something in the O'Reilly style sheet messed with that. The Kindle had the same problem. I had to write a script that injected the numbering into the HTML file. That was pretty annoying.

Overall, I think EPUB is great. With HTML and CSS, you can achieve whatever look you want, using technology that you may already know or, if not, is worth learning anyway. It would be even greater if you didn't have to fight publishers who muck with the formatting.

PDF

My publisher said that his contractor could easily turn any EPUB into a PDF. Their first effort was a sad joke. Here is a page from the 12th edition and the corresponding sample.

.png .png

What's wrong with the sample?

The code listing needs to be spaced much more tightly
We explicitly told the contractor to use a narrow monospace font, and they used Courier.
Why set notes in two columns? It gets very messy with code blocks. And anyway, have you ever seen two-column notes when the body is in a single column? Do you want to?

That didn't worry me. Those are fixable issues. But this sentence put fear into my heart:

Here, sgn is the sign of a number: sgn(n) is –1 if n is negative, 0 if n equals 0, and 1 if n is positive. In plain English, if you flip the arguments of compareTo, the sign (but not necessarily the actual value) of the result must also flip.

Have a close look at the monospaced font. Some words that should be monospace (such as compareTo) are not, but others that should not be, are, such as if you flip. Why does negative end with two monospaced letters???

That is really bad. It shows that the XHTML files were not the repository of formatting truth. Did someone muck with the tags? That's a no-no. Who is going to proofread close to 2,000 pages for random font issues?

I asked how this could have happened. I never received an answer, nor another sample file. (My nasty suspicion is that they may not have worked with the XHTML files. Perhaps they first converted them to Word and then used their existing tool chain?)

Actually, it is not trivial to turn arbitrary HTML + CSS into PDF. CSS has become pretty complex. A browser could do it. But that's not enough for print.

One issue is cosmetic. Note how in the sample pages, the page number is on the left. That's because the happen to be even numbers. Odd page numbers are on the right. Also note that the page header or footer contains the chapter heading. But odd page headers have the section title. Also, the first page of a chapter has no page number, and the front matter has Roman numerals.

The harder issue is page breaking. Of course, when a page is full, a new page needs to be started. But we want to avoid awkward page breaks, such as before the last line of a paragraph or table.

There are a few print CSS rules that help. But they are not sufficient to produce a book. When a figure doesn't fit on the current page, you can't just start a new page. Otherwise the preceding page might have a large gap at the bottom. The remedy is to float unbreakable elements to a nearby location, such as the top of the current page or the bottom of the preceding page.

Standard CSS cannot do that, but the Prince formatter has an extension for floating an element to the nearest page. To my delight, it placed all figures in acceptable locations. (You do not want to do this by hand, because you have to redo it every time the text changes.)

After a couple of days of futzing around, I had a book design that was very close to the previous edition.

My editor asked about the index. The what? Oh, that word list at the end of the book that I never use anymore? In an EPUB or PDF, I just search.

Ok, you can't do Control-F in a printed book, so we need an index. The indexer embedded the index terms into the HTML. I just need to extract and sort them, and add the page numbers. How am I going to get the page numbers?

I could have added unique IDs to those terms and used CSS to reference the page numbers of their locations, but that looked like work. Instead, I used the scripting feature inside Prince. You can run JavaScript code after the document is paginated. I wrote a tiny script that locates all index terms and writes their contents and page numbers.

The scripting feature was a life saver with another vexing issue. The notes have thin lines at the top and bottom. Occasionally, the page break would put the bottom line at the top of the next page, even though my CSS rule said not to break inside. I thought that's surely a bug, but howcome in the Prince forum said no, that's how CSS is. It is hard to argue with the person who may have been the creator of CSS.

I didn't want to restructure the HTML of hundreds of notes into a more robust construct. Instead, I wrote a script that alerts me if that problem arises, and then I tweak the offending note by hand.

.png

The end result: a PDF that looks pretty much like the previous print edition. And that rebuilds in seconds if I make a change to the document. Finally, I was the owner of the means of production.

.png

Not quite. The publisher sent the file to the printer. Two complaints:

The file contained a mystery font
They wanted grayscale

The mystery font was caused by an Italian flag emoji 🇮🇹. It is the result of two regional indicator Unicode characters, which is challenging for PDF. Prince uses “Type 3” fonts, which use PDF drawing operations. It was unclear whether this was going to work with the printer, so I used an image instead.

The grayscale was more frustrating. I had no idea how complex color management can be. Or, in my case, grayscale management. Really, couldn't the printer do that? A good friend with lots of printing experience told me that some printers refuse to do anything to the PDF. That way they can't be blamed when something goes wrong.

Fortunately, one of the magic incantations on the Prince forum did the trick. Now the properly grayscaled file is with the printer, and I can't wait to see the physical printed book.

Alternatives?

Most books are written in Word, presumably with a style sheet that the author must follow. If that works for you, great.

Perhaps you want to automate certain things, such as including program listings without the peril of copy/paste? Or you need to manage frequent updates? Then Word becomes unwieldy.

LaTeX is commonly used for typesetting scientific papers and books. However, styling is quite inflexible.

I have read through many descriptions of homegrown Markdown / Pandoc processes. The results looked rather drab.

Read through Robert Nystrom’s epic tale. He writes in Markdown, converts to XHTML, and imports that into InDesign, where he scripts the placement of side notes. And is forced to manually float figures. The result is pretty, but the process seems laborious.

Asciidoctor uses the AsciiDoc markup language, which is similar to Markdown but can express more complex document structures. It can produce EPUB and PDF. You can define your own “themes” with a YAML file, but you don't have direct access to the power of CSS. I know quite a few people who swear by it.

Conclusion

CSS is amazing for styling the appearance of text. If it's good enough for the world wide web, it's surely more than good enough for your book content. (Except for pagination.)

If you need to inject, transform, or extract data in your processing pipeline, XHTML is hard to beat. Parsing or producing LaTeX, Markdown, or AsciiDoc is a lot more complex. Word or Google Docs? I have done both—you don't want to go there.

EPUB is pretty awesome. It would be even more awesome if there was an EPUB reader that is as easy as your favorite PDF reader. Right now, the programs out there (Calibre, Thorium, etc. ) add too much ceremony with wanting to add the EPUB to some kind of library.

The real promise of EPUB, which I didn't mention so far, is interactive content. In 2024, I don't just want to write a book with static pages. I want to provide activities that my readers can do as they are reading: run a program and see the output, write a program that does something slightly different, trace what a program does, and so on. I can do that in an EPUB because, JavaScript.

And then there is PDF. You need it for redistributable files (at least, until EPUB becomes universally and trivially readable), and for printing (as long as customers want printed books). HTML and CSS are almost good enough to produce PDF, but you need a solution such as Prince to go all the way.

Ideally, authors wouldn't have to worry about any of this. It used to be that we turn over our content in a format that is convenient for us, and it was competently turned into printed books and digital versions. Those times are gone. You can be a slave to whatever cumbersome process your publisher lays out for you, or you can own the means of production.