The Sordid Tale of XML Catalogs

In this blog, I summarize what I found out about using XML catalogs with the Java SAX parser. I know, it's not the most riveting subject, but if your app waits for minutes untilthe parser delivers a perfectly ordinary XHTML file, you may find this useful. Or depressing.

I am finishing the code samples for my book “Scala for the Impatient”. (Yes, for those of you who are impatiently awaiting it—the end is near. Very near.)

I the XML chapter, I started an example with

val doc = XML.load("http://horstmann.com/index.html")
doc \ "body" \ "_" \ "li"

It took several minutes for the file to load. What gives? My network connection wasn't that slow. And neither is the Scala XML parser—it just calls the SAX parser that comes with the JDK.

The problem is DTD resolution. The file starts out with

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

So, the parser feels compelled to fetch http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd, and rightly so, because it needs to be able to resolve entities such as &auml; in the file.

Except, the W3C hates it when people fetch that file, and rightly so—they shouldn't have to serve it up by the billions. It should be up to the platform to cache commonly used DTDs.

My platform, Ubuntu Linux, happens to have a perfectly good infrastructure for caching DTDs. Schema files too. There is a file /etc/xml/catalog that maps public ID prefixes to other catalog files. For example, the prefix "-//W3C//DTD XHTML 1.0" is mapped to /etc/xml/w3c-dtd-xhtml.xml, which maps "-//W3C//DTD XHTML 1.0 Strict//EN" to /usr/share/xml/xhtml/schema/dtd/1.0/catalog.xml, which maps to the final destination, xhtml1-strict.dtd. I am pretty sure this is the same on other Linux systems too.

So, of course the JDK takes advantage of this infrastructure, right? No—or I wouldn't have had the problem that I described.  Here is what I had to do to make it work.

The JDK takes its SAX implementation from Apache, and Apache has a CatalogResolver class. The JDK has it too, well-hidden at com.sun.org.apache.xml.internal.resolver.tools.CatalogResolver. Ok, let's use it and delegate to it in the regular SAX handler.

import java.net.*;
import javax.xml.parsers.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import com.sun.org.apache.xml.internal.resolver.tools.*;

public class SAXTest {
   public static void main(String[] args) throws Exception {
      final CatalogResolver catalogResolver = new CatalogResolver();
      DefaultHandler handler = new DefaultHandler() {
            public InputSource resolveEntity (String publicId, String systemId) {
               return catalogResolver.resolveEntity(publicId, systemId);
            }
            public void startElement(String namespaceURI, String lname, String qname,
               Attributes attrs) { // the stuff you'd normally do
               if (lname.equals("a") && attrs != null) {
                  for (int i = 0; i < attrs.getLength(); i++) {
                     String aname = attrs.getLocalName(i);
                     if (aname.equals("href")) System.out.println(attrs.getValue(i));
                  }
               }
            }
         };

      SAXParserFactory factory = SAXParserFactory.newInstance();
      factory.setNamespaceAware(true);
      SAXParser saxParser = factory.newSAXParser();
      String url = args.length == 0 ? "http://horstmann.com/index.html" : args[0];
      saxParser.parse(new URL(url).openStream(), handler);
   }
}

Does it work? No. The compiler complains that there is no package com.sun.org.apache.xml.internal.resolver.tools. That's bull:

jar tvf /path/to/jdk1.7.0/jre/lib/rt.jar | grep /CatalogResolver
  6757 Mon Jun 27 00:45:14 PDT 2011 com/sun/org/apache/xml/internal/resolver/tools/CatalogResolver.class

Take this, Java:

javac -cp .:/path/to/jdk1.7.0/jre/lib/rt.jar SAXTest.java

It compiles. It runs. (As an aside, this is pretty weird. I didn't realize that the compiler excludes some classes from rt.jar.)

Does it work? No. But there is a useful warning: Cannot find CatalogManager.properties. That's the final missing step. Create a file CatalogManager.properties with the entry

catalogs=/etc/xml/catalog

and put it somewhere on the class path. (No, /path/to/jdk/jre/lib/ext doesn't work, which probably isn't a bad thing.) Or start your app with

java -Dxml.catalog.files=/etc/xml/catalog SAXParser

Did it work? No. It turns out that Linux isn't all that perfect in its XML catalog infrastructure. The catalog.xml file has itself a DTD, like this:

<!DOCTYPE catalog PUBLIC "-//GlobalTransCorp//DTD XML Catalogs V1.0-Based Extension V1.0//EN"
    "http://globaltranscorp.org/oasis/catalog/xml/tr9401.dtd">

globaltranscorp.org is no longer, so downloading the DTD is futile. But wait—don't we have a perfectly good mechanism for using the public ID and locating the cached copy? The Ubuntu folks put the blame on Apache, and I am inclined to agree with them.

Anyway, the fix is to replace the system ID with "/usr/share/xml/schema/xml-core/tr9401.dtd".

Now it works. But it's ugly. Why can't it work by default? Or at least by default when -Dxml.catalog.files is set?

BTW, I am aware that I can get a CatalogManager implementation from Apache, and that it will likely work fine when mixed with the Java XML implementation. I just feel that I shouldn't have to do that.

What about other platforms? On the Mac, I found a catalog file at /opt/local/etc/xml. It only had a few Docbook DTDs, not XHTML. I don't know how you add to it (except, of course, manually). In Ubuntu, it's sudo apt-get install w3c-dtd-xhtml. How about Windows? I hope that some of you can tell me.

In Scala, it's a little messier to use the catalog resolver since the parser installs its own SAX handler.  The following works:

import xml._
import java.net._

object Main extends App {
  System.setProperty("xml.catalog.files", "/etc/xml/catalog")

  val res = new com.sun.org.apache.xml.internal.resolver.tools.CatalogResolver

  val loader = new factory.XMLLoader[Elem] {
    override def adapter = new parsing.NoBindingFactoryAdapter() {
      override def resolveEntity(publicId: String, systemId: String) = {
        res.resolveEntity(publicId, systemId) 
      }
    }
  }

  val doc = loader.load(new URL("http://horstmann.com/index.html"))
  println(doc);
}

Don't ask. This doesn't use the documented API, just what I gleaned from reading the source.

Scala users have an alternative parser, ConstructingParser. Does it resolve entities? Nope. It replaces them with useless comments <!-- unknown entity nbsp; -->. Don't ask.

Overall, this enough to make grown men cry. In my Google searches, I ran across a good number of apps that maintained their own catalog infrastructure. Caching these DTDs isn't something that every app should have to reinvent. The blame falls squarely on the Java platform here. (In Linux, there are C++ based tools that have no trouble with any of this.) Java should support the catalog infrastructure where it exists, and allow users to manually manage the catalogs and communicate the location with a global setting, not something on the classpath or the command line.