In this blog, I summarize what I found out about using XML catalogs with the Java SAX parser. I know, it's not the most riveting subject, but if your app waits for minutes untilthe parser delivers a perfectly ordinary XHTML file, you may find this useful. Or depressing.
I am finishing the code samples for my book “Scala for the Impatient”. (Yes, for those of you who are impatiently awaiting it—the end is near. Very near.)
I the XML chapter, I started an example with
val doc = XML.load("http://horstmann.com/index.html") doc \ "body" \ "_" \ "li"
It took several minutes for the file to load. What gives? My network connection wasn't that slow. And neither is the Scala XML parser—it just calls the SAX parser that comes with the JDK.
The problem is DTD resolution. The file starts out with
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
So, the parser feels compelled to fetch
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
, and rightly so,
because it needs to be able to resolve entities such as ä
in the file.
Except, the W3C hates it when people fetch that file, and rightly so—they shouldn't have to serve it up by the billions. It should be up to the platform to cache commonly used DTDs.
My platform, Ubuntu Linux, happens to have a perfectly good infrastructure
for caching DTDs. Schema files too. There is a file
/etc/xml/catalog
that maps public ID prefixes to other catalog
files. For example, the prefix "-//W3C//DTD XHTML 1.0"
is mapped
to /etc/xml/w3c-dtd-xhtml.xml
, which maps "-//W3C//DTD XHTML
1.0 Strict//EN"
to
/usr/share/xml/xhtml/schema/dtd/1.0/catalog.xml
, which maps to the
final destination, xhtml1-strict.dtd
. I am pretty sure this is the
same on other Linux systems too.
So, of course the JDK takes advantage of this infrastructure, right? No—or I wouldn't have had the problem that I described. Here is what I had to do to make it work.
The JDK takes its SAX implementation from Apache, and Apache has a
CatalogResolver
class. The JDK has it too, well-hidden at
com.sun.org.apache.xml.internal.resolver.tools.CatalogResolver
.
Ok, let's use it and delegate to it in the regular SAX handler.
import java.net.*; import javax.xml.parsers.*; import org.xml.sax.*; import org.xml.sax.helpers.*; import com.sun.org.apache.xml.internal.resolver.tools.*; public class SAXTest { public static void main(String[] args) throws Exception { final CatalogResolver catalogResolver = new CatalogResolver(); DefaultHandler handler = new DefaultHandler() { public InputSource resolveEntity (String publicId, String systemId) { return catalogResolver.resolveEntity(publicId, systemId); } public void startElement(String namespaceURI, String lname, String qname, Attributes attrs) { // the stuff you'd normally do if (lname.equals("a") && attrs != null) { for (int i = 0; i < attrs.getLength(); i++) { String aname = attrs.getLocalName(i); if (aname.equals("href")) System.out.println(attrs.getValue(i)); } } } }; SAXParserFactory factory = SAXParserFactory.newInstance(); factory.setNamespaceAware(true); SAXParser saxParser = factory.newSAXParser(); String url = args.length == 0 ? "http://horstmann.com/index.html" : args[0]; saxParser.parse(new URL(url).openStream(), handler); } }
Does it work? No. The compiler complains that there is no package
com.sun.org.apache.xml.internal.resolver.tools
. That's bull:
jar tvf /path/to/jdk1.7.0/jre/lib/rt.jar | grep /CatalogResolver 6757 Mon Jun 27 00:45:14 PDT 2011 com/sun/org/apache/xml/internal/resolver/tools/CatalogResolver.class
Take this, Java:
javac -cp .:/path/to/jdk1.7.0/jre/lib/rt.jar SAXTest.java
It compiles. It runs. (As an aside, this is pretty weird. I didn't realize
that the compiler excludes some classes from rt.jar
.)
Does it work? No. But there is a useful warning: Cannot find
CatalogManager.properties. That's the final missing step. Create a file
CatalogManager.properties
with the entry
catalogs=/etc/xml/catalog
and put it somewhere on the class path. (No,
/path/to/jdk/jre/lib/ext
doesn't work, which probably isn't a bad
thing.) Or start your app with
java -Dxml.catalog.files=/etc/xml/catalog SAXParser
Did it work? No. It turns out that Linux isn't all that perfect in its XML
catalog infrastructure. The catalog.xml
file has itself a DTD,
like this:
<!DOCTYPE catalog PUBLIC "-//GlobalTransCorp//DTD XML Catalogs V1.0-Based Extension V1.0//EN" "http://globaltranscorp.org/oasis/catalog/xml/tr9401.dtd">
globaltranscorp.org
is no longer, so downloading the DTD is
futile. But wait—don't we have a perfectly good mechanism for using the
public ID and locating the cached copy? The Ubuntu folks put
the blame on Apache, and I am inclined to agree with them.
Anyway, the fix is to replace the system ID with
"/usr/share/xml/schema/xml-core/tr9401.dtd"
.
Now it works. But it's ugly. Why can't it work by default? Or at least by
default when -Dxml.catalog.files
is set?
BTW, I am aware that I can get a CatalogManager
implementation
from Apache, and that it will likely work fine when mixed with the Java XML
implementation. I just feel that I shouldn't have to do that.
What about other platforms? On the Mac, I found a catalog
file
at /opt/local/etc/xml
. It only had a few Docbook DTDs, not XHTML.
I don't know how you add to it (except, of course, manually). In Ubuntu, it's
sudo apt-get install w3c-dtd-xhtml
. How about Windows? I hope that
some of you can tell me.
In Scala, it's a little messier to use the catalog resolver since the parser installs its own SAX handler. The following works:
import xml._ import java.net._ object Main extends App { System.setProperty("xml.catalog.files", "/etc/xml/catalog") val res = new com.sun.org.apache.xml.internal.resolver.tools.CatalogResolver val loader = new factory.XMLLoader[Elem] { override def adapter = new parsing.NoBindingFactoryAdapter() { override def resolveEntity(publicId: String, systemId: String) = { res.resolveEntity(publicId, systemId) } } } val doc = loader.load(new URL("http://horstmann.com/index.html")) println(doc); }
Don't ask. This doesn't use the documented API, just what I gleaned from reading the source.
Scala users have an alternative parser, ConstructingParser
.
Does it resolve entities? Nope. It replaces them with useless comments
<!-- unknown entity nbsp; -->
. Don't ask.
Overall, this enough to make grown men cry. In my Google searches, I ran across a good number of apps that maintained their own catalog infrastructure. Caching these DTDs isn't something that every app should have to reinvent. The blame falls squarely on the Java platform here. (In Linux, there are C++ based tools that have no trouble with any of this.) Java should support the catalog infrastructure where it exists, and allow users to manually manage the catalogs and communicate the location with a global setting, not something on the classpath or the command line.