A spoonful of Scala

I write my lecture slides in XHTML, using the marvelous HTML Slidy package. I just dump the images into the same directory as the HTML files, which isn't so smart because it makes it hard to copy a presentation from one directory to another. I could change my habit, but hey, what is technology for? A couple of years ago I decided to write a script that simply generates a list of all images in an HTML file, so I can run

cp `images 01-intro.html` somewhere

Piece of cake, right? Just look for <img src="foo.jpg" .../>. Now I could just use a regular expression. But, as Jamie Zawinski said, “Some people, when confronted with a problem, think ‘I know, I'll use regular expressions.’ Now they have two problems.”

Wise words indeed. I could spend a long time fussing with problems such as <img (newline) src=. Of course, the right thing to do is to use an XML parser, and the manly thing to do is to use XSLT. After more pain than seemed warranted, I came up with this XSLT script.

<xsl:stylesheet version = '1.0'
      xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
      xmlns:html="http://www.w3.org/1999/xhtml">
   <xsl:output method="text"/>
   <xsl:template match="html:img">
      <xsl:value-of select="@src"/>
      <xsl:text></xsl:text>
   </xsl:template>
   <xsl:template match="@* | node()">
      <xsl:apply-templates select="@* | node()"/>
   </xsl:template>
</xsl:stylesheet>

Do not ask me about it. I do not want the pain to recur.

It worked fine for a couple of years, but this morning it broke.

java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd

WTH? I pasted the URL into my browser, and it worked just fine. Next, I tried telnet.

$ telnet www.w3c.org 80
Trying 128.30.52.45...
Connected to dolph.w3.org.
Escape character is '^]'.
GET /TR/xhtml1/DTD/xhtml1-strict.dtd HTTP/1.0

HTTP/1.1 503 Service Unavailable due to Unknown abuse from requesting IP
Date: Wed, 12 Aug 2009 01:53:24 GMT
Server: Apache/2
Content-Location: unknown.asis
Vary: negotiate
TCN: choice
Retry-After: 86400
Content-Length: 730
Connection: close
Content-Type: text/html; charset=UTF-8

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Forbidden due to abuse</title>
</head>

<body>

<h1>Forbidden due to abuse</h1>

<p>We are most interested in finding the source of this particular
abuse.  Please <a href="mailto:web-human+unknown-abuse@w3.org">contact
us</a> if you have any details as to the client software running
(browser, web crawler, other), what it was requesting, who your
provider is or are willing for us to follow up with you and try to get
details.</p>

<hr />
<address>
<a href="/Help/">W3C Webmaster</a><br />
<small>$Date: 2009-08-11$</small>
</address>
</body>
</html>
Connection closed by foreign host.

Apparently, the W3C decided to crack down on programs that just fetch a DTD from its server. If the user agent is Java, not Mozilla, you get an Error 503. Of course, I don't actually need the DTD. No problem, I just use one of these magic incantations for the parser, or maybe the parser factory, or the parser factory configuration, or the parser factory configuration assembly—as the designers of the SAX API demonstrate so vividly, any problem in computer science can be amplified with another level of indirection.

Except I didn't write a program that used the SAX API—I am no Evel Knievel. I just invoked Xalan on the command line. And I do not have the intestinal fortitude for figuring out its command line options.

Then I remembered that Scala can process XML natively.

My first attempt failed miserably:

val x = scala.xml.XML.loadFile("01-intro.html")
java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd

Ugh, the Scala library uses Xerces, just like Xalan does.

But they have another parser, and that one worked fine.

val doc = ConstructingParser.fromFile(new File("01-intro.html"), true).document
doc \\ "img" \\ "@src" map println

Two lines of Scala made the medicine go down...

So, what is the moral of all this?