I write my lecture slides in XHTML, using the marvelous HTML Slidy package. I just dump the images into the same directory as the HTML files, which isn't so smart because it makes it hard to copy a presentation from one directory to another. I could change my habit, but hey, what is technology for? A couple of years ago I decided to write a script that simply generates a list of all images in an HTML file, so I can run
cp `images 01-intro.html` somewhere
Piece of cake, right? Just look for <img src="foo.jpg"
.../>
. Now I could just use a regular expression. But, as Jamie Zawinski said, “Some people, when
confronted with a problem, think ‘I know, I'll use regular
expressions.’ Now they have two problems.”
Wise words indeed. I could spend a long time fussing with problems such as
<img
(newline) src=
. Of course, the right thing to
do is to use an XML parser, and the manly thing to do is to use XSLT. After
more pain than seemed warranted, I came up with this XSLT script.
<xsl:stylesheet version = '1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform' xmlns:html="http://www.w3.org/1999/xhtml"> <xsl:output method="text"/> <xsl:template match="html:img"> <xsl:value-of select="@src"/> <xsl:text></xsl:text> </xsl:template> <xsl:template match="@* | node()"> <xsl:apply-templates select="@* | node()"/> </xsl:template> </xsl:stylesheet>
Do not ask me about it. I do not want the pain to recur.
It worked fine for a couple of years, but this morning it broke.
java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
WTH? I pasted the URL into my browser, and it worked just fine. Next, I tried telnet.
$ telnet www.w3c.org 80 Trying 128.30.52.45... Connected to dolph.w3.org. Escape character is '^]'. GET /TR/xhtml1/DTD/xhtml1-strict.dtd HTTP/1.0 HTTP/1.1 503 Service Unavailable due to Unknown abuse from requesting IP Date: Wed, 12 Aug 2009 01:53:24 GMT Server: Apache/2 Content-Location: unknown.asis Vary: negotiate TCN: choice Retry-After: 86400 Content-Length: 730 Connection: close Content-Type: text/html; charset=UTF-8 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Forbidden due to abuse</title> </head> <body> <h1>Forbidden due to abuse</h1> <p>We are most interested in finding the source of this particular abuse. Please <a href="mailto:web-human+unknown-abuse@w3.org">contact us</a> if you have any details as to the client software running (browser, web crawler, other), what it was requesting, who your provider is or are willing for us to follow up with you and try to get details.</p> <hr /> <address> <a href="/Help/">W3C Webmaster</a><br /> <small>$Date: 2009-08-11$</small> </address> </body> </html> Connection closed by foreign host.
Apparently, the W3C decided to crack down on programs that just fetch a DTD from its server. If the user agent is Java, not Mozilla, you get an Error 503. Of course, I don't actually need the DTD. No problem, I just use one of these magic incantations for the parser, or maybe the parser factory, or the parser factory configuration, or the parser factory configuration assembly—as the designers of the SAX API demonstrate so vividly, any problem in computer science can be amplified with another level of indirection.
Except I didn't write a program that used the SAX API—I am no Evel Knievel. I just invoked Xalan on the command line. And I do not have the intestinal fortitude for figuring out its command line options.
Then I remembered that Scala can process XML natively.
My first attempt failed miserably:
val x = scala.xml.XML.loadFile("01-intro.html") java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
Ugh, the Scala library uses Xerces, just like Xalan does.
But they have another parser, and that one worked fine.
val doc = ConstructingParser.fromFile(new File("01-intro.html"), true).document doc \\ "img" \\ "@src" map println
Two lines of Scala made the medicine go down...
So, what is the moral of all this?
doc \\ "img" \\ "@src" map
println
. \\
looks like XPath //
(which, for obvious reasons, they couldn't have taken verbatim :-)), so the
learning curve was minimal.