The Perils of Partially Powered Languages

August 21, 2011

By Michael Snoyman

Let's make an ePub

I was recently tasked with setting up a system for one of our clients that would take their internal XML document format and produce an ePub. This isn't a particularly difficult task, and is something I've done before. The one restriction I was given was that the solution had to be implemented in Java.

Anyone who reads this blog at all should realize I'm a Haskell programmer, and as such, Java is not my language of choice. Nonetheless, I'm used to clients requiring work be done with it, on the chance that they ever want to look at the code. (To my knowledge, this has never actually happened.) This time around, I'm very happy I did it in Java, as it gave me a perfect real-life situation for expressing a point I've wanted to make for a while.

What is Java?

Before getting into the meat of this, we need to analyze the client requirement a bit more: what is Java, and why do they want me to use it? There are really two distinct things that get the name Java: the programmer language, and the virtual machine. The requirement to use Java then stems from two completely different goals:

Requiring Java-the-language has to do with the idea that "we know Java, we'll understand Java code."
Requiring the JVM is usually about ease of deployment, increased security, etc.

In this case, it was made clear that the goals were the first one: they wanted any one of their ten Java programmers to be able to understand the code I was writing. So in a situation like this, you would assume that every line of code needs to be written in Java itself, for maximum comprehension. (They actually mentioned a possible interest in using Clojure, but that would have been a stretch for them.)

However, as you'll see in a bit, most of the code I wrote was not Java, and they were perfectly happy with this.

Meet the Partially Powered Languages

So if not Java, then what? Well, there were actually three additional languages used in the development of the project, with one of them being replaced halfway through:

XSLT, used for transforming XML.
XProc, an XML pipeline language. More on this later.
Ant, used to orchestrate all this.

This list may seem a bit arbitrary, but it's not. One of the major tools used in the document services industry is the DITA-OT, and it is a polyglot of Java, XSLT and Ant. There is a lot of interest in migrating to XProc for both simplification and performance improvements, and the client specifically mentioned an interest in XProc since they are using it elsewhere.

Why do I call these partially powered languages? Quite simply, each language is good for its specific task, but can't (easily) do anything else. Take XProc: it's a language designed to piece together a number of XML-transforming operations. The approach often used today by tools like the DITA-OT is to use Ant to call out to XSLT for each pipeline step, storing the intermediate results in a temp folder. As you can imagine, all of that file I/O, parsing and rendering carries a performance penalty.

So I thought that XProc would be perfect here... until I found out there was no direct way to create a ZIP file from it. (ePubs are just glorified ZIP files.) This isn't a bug in XProc, or something for which the library has not yet been written, but simply a fact of the language: its goals do not include producing ZIP files. And that's precisely the problem: it's not very common that someones needs fit perfectly into the use case of the tool in question.

Here's another example from this project. XSLT turned out to be fairly straight-forward for most of the implementation. (Besides the fact that there are a whole bunch of issues with relative file paths waiting to rear their ugly heads... but for now, the XML is simple, so we took the easy route.) And using XSLT 2.0 allowed us to write this thing to generate multiple output files from a single input file, meaning a single XSLT call to produce all the output, avoiding all those temporary XML files I mentioned earlier. Hurray!

And all was well... until I got to the end. One of the XML files in an ePub (content.opf) requires a list of all the files contained in the ZIP file. No problem, I can generate a list of all the HTML files I generated. But I don't have a list of all the static files, like CSS and images, that were used. In theory, I could glean all the images referenced from the XML directly, but I don't know about any images referenced from the CSS. And XSLT can't parse CSS, so that's not an option.

There's actually a very simple solution: just get a list of all the files in the static folder. Heck, I can use that for the HTML too if I want to be lazy! All I need to do is use that XSLT directory listing function... that function that doesn't exist. You see, getting a list of files in a folder is not in the purview of XSLT. It knows how to read, manipulate, and generate XML. So how did I solve this? Well, the standard way we do things in the DITA-OT:

Write some Java code that lists files in folder and outputs them as an XML file.
Call that code before calling XSLT.
Pass in a referenced to that generated file when calling the XSLT. Let XSLT do its normal stuff at that point.

As for Ant... well, anyone familiar with it will know its powers. It can easily do stuff like copy files, compile Java, and call XSLT. And it can easily be extended with new features, by defining new Ant tasks- in Java. Which just proves the point: it can do some stuff itself, but not everything.

So what's the problem?

Just to review, let's see how our little ePub generator works:

We use Ant to orchestrate everything. It's what the user calls (perhaps ironically via a Bash script), it accepts parameters, and calls out to everything else.
There's one piece of Java code to generate an XML file containing the list of files.
XSLT takes the input XML files and produces the XHTML and XML files required by the ePub.
Ant does some file copying as necessary, then ZIPs everything up into an ePub.

Doesn't seem too bad, right? Well, remember that issue with relative paths I mentioned? It's something I've solved before on different projects. It's nothing too complicated, you just need to (1) parse file paths properly and (2) make sure you have a consistent mapping from source files to output files. This was really easy to solve in Haskell, and even in Java: you just use a datastructure like a map or hashtable and some basic string parsing.

And here's the problem: both of those are difficult to impossible in XSLT. XSLT isn't designed for maps. It certainly doesn't have the ability to get the canonical filename for a relative path and use it as a key in a dictionary. While that's just 1 or 2 lines in Java, Haskell, and probably a dozen other languages, it's not the case in XSLT. (By the way, this is an existing, long-standing bug in the DITA-OT already. The solution there is "don't set up your source XML that way.")

There's also the fact that this whole call-out-to-Java approach for the file listing is ridiculous. It's inefficient for certain, but more importantly, it's counter-intuitive. And it's something that can be easily broken in maintenance.

But then there's the really big issue: the whole reason to use Java was so that all of their Java programmers would understand the code. And we haven't achieved that goal at all. The Java code was the most innocuous part of the project. Now in order to understand what's going on, they need to know Java, Ant and XSLT.

There's one more issue that I've seen, that hasn't affected us here yet, but does affect other big system designed in this way: modularity. Ant and XSLT are not designed for modularity. They aren't general purpose languages, and they were designed with specific goals. You can see this limitation greatly in the DITA-OT: in needed to create an entire plugin system which generates Ant files from templates just to accomplish something simple like passing parameters. (It's a bit more complicated than that, I don't want to go into the gory details here.) And the worst part is that it doesn't even solve the problem well: you can't have two sets of modifications to HTML output in the DITA-OT without essentially duplicating all of the template files.

Why did you let this happen?

It seems like the fault in all this rests squarely on my shoulders: I knew that Ant and XSLT had shortcomings for this project, and I used them anyway. I should have just manned up and used Java for the whole thing. Everything would have been in a single language that the client can understand, and adding future features like better file reference support would have been an evolutionary change.

The problem is that there's a budget to consider. I've tried once before to replace our normal XSLT usage with pure Java, and the result was horrible: it took me significantly longer to write than just doing the same thing in XSLT. And there are two reasons for this: Java-the-language, and Java-the-libraries. I don't want to speak about this in the abstract, so let's have a real-world comparison to a Haskell library: xml-enumerator. I'm sure many other languages could make a similar comparison with their tools, but this is the one I'm most familiar with.

Let's start with a simple question: what's an XML document? In xml-enumerator, that's very simple: a datatype called Document that contains some stuff (processing instructions and comments) before and after the root element, a doctype statement, and the root element itself. What's an element? Why it's the tag name, a list of attributes and a list of children nodes. And a node is either an element, some text, a processing instruction, or a comment. It's all right there in the data type definitions, which I think are even understable by someone without any Haskell experience.

Contrast this with the DOM model from Java. We have a Node interface, which has child interfaces for elements, processing instructions, text and comments. So far, pretty similar to Haskell. Oh, and it has subinterfaces for documents, attributes, and notations. (Note: notations don't actually exist in the XML.) Now, this Node interface declares a whole bunch of functions, such as getNodeName. So you can guess what that does for an element, but what does it do for a comment? Well, it seems that it returns the contents of the comment... right.

Here's some more fun: try understanding how that API deals with XML namespaces. In xml-enumerator, we have a special datatype called Name, that contains the local name, namespace and prefix for an identifier. This models the XML specification for namespaces precisely. And for convenience, you can even just write them as normal strings in your Haskell source code, so most of the time you don't even need to think about namespaces. In Java? There are two versions for most functions, one that deals with namespaces, and one that doesn't.

Let's compare traversing the DOM. In Haskell, I can easily add an attribute to every element in a tree like so:

addAttr :: Name -> Text -> Node -> Node
addAttr name value (NodeElement (Element tag attrs children)) =
    NodeElement (Element tag ((name, value) : attrs) (map (addAttr name value) children))
addAttr _ _ node = node

Here, we get to leverage pattern matching extensively, as well as use persistent data structures easily. The equivalent in Java is:

public static void addAttr(String name, String value, Node node) {
    if (node.getNodeType() != Node.ELEMENT_NODE) return;
    Element element = (Element) node;
    element.setAttribute(name, value);

    NodeList nl = element.getChildNodes();
    for (int i = 0; i < nl.getLength(); ++i) {
        addAttr(name, value, nl.item(i));
    }
}

Note that we have to use a type cast, which if not checked correctly could result in a runtime exception. We're also modifying the original value instead of generating a new one, and we have no support for namespaces.

The magic features that's missing here is sum types. There is no good replacement in Java, and it precisely the right tool to model nodes in XML. Added to that the requirement to use mutable data structures here and it's painful. And don't even get me started on the difference between a null string and an empty string in the API.

I've only scratched the surface here, but I'm sure you can extrapolate from here how painful writing this kind of code is in Java. And I was surprised to find this out: Java is after all known for its prowess as manipulating XML.

Which now explains the situation fully: Java is a painful language for the majority of the code we need to write. So instead, we use a bunch of languages designed for those specific tasks to avoid writing that Java code. As a result, we have messy codebases using lots of sub-powered languages that results in spaghetti code that no one can read.

So... we should only use full-fledged languages?

You might think that I'm advocating never using a non-general-purpose programming language. That's not the case. If you look at Yesod, we have a number of those floating around, in particular Hamlet, an HTML templating language. Let's compare Hamlet to XSLT in the context of this conversation:

Hamlet is nothing more than simple, syntactic sugar. You can easily trace everything it does into the corresponding Haskell code. XSLT, on the other hand, is a very powerful language, for which it is very difficult to reproduce features in Java.
Hamlet only has meaning within the context of a Haskell program. It reads variables from the surrounding code and returns a value, which must be used by Haskell. XSLT can exist entirely outside of a Java program.
Hamlet has absolutely no side effects. XSLT can be used to generate as many output files as it wants.

You might be surprised at what I'm saying: Hamlet's advantage is that it's less powerful than XSLT! That's precisely the point. Having add-on languages like Hamlet are great for simplifying code, removing line noise, and making it easier to maintain the codebase. XSLT, on the other hand, is used as an alternative to Java.

Going back to my file reference example: with a language like Hamlet, there's no problem. We can call back to Haskell to handle the heavy lifting that Hamlet doesn't support. It's so tightly bound with the surrounding program that there is virtually no overhead to this. Also, there would be no problem using the strong data types we know and love within Hamlet; in XSLT, we're forced to switch back to weak datatypes. (Example: my Java code uses the File type for getting the file listing; XSLT just sees strings.)

If I was to write an ePub generator in Haskell (which I'll likely be doing in the not-too-distant future), I would use the xml-hamlet package. That would allow me to easily and concisely produce XML values. In other words, I could replace:

[ NodeElement $ Element "foo" [("bar", "baz")]
    [ NodeContent "some content"
    , NodeElement $ Element "bin" [] []
    ]
]

with

[xml|
<foo bar=baz>
    some content
    <bin>
|]

Of course, both of those are infinitely more pleasant than the Java equivalent:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.newDocument();
Element foo = doc.createElement("foo");
doc.appendChild(foo);
foo.setAttribute("bar", "baz");
foo.appendChild(doc.createTextNode("some content"));
foo.appendChild(doc.createElement("bin"));

And just wait till you have to write this to a file. In Haskell, it's the writeFile function. In Java, I pity you.

Conclusion

Java-the-language and Java-the-libraries make a number of simple tasks very complicated.
Therefore, we have a number of helper languages, like XSLT and Ant, to avoid the pain of a pure Java solution.
However, these tools are all purposely missing some features we'll inevitably need.
The result will be either missing proper handling of some features, a polyglot that will miss the point of using Java in the first place, or most likely both.
There are great use cases for non-general-purpose languages, but they should have tight integration with a real language.
Haskell's flexible syntax and great typing make it very powerful here.

Comments

The Perils of Partially Powered Languages

August 21, 2011

By Michael Snoyman

Let's make an ePub

What is Java?

Meet the Partially Powered Languages

So what's the problem?

Why did you let this happen?

So... we should only use full-fledged languages?

Conclusion

Archives