Efficient YAML Parsing
January 12, 2010
A few weeks ago, I made some changes so Atom feeds would be generated using HtmlObject, and suddenly I was running out of memory. At the time I made a few simple strictness hacks and set it aside for later.
Now that the release is nearing, I've started attacking memory leaks. The HtmlObject bug was fairly easily solved (it was using too much list appending), but I realized I was unhappy with the performance of my YAML library.
Just as a quick intro, this YAML library includes a binding to libyaml and converts the result to be a Data.Object.Object. The problem with this approach is that, anytime you read the file, it reads and stores the entire contents in memory, even if you only need to collect, say, the names of the entries.
A simple solution
A fairly simple solution to this problem is to run a program to split the single large YAML files into different views. For example, one file for each entry, and then a separate file with the list of all entries. This is most likely something I will want to do regardless, since it will reduce disk read times. Nonetheless, the library should be made more efficient.
In order to test things out, I decided to go for a ridiculously simple benchmark: count how many entries are in my YAML file. You can download the sample file here.
The main issue I'm interested in here is memory, not processing, so I will only mention those results. In particular, I consider the most important number total memory in use as reported by "+RTS -sstderr". If someone know a reason to trust another number instead, please let me know.
The original version of the library ended up using 7MB of memory. Considering the fact that the entire YAML file is only 200KB, there's a lot of fat to trim.
The main issue, however, is the fact that I'm reading the entire file into memory. Let's see how to address that.
I at first considered using unsafeInterleaveIO to get out of this predicament. The library produced a list of events ([Event]). However, I had plenty of reasons to avoid this:
- I'm calling out to a C library, which made things very hairy. There were all kinds of resource management issues to beware of.
- Any step along the way could have generated a failure from the C library, and I didn't want those popping up in pure code. I could have changed the type to [Either YamlException Event], but it seemed like an uphill battle.
- I haven't seen examples of unsafeInterleaveIO used in this context, and I wasn't certain how stable it would be.
So I decided that it would make most sense to have a left-fold enumerator here. In this setup, I would provide the library with a function and accumulating parameter, and the library handles all memory management and exception issues. Others have discussed this topic at-length, so I won't get into more details for now.
In this process, I decided to split the library into two pieces: yaml would be the low-level binding providing the left-fold enumerator interface, while data-object-yaml would provide high-level conversions to data-object types.
The results so far are promising. I now have two benchmarks: the easy and efficient one. The easy benchmark follows the same procedure as before: convert the entire YAML file into an Object and then count the entries. After the rewrite, this benchmark takes 5 MB. Still needs lots of work, but I was gratified to see that I shaved off 2 MB without focusing on this area.
However, the efficient benchmark is where we see the real possibilities of a left-fold enumerator approach. This code generates no intermediate data structure, it simply counts up the entries. Here's some number comparisons between the two benchmarks:
|Total memory used||5-6MB||1MB|
|Maximum residency (bytes)||3,545,840||7,632|
It appears to me that this version runs in constant space, which is what I would hope for.
The downside to the efficient method is that it's rather tedious to program. However, I think it should be relatively straight-forward to develop a parsing framework to ease this burden. After all, we're dealing with incredibly simple data structures here: scalars, sequences and mappings.