January 16, 2012

By Michael Snoyman

I've been tempted to write this blog post for a few weeks now. While my blog series on conduits purposely avoided enumerator comparisons, I'm sure the first question people have is, "Why did we need a new solution to the problem?" Now that I've gotten feedback from a different people, and since multiple people are looking at novel solutions to the streaming data problem, I think it's time to write this post.

Before we can really address the question, we need to understand the goals we are trying to achieve. We want a solution that will allow us to:

Produce, consume and transform streams of data.
Abstract over the underlying data source and destination.
Handle resources determinstically. This includes exception safety.
Composability: it should be simple to take a few transformers and a consumer and produce a new consumer, for example.
Approachable: it shouldn't take months to master the approach.
Robust: we should be making it difficult for people to shoot themselves in the foot.
Convenient: we don't want a solution that requires radically modifying our code.

The enumerator approach (encompassing the iteratee, enumerator, and iterIO packages, as well as Oleg's original papers) was a breakthrough for solving these issues. It fully addresses issues (1) and (2) above. I would argue it also solves (6) very well. However, it only gives a half solution to (3): while we can deterministically allocate resources in the producer (i.e. Enumerator), you cannot do so in the consumer (i.e. Iteratee). As a result, it's impossible to write an exception-safe iterFile function (this has been confirmed in all four variations listed above, I can provide the code if anyone's curious).

(4) is also only a partial solution in my mind, at least if you consider (7). The use case that finally opened my eyes to the limitations here was producing an HTTP proxy. I won't get into all the gory details, but the basic issue is that enumerator forces a kind of Inversion of Control on its users, making it awkward to interleave different streams. Again, I can provide the code for this for the curious, but it's too involved to be discussed here.

(5) is really a complete failure. I didn't understand enumerators until I wrote a three part blog series on how to use them, and even to this day, after having written a huge amount of enumerator code, I still get stuck trying to write an Enumeratee. The naming is confusing, and the technique is non-intutitive. Want to produce a stream of data? Then write a function that transforms a Step into an Iteratee. There's a logic to it, but it's not immediate. All of these type synonyms based on a single base type (Iteratee) leads to very difficult-to-parse error messages. (7) also presents problems: it's impossible to catch exceptions in an Iteratee, it's awkward to deal with monad transformer stacks, and so on.

So after much discussion with others and a lot of thought, I started working on conduits. In my mind, here are the major changes:

The names are much more intuitive. I think most people understand Source and Sink immediately, with Conduit only taking a moment longer.
Source, Conduit, and Sink are all distinct types, meaning error messages are much clearer. This also means no awkward, unused type variables like we have for Enumerator and Enumeratee.
Instead of being a sink transformer, a Source is a type that allows you to pull data from it. One advantage is simplicity. But a more powerful advantage is that there's no inversion of control. If you want, you can just pull data from a source and never write a sink. Or if you use the standard connect operator ($$, as I'm guessing everyone does), there is no complicated control flow involved. A side effect of this is you can actually deal with exceptions fully. (Another minor advantage is we get nice typeclass instances like Monoid.)
There is full control for allocating resources anywhere, via the ResourceT transformer. Not only does this allow us to write code we couldn't write before, it also means we can write all of our code more easily. For example, in the past getting a database connection from a resource pool required using a withPool function, which meant that creating streaming responses from WAI was a huge inversion-of-control exercise. With ResourceT, we can safely "check out" a resource and know it will be returned appropriately.
Since we've avoided continuation passing style throughout the types, we can easily modify the monad stacks that our code is living in. One example usage for this would be parsing some data, introducing a new monad transformer (e.g., a ReaderT holding the parsed data) for some internal computation, unwrapping the transformer, and continuing computation. I tried, and failed, to implement a general purpose solution to this problem in enumerator.
To deal with the remaining inversion of control cases, we've introduced buffered sources. This is a direct outcome of the third point above, and means massively simplified APIs (compare http from http-enumerator and http from http-conduit).

Vague criticisms of the conduit package notwithstanding, the only real downside I know of to the approach is its reliance on mutable state. I've considered reworking parts of the codebase to get rid of the mutable state, but have decided against it, because:

Besides a vague "I don't like it," no one has shown me a concrete problem with our usage of mutable variables.
Since we allow ST, you can still use conduits from pure code. Besides, the vast majority of conduit code will live in IO anyway.
I think the mutable state makes the internals of conduit much more approachable. It's very easy to see what's going on.
We'd have to replace the mutable state with some form of CPS, which would likely destroy some of our other advantages (e.g., easy monad stack modification).

So that's my overall conduit vs enumerator breakdown. I would like to point out that not only have a huge percentage of the enumerator-based packages out there been successfully ported to conduit (and simplified in the process!), but we have even added new packages that never existed in enumerator. Conduits may be young, but they are production ready (I'm shipping conduit code to clients already), and far beyond any proof-of-concept stage. They are a viable alternative to enumerators.

Other alternatives

In the first paragraph of this post, I linked to three other alternate approaches. Firstly, none of these address the resource allocation issue. While a few authors have pointed out that they could reuse the ResourceT approach to do so, that hasn't actually been done, and therefore- as they stand- they do not solve the problem.

One of the approaches is actually very close in spirit to conduits. The main difference is its avoidance of mutable state. You can actually get a very good idea about the changes necessary to get rid of mutable state in conduits by reading through that post. I still believe that such a change is not only unnecessary, but would be detrimental.

The other two approaches use coroutines as a basis. I'll focus on pipes, as it got much more attention on Reddit. While I find the approach beautiful, I don't find it practical. It's actually a step backwards from enumerators, as it doesn't solve any of the resource allocation issue. It's very nice that it fits snugly into a Category, but I don't think people are going to rush to restructure their code to fit the Category approach. Additionally, while the author makes claims that all error handling is orthogonal to the package, I don't think it's at all possible to catch exceptions within a Pipe, which would make it a deal breaker for me.

But probably my biggest concern with this package is its distinction between lazy and strict. The tutorial makes it clear that if you use the wrong one, your code might not terminate, or perhaps never free an allocated file. (I don't think exceptions were ever considered here.) This screams out to me that the paradigm is brittle, and will not compose for any large project. I think this is something that would become immediately apparent as soon as someone starts writing real code with pipes. (Going to the original seven points in the post, I'm saying pipes fails at (6).)

My point here is not to pick on pipes: I think it's an interesting concept, and I really do like the code. My point is more fundamental: the problems we are solving are not simple, elegant problems. We're dealing with the real world, where ugliness like asynchronous exceptions is the norm. Elegant solutions are wonderful, but we can't have that elegance at the cost of correctness. Any solution to the problems at hand needs to take into account all the bad stuff that can happen.

Comments

Conduits versus enumerators

January 16, 2012

By Michael Snoyman

Other alternatives

Archives