A More Powerful Conduit

March 2, 2012

GravatarBy Michael Snoyman

I've gotten a few emails recently (most recently from Nathan Howell) about a shortcoming of the Conduit datatype. The issue is that a Conduit can only produce output when it is pushed to. However, if you have a Conduit that could produce a large amount of output for a single input (e.g., a decompressor), this could become memory inefficient.

I came up with a simple solution: allow a Conduit to return a stream of outputs for a single input. In code, this turns into just a single additional constructor for the ConduitResult type:

HaveMore (m (ConduitResult input m output)) (m ()) [output]

I'll go into more detail on the m () bit below, but it says how to close the Conduit early.

Any time you push to a conduit, it can now say "here's some output, and more is on the way." I've implemented this, and I'm happy with this solution. However, I want to make it better.

There's one way to do it

With the previous API, there was only one way to encode each operation. If you wanted to implement a map, you had to use the Producing constructor with a single element list for the output. A concatMap would look something like:

push input = return $ Producing (Conduit push close) (f input)

However, we now have at least two other ways to encode the same thing:

  1. Return a HaveMore constructor which contains all of the output, which will then return then Producing constructor to allow the Conduit to continue.

  2. Return the elements one at a time via HaveMore.

Having these multiple approaches makes the internals of the library a bit ugly, and since there are multiple codepaths, it increases the likelihood of bugs. I also think it's difficult for new users to see so many options.

There are two separate issues at play, so let's deal with them separately.

All constructors can return output

In the current setup, all three constructors can return output. This was necessary previously, but no longer. If we removed the [output] field from both Producing and Finished, then a user would be forced to use HaveMore when they want to return output.

My concern here is complicating library usage. A previously simple function like map would now require a few extra hoops to be jumped through. We could address this by leaving the same higher-level interface we have before in conduitState and conduitIO. That would give the downside of having a mismatch between the low-level and high-level API.

To chunk, or not to chunk

Another question is chunking. Previously, returning a list of outputs was necessary, since we only had one chance to return output. Now, however, we could just return successive HaveMores. This has the downside of- once again- complicating some implementations. It has an additional downside that it might hurt performance. On the flip side, it may improve performance in some cases, since it would be impossible to return empty lists in a HaveMore.

Should closing give a Source?

And as long as we're on the subject of change, let's look at closing a Conduit. This applies in two circumstances: the feeding source closed, or the consuming sink closed. If the feeding source closes, we want to have an opportunity to produce a bit more output. This is necessary, for example, in the case of compression: we want to build up large chunks of compressed data and then generate output. But the last chunk of output has to be manually flushed once we know there's no more input.

On the flip side, if the consuming sink closes, we don't need to produce any more output as it won't be used. If you look at the definition of HaveMore above, it has a field m (), which is how it's closed. This doesn't allow for any new output to be produced, because a HaveMore would only ever be closed if the consuming Sink closed.

At this point, I see two problems with the way conduitClose works:

  • When closing a Conduit, you can only return a single chunk of values, not a stream of values. I can't think of a use case where you would return a large quantity of output from closing, but this limitation does both me.

  • In the case of a closed sink, the conduit will still try to produce some extra output which may never be used.

There's an easy solution to both problems: closing a Conduit returns a Source, which provides the last set of data. In the case of a closed Sink, then the conduit functions would simply call sourceClose immediately. In the case of large output, we could take advantage of Source's natural streaming abilities.

Feedback wanted

I'm writing this post in hope of getting some good feedback from people. Is my desire for one-way-to-do-things worthwhile, or is it better to complicate the internals of the library in exchange for potentially simpler user code? Does anyone have recommendations for better names for any of the constructors?

Postscript: prior art

While working on this, I reviewed two alternate approaches: enumerator and pipes. Let me explain why I can't reuse their solutions:

  • The Enumeratee type from enumerator is very powerful, much more so than a Conduit. It is a general purpose Iteratee-transformer, capable of doing lots of crazy stuff. That's exactly what I want to avoid for conduit: implementing an Enumeratee is far more complicated than implementing a Conduit, since it requires thinking directly about the inner Iteratee. The simplicity of a Conduit comes from the fact that it is a standalone unit.

  • As usual, pipes look like a simple, elegant solution. But the big thing it's lacking is proper resource management. Notice how much thought goes into Conduit to ensure that all resources are closed as early as possible, even in the case of early termination. It's true that by using ResourceT, pipes is able to avoid completely losing scarce resources, but holding onto a file handle for too long is not much better. I see no way to adapt any of pipes's approaches to conduit and still maintain our strict resource management.

Comments

comments powered by Disqus

Archives