I've gotten a few emails recently (most recently from Nathan Howell) about a shortcoming of the
Conduit datatype. The issue is that a
Conduit can only produce output when it is pushed to. However, if you have a
Conduit that could produce a large amount of output for a single input (e.g., a decompressor), this could become memory inefficient.
I came up with a simple solution: allow a
Conduit to return a stream of outputs for a single input. In code, this turns into just a single additional constructor for the
HaveMore (m (ConduitResult input m output)) (m ()) [output]
I'll go into more detail on the
m () bit below, but it says how to close the
Any time you push to a conduit, it can now say "here's some output, and more is on the way." I've implemented this, and I'm happy with this solution. However, I want to make it better.
There's one way to do it
With the previous API, there was only one way to encode each operation. If you wanted to implement a
map, you had to use the
Producing constructor with a single element list for the
concatMap would look something like:
push input = return $ Producing (Conduit push close) (f input)
However, we now have at least two other ways to encode the same thing:
HaveMoreconstructor which contains all of the output, which will then return then
Producingconstructor to allow the
Return the elements one at a time via
Having these multiple approaches makes the internals of the library a bit ugly, and since there are multiple codepaths, it increases the likelihood of bugs. I also think it's difficult for new users to see so many options.
There are two separate issues at play, so let's deal with them separately.
All constructors can return output
In the current setup, all three constructors can return output. This was necessary previously, but no longer. If we removed the
[output] field from both
Finished, then a user would be forced to use
HaveMore when they want to return output.
My concern here is complicating library usage. A previously simple function like
map would now require a few extra hoops to be jumped through. We could address this by leaving the same higher-level interface we have before in
conduitIO. That would give the downside of having a mismatch between the low-level and high-level API.
To chunk, or not to chunk
Another question is chunking. Previously, returning a list of
outputs was necessary, since we only had one chance to return output. Now, however, we could just return successive
HaveMores. This has the downside of- once again- complicating some implementations. It has an additional downside that it might hurt performance. On the flip side, it may improve performance in some cases, since it would be impossible to return empty lists in a
Should closing give a
And as long as we're on the subject of change, let's look at closing a
Conduit. This applies in two circumstances: the feeding source closed, or the consuming sink closed. If the feeding source closes, we want to have an opportunity to produce a bit more output. This is necessary, for example, in the case of compression: we want to build up large chunks of compressed data and then generate output. But the last chunk of output has to be manually flushed once we know there's no more input.
On the flip side, if the consuming sink closes, we don't need to produce any more output as it won't be used. If you look at the definition of
HaveMore above, it has a field
m (), which is how it's closed. This doesn't allow for any new output to be produced, because a
HaveMore would only ever be closed if the consuming
At this point, I see two problems with the way
When closing a
Conduit, you can only return a single chunk of values, not a stream of values. I can't think of a use case where you would return a large quantity of output from closing, but this limitation does both me.
In the case of a closed sink, the conduit will still try to produce some extra output which may never be used.
There's an easy solution to both problems: closing a
Conduit returns a
Source, which provides the last set of data. In the case of a closed
Sink, then the
conduit functions would simply call
sourceClose immediately. In the case of large output, we could take advantage of
Source's natural streaming abilities.
I'm writing this post in hope of getting some good feedback from people. Is my desire for one-way-to-do-things worthwhile, or is it better to complicate the internals of the library in exchange for potentially simpler user code? Does anyone have recommendations for better names for any of the constructors?
Postscript: prior art
While working on this, I reviewed two alternate approaches: enumerator and pipes. Let me explain why I can't reuse their solutions:
enumeratoris very powerful, much more so than a
Conduit. It is a general purpose
Iteratee-transformer, capable of doing lots of crazy stuff. That's exactly what I want to avoid for
conduit: implementing an
Enumerateeis far more complicated than implementing a
Conduit, since it requires thinking directly about the inner
Iteratee. The simplicity of a
Conduitcomes from the fact that it is a standalone unit.
As usual, pipes look like a simple, elegant solution. But the big thing it's lacking is proper resource management. Notice how much thought goes into
Conduitto ensure that all resources are closed as early as possible, even in the case of early termination. It's true that by using
pipesis able to avoid completely losing scarce resources, but holding onto a file handle for too long is not much better. I see no way to adapt any of pipes's approaches to
conduitand still maintain our strict resource management.