Summarizing the conduit questions

March 28, 2012

GravatarBy Michael Snoyman

My last blog post on the prosposed changes to conduit got quite a discussion on Reddit. I really appreciate everyone's input, thank you! In the process of the discussion, a number of questions came up, and I'd like to summarize them here.

By the way, if anyone thinks I'm spending an inordinate amount of time on this stuff... it's because I am :). This is the last remaining blocking issue for the Yesod 1.0 release, so I'm trying to make a decision on this stuff quickly. I don't want to make a bad decision under pressure of course, but if we can come to some conclusions in the next week, it would be very nice to be able to use the shiny new conduit 0.4 in Yesod 1.0.

Most importantly: is this a good change?

The first question is the most important. Is the move towards a single datatype overall a good thing or a bad thing? Obviously the advantages are strong: a single set of instances to maintain, a single fusion operator, less constructors to deal with. However, we need to accept that there are also downsides: error messages are more confusing and code needs to deal with meaningless constructors (e.g., HaveOutput for Sink).

After playing around with this quite a bit, I'm strongly leaning towards saying that the benefits outweigh the costs. The clincher for me was when I was able to reimplement sequence and sequenceSink, and the two functions basically disappeared. Compare the old and new versions.

I'm still hoping to hear from some conduit users on this to make sure the changes won't be off-putting, but I think it's almost certain that the code well be merged into master.

Side point: newtypes?

The idea came up of using newtypes for the Source, Sink and Conduit types to try and have a best of both worlds. I still think it would be beneficial (better error messages, and a nicer Functor instance for Source and Conduit), but at least for now, I think it's too much overhead to have to wrap/unwrap everywhere. There's also an argument to be made that a newtype would hide away the "true nature" of our types, though I'm still on the fence as to whether users should be confronted with the fact that the three types are unified.

Type for second record in NeedInput

The NeedInput constructor has two records. The first takes some input and returns a new Pipe. The second is for indicating that no input is available. Unlike the early termination records for HaveOutput and PipeM, this record gives a Pipe, since it's feasible that we may want to output more values after we've run out of input (the typical example here is a decompression Conduit).

Said another way: the early termination for HaveOutput and PipeM can only ever be called when the upstream Pipe closes, not when the downstream Pipe closes.

Anyway, the idea of this record is that it can't receive any input, since once it's been called, we know that the upstream pipe has closed and won't be providing any more input. There are two ways we could model this: set the i parameter to (), or set it to Void.

However, there's also a third approach: keeping i as it was before. Since we'll never be providing any more input to this Pipe, it's completely irrelevant what the i parameter is. If the Pipe ever requests more input, we'll just call the early termination Pipe again anyway. The advantage to this third approach is that it simplifies some of the internal code, since we don't need to juggle different parameters.

I'm leaning towards the third approach, but all three seem equally feasible.

Separate Leftover constructor?

There's a bit of an inconsistency, in that the Done constructor performs two actions: it returns a result, and gives back any leftover input. Also confusing is that we can only have 0 or 1 leftover values.

We could address the second issue by changing the Maybe to a list. Alternatively, we could solve both issues by introducing a new Leftover constructor, and modifying the Done constructor, like so:

| Done r
| Leftover (Pipe i o m r) i

I've put this in a branch on Github, and it certainly works. However, I think I'm most comfortable leaving code as-is:

  • We don't have the concept of chunking (i.e., dealing with more than one value at a time) anywhere else in the type, so why should the leftovers be different?
  • Right now, we have a nice invariant expressed in the types that you can only return leftovers when computation is complete. I think I like that setup.

Source: to Void or not to Void?

In order to ensure that a Sink never yields output, we set the o parameter to Void. Initially, we set the i parameter on a Source to (), so that runPipe can just provide an infinite stream of unit values.

However, we can just as well set the i parameter to Void, and then call the no-more-input record of NeedInput. I'm not going to try and summarize the arguments back and forth on this one, because there are a lot of them. I will say that I'm leaning towards Void, just because it gives a very nice parallel between Source and Sink.

Fuse operators: unify?

There's now a fusion function (pipe) which can fuse together Sources, Sinks and Conduits. All three fusion operators ($=, =$, and =$=) are simply type-constrained wrappers around it. ($$ also utilizies pipe, but it also calls runPipe on the generated Pipe.) The question is: do we need all three operators, or should we have just one?

The advantage of separate operators is clearer error messages, and more explicit code. However, it hides the fact that all three types are really one and the same. (Again, I'm ambivalent as to whether that hiding is a feature or a bug.) It also means that people have to learn more names.

So should we have a unified fusion operator? And what would it be called?

Note: either way, this next release will still contain the other three, type-constrained fusion operators, if only to ease migration. If we add a unified operator, it would be in addition to those three for now, and likely after a few point releases we would deprecate the three operators.

Bikeshedding: rename the $$& operator

This is likely the easiest. I call $$& the connect-and-resume operator, and it connects a Source to a Sink, gets a result, and also gives back the most recent state of the Source. This allows us to continue computation.

Frankly, I chose a pretty bad name for the operator. (In my defense, I did that on purpose to make sure I didn't become attached to it.) Some other ideas that have been floated are $$- and $$+. I feel no particular drive one way or another here.


comments powered by Disqus