January 29, 2012

By Michael Snoyman

tl;dr: The new Haddocks are available at: http://www.snoyman.com/haddocks/conduit-0.2.0/index.html

Even though it's relatively young, conduit has gotten a lot of real-world usage, and a fair bit of scrutiny. I think we achieved all of our main objectives with the first release, but that doesn't mean we're going to avoid improvements. I asked the community to give their feedback, and here were the main criticisms I've heard:

BufferedSource doesn't feel quite right. One complaint was the name bsourceUnpull, but overall people thought it didn't fit in well with the rest of the package.
Usage of mutable variables for storing state is suboptimal.
The split between Source and PreparedSource isn't very nice.

While I won't call the first issue fully resolved, I would say that conduit 0.1 was a big step in the right direction. Instead of exposing all the internals of BufferedSource, it's now an abstract type. (This does solve the bsourceUnpull name dislike, though that's obviously a minor point.) Overall, we had a move in dependent packages away from using BufferedSource in any external APIs. In other words, BufferedSource is intended purely as an internal tool. For example, in Warp, we use BufferedSource to parse the request headers, but then convert it back to a Source to pass to the application for request body reading.

I've been opposed to making any changes for the second issue (mutable variables). My belief was that one of the sources of conduits' simplicity relative to enumerators was its usage of mutable state. And in general, I don't believe in changing something until there's hard evidence that it's actually causing problems.

Last week, however, Felipe Lessa found one such concrete problem: using SequencedSink was very slow. Upon investigation, I determined that the problem came from Sink's monadic bind implementation. The issue is that for each bind, a new mutable variable was being allocated, and it needed to be checked to determine its state. Unfortunately, having a long chain of binds resulted in exponential complexity, having to check N variables for each action. This clearly needed to be fixed, but there was no way to do so (that I could see) with the previous types.

So I was presented with a dilemna: either continue in the mutable variable path and try to solve the problem, or go in the pure/CPS direction, where I knew a simpler solution existed. The choice was actually pretty easy: go for the pure approach. I had the following reasons:

The main motivation to avoid the change to CPS was to keep the simplicity of the current approach. However, I was about to lose that simplicity anyway.
Like most Haskellers, I do have an innate dislike for mutable variables.
After more work comparing conduits to enumerators, I've come to believe that the main source of confusion in enumerators is that the data producer (Enumerator) is just a consumer-transformer. Since the essence of Source would stay the same in CPS, I think that this change does not hinder our simplicity.
There was strong reason to believe that GHC would be able to optimize CPS code better than mutable variable code.

So I took the plunge and tried out CPS... and I really like the result! The first change is to SourceResult's Open constructor: instead of just returning a new value, it returns a new value and a new Source. This allows us to pass our state in that new Source. There are similar changes to SinkResult and ConduitResult. After this, I benchmarked the old and new version, comparing both a monadic-bind-intensive Sink and a Sink without any binds. The former had a ten-fold speedup (not surprising due to the decrease in algorithmic complexity), and the latter had a 20% speedup.

But that wasn't the end of it. This new approach allows us to get rid of the Prepared family of types. Let's take the sourceFile function as an example, which opens a Handle and reads data from a file. In the old approach, we needed to provide the PreparedSource with the Handle in order for the PreparedSource to read from it. Therefore, we had a Source which opened the Handle and passed it to the PreparedSource. In the new approach, we have a Source that opens a handle, reads some data, and returns a new Source that reads from the Handle.

So contrary to my original belief, I think this CPS move actually simplifies conduit greatly.

Another, orthogonal change that I put in was better data types in a few places. Previously, if you wanted to use the sourceState function, and had a pull function that returned Closed, you needed to provide a dummy state value. (If you look through current conduit code bases, you'll see a lot of error calls.) Instead, we now have a specialized data type (ConduitStateResult, name suggestions welcome) that avoids this need. Internally, I also cleaned up a number of the types to enforce invariants at the type level.

Speaking of invariants, the final simplification is that we now have just one invariant ruling over the whole package: never reuse a Source, Sink, or Conduit. After you pull from a Source, it will give you a new Source. Do not reuse the original Source. If you get a Closed result, there is no new Source, and therefore you cannot pull again or close the Source.

I encourage everyone to have a look at the Haddocks and give me your feedback.

When will this be released?

Likely some time this week. I don't have any specific changes in mind right now, outside of name adjustments that are suggested by the community.

How this affects users

Anyone programming against the high-level conduit API exclusively will have no breakage. If you're using functions like sourceIO or sinkState, you'll have minimal changes to use the modified datatypes (essentially changing a few constructors and reordering your arguments). If you're coding directly against the low-level types, you'll need to restructure things a bit to pass around continuations.

Please email me (or preferably the Haskell cafe) if you want some help on converting old conduit code to this new set of types. For the most part, it's a mechanical process, and I can give lots of examples from the code I've already migrated.

How this affects Yesod

Yesod 0.10 will be built off of this new-and-improved conduit. In fact, the code is already updated for it. This likely means that the Yesod release will be about a week later than originally anticipated, maybe in the second week of February.

Comments

Exciting changes coming to conduit 0.2

January 29, 2012

By Michael Snoyman

When will this be released?

How this affects users

How this affects Yesod

Archives