Conduits: Simplifying ResourceT

February 23, 2012

By Michael Snoyman

This blog post discusses the conduit package. If you are not familiar with it, you can read up on it in the Yesod book chapter on conduits.

Earlier today, Oleg sent an email to the Haskell cafe about regions. Yves Parès sent a response that linked to resource-simple, a package I had not heard of until then. Reading the description on the page reminded me of one of the earlier decisions I made in designing conduits. I'd like to explain that decision, and then explain how we can work around it.

Originally, I had intended that all conduits would live in the IO monad. This is a fair assumption: the majority of the time, we want to use conduits to perform some kind of I/O (otherwise, why not just use lazy lists?). So for my first stab at the problem, I designed a ResourceT transformer that always assumed an IO monad for its base. Then, all three data types in conduit (Source, Sink, and Conduit) assumed that their actions lived in the ResourceT transformer so that they could safely acquire resources.

However, this IO assumption can be limiting. Thre are plenty of sources, sinks, and conduits which perform no resource allocation at all, and we would like to be able to access from pure code. For example, xml-conduit provides a greater parser and renderer for XML documents; it would be a shame to only be able to access it from IO. We could of course use unsafePerformIO, but we don't mention that in polite company.

I created an elaborate typeclass system around ResourceT, which would allow us to build monad stacks around both IO and ST. Then we could call ST from pure code, and no need to touch that unsafe stuff!

Unfortunately, there are a few downsides to this approach:

ResourceT doesn't really make sense for ST. You can't safely allocate scarce resources in the ST monad, so we're just pretending for the sake of uniformity. Just have a look at how resourceBracket is implemented for ST.
The type complexity really gets in the way. Look at the presence of both with and withIO for an example.
We are still limited in our monad choices, since we need monads that provide mutable references for ResourceT to work. This turns into a performance penalty, as we'll see later.

It turns out there's a simple solution here: don't bake ResourceT into Source, Sink, and Conduit. Instead, only use it for functions that actually allocate scarce resources, such as sourceFile. There is a downside to this approach: type signatures get a little longer:

-- conduit 0.2
sourceFile :: ResourceIO m => FilePath -> Source m ByteString

-- conduit 0.3
sourceFile :: ResourceIO m => FilePath -> Source (ResourceT m) ByteString

However, you can make the argument that this is in fact a Good Thing: we're now explicit in our types as to whether we're performing allocation of scarce resources.

I've put together a separate branch on Github for this approach, and have generated some Haddocks. I'm not yet ready to release this code to Hackage, but I wanted to get people's feedback.

Beyond the theoretical issues above, I'm sure there are two big questions people want to ask.

How bad is the breakage?

Not bad at all. The Resource typeclass is completely gone now. You can replace it with Monad. In other words:

-- old
nums :: Resource m => Source m Int
nums = fromList [1..10]

-- new
nums :: Monad m => Source m Int
nums = fromList [1..10]

Additionally, the lesser-used ResourceThrow and ResourceUnsafeIO classes have been renamed to MonadThrow and MonadUnsafeIO. These classes are not in any way ResourceT-specific, thus the name change. ResourceIO remains as-is.

You might have to add a few explicit lift calls now, and in some cases will have to change your type signature to include ResourceT. But overall, this is a minor change.

How does this affect performance?

For code that will still live in the ResourceT transformer, this will have no performance affect. (I made a separate change to optimize the monadic bind implementation of ResourceT, which does improve performance significantly.) However, if you don't need scarce resource allocations, you can now skip out on the ResourceT overhead entirely. In fact, you can skip out on the overhead of IO and ST as well if you just need to perform pure actions.

I implemented a simple Criterion benchmark comparing six different ways of summing up the numbers 1 to 1000:

main :: IO ()
main = defaultMain
    [ bench "bigsum-resourcet-io" (whnfIO $ C.runResourceT $ CL.sourceList [1..1000 :: Int] C.$$ CL.fold (+) 0)
    , bench "bigsum-io" (whnfIO $ CL.sourceList [1..1000 :: Int] C.$$ CL.fold (+) 0)
    , bench "bigsum-st" $ whnf (\i -> (runST $ CL.sourceList [1..1000 :: Int] C.$$ CL.fold (+) i)) 0
    , bench "bigsum-identity" $ whnf (\i -> (runIdentity $ CL.sourceList [1..1000 :: Int] C.$$ CL.fold (+) i)) 0
    , bench "bigsum-foldM" $ whnf (\i -> (runIdentity $ foldM (\a b -> return $! a + b) i [1..1000 :: Int])) 0
    , bench "bigsum-pure" $ whnf (\i -> foldl' (+) i [1..1000 :: Int]) 0
    ]

The results are very promising: moving from ResourceT to the Identity monad brings runtime from 1541us to 409us. Unsurprisingly, a straight foldM is still faster (no conduit overhead at all), and a pure foldl' faster yet, but we're definitely closing the gap.

benchmarking bigsum-resourcet-io
mean: 1.541109 ms, lb 1.536687 ms, ub 1.546054 ms, ci 0.950
std dev: 23.92658 us, lb 20.27472 us, ub 30.16423 us, ci 0.950
found 3 outliers among 100 samples (3.0%)
  2 (2.0%) high mild
  1 (1.0%) high severe
variance introduced by outliers: 8.475%
variance is slightly inflated by outliers

benchmarking bigsum-io
mean: 705.3596 us, lb 703.8689 us, ub 706.8185 us, ci 0.950
std dev: 7.554517 us, lb 6.639072 us, ub 8.699024 us, ci 0.950

benchmarking bigsum-st
mean: 737.1096 us, lb 734.7698 us, ub 739.0198 us, ci 0.950
std dev: 10.77292 us, lb 8.970027 us, ub 13.39969 us, ci 0.950
found 4 outliers among 100 samples (4.0%)
  4 (4.0%) low mild
variance introduced by outliers: 7.532%
variance is slightly inflated by outliers

benchmarking bigsum-identity
mean: 409.2451 us, lb 407.7206 us, ub 411.1361 us, ci 0.950
std dev: 8.671930 us, lb 6.924580 us, ub 12.59325 us, ci 0.950
found 2 outliers among 100 samples (2.0%)
  1 (1.0%) high severe
variance introduced by outliers: 14.217%
variance is moderately inflated by outliers

benchmarking bigsum-foldM
mean: 147.9192 us, lb 146.5067 us, ub 149.2992 us, ci 0.950
std dev: 7.126892 us, lb 6.556864 us, ub 7.773273 us, ci 0.950
variance introduced by outliers: 46.449%
variance is moderately inflated by outliers

benchmarking bigsum-pure
mean: 36.21976 us, lb 36.01451 us, ub 36.39617 us, ci 0.950
std dev: 970.5240 ns, lb 832.2271 ns, ub 1.192476 us, ci 0.950
found 2 outliers among 100 samples (2.0%)
  2 (2.0%) low mild
variance introduced by outliers: 20.947%
variance is moderately inflated by outliers

Comments

Conduits: Simplifying ResourceT

February 23, 2012

By Michael Snoyman

How bad is the breakage?

How does this affect performance?

Archives