This is the third article for the series of "Improving the performance of Warp".
As I wrote in "Sending header and body at once", I will explain how to avoid the
calls in this article.
I designed the simple-sendfile to use as few system calls as possible, since "system calls are evil for network programming in Haskell". To make the story simple, I will only talk about the sendfile system call on Linux.
The type of the
sendfile function in
Network.Sendfile is as follows:
sendfile :: Socket -> FilePath -> FileRange -> IO () -> IO ()
FileRange has two constructors:
Let's consider the case where we send the entire file
EntireFile to the
Unfortunately, the function has to call the
stat() system call
to know the size of the file because the
sendfile() system call on Linux
requires the caller to specify how many bytes to be sent
sendfile() on FreeBSD/MacOS has magic number '0' which indicates
the end of file).
If WAI applications know the file size, they can specify
PartOfFile to avoid the
stat() system call.
It is easy for WAI applications to cache file information
such as size and modification time.
If cache timeout is fast enough (say 10 seconds),
the risk of cache inconsistency problem is not serious.
And because we can safely clean up the cache,
we don't have to worry about leakage.
open() and close()
PartOfFile is specified,
sendfile function calls
sendfile() repeatedly if necessary, and
When I implemented the simple-sendfile package,
I noticed that
close() should also be eliminated.
For this, we should cache file descriptors.
Caching file descriptors works as follows: If a client requests that a file be sent, a file descriptor is opened. And if another client requests the same file shortly thereafter, the file descriptor is reused. At a later time, the file descriptor is closed if no Haskell thread uses it.
Sound easy? Unfortunately, I had no idea on how to safely cache file descriptors. It seems to me that it is difficult to ensure that no Haskell thread uses a file descriptor when closing it.
A typical tactic for this case is reference counter. But I was not sure that I could implement a robust mechanism for a reference counter. What happens if a Haskell thread is killed for unexpected reasons? If we fail to decrement its reference counter, the file descriptor leaks.
Andreas motivated me to consider this issue again
by pointing out that the performance bottleneck of Warp is
close(). I thought this over for a month and
all the necessary pieces came together suddenly.
The timeout manager of Warp
I implemented a cache mechanism for file descriptor based on Warp's timeout system. So, let me explain how Warp's timeouts work first. For security reasons, Warp kills a Haskell thread, which communicates with a client, if the client does not send a significant amount of data for a specified period (30 seconds by default). I think that the heart of Warp's timeout system is the following two points:
- Safe swap and merge algorithm
Suppose that status of connections is described as
To clean up inactive connections,
a dedicated Haskell thread, called the timeout manager, repeatedly inspects the status of each connection.
If status is
Active, the timeout manager turns it to
Inactive, the timeout manager kills its associated Haskell thread.
Each status is refereed by an
To update status through this
atomicity is not necessary because status is just overwritten.
In addition to the timeout manager,
each Haskell thread repeatedly turns its status to
Active through its own
IORef if its connection actively continues.
To check all statuses easily,
the timeout manager uses a list of the
IORef to status.
A Haskell thread spawned for a new connection
tries to 'cons' its new
IORef for an
Active status to the list.
So, the list is a critical section and we need atomicity to keep
the list consistent.
A standard way to keep consistency in Haskell is
But as Michael Snoyman pointed out in "Warp: A Haskell Web Server",
MVar (in threaded RTS) is slow.
This is because each
MVar is protected with a home-brewed spin lock.
So, he used another
IORef to refer the list and
to manipulate it.
atomicModifyIORef is implemented via CAS (Compare-and-Swap),
which is much faster than spin locks.
The following is the outline of the safe swap and merge algorithm:
do xs <- atomicModifyIORef ref (\ys -> (, ys)) -- swap with  xs' <- manipulates_status xs atomicModifyIORef ref (\ys -> (merge xs' ys, ()))
The timeout manager atomically swaps the list with an empty list.
Then it manipulates the list by turning status and/or removing
unnecessary status for killed Haskell threads.
During this process, new connections may be created and
their status are inserted with
corresponding Haskell threads.
Then, the timeout manager atomically merges
the pruned list and the new list.
The algorithm to cache file descriptors
Warp's timeout approach is safe to reuse as a cache mechanism for file descriptors because it does not use reference counters. However, we cannot simply reuse Warp's timeout code for some reasons:
Each Haskell thread has its own status. So, status is not shared.
But we would like to cache file descriptors to avoid
close() by sharing.
So, we need to search a file descriptor for a requested file from
cached ones. Since this look-up should be fast, we should not use a list.
You may think
Data.Map can be used.
Yes, its look-up is O(log N) but there are two reasons why we cannot use it:
Data.Mapis a finite map which cannot contain multiple values for a single key.
Data.Mapdoes not provide a fast pruning method.
Problem 1: because requests are received concurrently,
two or more file descriptors for the same file may be opened.
So, we need to store multiple file descriptors for a single file name.
We can solve this by re-implementing
hold a non-empty list.
This is technically called a "multimap".
Data.Map is based on a binary search tree called "weight
balanced tree". To the best of my best knowledge, there is no way to prune the tree
directly. You may also think that we can convert the tree to a list (
then prune it, and convert the list back to a new tree (
The cost of the first two operations is O(N) but
that of the last one is O(N log N) unfortunately.
One day, I remembered Exercise 3.9 of "Purely Functional Data Structure" -
fromOrdList which constructs
a red-black tree from an ordered list in O(N).
My friends and I have a study meeting on this book every month.
To solve this problem, one guy found a paper by Ralf Hinze,
"Constructing Red-Black Trees".
If you want to know its concrete algorithm,
please read this paper.
Since red-black trees are binary search trees,
we can implement multimap by combining it and non-empty lists.
Fortunately, the list created with
toList is sorted.
So, we can use
fromOrdList to convert the sorted list to a new
Now we have a multimap whose look-up is O(log N) and
pruning is O(N).
The cache mechanism has already been merged into the master branch of Warp, and is awaiting release.
UPDATE: Discussion here is based on my ignorance. Please read also comments.
New functions in simple-sendfile
I explained the
sendfile function and
sendfileWithHeader function in
this article and the previous one, respectively:
sendfile :: Socket -> FilePath -> FileRange -> IO () -> IO () sendfileWithHeader :: Socket -> FilePath -> FileRange -> IO () -> [ByteString] -> IO ()
To avoid the
close() system call, I added two more functions
to the simple-sendfile package:
sendfileFd :: Socket -> Fd -> FileRange -> IO () -> IO () sendfileFdWithHeader :: Socket -> Fd -> FileRange -> IO () -> [ByteString] -> IO ()
Of course, the master branch of Warp uses the last one.
That's all. Thank you for reading this long article.