[iolib-devel] c10k HTTP server with iolib

Sun Oct 18 09:04:36 UTC 2009

On Sat, 17 Oct 2009 20:05:52 -0700
Red Daly <reddaly at gmail.com> wrote:

> I was wondering how best to use IOLib to write an efficient HTTP server that
> can handle perhaps 10,000+ simultaneous connections.  It seems like iolib
> has all the right ingredients: a sytem-level sockets interface and a io
> multiplexer that uses epoll/kqueue for efficiently querying sockets.  There
> is quite a bit of code already written so I was hoping for some advice about
> how this would be best implemented.

Note that the following is about using the kqueue backend using SBCL
and iolib dating more than a year ago.  It mignt not apply when using
the /dev/poll, epoll, or even select backends.  Also, please forgive me
if I'm stating the obvious, as I have no knowledge of your background :)

I tried using iolib on NetBSD (which supports kqueue), along with the
multiplexer.  I wrote a very simplistic IO-bound server around it to
measure preformance (no worker threads, but non-blocking I/O in a
single threaded process, a model which I previously successfully used
for high performance C+kqueue(2) (and JavaScript+libevent(3)) on the
same OS).

The performance was unfortunatly pretty bad compared to using C+kqueue
(i.e. in the order of a few hundred served requests per second versus
thousands, and nearly a thousand with JS), so I made sure the kqueue
backend was being used (it was), and then looked at the code (after
being warned that the multiplexer was the less tested part of iolib).
What I noticed at the time was that timers were not dispatched to
kqueue but to a custom scheduler, and that a kevent(2) syscall was
issued per FD add/remove/state change event.

kqueue allows to use a single kevent(2) syscall in the main loop to
handle all additions/removals/state changes/notifications of
descriptors, signals and timers, which is part of what makes it so
performant, other than only needing the caller to iterate among new
state changes rather than a full descriptor table.

I admit that I didn't look at the iolib kevent backend code again
lately, which could have improved, and didn't try to fix it myself
(library portability being of limited value in my case, and using
complete C+PHP and C+JavaScript solutions for work, my adventure
into CL and iolib was experimental and a hobby, but I can confirm my
growing love for CL. :)

Another potential performance issue I've noticed is the interface
itself, i.e. all the sanity checking which to be (allegro?) compatible
as much as possible has to force distinction of various socket types
(bind/listen/accept vs read/write sockets for instance, adding
overhead).  Also, unlike BSD accept(2) which allows to immediately
access the client's address as it's stored into a supplied sockaddr
object, with iolib one has to perform a separate syscall to obtain the
client address/port as the interface did not cache that address.  I
honestly didn't look at if iolib made this possible, but the BSD
sockets API also allows asynchroneous non-blocking accept(2)/connect(2)
which is important for non-blocking I/O-bound proxies.

In the case of my test code, there also was some added overhead
as I wrote a general purpose TCP server library which the minimal test
HTTP server could use.  CLOS was used (which itself has some overhead
over struct/closures/lambda based code because of dynamic runtime
dispatching, although SBCL was pretty good compared to other
implementations to optimize CLOS code).  It also used a custom buffer to
be able to use file descriptors directly instead of streams (especially
since non-blocking I/O was used), although similar code using a
libevent(3) stub class in non-JIT/interpreted JavaScript using
SpiderMonkey was still faster (note that I've not tested iolib's own
buffering against mine however).  libevent(3) is also able to use a
single-syscall kevent(3) based loop which greatly helps performance.

At the time I didn't look into this as I had no idea, but CFFI itself
appears to incur some overhead compared to UFFI, but only looking at
the resulting assembly and microbenchmarks showed me this.  It probably
was a non-issue compared to the numerous kevent(2) syscalls.  Another
probably insignificant, since CPU-bound overhead could be iolib's use
of CLOS (I noticed CLOS to be from 1.5 to 10 times slower in some
struct+lambda vs class+method tests depending on task and CL
implementation).

Another factor was that it was among my first Common Lisp tests, so
the code was probably clumsy :)
In case it can be useful, the test code can be found at:
http://cvs.pulsar-zone.net/cgi-bin/cvsweb.cgi/mmondor/mmsoftware/cl/test/httpd.lisp?rev=1.10;content-type=text%2Fplain
Which uses:
http://cvs.pulsar-zone.net/cgi-bin/cvsweb.cgi/mmondor/mmsoftware/cl/lib/rw-queue/
http://cvs.pulsar-zone.net/cgi-bin/cvsweb.cgi/mmondor/mmsoftware/cl/lib/server/

In case iolib's multiplexer can't suit your needs with your favorite
backend, it however still doesn't make iolib useless, especially in the
case of application servers.  For instance:

As I was playing with ECL more recently, and that it supports POSIX
threads and SBCL-compatible simple BSD sockets API contrib library, I
wrote a simple multithreaded test server where a pool of ready threads
accept new connections themselves to serve the client to then go back
to accept mode when done.  This was actually to test ECL itself, and is
very minimal (isn't flexible and doesn't even implement input
timeouts!), but it can serve to demonstrate the idea which also could
be implemented using SBCL and its native sockets, or iolib, and the
performance was very decent for an application-type server (also note
that the bugs mentionned in the comments have since been fixed in ECL):
http://cvs.pulsar-zone.net/cgi-bin/cvsweb.cgi/mmondor/mmsoftware/cl/test/ecl-server2.lisp?rev=1.10;content-type=text%2Fplain

The above does not require an efficient multiplexer.  The method it
uses is similar to mmlib/mmserver|js-appserv and apache, and generally
a manager thread/process uses heuristics to grow and shrink the
processes/threads pool as necessary.  In the case of ECL, a
libevent(3)/kqueue(2) C-based main loop could even invoke CL functions
if optimal multiplexing was a must, as ECL compiles CL to C (SBCL's
compiler is more efficient however, especially where CLOS is involved).

In general, CPU-bound applications (HTTP application and database
servers often are) use a pool of processes if optimal reliability and
security is a must (permits privilege separation, avoids resource
leaks by occasionally recycling the process, a bug generally only
affects the instance, need for reentrant and thread-safe libraries is a
non-issue) or threads (generally with languages minimizing buffer
overflows and supporting a GC, with a master process managing the
threaded processes) when I/O-bound applications are the ones needing
optimal multiplexing with non-blocking asynchroneous I/O, often in a
single thread/process (i.e. frontend HTTP servers/proxies (lighttpd or
nginx), IRCD, etc).

For very busy dynamic sites, as load grows, a farm of CPU-bound
application servers can be setup and a few frontend I/O-bound HTTP
servers proxy dynamic requests to them (via fastcgi, or most commonly
today HTTP, especially with Keep-Alive support) and perform load
balancing (which is sometimes done at an upper layer).  In this sense
it is not necessary for a single all-purpose HTTP server to both handle
very efficient multiplexing and CPU-bound worker threads simultaneously
(the later usually better kept separate for the purpose of
redundancy and application-specific configuration)...

That said, if you want to implement an IO-bound server, I hope the
backends you'll need to use provide better performance than the kqueue
one did for me back then.  Working on improving it would be interesting
but I'm affraid I don't have the time or motivation to take up the task
at current time.  As for the interface-specific improvements I can (and
did) suggest a few changes but have no authority to change the API,
which seems to have been thought out with valid compatibility concerns.
-- 
Matt