Last week I was building VMware images for database and web server
appliances that would host a fairly large Django application. The
application is backed by a PostgreSQL database and I was looking for
some info on compiling/configuring pgpool on Debian (I like to compile
stuff when I can, especially when the last version of Debian is 2
years old). Googling around I came across some very interesting posts
on mailing lists and SO regarding Django, PostgeSQL and connection
pooling.
Among other things, people seem to have a notion that using pooling
middleware won’t accomplish much as the web server still needs to open
a TCP connection and that is the source of a noticeable overhead for
each request. So they’ve come up with solutions to avoid opening TCP
connections as much as possible, trying to accomplish something
similar to what SQLAlchemy’s connection pool does, keeping the web
server connected to the database with multiple connections at all
times. These solutions of course range from changing the code in
django.db.backends.... to monkey-patching it.
When you change Django’s code you’ve just created a fork of a growing
and evolving open source project and based your own project around
that fork you have to maintain yourself. Monkey-patching is not as
bad, but comes very close regarding maintenance problems. And is all
that really necessary?
The Overhead Doesn’t Stem From Opening TCP Connections
Every time you open a database connection (session) to execute some
SQL on a Postgres database the Postgres server spawns a new process
and upon closing the connection (session) from your application that
process is shut down. With the way Django handles database sessions
this is repeated for every request. Which means for every request
Postgres will have to spawn a new worker process that will last for
the duration of the database session involved in responding to that
HTTP request. The overhead involved in opening a TCP connection to a
process running on the same machine or on the same network is not much
compared to the overhead involved in spawning a new process.
Solutions Do Exist
So if the overhead for each request comes from spawning new processes
then the obvious solution would be to keep that at a minimum level. If
you keep the connections to the database server open and reuse them
for every request then the processes spawned at the time the
connections were established would be reused as the connections are
reused.
But you don’t have to keep your web server connected to the database
server to achieve this. Two more popular solutions are pgpool II and
pgbouncer. Both are designed as sort of middleware proxies that sit
between your application and your database. Pgpool is more of a
replication and load balancing solution than a connection pool. It
works as a connection pool because at each connection opened by your
application to pgpool it will have a separate process handling that
connection, but it will keep those processes alive and connected to
the Postgres server even after your application closes those
connections. So using it would have the effect of lowering the net
amount of new processes created to serve a certain number of requests
to your web application. Pgbouncer on the other hand handles all the
requests between your application and Postgres in a highly efficient
asynchronous manner by utilizing libevent and not using
multiprocessing at all, and it will keep the initially opened
connections for a longer time after your application closes them, so
making a new connection to pgbouncer will rarely result in Postges
spawning a process.
The Proof
To prove that solutions like pgbouncer do make a difference I created
a simple test scenario. A very small Django project with one page
displaying 5 rows from a table in a PostgreSQL database. Both the
database server and the web server (Apache with mod-wsgi in daemon
mode) running on one small VM with 1GB of RAM and 4 CPU cores
assigned.
In the first test I configured the application to connect to the
database server directly and put the page under
siege with 1, 5, 15,
50, 100 and 200 concurrent requests, each session lasting for 1
minute. Then I repeated the process with the application configured to
connect to pgbouncer instead. The results show something close to a
50% increase in responsiveness: