Sunday, November 14, 2010

Multithreaded Python, extensions, and static data


The GIL and Context Switching

I use Boost.Python to write some C++ extensions for Python. I work on an I/O-bound Python program; Python has a global interpreter lock ("the GIL") which means that in a multi-threaded program, only one thread can be executing code inside the Python interpreter at once. Now, a thread can drop the GIL, and the built-in Python read and write routines do this so that while one thread is doing I/O, another thread can run. However, due to a peculiarity in how the GIL is implemented,1 even though the actual I/O takes place during system calls that drop the GIL, the need to re-acquire the GIL after every I/O operation was killing our performance.

For instance, one of the things that the application does a lot of is logging. Doing the logging synchronously -- as the code is executing and it wants to write something to its log, it needs to wait for the write to the logfile to finish before it can continue going about its business -- turned out to be a bottleneck. My first attempt to do something about that was to spawn off a separate thread for the logfile, and have logging look something like this:

class Logger:
    def __init__(self):
        self._buffer = []
        self._file = open("foo.log", "w")
        self._hasMessages = threading.Condition()
        self._lock = threading.Lock()

    def writeEntriesForever(self):
        while True:
            with self._lock:
                while len(self._buffer) == 0:
                messages = self._buffer
                self._buffer = []


    def log(self, message):
        with self._lock:

def main():
    logger = Logger()
    t = threading.Thread(target = logger.writeEntriesForever)


That did get the I/O off of the main thread. However, while the write call in Logger.writeEntriesForever would make the logging thread drop the GIL, allowing the main thread to continue executing, the logging thread would need to reacquire the GIL when write returned. Now, it'd drop the GIL again when while waiting, which is where the thread would spend most of its time, but then it'd need to acquire the GIL again between the end of wait and the start of write. All of these context switches were almost completely negating any performance win from offloading the actual I/O to a separate thread.


Enter Boost.Python. The GIL is only needed when using the Python interpreter, so if the entire body of writeEntriesForever doesn't need the interpreter, the thread can drop the GIL as soon as it enters that method and never reacquire it. This means writing that method in some language other than Python, which is what Boost.Python makes it easy to do.

The way Boost.Python works, you wind up compiling and linking your C++ code into a dynamic library, and that library is a Python extension. In the example above, the new Python code would look like:

from logging_extension import Logger

def main():
    logger = Logger()
    t = threading.Thread(targer=logger.writeEntriesForever)


and then you'd have C++ code that would look something like:

class ScopedGILRelease {
// The GIL will be released when an instance of this class goes in-scope
// and reacquired when it goes out of scope.
    inline ScopedGILRelease() { m_thread_state = PyEval_SaveThread(); }
    inline ~ScopedGILRelease() {
        m_thread_state = NULL;
    PyThreadState *m_thread_state;

class Logger {
   void log(boost::python::str message);
   void writeEntriesForever();

    typedef std:list LogBuffer;
    std::auto_ptr buffer;
    std::filebuf fb;
    std::ostream file;
    boost::condition hasMessages;
    boost::mutex mutex;

Logger::Logger() {"foo.log", ios::out);
    file = std::ostream(&fb);
    buffer = std::auto_ptr(new LogBuffer);

void Logger::log(boost::python::str message) {
    ScopedGILRelease noGIL; //Drop GIL before acquiring mutex to avoid deadlock.
    boost::scoped_lock lock(mutex);

void Logger::writeEntriesForever() {
    ScopedGILRelease noGIL;
    while(true) {
        std::auto_ptr messages;
            boost::scoped_lock lock(mutex);
            while(!buffer.size()) hasMessages.wait();
            messages = buffer;
        for(LogBuffer::iterator i = messages->begin(); i != messages->end(); i++) {
            file << *i;

    using namespace boost::python;
        .def("log", &Logger::log)
        .def("writeEntriesForever", &Logger::writeEntriesForever)

Python Extensions and Static Data

So, that all worked fine and dandy until I did three things.

Multiple Extensions

The first thing that caused a problem is I also decided to move the code for interacting with Oracle into a Boost.Python extension. For the usual reasons, I didn't want to have that code and the logging code in one big honking library of doom, so I put it in its own extension; there was now and

Static Data

The second thing that caused a problem is that our logging code is actually more complicated above. We have a syslog-like framework where there are different categories of log message, and the app can be configured so that different categories have different log levels. There are a lot of LOG_DEBUG statements in the application, but if none of the logging categories are configured to be that verbose, those statements will never actually make it into the log.

Since logging settings are application-wide, and it'd be ugly to have to pass around a "logging state" object everywhere (or for that matter, an instance of Logger), I used static data for that:

    static std::map theLoggers; //Map logging destination to the logger object.
    static std::map theSettings; //Map logging category to its log level.

Using One Extension From Another

The third thing was that I wanted to actually log things from inside the database interaction code. Simply including the logging headers from inside the DB sources didn't work:<

#include "../logging/logging.hpp"
logging::doLog(logging::LEVEL_DEBUG, logging::CATEGORY_DATABASE, "Hello, world!");

It would compile, but it wouldn't run because the symbols from were unresolved. Okay, easy enough to fix. I added to the link line for and went about my merry business.

Symbol Visibility

This looked like it worked, but none of the messages from were actually making it into the log! I thought I must be doing something threading-related incorrectly, or something. But, eventually, while debugging in GDB I noticed an odd message.

(gdb) b
Breakpoint 1 at 0x1234567: file logging.cpp, line 6. (2 locations)

2 locations? Oh. Well, that was the problem. To load the dynamic library at runtime on Linux, Python uses the dlopen function. The documentation for dlopen mentions, in the description of the flag argument, that the RLD_LOCAL flag (the default) means that "symbols defined in this library are not made available to resolve references in subsequently loaded libraries." This meant that when Python loaded, would map in a new copy of (because was linked to it), ignore the copy pulled in when Python did dlopen("", RTLD_LOCAL); . This meant that when Python routines called logging functions, they got one copy of the static data, while when oracle routines called logging functions, they got their own copy of the static data! So, the database code wasn't seeing any of the logging settings changes made from the Python code.

My solution was to stop linking against, and to create a new, with a dire warning in a comment that before importing any Boost.Python extensions, one must import this file:

###---*** IMPORTANT! Before importing any Boost.Python modules, you *must*
###---*** import this, e.g.:
###---***    import pre_c_import
###---***    import c_extensions.logging
###---***    import
###---*** If you don't, your import will fail with unresolved symbol errors.
###---*** Whenever you add a new extension module, you should add it to
###---*** the import below.
import sys, DLFCN
current = sys.getdlopenflags()
sys.setdlopenflags(current | DLFCN.RTLD_GLOBAL)

import c_extensions.logging


RTLD_GLOBAL is the opposite of RLTD_LOCAL, so now the extensions were able to see each other's symbols, and everything was happy.

  1. Dave Beazley has a good explanation and hour-long video describing the issues.