Logo for “Nota Bene / non-blog” goes here. NB main page

Along with navigation links.

Lock-Free Programming Notes

Lock-Free Queue

15 June 2011

Mundane Implementation

A typical thread-safe queue can be written by simply making the various member functions lock a mutex. After ensuring that only one function gets to run at a time, the implementation can be a straightforward circular buffer. Here is an example from the Boost library, which illustrates the soon-to-be standard C++ threading constructs.

Actually, the listing does not match the documentation. Presumably the Boost library still supports older names and constructs for backward compatibility. The Boost library was rewritten to track the proposed standard.

// Copyright (C) 2001-2003
// William E. Kempf
//
//  Distributed under the Boost Software License, Version 1.0. (See accompanying 
//  file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)

#include <boost/thread/condition.hpp>
#include <boost/thread/mutex.hpp>
#include <boost/thread/thread.hpp>
#include <iostream>
#include <vector>

class bounded_buffer : private boost::noncopyable
{
public:
    typedef boost::mutex::scoped_lock lock;
    bounded_buffer(int n) : begin(0), end(0), buffered(0), circular_buf(n) { }
    void send (int m) {
        lock lk(monitor);
        while (buffered == circular_buf.size())
            buffer_not_full.wait(lk);
        circular_buf[end] = m;
        end = (end+1) % circular_buf.size();
        ++buffered;
        buffer_not_empty.notify_one();
    }
    int receive() {
        lock lk(monitor);
        while (buffered == 0)
            buffer_not_empty.wait(lk);
        int i = circular_buf[begin];
        begin = (begin+1) % circular_buf.size();
        --buffered;
        buffer_not_full.notify_one();
        return i;
    }
private:
    int begin, end, buffered;
    std::vector<int> circular_buf;
    boost::condition buffer_not_full, buffer_not_empty;
    boost::mutex monitor;
};

This embodies a technique called a monitor, which describes functions as “belonging to” the same monitor and ensures that only one function in that set can be executing at a time. The concept maps well to C++ member functions, but there is no native way to declare that the class act as a monitor. Instead, a scoped lock lk is declared at the top of each function’s body.

But monitors are more than just a pattern of using mutexes. They also work with condition variables, which allow functions inside a monitor to block and resume based on activity of other functions in the same monitor. The big idea is that a monitor can have one thread of execution at a time, and a thread that is blocked is not executing. Blocking on a condition variable allows the monitor to run something else, even though the function has not finished and released the mutex.

The proposed C++ standard threading classes support the monitor concepts, so it has a mutex class and a condition class that works with it. This example uses condition variables to handle the empty and full conditions in the class. If the buffer is full, the send function waits until something is removed. Likewise, if the buffer is empty, the receive function waits until something is added.

The best feature of this code is that it is simple, and not very error-prone to write. The entire state of the instance is locked as only one member function at a time is running on some thread or another. So, you write ordinary code and don’t have worry about the values changing while you are using them. The required blocking is handled by condition variables which are meant to do exactly that.

However, the code is inherently inefficient. Different threads use the queue object, but only one thread at a time can execute code within it. The code cannot run in parallel. Rather, the threads take turns, running one at a time. If the different threads have a lot to do and accessing this queue instance is a very small part of it, there will be low contention and you expect that one user will not find the monitor already occupied. But if the queue is used a lot, this becomes a problem. It is like a stop-light at an intersection: only one thread goes through at a time and the other waits its turn.

Even in the low contention case, you see that every function call must acquire and release the lock. Presumably the implementation is good enough to avoid making Operating System calls in the case where it does not have to block. But that still involves costly “atomic” instructions which sap the strength from the CPU. Likewise, every call to either function calls notify_one on a condition variable. I expect that the implementation of that also needs to use atomic instructions.

Wait-Free Implementation

In contrast, here is an implementation I coded that is both lock-free and wait-free.

#include <cstdlib>
#include <vector>
#include <Windows.h>    // need OS Event stuff
#include <emmintrin.h>  // need memory fences
#include "boost/utility.hpp"

/*
Features:  1 thread reads, 1 thread writes.  Max size fixed at construction time.
It intentionally blocks if empty or full.
Besides that, the steady-flow state where one thread is adding items and another thread
is removing items is both "lock free" and "wait free" and furthermore does not make
an expensive OS call.

This is both Win32 and x86/x64 CPU specific.
*/


template <typename T>
volatile T& asVolatile (T& x)
 { return x; }


template <typename ItemT>
class Queue_1R1W : boost::noncopyable
{
std::vector<ItemT> data;
size_t head, tail;  // head==tail means empty, values increment when advancing
   // tail points one past the end; e.g. like the end() concept.
   // head points to next element to consume (if present)
   // watch for wrapping.  store normallized [0 through Size-1].
   // ===== OS Stuff
   HANDLE evEmpty, evFull;
   volatile bool need_consumer_wakeup;
   volatile bool need_producer_wakeup;
   bool is_full_v() const
      {
      // make sure to really re-check 'head' since consumer is changing it.
      return (tail+1)%data.size() == asVolatile(head);
      }
   void block_when_full()
      {
      while (is_full()) {
         // This code is nearly the same on produce and consumer ends.  The Real Abstraction
         // identified is a high-performance lazy event signaling primitive, which should be
         // implemented in the Platform/OS abstraction layer.
         ++full_count;
         ResetEvent (evFull);
         need_producer_wakeup= true;
         _ReadWriteBarrier();
         _mm_mfence();
         if (is_full_v())
            WaitForSingleObject (evFull, INFINITE);
         need_producer_wakeup= false;
         }
      }
   void push_advance()
    {
    // common code for push_back variations
    asVolatile(tail)= (tail+1) % data.size();
    _WriteBarrier();
    _mm_mfence();
    if (need_consumer_wakeup)  SetEvent(evEmpty);
    }
public:
   // ===== Housekeeping
   explicit Queue_1R1W (size_t elcount) : data(elcount), head(0), tail(0), evEmpty(0), evFull(0), need_consumer_wakeup(false), need_producer_wakeup(false),
        empty_count(0), full_count(0)
      {
      evEmpty= CreateEventW (0, TRUE/*manual reset*/, TRUE/*stays on except when waiting*/, 0);
      evFull= CreateEventW (0, TRUE/*manual reset*/, TRUE/*stays on except when waiting*/, 0);
      }
   ~Queue_1R1W()
      {
      CloseHandle(evEmpty);
      CloseHandle(evFull);
      }

   // ======== Producer Functions
   bool is_full() const
      {
      return (tail+1)%data.size() == head;
      }

   void push_back (const ItemT& el)
      {
      block_when_full();
      data[tail]= el;
      push_advance();
      }
   void push_back (ItemT&& el)
      {
      block_when_full();
      data[tail]= std::move(el);
      push_advance();
      }

   // ======= Consumer Functions
   bool is_empty() const
      {
      return head==tail;
      }
   bool is_empty_v() const
      {
      return head==asVolatile(tail);
      }
   ItemT pop_front()
      {
      while (is_empty()) {
         ++empty_count;
         ResetEvent (evEmpty);
         need_consumer_wakeup= true;
         _ReadWriteBarrier();
         _mm_mfence();
         if (is_empty_v())
            WaitForSingleObject (evEmpty, INFINITE);
         need_consumer_wakeup= false;
         }
      ItemT retval (data[head]);
      data[head]= ItemT();
      asVolatile(head)= (head+1) % data.size();
      _ReadWriteBarrier();
      _mm_mfence();
      if (need_producer_wakeup)  SetEvent(evFull);
      return retval;
      }
  // debug and testing
  volatile int empty_count, full_count;
};

This implementation runs fully concurrently. One thread can be adding an element at the exact same time that another thread is removing an element, and they don’t get in each other’s way at all!

Obviously, the class is defined as blocking in the empty-read or full-write cases. But when elements are buffered, both threads run with nearly full efficiency.

First, there are clearly no locks at all so there is no expense of acquiring and releasing them. The signaling of the other thread is lazy and only makes that call when it is certainly needed. Otherwise, there are not only no Operating System calls, but no atomic instructions either!

However, each function has some features that make it slower than a non-threaded implementation. The footprint though is significantly lighter than seen with the first example.

Although there are no atomic instructions, there is a “memory fence” encountered once in each function. On some x86 CPUs this might cost the same as one atomic instruction. But in principle the cost is much less: rather than flushing the entire execution pipeline, it only has to prevent the internal optimization of allowing reads to take priority over writes. I expect that the newer the chip, the better the relative performance difference. And remember, this is only needed once per call.

The other cost relates to disabling compiler optimizations. The use of volatile makes sure that the variable is actually read or written to and not cached in a register with the expectation that it doesn’t change by other means. A related construct, _ReadWriteBarrier, prevents the compiler from re-arranging instructions around this point.

These have very little effect, and are sometimes there “just in case” as the compiler wasn’t making optimizations that break the code anyway. Future compilers might be smarter.

This version is everything that the first version is not. The reverse is also true. This version is tricky to write correctly. It is specific to exactly two threads, and does not allow any number of readers and writers like the first listing would. Worse, the first listing is portable, using (eventual) standard C++ library functions. This wait-free version is specific to the compiler, specific to the operating system, and specific to the machine architecture.

Benchmarks

Looking at the wall-clock time of a program that runs two threads and pumps a bunch of stuff between them using one of these queue implementations, the two programs take the same amount of time when the queue is fairly large.

The real point here is that the queue code is not very long, so even in a simple benchmark program it does not take a significant portion of the program execution. In fact, changing the test record type makes an overwhelming difference! If the queue is declared to hold shared_ptrs of records that contain a std::string, then the producer must allocate memory and the consumer frees it, and both hit the shared_ptr atomic counter’s overhead. Changing to an ItemT type that contains just an int made the program run 5× faster.

Making the queue larger reduces the number of times that the queue becomes full, and thus cuts how often the producer must be blocked. Making the queue “large enough” doubles the speed, and making it larger still does not further improve matters. The queue is empty more than it is full, since the producer is a little slower than the consumer which does nothing with the data it receives. Depending on the situation, it is natural that one or the other will be faster as they won’t run at exactly the same speed. So rather than waking the consumer immediately when a new item is ready, having a hysterisis or “high/low-water mark” would improve performance on a busy server that had other stuff to do as well. Wait for a few items to be ready, not just one, before re-enabling the consumer thread. Of course, this increases latency, and is a de-optimization if cores would be idle anyway.

A surprising result is that the Boost version runs significantly faster, up to 3×, when the queue is made very small and blocking is frequent. Presumably, the implementation of condition variables is often faster than an OS call to use kernel Event objects. In fact, the proposed standard has different mutex types for efficiency: the plain mutex used here is very minimal, and does not support try-lock, locking on multiple mutexes at once, or even recursive locks. Likewise, there are two implementations of condition variables to choose from, one of which is optimized for use only with this most simplistic mutex.

It does some logic in the user-mode caller thread before deciding it really has to block, and this not only avoids the OS call, but has the same result as a spin-lock: a slight delay will allow the other thread to produce something, and the thread doesn’t need to be benched after all.

Confirming this idea, adding a small delay to the code made that case speed up to nearly the point given by the large queue. With this version, the program uses less memory: a small queue works just as well as a large one.

   void spin_wait() const
      {
      static volatile int x= 0;
      for (int x= 0;  x<10;  ++x)  ++x;
      }
   ItemT pop_front()
      {
      bool first_wait= true;
      while (is_empty_v()) {
         ++empty_count;
         if (first_wait) {
            first_wait= false;
            spin_wait();
            continue;
            }
         ResetEvent (evEmpty);
         …

Since the producer and consumer run at near the same speed already, slowing down the slightly faster side allows them to run at the same speed and prevents the expensive blocking. This gives the same benefit as allowing the producer to fill the queue up a while before resuming the consumer, but without the latency hit. Naturally, this is only the case where the needed wait is much briefer than the time taken by blocking and unblocking the thread.

Looking not at the total program wall-clock time, but the cycle count of executing just the producer’s call to add an item to the queue, in the case where the queue is not full, the Boost version takes 11602 cycles and the lock-free version takes 4209 cycles. The lock-free version in simply 2ľ× faster. That comes to 1.28 µs vs. 3.51 µs. Over a million calls, it adds up to a savings of over two seconds!

But… the cycle time measurement doesn’t check with the wall-clock time predicted for a large number of iterations. Not by a long shot. Is the cycle counter not returning the units I expect?

It seems that if the code reports the results of each call to cout as it goes, then it runs much slower than if it called the function in a tight loop without doing anything else.

 for (int loop= 0;  loop < time_count;  ++loop) {
    S1 rec;
    rec.n= loop;
    unsigned __int64 before= __rdtsc();
    q.push_back (std::move(rec));
    times[loop]= __rdtsc()-before;
    // cout << "time " << loop << ": " << times[loop] << endl;
    // oops, don't call that now!
    }

 for (int loop= 0;  loop < time_count;  ++loop) {
    cout << "time " << loop << ": " << times[loop] << endl;
    }

In the listing above, the presence of the line in bold makes the call to push_back run much slower. How much slower? In the simple Boost implementation, 11602 cycles vs. 180 cycles is about 64× slower! In the lock-free implementation, 4209 cycles vs. 149 cycles, or a mere 28× slower.

Now, which case is more common in real code? Will the producer be calling the single function in a tight loop and not doing anything else (like, say, actually getting the data from somewhere)? Or will it call queue.push_back(rec) as a small part of its work, basically shipping the completed product to the next guy in the assembly line?

The normal way of benchmarking something, at least of old tradition, is to strip out everything else and just test the thing you are interesting in timing. But as we see here, that gives totally skewed results. If called in a tight loop with nothing else, presumably the CPU caches a great deal (like the decoded instructions). Interestingly, this case minimizes the speed difference between the two implementations (180 vs. 149 cycles). When called in a more realistic situation, the speed difference is much greater.

On modern computers, benchmarking is black magic. As we see here, changing something that’s not in the function makes the function run 64 times faster!

Now consider code that really does execute a small-ish loop many times. Perhaps moving a “heavy” call away from there can make the code speed up enormously. The heavy call might be a synchronization construct of some kind. For example, not having to “signal” each time through the loop might cause this kind of speed-up not just because the library routine is thought to contain expensive atomic instructions, but simply because it is more stuff to fit into the CPU’s limited attention span.

This is continued in part 2.