In a multi-threaded program, we have to make sure that the shared data are thread safe. Usually, we employ some kind of lock (mutex, semaphore etc) to achieve it. However lock itself is expensive. If the lock is removed and in the same time the data is still thread safe, the performance will be improved a lot. There are lots of researches on lock free data structures. Most of them require the hardware to provide CAS. These methods are intricate in terms of implementation and correctness. They need to deal with such problems as ABA Problem as well.Those methods are for general cases. They can be applied to the situation where there are many threads and each one can have read and write acc/ess to the shared data. But when looking into a special case, and studying its properties, we can find a simpler and more efficient lock free data structure.
The special case is the single producer single consumer. This is a simple scenario and it is not uncommon. In this scenario:
- There are only two threads accessing to the shared data.
- One of them only reads the data. The other only modifies the data.
- Data is written at one end while it's read at the other end.
A queue is usually used to pass data between threads. But I would like to use the ring buffer. This is because:
- A queue is usually implemented by using a list. There is memory allocation for each element added. Memory allocation is expensive and there is a global lock in dynamic allocation in C/C++ (you can search for "lock-free malloc" if you're interested).
- A ring buffer is usually implemented by using an array. we can allocate memory in bulk. Thus reduce the times to allocate memory. e.g. instead of allocating 10 elements 10 times in a queue, we can allocate an array with 10 element only once.
- We can achieve spatial locality by using an array. It is hard to achieve it by using a queue.
There is a drawback of an array. The size of an array is fixed. So we cannot put more elements into the ring buffer when it is full. But there is a way around it.
First let's look at a simple implementation of a lock-free ring buffer in a single producer single consumer context (refer to this link for more information about ring buffer). Then we will see how it works and why it works. At last, I'll address the limitations and improvements.
This is the implementation. It may be improved later.
class RingBuffer
{
public:
RingBuffer();
int read(int *result)
int write(int element);private:
unsigned read_index;
unsigned write_index;
static const unsigned buffer_size = 256;
int buffer[buffer_size];
};
RingBuffer::RingBuffer() : read_index(0), write_index(0)
{
}
int RingBuffer::read(int *result)
{
unsigned local_read = read_index;
if (local_read == write_index)
return 0;++local_read;
if (local_read == buffer_size)
local_read = 0;*result = buffer[local_read];
read_index = local_read;return 1;
}
int RingBuffer::write(int element)
{
unsigned local_write = write_index;++local_write;
if (local_write == buffer_size)
local_write = 0;
if (local_write != read_index) {
buffer[local_write] = element;
write_index = local_write;
return 1;
}
return 0;
}
There are only two variables that we're interested in. They are read_index and write_index. The variable buffer is where the shared data will be in. And we need to guarantee the access to the data in buffer is thread safe. That is what read_index and write_index for. Any writes won't go pass read_index and any reads won't go pass write_index either.
Note that in single producer single consumer, there are only one thread reading (called reader thread) and only one thread writing (called writer thread). And the reader thread will only call RingBuffer::read() and writer thread will only call RingBuffer::write(). So the read_index is only changed by the reader thread and write_index is only changed by the writer thread. Also note that they are changed only when the actual read/write is done. So in read(), before read_index is changed, writer thread won't write to any places after read_index. That is also where reader thread will read data. So the writer thread won't overwrite any data. This is true in writer(). The reader thread won't read from places after write_index.
Let's take an example to see how it works. Say the size of the ring buffer is 256 and write_index = 200 and read_index = 100. The writer thread has already writen to the cell 200, and the reader thread has read data from up to cell 100.
0 | ... | ... | ... | 100 | ... | ... | ... | 200 | ... | ... | ... | 255 |
The cells in blue are safe to write in and the cells in red are available to read. The reader thread first checks whether read_index (100) is equal to write_index (200). It is not, so it will read the element at 101. If read_index goes to 200, the reader thread won't get any element because read() will just return when read_index == write_index, which means that the reader thread never gets a chance to read from any cells in blue including the cell 200. So either the reader thread will read valid data or it won't read any data at all. This is the same when we analyse the writer thread. The writer thread never overwrites the cells in red, including the cell 100. The writer thread will always write to an cell that is safe to do so.
But why are there local_read and local_write? When the reader thread reads the buffer, the writer thread will probably try to write to it. Since read_index is used to prevent the writer thread from writing to the red cells. It shouldn't be changed during this process until the reader thread has finished reading. So we use local_read to tell the reader thread where to read the element. Once the reader thread gets the element and it is assigned with the value local_read, the writer thread can immediately write to that cell. It won't corrupt any data because the element at that cell is already read.
That is how read_index and write_index together protect the shared data. But to make it actually to work as we expect, we need to address some technical details. There are two major problems we need to take into consideration: atomic operations and reordering.
- atomic operations
Both the reader thread and the writer thread can access read_index and write_index. The read/write of them must be atomic operations for this to work properly. Otherwise, one thread may get stale data. On x86/x86_64 integer read/write are atomic if they're properly aligned. Thus we have to check with the compiler how they align integers. And I believe most will do them correctly.
- reordering
- Compiler reordering
The compiler may reorder instructions to optimize the program. Let's take a look at this snippet from read():
*result = buffer[local_read]; inst. 1
read_index = local_read; inst. 2
The compiler may emit inst. 2 before inst. 1. If so, read_index is changed before the reader thread reads the element. If the writer thread happens to run, can it write to read_index? NO. We don't need to worry compiler reordering here.
But it is not OK in write().
buffer[local_write] = element; inst. 3
write_index = local_write; inst. 4
If inst. 4 is moved above inst. 3. Write_index is updated before the writer thread writes to buffer[local_write]. That will allow the reader thread to read from buffer[local_write], which will read stale data. To deal with it, I make local_write volatile to have the compiler not to optimize reads/writes of local_write and make sure inst3. is executed before inst. 4. But don't need to make read_index or write_index volatile. (refer to this and this if you're interested in volatile.)
We still assume it is on x86/x86_64. The CPU won't move reads around other reads, writes around other writes or move writes before reads. It only moves reads ahead of writes. Let's look at read(). It reads read_index to local_read, modifies local_read, reads the element at local_read to the result, and then modifies read_index. The update of read_index is not executed before the element at local_read is written to result. Even if modification of read_index runs before reading the element in local_read due to compiler reordering, we know that the writer thread won't write to read_index. The data is still safe.
There are some limitations to this implementation.
- We assume that atomic operations and reordering on x86/x86_64. It is not necessarily true on other platforms.
- It only reads/writes one element at once. It would be better to have such interfaces int read(int *array, int max_size) and int write(int *array, int max_size).
- The array won't grow. This implementation of ring buffer won't hold more than 255 elements. But what if the writer thread writes more than that but reader thread cannot catch up? We need to think of a way to "expand" the ring buffer. The idea comes from the possible implementation of deque in C++. But instead of vector of vectors, a list of such ring buffer is used. If the current ring buffer is full, we can append a new ring buffer to the list and the writer thread writes to the new one. That's it. Again, we need a lock-free list. And when a ring buffer is empty after reading, we need to release them if they are dynamic allocated. There are two articles regarding lock-free queue. http://www.drdobbs.com/parallel/208801974 and http://www.drdobbs.com/parallel/210604448?pgno=1. I believe you can implement it and you won't worry about the capacity of a single ring buffer.