Hi. Please bear with me as this is a serious question. I cannot figure out the popularity of searching for the ultimate "lock free" algorithm for queues, etc. I just finished reading an article
Yet another implementation of a lock-free circular array queue[
^] and followed up on some of the references and Googled some other articles, going back almost 20 years. For reference, see
An Optimistic Approach to Lock-Free FIFO Queues[
^]
Here's my confusion, all of the articles that address the N-producers and M-consumers end up creating some sort of loop, a
while(true){}
or
do{}while (CAS...)
block to wrap a set of instructions. Both referenced articles demonstrate this. Effectively, they replace a blocking operation with a "spin loop" that continues until the operation succeeds without contention from other processes.
Back in the '60s, we used to call this a "spin lock", protecting a block of code using a "read / modify / write" hardware instruction to "test / set" a condition. Early IBM/360s had "TS" (Test and Set). PDP-10's had "AOSE" (Add one and skip if equal to zero) allowing similar logic.
Many of the algorithms I've seen then used complicated methods to detect and inconsistent queue setup or the "ABA" problem, basically discovering a pointer has been reused and doesn't point to what you think it does anymore. Object serial numbers are usually introduced to alieviate this problem.
Now my question. Why go through all these convoluted steps? Why isn't the simpliest solution used instead? I mean I get it, you want to avoid using system calls for locking (Mutex, Conditioned Variables, et al) for the "low contention" cases, when collisions on the queue (or whatever) are rare. If folks are willing to use a "compute bound spin loop" during contention cases, why not just
...
while (!CAS(&m_lock_taken, 0, 1));
m_lock_taken = 0;
...
Simple spin, no complex detection nor unwinding. It has all the drawbacks of a spin loop in that it chews up cpu time doing nothing until the lock is released, probably memory contention for the m_lock_taken variable, and worse, when the N+M exceeds the number of processors / cores in the system, the spinning thread is using execution cycles that could be given to the thread that owns the lock and was pre-empted by and event or the scheduler. I simply do not understand the need for all the complexity.
And if you care about chewing up cpu time, then what you want is the case where the lock is taken out quickly when there is no contention and a waitable object (Mutex, Conditioned Variable) when the rare contention happens.
It has always been my experience that when multiple threads need to access a structure / object that need more than 1 data item to be in a consistent state before the object is "valid", that sequence needs to be protected by an "interlock", blocking or looping.
All the algorithms I've seen so far end up implementing just this simple case, loop until the complex set of operations yield a consistent object state yet there are far more simple ways of doing that exact thing.
Discuss?