Used/free stacks for workers #3

ulatekh · 2018-01-28T18:34:10Z

Now workers are tracked in used/free stacks, and ringbuf_consume() only iterates through in-use workers.

Previously, in-use workers were tracked with a "registered" flag in ringbuf_worker_t, and ringbuf_consume() iterated through all possible workers. This could be inefficient if there were far more potential workers than actual workers.

Presently, ringbuf_worker_t.next is a pointer. If the ABA Problem is considered to be important, it can be changed to an offset into ringbuf_t.workers[] and a counter.

Finally, the "unsigned i" parameter can be removed from ringbuf_register(); I haven't done that yet. I didn't want to change the API yet. But it seems helpful to not force ring-buffer clients to coordinate their index-numbers among each other. (My expected use of ringbuf involves communication between applications using ring-buffers allocated from shared memory, so such coordination is even more difficult.)

only iterates through in-use workers. Previously, in-use workers were tracked with a "registered" flag in ringbuf_worker_t, and ringbuf_consume() iterated through all possible workers. This could be inefficient if there were far more potential workers than actual workers. Presently, ringbuf_worker_t.next is a pointer. If the ABA Problem is considered to be important, it can be changed to an offset into ringbuf_t.workers[] and a counter.

rmind · 2018-01-28T20:32:47Z

My initial implementation was similar, but later I changed. The main reason is that I do not want the implementation to assume the use of threads. In fact, my primary use case of this ring buffer is with processes and shared memory. The current API (or rather worker structures in an array) is also a little bit more process crash/exit friendly, as they can come back and re-claim the slot. Walking the pointers would be problematic with the processes. Also, the ABA problems must be handled, otherwise you can get livelocks (the stress test is able to trigger them, if you have enough cores).

So, I think we need a better solution.

Can you perhaps describe your use case a little bit more? Normally, I would expect the number of producers to match or not exceed the number of logical processors/cores (unless you have 2000s style application runings with tons of blocking threads). Are you dealing with some kind of virtual instances which cannot easily share a ringbuf worker?

ulatekh · 2018-01-28T23:36:34Z

Your point about crashed applications is well taken. I don't have a clean solution for that yet.

Mostly, my use case is a general communications library, one that doesn't know how many clients it'll have. Shared memory tends to be allocated 4kiB at a time; 512 instances of my my current per-client structure fit into 4 kiB, so my initial max-clients was 512. That's a lot to run through every time a message is consumed.

The only better idea I have is a big rethink, i.e. to use the method described in "Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithm" (available here, among other places), and to store the referred-to list-nodes using ring-buffer memory, between the user-supplied data areas. Then there wouldn't be any need to allocate worker-structures. I was imagining a variant of that algorithm with, in addition to Head and Tail pointers, PendingHead and PendingTail pointers. One upside is that you'd get multi-consumer for free.

Next weekend, hopefully :-)

rmind · 2018-01-29T01:21:46Z

OK, a few ideas:

We can add something like ringbuf_adjust_workers() which would change the ringbuf_t::nworkers. It can be incremented at any point. It can also be decremented, if concurrent consumption is not taking place (it would be the caller's responsibility to ensure that). So, the caller could scale the array (if needed, a bitmap can be used for efficient allocation). Alternatively, if not expecting sparse workers, maxworkers can be added (to indicate the size of the array) and nworkers would become a watermark indicating the highest index registered worker.
Another option could be to provide a different ringbuf_consume() API, which would take an iterator (so the caller could use a linked list, array or whatever it wants. The default iterator would be implementing the current loop over the array.

On the re-think: it depends what problem are you trying to solve. There is a reason why this library is called a ring buffer (rather than a queue). :) If you just need a queue of pointers -- yes, there are better algorithms. One reason why I implemented such ringbuf is because you don't have to do malloc + memcpy + produce. It's a true buffer, so you can also send a stream of data without the need to encode the message lengths. That is, I can do: ringbuf_consume (as much as it can) + write() + ringbuf_release(). This maximises the throughput (think of I/O operations).

ulatekh · 2018-01-29T14:32:39Z

I wasn't talking about dynamically-allocating nodes. Instead of using malloc(), I was talking about allocating the node-like structures from ring-buffer memory, just before the user-supplied data of each acquire/produce. The ring-buffer becomes less of a stream, but it no longer requires worker-structures, crashed produces don't require extra handling, and you get multi-consumer for free.

ulatekh · 2018-01-29T21:03:32Z

BTW...in your copious spare time...can you take a look at my full_buffer branch & see if you spot why it's not working? In addition to wanting to fill the buffer, I have a need for ringbuf_acquire() to tell me if the buffer was empty, i.e. so that I can wake up a per-client message-receiving thread. My code is pretty simple, so I must be doing something really dumb. Thanks for any time you can spend on this.

rmind · 2018-02-03T23:33:21Z

I can just re-iterate that the ringbuf implementation is solving a somewhat different problem. If you want message passing, then other algorithms can do a better job. Actually, I might write an implementation for one of my projects.
From about a 60 second glance -- your approach seems to be racy, e.g. rbuf->used can be decremented by a re-trying thread, while your thread proceed with a stale value. Basically, the invariants are contained within rbuf->next. If they are not handled in one place, then you have all sorts of races, which are typically very tricky to handle. Unfortunately, I do not have free time to look at a greater detail. Sorry.
Would any of the ideas listed in the previous comment work for you? As already mentioned, linking workers with pointers would break processes + shared memory use case for me, so I cannot accept this PR as is.

ulatekh · 2018-02-05T14:35:52Z

I can change the worker-pointers to be an offset within the workers[] array. I can also change the allocation of workers so that it happens in ringbuf_acquire(), and so the deallocation happens in ringbuf_produce(). That was actually my original implementation, but I thought it was overkill...not thinking of the crashed-applications problem you pointed out. I will re-produce that version and submit it here.

And thanks for looking at the free_buffer branch. I know it's racy, but the value of "used" should only be too large, never too small, so it should be safe. The problem I have is that it ends up way too large, like nearly twice the size of the buffer, and I haven't been able to figure out why. If I intend to write more lock-free code in the future, I really need to figure out why this is happening.

I spent the weekend reading up on lock-free programming in general, to make sure I wasn't missing anything fundamental, so that I can produce higher-quality push-requests. So I have no new code to submit yet. Thanks for the time you've spent on this.

…re(). This not only decreases the amount of time that a worker-record remains on the used-stack (i.e. to help make it more crash resistant), but it removes the need to register workers. It passes the tests, but not the stress-test.

The downside is that failed calls to ringbuf_acquire() consume a worker-record, and ringbuf_consume() must be called to free up worker-records on the used-workers stack that are no longer in use. I'm pretty sure it's not safe to try to remove used-workers from the stack except in ringbuf_consume(). This probably isn't a problem in practice, but I'm still not happy about it.

It helps, but stress-test performance is still far below the old version of the code, as well as being inconsistent.

rmind · 2018-02-07T22:25:46Z

Your latest diff is conceptually nice design and a clean piece of code. Unfortunately, it is not efficient.

Atomic operations are expensive. It depends a lot on a CPU (and how many of them do you have!), but your latency might be 10-30x higher compared to a primitive instruction and some measurements claim the cost to be in the range of 300 cycles.. although it's not a good/meaningful measurement. Atomic operations can generate a lot of inter-CPU/core traffic, especially if you have cache-line contention, so it all "depends" and the overall impact can be greater (see [1] or similar papers if you want more in-depth view). Your change means 3 atomic operations and now 2 contention points on the producer side. That is quite more significant overhead than 1 atomic operation (and a contention point).

So, let's keep it simple and efficient. :)

[1] https://htor.inf.ethz.ch/publications/img/atomic-bench.pdf

ulatekh · 2018-02-08T14:20:26Z

Yes, I figured that out after the fact.
This branch wasn't ready for you to look at again. Sorry for the distraction.
I was thinking of cleaning it up a bit, and then asking you to include it as-is (i.e. as a branch, not merged into master) in your repo, as a cautionary tale for anyone else that might want to try this.

Fixed clang-detected build errors.

93ef399

ulatekh added 3 commits February 6, 2018 20:24

Added spinlock-backoff to the worker-record stack routines.

3bcf4cc

It helps, but stress-test performance is still far below the old version of the code, as well as being inconsistent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Used/free stacks for workers #3

Used/free stacks for workers #3

ulatekh commented Jan 28, 2018

rmind commented Jan 28, 2018

ulatekh commented Jan 28, 2018

rmind commented Jan 29, 2018

ulatekh commented Jan 29, 2018

ulatekh commented Jan 29, 2018

rmind commented Feb 3, 2018

ulatekh commented Feb 5, 2018

rmind commented Feb 7, 2018

ulatekh commented Feb 8, 2018

Used/free stacks for workers #3

Are you sure you want to change the base?

Used/free stacks for workers #3

Conversation

ulatekh commented Jan 28, 2018

rmind commented Jan 28, 2018

ulatekh commented Jan 28, 2018

rmind commented Jan 29, 2018

ulatekh commented Jan 29, 2018

ulatekh commented Jan 29, 2018

rmind commented Feb 3, 2018

ulatekh commented Feb 5, 2018

rmind commented Feb 7, 2018

ulatekh commented Feb 8, 2018