-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Used/free stacks for workers #3
base: master
Are you sure you want to change the base?
Conversation
only iterates through in-use workers. Previously, in-use workers were tracked with a "registered" flag in ringbuf_worker_t, and ringbuf_consume() iterated through all possible workers. This could be inefficient if there were far more potential workers than actual workers. Presently, ringbuf_worker_t.next is a pointer. If the ABA Problem is considered to be important, it can be changed to an offset into ringbuf_t.workers[] and a counter.
My initial implementation was similar, but later I changed. The main reason is that I do not want the implementation to assume the use of threads. In fact, my primary use case of this ring buffer is with processes and shared memory. The current API (or rather worker structures in an array) is also a little bit more process crash/exit friendly, as they can come back and re-claim the slot. Walking the pointers would be problematic with the processes. Also, the ABA problems must be handled, otherwise you can get livelocks (the stress test is able to trigger them, if you have enough cores). So, I think we need a better solution. Can you perhaps describe your use case a little bit more? Normally, I would expect the number of producers to match or not exceed the number of logical processors/cores (unless you have 2000s style application runings with tons of blocking threads). Are you dealing with some kind of virtual instances which cannot easily share a ringbuf worker? |
Your point about crashed applications is well taken. I don't have a clean solution for that yet. Mostly, my use case is a general communications library, one that doesn't know how many clients it'll have. Shared memory tends to be allocated 4kiB at a time; 512 instances of my my current per-client structure fit into 4 kiB, so my initial max-clients was 512. That's a lot to run through every time a message is consumed. The only better idea I have is a big rethink, i.e. to use the method described in "Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithm" (available here, among other places), and to store the referred-to list-nodes using ring-buffer memory, between the user-supplied data areas. Then there wouldn't be any need to allocate worker-structures. I was imagining a variant of that algorithm with, in addition to Head and Tail pointers, PendingHead and PendingTail pointers. One upside is that you'd get multi-consumer for free. Next weekend, hopefully :-) |
OK, a few ideas:
On the re-think: it depends what problem are you trying to solve. There is a reason why this library is called a ring buffer (rather than a queue). :) If you just need a queue of pointers -- yes, there are better algorithms. One reason why I implemented such ringbuf is because you don't have to do malloc + memcpy + produce. It's a true buffer, so you can also send a stream of data without the need to encode the message lengths. That is, I can do: ringbuf_consume (as much as it can) + write() + ringbuf_release(). This maximises the throughput (think of I/O operations). |
I wasn't talking about dynamically-allocating nodes. Instead of using malloc(), I was talking about allocating the node-like structures from ring-buffer memory, just before the user-supplied data of each acquire/produce. The ring-buffer becomes less of a stream, but it no longer requires worker-structures, crashed produces don't require extra handling, and you get multi-consumer for free. |
BTW...in your copious spare time...can you take a look at my full_buffer branch & see if you spot why it's not working? In addition to wanting to fill the buffer, I have a need for ringbuf_acquire() to tell me if the buffer was empty, i.e. so that I can wake up a per-client message-receiving thread. My code is pretty simple, so I must be doing something really dumb. Thanks for any time you can spend on this. |
|
I can change the worker-pointers to be an offset within the workers[] array. I can also change the allocation of workers so that it happens in ringbuf_acquire(), and so the deallocation happens in ringbuf_produce(). That was actually my original implementation, but I thought it was overkill...not thinking of the crashed-applications problem you pointed out. I will re-produce that version and submit it here. And thanks for looking at the free_buffer branch. I know it's racy, but the value of "used" should only be too large, never too small, so it should be safe. The problem I have is that it ends up way too large, like nearly twice the size of the buffer, and I haven't been able to figure out why. If I intend to write more lock-free code in the future, I really need to figure out why this is happening. I spent the weekend reading up on lock-free programming in general, to make sure I wasn't missing anything fundamental, so that I can produce higher-quality push-requests. So I have no new code to submit yet. Thanks for the time you've spent on this. |
…re(). This not only decreases the amount of time that a worker-record remains on the used-stack (i.e. to help make it more crash resistant), but it removes the need to register workers. It passes the tests, but not the stress-test.
The downside is that failed calls to ringbuf_acquire() consume a worker-record, and ringbuf_consume() must be called to free up worker-records on the used-workers stack that are no longer in use. I'm pretty sure it's not safe to try to remove used-workers from the stack except in ringbuf_consume(). This probably isn't a problem in practice, but I'm still not happy about it.
It helps, but stress-test performance is still far below the old version of the code, as well as being inconsistent.
Your latest diff is conceptually nice design and a clean piece of code. Unfortunately, it is not efficient. Atomic operations are expensive. It depends a lot on a CPU (and how many of them do you have!), but your latency might be 10-30x higher compared to a primitive instruction and some measurements claim the cost to be in the range of 300 cycles.. although it's not a good/meaningful measurement. Atomic operations can generate a lot of inter-CPU/core traffic, especially if you have cache-line contention, so it all "depends" and the overall impact can be greater (see [1] or similar papers if you want more in-depth view). Your change means 3 atomic operations and now 2 contention points on the producer side. That is quite more significant overhead than 1 atomic operation (and a contention point). So, let's keep it simple and efficient. :) [1] https://htor.inf.ethz.ch/publications/img/atomic-bench.pdf |
Yes, I figured that out after the fact. |
Now workers are tracked in used/free stacks, and ringbuf_consume() only iterates through in-use workers.
Previously, in-use workers were tracked with a "registered" flag in ringbuf_worker_t, and ringbuf_consume() iterated through all possible workers. This could be inefficient if there were far more potential workers than actual workers.
Presently, ringbuf_worker_t.next is a pointer. If the ABA Problem is considered to be important, it can be changed to an offset into ringbuf_t.workers[] and a counter.
Finally, the "unsigned i" parameter can be removed from ringbuf_register(); I haven't done that yet. I didn't want to change the API yet. But it seems helpful to not force ring-buffer clients to coordinate their index-numbers among each other. (My expected use of ringbuf involves communication between applications using ring-buffers allocated from shared memory, so such coordination is even more difficult.)