-
Notifications
You must be signed in to change notification settings - Fork 687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Io thread work offload #763
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #763 +/- ##
============================================
- Coverage 70.17% 70.16% -0.02%
============================================
Files 112 112
Lines 60536 61263 +727
============================================
+ Hits 42483 42986 +503
- Misses 18053 18277 +224
|
9b06940
to
a47dbee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me in general.
Free offload adds some complexity. It would be good to know how important that one is for performance.
It seems this PR could have been three separate PRs:
- poll offload
- command lookup offload
- free offload
e918ea3
to
5d11ff8
Compare
5d11ff8
to
a941909
Compare
e36db8e
to
8b14ebe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks fine. added some small comments/suggestions
8b14ebe
to
61d0d18
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for moving the mutex to AE. I agree it's better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no other comments ontop of what viktor said. I'll give off some more tests tomorrow, and hopefully we can merge it soon!
Signed-off-by: Uri Yagelnik <[email protected]>
Signed-off-by: Uri Yagelnik <[email protected]>
61d0d18
to
2753365
Compare
This failed twice, it seems like it might be an issue. Going to kick off wider tests now though. |
All tests: https://github.com/valkey-io/valkey/actions/runs/9985021048 We expect the second one to have some failures. |
Signed-off-by: Uri Yagelnik <[email protected]>
2753365
to
570e7cb
Compare
There were 4 errors in the tests:
|
Pending tests to make sure issues are fixed: https://github.com/valkey-io/valkey/actions/runs/9998114971 |
Test was all green, even though we have some existing failures, going to merge this. |
### IO-Threads Work Offloading This PR is the 2nd of 3 PRs intended to achieve the goal of 1M requests per second. (1st PR: valkey-io#758) This PR offloads additional work to the I/O threads, beyond the current read-parse/write operations, to better utilize the I/O threads and reduce the load on the main thread. It contains the following 3 commits: ### Poll Offload Currently, the main thread is responsible for executing the poll-wait system call, while the IO threads wait for tasks from the main thread. The poll-wait operation is expensive and can consume up to 30% of the main thread's time. We could have let the IO threads do the poll-wait by themselves, with each thread listening to some of the clients and notifying the main thread when a client's command is ready to execute. However, the current approach, where the main thread listens for events from the network, has several benefits. The main thread remains in charge, allowing it to know the state of each client (idle/read/write/close) at any given time. Additionally, it makes the threads flexible, enabling us to drain an IO thread's job queue and stop a thread when the load is light without modifying the event loop and moving its clients to a different IO thread. Furthermore, with this approach, the IO threads don't need to wait for both messages from the network and from the main thread instead, the threads wait only for tasks from the main thread. To enjoy the benefits of both the main thread remaining in charge and the poll being offloaded, we propose offloading the poll-wait as a single-time, non-blocking job to one of the IO threads. The IO thread will perform a poll-wait non-blocking call while the main thread processes the client commands. Later, in `aeProcessEvents`, instead of sleeping on the poll, we check for the IO thread's poll-wait results. The poll-wait will be offloaded in `beforeSleep` only when there are ready events for the main thread to process. If no events are pending, the main thread will revert to the current behavior and sleep on the poll by itself. **Implementation Details** A new call back `custompoll` was added to the `aeEventLoop` when not set to `NULL` the ae will call the `custompoll` callback instead of the `aeApiPoll`. When the poll is offloaded we will set the `custompoll` to `getIOThreadPollResults` and send a poll-job to the thread. the thread will take a mutex, call a non-blocking (with timeout 0) to `aePoll` which will populate the fired events array. the IO thread will set the `server.io_fired_events` to the number of the returning `numevents`, later the main-thread in `custompoll` will return the `server.io_fired_events` and will set the `customPoll` back to `NULL`. To ensure thread safety when accessing server.el, all functions that modify the eventloop events were wrapped with a mutex to ensure mutual exclusion when modifying the events. ### Command Lookup Offload As the IO thread parses the command from the client's Querybuf, it can perform a command lookup in the commands dictionary, which can consume up to ~5% of the main-thread runtime. **Implementation details** The IO thread will store the looked-up command in the client's new field `io_parsed_cmd` field. We can't use `c->cmd` for that since we use `c->cmd `to check if a command was reprocessed or not. To ensure thread safety when accessing the command dictionary, we make sure the main thread isn't changing the dictionary while IO threads are accessing it. This is accomplished by introducing a new flag called `no_incremental_rehash` for the `dictType` commands. When performing `dictResize`, we will rehash the entire dictionary in place rather than deferring the process. ### Free Offload Since the command arguments are allocated by the I/O thread, it would be beneficial if they were also freed by the same thread. If the main thread frees objects allocated by the I/O thread, two issues arise: 1. During the freeing process, the main thread needs to access the SDS pointed to by the object to get its length. 2. With Jemalloc, each thread manages thread local pool (`tcache`) of buffers for quick reallocation without accessing the arena. If the main thread constantly frees objects allocated by other threads, those threads will have to frequently access the shared arena to obtain new memory allocations **Implementation Details** When freeing the client's argv, we will send the argv array to the thread that allocated it. The thread will be identified by the client ID. When freeing an object during `dbOverwrite`, we will offload the object free as well. We will extend this to offload the free during `dbDelete` in a future PR, as its effects on defrag/memory evictions need to be studied. --------- Signed-off-by: Uri Yagelnik <[email protected]>
IO-Threads Work Offloading
This PR is the 2nd of 3 PRs intended to achieve the goal of 1M requests per second.
(1st PR: #758)
This PR offloads additional work to the I/O threads, beyond the current read-parse/write operations, to better utilize the I/O threads and reduce the load on the main thread.
It contains the following 3 commits:
Poll Offload
Currently, the main thread is responsible for executing the poll-wait system call, while the IO threads wait for tasks from the main thread. The poll-wait operation is expensive and can consume up to 30% of the main thread's time. We could have let the IO threads do the poll-wait by themselves, with each thread listening to some of the clients and notifying the main thread when a client's command is ready to execute.
However, the current approach, where the main thread listens for events from the network, has several benefits. The main thread remains in charge, allowing it to know the state of each client (idle/read/write/close) at any given time. Additionally, it makes the threads flexible, enabling us to drain an IO thread's job queue and stop a thread when the load is light without modifying the event loop and moving its clients to a different IO thread. Furthermore, with this approach, the IO threads don't need to wait for both messages from the network and from the main thread instead, the threads wait only for tasks from the main thread.
To enjoy the benefits of both the main thread remaining in charge and the poll being offloaded, we propose offloading the poll-wait as a single-time, non-blocking job to one of the IO threads. The IO thread will perform a poll-wait non-blocking call while the main thread processes the client commands. Later, in
aeProcessEvents
, instead of sleeping on the poll, we check for the IO thread's poll-wait results.The poll-wait will be offloaded in
beforeSleep
only when there are ready events for the main thread to process. If no events are pending, the main thread will revert to the current behavior and sleep on the poll by itself.Implementation Details
A new call back
custompoll
was added to theaeEventLoop
when not set toNULL
the ae will call thecustompoll
callback instead of theaeApiPoll
.When the poll is offloaded we will set the
custompoll
togetIOThreadPollResults
and send a poll-job to the thread. the thread will take a mutex, call a non-blocking (with timeout 0) toaePoll
which will populate the fired events array. the IO thread will set theserver.io_fired_events
to the number of the returningnumevents
, later the main-thread incustompoll
will return theserver.io_fired_events
and will set thecustomPoll
back toNULL
.To ensure thread safety when accessing server.el, all functions that modify the eventloop events were wrapped with a mutex to ensure mutual exclusion when modifying the events.
Command Lookup Offload
As the IO thread parses the command from the client's Querybuf, it can perform a command lookup in the commands dictionary, which can consume up to ~5% of the main-thread runtime.
Implementation details
The IO thread will store the looked-up command in the client's new field
io_parsed_cmd
field. We can't usec->cmd
for that since we usec->cmd
to check if a command was reprocessed or not.To ensure thread safety when accessing the command dictionary, we make sure the main thread isn't changing the dictionary while IO threads are accessing it. This is accomplished by introducing a new flag called
no_incremental_rehash
for thedictType
commands. When performingdictResize
, we will rehash the entire dictionary in place rather than deferring the process.Free Offload
Since the command arguments are allocated by the I/O thread, it would be beneficial if they were also freed by the same thread. If the main thread frees objects allocated by the I/O thread, two issues arise:
tcache
) of buffers for quick reallocation without accessing the arena. If the main thread constantly frees objects allocated by other threads, those threads will have to frequently access the shared arena to obtain new memory allocationsImplementation Details
When freeing the client's argv, we will send the argv array to the thread that allocated it. The thread will be identified by the client ID. When freeing an object during
dbOverwrite
, we will offload the object free as well. We will extend this to offload the free duringdbDelete
in a future PR, as its effects on defrag/memory evictions need to be studied.