Multithreading performance #328

Sophist-UK · 2022-03-21T18:01:04Z

Sophist-UK
Mar 21, 2022

Some years ago a guy named Dave Beazley instrumented Python and undertook an analysis of multithreading and why multi-threaded apps had poor perceived performance as seen by users particularly when there were one or more CPU bound threads. In essence Python release the GIL, all waiting threads became runnable, and whichever thread got in first would grab the GIL. This should in theory have been random rather than deterministic, but in essence what Dave Beazley showed was that at the end of a time interval, the current executing thread was (obviously) already scheduled by the operating system and so it got in first and grabbed the GIL back - and this would happen multiple times. In 2009 a change was made in Python 2.9 and 3.2 which implemented "FORCED_SWITCHING" (commit) to try to avoid CPU being given back to the same thread again and again, but the alternative thread that got the GIL was still random/pseudo-random.

In 2010 David Beazley ran his analysis again against Python 3.2 and noted a different set of issues. He raised these in https://bugs.python.org/issue7946. Soon after it was opened, someone actually implemented a full Linux BFS scheduler as a test (which I personally think was way too complex) and it wasn't beneficial, and there were (understandable) concerns about a scheduler needing to be platform agnostic. The subject was then not mentioned for about 10 years despite the consensus being that a scheduler was probably what was needed essentially to provide a deterministic solution, and was mentioned in passing in the last year or so of the issue before the issue was closed.

I am opening this discussion to suggest a specific and (IMO) easily implemented algorithm for a GIL scheduler. Unfortunately I am a Python programmer (and computer science nerd) and not a C-programmer so, whilst I have had real life experiences of the difficulties of delivering a Python app which has a consistently snappy GUI whilst doing a whole load of stuff in background threads, creating this GIL enhancement or even undertaking the analysis to confirm that it is still needed is beyond my skills.

The difficulties I have experienced can be summarised as:
a. Inability to assign an operating system priority to a thread (even though a lot of O/S allow this).
b. An inability for I/O and CPU bound threads to be differentiated and handled diffferently.
c. The random nature of who gets the GIL next.
d. The fluctuating nature of who gets the GIL next depending on the GIL implementation in the version of Python the code is run against.

Here is my wish list for a GIL scheduler, some but not all of which were discussed in the above issue 7946:

Do not try to replace the functionality of the O/S scheduler - This can only result in a difficulty or impossibility to have a platform independent solution. Instead, allow the O/S to schedule the threads, and concentrate only on which of the threads waiting for the GIL gets given it next.
Keep it simple - Because there is only one GIL you can avoid all the complexities that are found in O/S schedulers (including BFS) in order to support multiple CPU cores. Even if PEP554 is implemented to have multiple GILs, each can be scheduled independently by each having a copy of the GIL scheduler.
Maintain backward compatibility - Because the GIL scheduler is only interested in allocating the GIL, it should not have any impact on C-based extensions that explicitly hand back the GIL. They should continue to run as now without any additional support required because they will continue to be scheduled by the O/S.
Make it work reasonably well out of the box - Existing multithreaded applications should not perform any worse when they run under a version of Python with a GIL scheduler. Hopefully they will run better.
Distinguish between CPU and I/O bound GIL-locking workloads - this should be reasonably easy to determine already. I/O bound threads give up the GIL because they have blocking I/O, CPU get the GIL taken away because they have used up their time-slice. Extensions that hand back the GIL and continue to run should IMO be considered I/O bound, but that is just a guess on my part. The idea here is that I/O-bound processes typically need short bursts of CPU and then hand back the GIL, and their constraining factor is normally the I/O time - so we want to schedule these as quickly as possible in order to get the next I/O underway. CPU bound processes then run to fill in the gaps between the I/O bound processes. GUI apps waiting for a GUI event in their main loop are considered I/O bound - and ideally the time-slice would be set so that a normal GUI event (mouse movement, key stroke etc) is dealt with well within a single time-slice so that they GUI thread normally remains considered I/O bound.
Make the determinism independent of the CPU power. Consider whether to switch back to using ticks instead of miliseconds as the basis for determining a time-slice. Yes, the number of ticks required for a particular piece of python code could change based on python version and possibly platform, but it is more consistent across different architectures and powers of systems than CPU time for determining whether a thread is CPU bound or not.
Give the programmer tools to tweak performance further - allow thread priorities to be set, allow thread type (CPU, I/O) to be overridden, allow the programmer to separate the GIL priority from the O/S thread priority.
Be deterministic and future consistent - implement this right the first time so that it doesn't need to change in the future (risking worse performance).

So, to the detail of my proposal:

The GIL management code maintains a list of all runnable threads waiting to be allocated the GIL. This is held in a tree where the major branches are O/S (or GIL) priorities, each having two sub-branches holding I/O and CPU bound threads respectively.
When a thread releases the GIL the reason is noted. A blocking I/O is considered I/O bound (and will be placed at the end of the relevant priority's I/O bound list when the I/O completes and the thread becomes runable). A thread which reaches the end of the time-slice is considered CPU bound and is immediately added to the end of the relevant priority's CPU bound list to run again when all I/O bound threads are busy doing I/O and when there are no CPU-bound threads of a higher priority and when all CPU-bound threads of the same priority have had their timeslice with the GIL.
When a thread is chosen to get the GIL, a signal (or other mechanism) is used to tell it to take the GIL and execute. The O/S will then use its own priority scheme to schedule it.
Note: This is an entire different approach to the current scheme where threads compete for the GIL - the GIL scheduler would normally execute in the thread that is currently giving up the GIL and it would choose a new thread to run and signal it to take the GIL. I anticipate that the CPU used for scheduling will in most cases be less than that saved by not having all threads competing.

I should add that this type of simple scheduler is not a new one - it is pretty much a Priority / Foreground-Background / Round-Robin / Cooperative / Pre-Emptive scheduler (see wikipedia) combining several well understood algorithms that I first came across many decades ago in my mainframe days.

In general terms this seems like a very simple thing to implement, but of course reality is often never quite that easy.

And of course, I might not be up to date with the current state of the GIL, or I might be off the mark here in other ways, but I want to put this forward and see if people think it is a good idea and whether there is any interest in implementing it.

Thanks.

gvanrossum · 2022-03-21T19:26:16Z

gvanrossum
Mar 21, 2022
Maintainer

I'd like to see some thoughts from @pitrou on this, since IIRC he touched it last. :-)

0 replies

pitrou · 2022-03-21T19:31:36Z

pitrou
Mar 21, 2022

Apart from the fact that I don't understand where the mentioned "priorities" come from, it seems to me that the only way to evaluate this proposal is for someone to write the code and gather metrics of interest.

8 replies

Sophist-UK Mar 23, 2022
Author

Actually I may be partly wrong here - the current threading model does not provide functionality to set thread priorities, so all threads in a process run using the processes O/S priority, and so for this model the GIL would not need to have branches for each O/S priority.

However, my real-life experience with trying to balance a mix of CPU-light and CPU-heavy background threads whilst keeping the UI responsive suggests that:

a. Mostly you want to run background threads at lower priority than the main thread. But sometimes you want to run CPU-light background threads at a higher priority than the main thread.
b. You might well want to give different priorities to different background threads.
c. The whole tuning thing is aided by allowing threads to have different priorities.

So I believe that thread priorities is something that should be considered, even if the initial implementation doesn't provide or handle thread priorities and they are provided as a subsequent enhancement. I also note that Windows threading has the ability to mark threads as background activities - and this might be nice if it could be added as e.g. an IDLE thread priority where the O/S supports it.

Additionally (though it has no impact on a GIL scheduler, consideration should be given to supporting thread I/O priority/nice (if such a thing actually exists for threads) as well as CPU priority.

thinkwelltwd Mar 30, 2022

it seems to me that the only way to evaluate this proposal is for someone to write the code and gather metrics of interest.

@pitrou, IIUC, you're the primary author of the current GIL implementation. If that impression is indeed correct, wouldn't you be the best person to do that? Or if you don't have the time / interest, is there some way you could express more support for resolving this long-standing problem?

pitrou Mar 30, 2022

@thinkwelltwd This is open source software, and code that I wrote more than 12 years ago. CPython is a collective project with many talented contributors, such as @gpshead and @Fidget-Spinner in this discussion. I fully trust that they don't need my explicit support to judge whether solving these issues is worth their time.

thinkwelltwd Mar 30, 2022

I fully trust that they don't need my explicit support to judge whether solving these issues is worth their time.

I share that confidence in them.

From the perspective of one who's watched this GIL discussion play out over the years and who noticed with great interest this latest resumption, it seems that the nature / tone of your comments sounds more disinterested and aloof than the history of your involvement merits.

If that comes across as too critical or accusatory, I don't mean it that way. I thank you for your involvement. I also share @Sophist-UK interest in seeing this resolved.

Sophist-UK Mar 30, 2022
Author

@thinkwelltwd This is open source software, and code that I wrote more than 12 years ago. CPython is a collective project with many talented contributors, such as @gpshead and @Fidget-Spinner in this discussion. I fully trust that they don't need my explicit support to judge whether solving these issues is worth their time.

@pitrou / Antoine - I think you are being somewhat modest here. Yes, it is an open source project, and yes there are indeed many talented contributors, yourself included. But whilst it was indeed over a decade ago that you wrote the new-gil code, and a similar amount of time ago that @dabeaz / Dave undertook his instrumented analysis, nevertheless this does mean that each of the two of you probably have more knowledge about the current GIL and how it behaves than the rest of us put together - and therefore better able to make an informed guess about whether current efforts are likely going in the right direction or whether it is worth considering a design of a bespoke solution as an alternative to the BSP.

gvanrossum · 2022-03-21T19:41:01Z

gvanrossum
Mar 21, 2022
Maintainer

@Sophist-UK Could you run Beazley's original experiment with Python 3.10? It's a very small amount of pure Python code. In his talk he claims a 7x slowdown when the server thread is competing with one CPU-bound thread. That should be simple to do. You can forget about the ZeroMQ version -- in fact any of the versions (e.g. Python + blocking sockets) would be fine.

1 reply

gvanrossum Mar 21, 2022
Maintainer

(I am referring to the experiment Dave showed in https://www.youtube.com/watch?v=fwzPF2JLoeU . I simply want to know if things have at all improved since Python 3.2, which he used for his experiments.)

Sophist-UK · 2022-03-21T19:51:02Z

Sophist-UK
Mar 21, 2022
Author

@gvanrossum Guido I am really not sure I have the skills. I certainly have never done anything like this before and I have no idea where to start. If it needs python / windows skills, then I would probably be OK. If it needs C/C++ skills then I am instantly out of my depth with way to big a learning curve.

I will try to find the time in the next few days to research what David ran and try to run it again. I have an aging 4th gen i7 laptop with 4 cores and 8 hyperthreads so I probably have the hardware. I have py3,9 64bit installed and can easily install py3.10 alongside.

However what I am trying to solve is not necessarily the same as David was looking at - it is some years since I last worked on the multithreading app, and reached out to David to understand the GIL issues, so I am not quite sure what my issues were then and no idea whether they still exist several years later. But I do have an understanding of the sorts of issues that might arise with one or more CPU bound threads i.e. CPU threads not getting fair shares, I/O threads getting no or infrequent CPU.

I probably have the Python skills to tweak existing test code to test for the things we are talking about (except hooking into functionality that would e.g. record the end timestamp of an I/o and the timestamp when the thread that called that I/O next gets scheduled). But if I need to start tweaking C-code and recompile Python, I am going to be lost.

5 replies

gvanrossum Mar 21, 2022
Maintainer

It only requires Python skills.

Sophist-UK Mar 21, 2022
Author

Ok - I'll give it a go then. I will try to find the code that Dave Beazley used as soon as I have some free time.

Sophist-UK Mar 21, 2022
Author

Are there any specific changes in the 3.10 GIL that means I should install 3.10. Or a 3.11 alpha? Or will 3.9 do?

gvanrossum Mar 21, 2022
Maintainer

I don't know. If the problem seems resolved in 3.9 we'll assume it is still resolved in later versions. If it still exists in 3.9, someone will ask to redo the experiment in 3.10 and/or 3.11.

Most important is that you post some working code here (a few files in a Gist might suffice).

Sophist-UK Mar 21, 2022
Author

My starting point as a requirement is a GIL that allows developers to know how their application is going to behave (i.e. determinism) and for it to behave in a reasonable way if they just chuck some multithreading code in there without understanding the GIL. That is a real life requirement.

I tried the spin code in his presentation and it no longer locks up the console. So there has been some improvement in the GIL since 3.2 but that is not the same as meeting the above requirement.

I had a quick look for a repository of the other code from this presentation. But the bitbucket link to it on Dave's EmbraceGIL page at http://dabeaz.com/talks/EmbraceGIL/ no longer works and a google search doesn't bring it up either.

My next step is to try to find some code from another of Beazley's presentations and tweak it to have a mix of CPU and I/O bound threads and then try to mix and match these to see just how well CPU-bound threads scale and the impact they have on I/O bound threads.

But another thing he points out in his presentation(s) is that the current competition approach has a heavy overhead as all the threads put the effort into attempting to get the GIL, when only one can actually win. A scheduler would pick one thread and signal it, so avoiding this overhead. Not sure how to measure this - Dave instrumented a version of Python to capture a trace of all GIL related actions from all threads and recreating that is way beyond my ability.

gvanrossum · 2022-03-21T21:30:55Z

gvanrossum
Mar 21, 2022
Maintainer

Could you reconstruct it from first principles given David's slides? He gave some pseudo-code, maybe you could turn that into working real code.

1 reply

gvanrossum Mar 21, 2022
Maintainer

Or just use udp-ioclient.py and udp-iotest.py from https://bugs.python.org/issue7946. Get those running and report your results.

Sophist-UK · 2022-03-21T22:24:06Z

Sophist-UK
Mar 21, 2022
Author

I have downloaded the UDP code and had a quick look - they look good as a basis. I will need to tweak the code so that we can see the CPU-bound thread performance depending on the number of threads. I would also like to check that there is no performance with the UDP echo server running in foreground vs. background. And I probably want to allow the client to be throttled or not so as to see how the CPU intensive workload is impacted depending on zero, low or high volumes of echos and to capture and do a histogram graph on the response times as well as the total throughput. But none of this is going to be that difficult.

I will post results here once I have them.

0 replies

Sophist-UK · 2022-03-21T22:26:24Z

Sophist-UK
Mar 21, 2022
Author

That said I am not sure that I have any results to compare these runs to. We will have to evaluate them on their own merits. But if I install various key versions of Py3 and see what the differences are that might give us some insights too.

3 replies

gvanrossum Mar 22, 2022
Maintainer

Beazley reported the following in his talk.

Without a CPU-consuming thread, handling 10,000 messages spaced 1 msec apart took about 11-12 seconds, whereas the theoretically best value would be 10 seconds (because the client sleeps 1 msec between requests).
With a CPU-consuming thread added to the server, the total processing time was about 7x as slow.
With his hacked scheduler, the latter result was instead only 1.1x - 1.2x as slow.

I would certainly hope that result (1) is still reproducible. Once you have that result, what is the slowdown with a CPU-consuming thread added?

pitrou Mar 22, 2022

When working on the new GIL I wrote a small concurrency benchmark which is still there in the source tree in Tools/ccbench/.
AFAIR one of the situations it tests is the kind of situation Dave Beazley wrote about.

Sophist-UK Mar 22, 2022
Author

Tools/ccbench/

It is a superset of the UDP test server so will use this as a base.

gpshead · 2022-03-21T23:58:00Z

gpshead
Mar 21, 2022

A blocking I/O is considered I/O bound (...). A thread which reaches the end of the time-slice is considered CPU bound

How do you determine a thread is I/O bound?

Assuming that one that releases the GIL rather than hitting the end of their interpreter loop time-slice is doing so for I/O is unrealistic. C API users release the GIL all the time to enable CPU parallelism during pure non-Python code. Not just for potentially blocking I/O. The interpreters loop time-slices seem like an insufficient proxy for CPU vs IO bound. Example: A thread running while True: hashlib.sha512(b"uzz!"*4_000_000) rarely holds the GIL but is not IO bound.

If this requires all C API uses to update code to use new explicit purpose GIL releasing APIs stating the reason (Blocking vs non-Python-CPU) they are releasing it, adoption could be a challenge and you may find an "unknown" scheduling state falling between or acting as both IO and CPU bound becomes necessary. I'm not suggesting more informative GIL releasing APIs are necessarily a bad idea, just that they wouldn't magically win on existing code. https://docs.python.org/3/c-api/init.html#thread-state-and-the-global-interpreter-lock

3 replies

gpshead Mar 22, 2022

I could also be misunderstanding what you had in mind for time-slice.

Sophist-UK Mar 22, 2022
Author

I guess I am using I/O bound & CPU-bound when I should be talking about threads that are CPU-light or CPU-heavy or threads that time-out vs. threads that voluntarily give up the GIL (explicitly or by doing a blocking I/O). The idea of Foreground/Background scheduling is that processes that need short intermittent bursts of CPU are prioritised over those that use lots of CPU. Of course, whilst a thread that did a short-burst last time is probably going to do another short-burst next time, but that is not guaranteed, but providing that the time-slice is short enough it shouldn't make a huge amount of difference.

As stated in my original post, backward compatibility is a requirement (item 3.) I haven't seen any reason that C API would need to change. The GIL scheduler only decides who to allocate the GIL to. If a C extension releases the GIL, it will then carry on executing as scheduled by the O/S until it decides it needs to reacquire the GIL when it gets put on the GIL Scheduler queue and waits to be allocated it.

Sophist-UK Mar 22, 2022
Author

A thread running while True: hashlib.sha512(b"uzz!"*4_000_000) rarely holds the GIL but is not IO bound.

I will try to add this as an alternative CPU heavy thread in the tests.

But my gut reaction is that the scheduling of this thread when it doesn't have the GIL should be left to the operating system. It is up to the O/S - and the programmer influencing it by setting an O/S thread priority - to ensure that competing threads/processes get sufficient CPU.

Sophist-UK · 2022-03-23T00:25:28Z

Sophist-UK
Mar 23, 2022
Author

I am making good progress with adding some stuff to @pitrou 's ccbench code e.g. improved environmental reporting at the start of the run, ability to set process affinity, and automatically setting high O/S process priority so that the impact of other system activities are minimised.

I think that this script now has the basic functionality that I think is needed for some benchmarking, and I now need to think about:

What matrix I want to use for tests i.e. we already vary the number of CPU threads - we might want to automate the varying of affinity as well
Whether we want to benchmark a mixed workload i.e. with some pure python + GIL release C-code, and latency (ping) or bandwidth (echo) as well.
What measurements we want out of a mixed workload benchmark to e.g. allow us to see the relative impacts of one type of workload on another
Whether we want to get CPU throughput measurements at the same time as latency and bandwidth measurements - so we can see how the use of the lightweight ping/echo threads impact the CPU threads.
How to present the mass of results from this matrix in a way which is meaningful.

A couple of key things you might want to try if you decide to check out the new script:

You can see a clear differences between pure python and GIL releasing C code.
The difference when you run with a single core, two hyperthreads on the same core "-a [0,1]" two hyperthreads on different cores "-a [0,2]".
I think the latency / bandwidth difference between pure-python CPU load and GIL releasing C-code are noticeable showing how this is GIL related.
I think it is pretty clear from the latency and bandwidth tests that CPU-light threads like ping and echo really suffer when there are CPU-heavy loads running alongside.
I don't think these tests yet have any way of seeing how evenly CPU is spread between competing CPU-heavy threads - I will need to think about how to show this.

Any / all feedback on this new script and the next steps are very welcome as I do not have a clear vision yet of what I need to achieve next. I will post an example output in a subsequent comment in a few mins.

ccbench_updated.py.txt

1 reply

Sophist-UK Mar 23, 2022
Author

P.S. I am not (yet) measuring CPU overall usage for these benchmarks - I can probably only measure CPU usage for the process, but it would be useful to see how that varies from the expected amount for the workload undertaken.

Sophist-UK · 2022-03-23T00:32:39Z

Sophist-UK
Mar 23, 2022
Author

Example output:

ccbench_updated.py -a 3

== CPython 3.9.5 (tags/v3.9.5:0a7dcbd) 64-bit ==
== AMD64 Windows 64bit on 'Intel64 Family 6 Model 60 Stepping 3, GenuineIntel' with 8 cores ==
== Threads: nt, Lock: None, Version: None ==
== Check interval: None, Switch interval: 0.005 ==
== Cores: 4, Hyperthreads: 8, Priority: Normal, I/O priority: Normal, Affinity: [0, 1, 2, 3, 4, 5, 6, 7] ==
!! Process priority set to HIGH !!
!! Process affinity set to [0, 2, 4] !!
== Cores: 4, Hyperthreads: 8, Priority: High, I/O priority: Normal, Affinity: [0, 2, 4] ==

--- Throughput ---

Pi calculation (Python)

threads=1: 561 iterations/s.
threads=2: 647 (115%)
threads=3: 659 (117%)
threads=4: 596 (106%)
threads=5: 567 (100%)
threads=6: 709 (126%)

regular expression (C)

threads=1: 195 iterations/s.
threads=2: 190 (97%)
threads=3: 213 (109%)
threads=4: 168 (86%)
threads=5: 169 (86%)
threads=6: 177 (90%)

bz2 compression (C)

threads=1: 273 iterations/s.
threads=2: 416 (152%)
threads=3: 608 (222%)
threads=4: 620 (226%)
threads=5: 632 (231%)
threads=6: 600 (219%)

SHA1 hashing (C)

threads=1: 1756 iterations/s.
threads=2: 2392 (136%)
threads=3: 3231 (183%)
threads=4: 3659 (208%)
threads=5: 4439 (252%)
threads=6: 4647 (264%)

SHA512 hashing (C) - GIL is released

threads=1: 1311 iterations/s.
threads=2: 1629 (124%)
threads=3: 2221 (169%)
threads=4: 2698 (205%)
threads=5: 2801 (213%)
threads=6: 2976 (226%)


--- Latency ---

Background CPU task: Pi calculation (Python)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 13 ms. (std dev: 5 ms.)
CPU threads=2: 18 ms. (std dev: 15 ms.)
CPU threads=3: 50 ms. (std dev: 43 ms.)
CPU threads=4: 51 ms. (std dev: 53 ms.)
CPU threads=5: 77 ms. (std dev: 64 ms.)
CPU threads=6: 111 ms. (std dev: 87 ms.)

Background CPU task: regular expression (C)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 16 ms. (std dev: 4 ms.)
CPU threads=2: 25 ms. (std dev: 17 ms.)
CPU threads=3: 50 ms. (std dev: 37 ms.)
CPU threads=4: 49 ms. (std dev: 27 ms.)
CPU threads=5: 51 ms. (std dev: 26 ms.)
CPU threads=6: 57 ms. (std dev: 42 ms.)

Background CPU task: bz2 compression (C)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 0 ms. (std dev: 0 ms.)
CPU threads=2: 1 ms. (std dev: 7 ms.)
CPU threads=3: 4 ms. (std dev: 8 ms.)
CPU threads=4: 1 ms. (std dev: 4 ms.)
CPU threads=5: 2 ms. (std dev: 6 ms.)
CPU threads=6: 1 ms. (std dev: 2 ms.)

Background CPU task: SHA1 hashing (C)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 0 ms. (std dev: 0 ms.)
CPU threads=2: 0 ms. (std dev: 0 ms.)
CPU threads=3: 24 ms. (std dev: 58 ms.)
CPU threads=4: 0 ms. (std dev: 2 ms.)
CPU threads=5: 0 ms. (std dev: 0 ms.)
CPU threads=6: 5 ms. (std dev: 23 ms.)

Background CPU task: SHA512 hashing (C) - GIL is released

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 0 ms. (std dev: 0 ms.)
CPU threads=2: 0 ms. (std dev: 0 ms.)
CPU threads=3: 21 ms. (std dev: 93 ms.)
CPU threads=4: 0 ms. (std dev: 0 ms.)
CPU threads=5: 7 ms. (std dev: 32 ms.)
CPU threads=6: 11 ms. (std dev: 48 ms.)

--- I/O bandwidth ---

Background CPU task: Pi calculation (Python)

CPU threads=0: 4333.0 packets/s.
CPU threads=1: 56.5 (1%)
CPU threads=2: 19.1 (0%)
CPU threads=3: 16.0 (0%)
CPU threads=4: 9.5 (0%)
CPU threads=5: 7.6 (0%)
CPU threads=6: 6.8 (0%)

Background CPU task: regular expression (C)

CPU threads=0: 5021.6 packets/s.
CPU threads=1: 40.6 (0%)
CPU threads=2: 22.4 (0%)
CPU threads=3: 21.3 (0%)
CPU threads=4: 18.2 (0%)
CPU threads=5: 9.1 (0%)
CPU threads=6: 11.2 (0%)

Background CPU task: bz2 compression (C)

CPU threads=0: 4710.5 packets/s.
CPU threads=1: 3012.6 (63%)
CPU threads=2: 1749.3 (37%)
CPU threads=3: 297.7 (6%)
CPU threads=4: 136.3 (2%)
CPU threads=5: 175.6 (3%)
CPU threads=6: 23.5 (0%)

Background CPU task: SHA1 hashing (C)

CPU threads=0: 4267.1 packets/s.
CPU threads=1: 3241.6 (75%)
CPU threads=2: 2744.9 (64%)
CPU threads=3: 245.0 (5%)
CPU threads=4: 130.7 (3%)
CPU threads=5: 60.7 (1%)
CPU threads=6: 3.9 (0%)

Background CPU task: SHA512 hashing (C) - GIL is released

CPU threads=0: 4546.9 packets/s.
CPU threads=1: 3267.7 (71%)
CPU threads=2: 2594.0 (57%)
CPU threads=3: 671.4 (14%)
CPU threads=4: 49.1 (1%)
CPU threads=5: 34.3 (0%)
CPU threads=6: 49.6 (1%)

0 replies

gvanrossum · 2022-03-23T00:52:54Z

gvanrossum
Mar 23, 2022
Maintainer

I wouldn't be so sure that the pi calculation is Python. It looks as if it is Python, but once you have a few hundred digits it spends all its time in the long integer code, which is all C.

1 reply

gpshead Mar 23, 2022

I don't believe longobject releases the GIL though. hashlib things all do for larger inputs. as should compression algorithms. the accurate info on that is found by looking at their C code.

Sophist-UK · 2022-03-23T11:41:15Z

Sophist-UK
Mar 23, 2022
Author

Ok I have tweaked the code a little more (to give std dev on throughput and to run the latency/bandwidth clients at high priority with full affinity) and rerun the example, and I am enclosing the file and the example results here and will give my first-cut analysis in the next comment.
ccbench_updated.py.txt

ccbench_updated.py -a 3

== CPython 3.9.5 (tags/v3.9.5:0a7dcbd) 64-bit ==
== AMD64 Windows 64bit on 'Intel64 Family 6 Model 60 Stepping 3, GenuineIntel' with 8 cores ==
== Threads: nt, Lock: None, Version: None ==
== Check interval: None, Switch interval: 0.005 ==
== Cores: 4, Hyperthreads: 8, Priority: Normal, I/O priority: Normal, Affinity: [0, 1, 2, 3, 4, 5, 6, 7] ==
!! Process priority set to HIGH !!
!! Process affinity set to [0, 2, 4] !!
== Cores: 4, Hyperthreads: 8, Priority: High, I/O priority: Normal, Affinity: [0, 2, 4] ==

--- Throughput ---

Pi calculation (Python)

threads=1: 1594 iterations/sec
threads=2: 941 (59%, std dev: 6 it/sec)
threads=3: 1571 (98%, std dev: 3 it/sec)
threads=4: 1765 (110%, std dev: 63 it/sec)
threads=5: 1844 (115%, std dev: 107 it/sec)
threads=6: 1755 (110%, std dev: 43 it/sec)

regular expression (C)

threads=1: 652 iterations/sec
threads=2: 701 (107%, std dev: 6 it/sec)
threads=3: 729 (111%, std dev: 18 it/sec)
threads=4: 698 (107%, std dev: 9 it/sec)
threads=5: 699 (107%, std dev: 30 it/sec)
threads=6: 759 (116%, std dev: 25 it/sec)

bz2 compression (C)

threads=1: 591 iterations/sec
threads=2: 828 (140%, std dev: 2 it/sec)
threads=3: 1106 (187%, std dev: 6 it/sec)
threads=4: 1222 (206%, std dev: 12 it/sec)
threads=5: 1222 (206%, std dev: 17 it/sec)
threads=6: 973 (164%, std dev: 7 it/sec)

SHA1 hashing (C)

threads=1: 4035 iterations/sec
threads=2: 5160 (127%, std dev: 44 it/sec)
threads=3: 6141 (152%, std dev: 20 it/sec)
threads=4: 6444 (159%, std dev: 54 it/sec)
threads=5: 7661 (189%, std dev: 223 it/sec)
threads=6: 8017 (198%, std dev: 129 it/sec)

SHA512 hashing (C) - GIL is released

threads=1: 2537 iterations/sec
threads=2: 3285 (129%, std dev: 12 it/sec)
threads=3: 3988 (157%, std dev: 21 it/sec)
threads=4: 4461 (175%, std dev: 111 it/sec)
threads=5: 5035 (198%, std dev: 71 it/sec)
threads=6: 4975 (196%, std dev: 60 it/sec)


--- Latency ---

Background CPU task: Pi calculation (Python)

CPU threads=0: 0 ms (std dev: 0 ms)
CPU threads=1: 14 ms (std dev: 4 ms)
CPU threads=2: 20 ms (std dev: 14 ms)
CPU threads=3: 25 ms (std dev: 19 ms)
CPU threads=4: 46 ms (std dev: 34 ms)
CPU threads=5: 38 ms (std dev: 29 ms)
CPU threads=6: 81 ms (std dev: 96 ms)

Background CPU task: regular expression (C)

CPU threads=0: 0 ms (std dev: 0 ms)
CPU threads=1: 14 ms (std dev: 4 ms)
CPU threads=2: 22 ms (std dev: 13 ms)
CPU threads=3: 19 ms (std dev: 9 ms)
CPU threads=4: 27 ms (std dev: 18 ms)
CPU threads=5: 53 ms (std dev: 36 ms)
CPU threads=6: 44 ms (std dev: 45 ms)

Background CPU task: bz2 compression (C)

CPU threads=0: 0 ms (std dev: 0 ms)
CPU threads=1: 0 ms (std dev: 0 ms)
CPU threads=2: 0 ms (std dev: 0 ms)
CPU threads=3: 0 ms (std dev: 0 ms)
CPU threads=4: 0 ms (std dev: 0 ms)
CPU threads=5: 0 ms (std dev: 0 ms)
CPU threads=6: 0 ms (std dev: 0 ms)

Background CPU task: SHA1 hashing (C)

CPU threads=0: 0 ms (std dev: 0 ms)
CPU threads=1: 0 ms (std dev: 0 ms)
CPU threads=2: 0 ms (std dev: 0 ms)
CPU threads=3: 0 ms (std dev: 0 ms)
CPU threads=4: 0 ms (std dev: 0 ms)
CPU threads=5: 0 ms (std dev: 0 ms)
CPU threads=6: 0 ms (std dev: 0 ms)

Background CPU task: SHA512 hashing (C) - GIL is released

CPU threads=0: 0 ms (std dev: 0 ms)
CPU threads=1: 0 ms (std dev: 0 ms)
CPU threads=2: 0 ms (std dev: 0 ms)
CPU threads=3: 0 ms (std dev: 0 ms)
CPU threads=4: 0 ms (std dev: 0 ms)
CPU threads=5: 0 ms (std dev: 0 ms)
CPU threads=6: 0 ms (std dev: 0 ms)

--- I/O bandwidth ---

Background CPU task: Pi calculation (Python)

CPU threads=0: 6676.9 packets/sec
CPU threads=1: 41.7 (0%)
CPU threads=2: 18.8 (0%)
CPU threads=3: 12.2 (0%)
CPU threads=4: 8.8 (0%)
CPU threads=5: 9.7 (0%)
CPU threads=6: 10.1 (0%)

Background CPU task: regular expression (C)

CPU threads=0: 6322.0 packets/sec
CPU threads=1: 36.2 (0%)
CPU threads=2: 17.8 (0%)
CPU threads=3: 14.3 (0%)
CPU threads=4: 16.8 (0%)
CPU threads=5: 10.5 (0%)
CPU threads=6: 18.1 (0%)

Background CPU task: bz2 compression (C)

CPU threads=0: 5972.8 packets/sec
CPU threads=1: 4900.1 (82%)
CPU threads=2: 4955.5 (82%)
CPU threads=3: 4989.2 (83%)
CPU threads=4: 4817.7 (80%)
CPU threads=5: 4787.2 (80%)
CPU threads=6: 4029.9 (67%)

Background CPU task: SHA1 hashing (C)

CPU threads=0: 6523.3 packets/sec
CPU threads=1: 5379.9 (82%)
CPU threads=2: 4418.0 (67%)
CPU threads=3: 4787.6 (73%)
CPU threads=4: 4660.6 (71%)
CPU threads=5: 4484.5 (68%)
CPU threads=6: 4717.5 (72%)

Background CPU task: SHA512 hashing (C) - GIL is released

CPU threads=0: 6807.9 packets/sec
CPU threads=1: 5644.5 (82%)
CPU threads=2: 4662.1 (68%)
CPU threads=3: 5146.4 (75%)
CPU threads=4: 4827.2 (70%)
CPU threads=5: 4997.8 (73%)
CPU threads=6: 5284.1 (77%)

2 replies

pitrou Mar 23, 2022

To be honest, I'm not sure what you're trying to do here. What you're observing is already well known (and morally already tested for by the original version of the benchmark).

Sophist-UK Mar 23, 2022
Author

You can't really know whether the changes made were worthwhile until you have made them and looked at the results. My first stage was to tweak your (already excellent) code to give a bit more information, and to give a bit more flexibility on the affinity.

But the real purpose of this was that Guido wanted the previous analysis repeated so as to confirm that there is still an issue to be solved.

Sophist-UK · 2022-03-23T13:26:50Z

Sophist-UK
Mar 23, 2022
Author

Here is my first cut analysis of the above example:

Throughput - We can, I think, see that the Python and Regular expressions retain the GIL, the compression/hashing code doesn't.

IMO the standard deviations in throughput of the several CPU-heavy workloads are not significant.

I am a little surprised that the multi-thread Pi calculations go above 100%. Guessing a little, it seems to me that there are two possible explanations for this:
a. There are parts of the pure Python code running here where the GIL is not held and some parallel-ism can occur (unlikely but possible); or
b. The baseline with one thread is not executing at full potential (more likely but still strange).
I really have no idea what the cause of this might be, but they might be worthy of further separate study.

If (as suggested by the latency/bandwidth tests) the GIL utilisation for the compression/hashing workloads is low, then I am highly surprised that the throughput for these are so non-linear (i.e. close to 2x throughput for 2 threads etc.) for at least the first half of the threads especially since in this run we have explicitly avoided the use of hyperthreads. I have no immediate idea about the reason why these are non-linear, and therefore no idea whether these are due to the GIL, to the threading code, or to something else entirely (hardware memory contention, something in the O/S scheduler), and therefore no idea whether a GIL scheduler would have any impact on this either way. IMO if this is not already understood it might be worthy of a separate study. (Adding measurements of overall process CPU usage recorded by the O/S for each run might give us more insight into this.)

This test is intended to show the impact of the GIL where ALL threads are the same - we should probably run some tests which mix GIL-releasing and non-GIL-releasing workloads and see what happens in that situation.
Latency - Pi/Regex loads are showing significant delays to the echo response thread, whilst the compression/hash loads do not. This is IMO essentially because with the former the GIL is permanently in use, and with the latter it is mostly unlocked.

For the Pi/Regex loads, (as expected) we can clearly see that the CPU-light echo responder thread is having to wait to get the GIL, and that the more CPU-heavy threads there are, the longer it has to wait. The increasing standard deviation as the number of CPU-heavy threads increases suggests to me that the allocation of the GIL is still random rather than round-robin. If it were round-robin I would expect the standard deviation to be zero or very small. For the GIL releasing workloads both the latency and the standard deviation are zero, which suggests to me that the GIL is only locked for a fairly small proportion of the time.

I can also rerun the tests and vary the affinity upwards and see whether more GIL-releasing C-code makes a difference, and once I get above 4 then I will be using hyperthreads and the internal CPU archictecture and delays from hyperthreading might help keep the GIL locked longer.

The latency responder currently runs on main thread and not a worker thread but I don't think this makes a difference.

We do not currently measure the impact of the latency execution on the background CPU tasks, but that would IMO be a useful thing to add to the code as it would allow us to compare the impact of the CPU-light loads on the CPU-heavy loads as well as the impact of the CPU-heavy loads on the CPU-light loads. My guess is that with the current implementation the CPU-light loads would have small impact on CPU-heavy loads, but we should confirm this and have a baseline to compare to if we do implement a GIL scheduler.
Bandwidth - Essentially similar results to the above.
General observations - The existing code does not measure the total CPU usage for each run, and if we think this will be useful to measure the overhead of GIL contention, I can probably add that.

I will rerun the above test with affinity set to a single core and to 8 cores to see if that makes a significant difference as Dave Beazley did note that the overhead due to multiple threads competing for the GIL was different based on affinity (because e.g. with a single core, you don't get competition whilst with 8 you can have up to 8 threads attempting to grab the GIL).

The code includes a special yield to prevent the tests locking up (presumably because in some previous Python versions the GIL code was sufficiently bad that a single thread was allowed to monopolise the GIL. I have not yet run this without then fix-up to see if this is still a problem in 3.9 (though by all accounts it shouldn't be). The choice of whether to run this is probably python-version-specific, and we should probably code this as such.

I have not yet run this on 32-bit python or on other versions of Python to make any comparisons, but I can do so if this is considered useful.

Conclusions (personal view)

So what conclusions might potentially be drawn from this about whether there is a need for a GIL scheduler.

As @pitrou suggested, these results are as expected for what we know about the current GIL algorithms. It is still based on contention rather than on any round-robin algorithm, and CPU-light threads are not given any preference cf. CPU-heavy threads.
If my analysis is correct, then CPU-light loads are disproportionately impacted by having to compete with CPU-heavy workloads, having a significant impact on both responsiveness (latency) and throughput.
I believe that the introduction of a GIL scheduler which distinguishes between CPU-light and CPU-heavy threads would be highly beneficial, not the least because (with the right switch-interval) GUI threads should generally be considered CPU-light and get preference, and this would make the app more responsive.

I do not think that the lack of round-robin capability would be sufficient justification for a GIL scheduler by itself, but if you are implementing a scheduler to fix the CPU-light vs. CPU-heavy thread issue, then this would be easy to add and would be beneficial.
I am not sure how the current switch interval of 5ms was chosen, but in the event that a GIL scheduler is implemented then a study of existing UIs (or alternatively an experiment using typical UI frameworks such as PyQt and wxPython) to determine the 90%ile of their GUI transaction CPU-loads might be useful in order to determine whether a different default Switch Interval would be appropriate. Additionally some guidelines on how app developers can measure the GUI CPU load for transactions and programmatically set an appropriate Switch Interval for their own app might also be useful once a GIL scheduler is implemented.
There are several anomalies in these measurements that do not appear to be related to the question of a GIL scheduler, but do appear to indicate other opportunities for Python performance improvement:
a. Pure throughput of pure python code increasing with multiple competing threads
b. Non-linear growth of GIL releasing C extensions
As suggested above, these might be worth separate investigations as part of overall performance improvement activities.
I imagine that in some circumstances (where the regex is going to take a significant time to complete), overall performance would benefit from the regex C code giving up the GIL. However, this probably needs some further analysis / experimentation - if the regex would currently (no GIL-release) complete in a single time-slice, then if it released the GIL and then had to wait to get it back competing with one or more CPU-heavy threads that hold the GIL, then it might take significantly longer to execute and have a backwardly incompatible impact on the performance of existing code. I imagine that the same might be true of code that already releases the GIL i.e. compression or hashing of small objects.

However, if we assume that a GIL scheduler treated GIL-releasing C-code as CPU-light, then the impact of releasing the GIL and then needing to acquire it again very soon afterwards should be largely mitigated.

Obviously, these are my personal observations and I am as capable as the next person of drawing false conclusions - so please chip in with your own observations and / or suggestions on how to proceed from here.

8 replies

luzfcb Mar 23, 2022

I don't know if it helps at all, but I ran the tests on my Ryzen 9 3900X (Cores: 12, Hyperthreads: 24)
amd_ryzen_24-core-test.txt

Sophist-UK Mar 23, 2022
Author

@luzfcb Fabio - I will take a look at the details of this in a minute, but having a linux run is useful to iron out the linux bugs 😄

Sophist-UK Mar 23, 2022
Author

@luzfcb I didn't get anything more out of this - the results seem consistent with my earlier 3 and 8 core runs. But I am about to post a new set of results which include CPU measurements and which I think tell us something useful - so you might want to run 1-core and all-core tests to see if your results are any different (using the new code).

luzfcb Mar 23, 2022

The test with 1-core

❯ sudo /home/luzfcb/.pyenv/versions/3.9.11/bin/python3.9 ccbench_updated.py -a 1

== CPython 3.9.11 (main) 64-bit ==
== x86_64 Linux 64bit on 'x86_64' with 24 cores ==
== Threads: pthread, Lock: semaphore, Version: NPTL 2.31 ==
== Check interval: None, Switch interval: 0.005 ==
== Cores: 12, Hyperthreads: 24, Priority: 0, I/O priority: pionice(ioclass=, value=4), Affinity: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] ==
!! Process I/O priority set to REALTIME !!
!! Process affinity set to [0] !!
== Cores: 12, Hyperthreads: 24, Priority: -20, I/O priority: pionice(ioclass=, value=0), Affinity: [0] ==
!! Process I/O priority set to REALTIME !!
--- Throughput ---
Pi calculation (Python)
threads=1: 2797 iterations/s.

threads=2: 2765 (98%)
regular expression (C)
threads=1: 1184 iterations/s.

threads=2: 1173 (99%)
bz2 compression (C)
threads=1: 1600 iterations/s.

threads=2: 1620 (101%)
SHA1 hashing (C)
threads=1: 15560 iterations/s.

threads=2: 15543 (99%)
SHA512 hashing (C) - GIL is released
threads=1: 6274 iterations/s.

threads=2: 6206 (98%)
--- Latency ---
Background CPU task: Pi calculation (Python)
CPU threads=0: 0 ms. (std dev: 0 ms.)

CPU threads=1: 5 ms. (std dev: 0 ms.)

CPU threads=2: 4 ms. (std dev: 2 ms.)
Background CPU task: regular expression (C)
CPU threads=0: 0 ms. (std dev: 0 ms.)

CPU threads=1: 5 ms. (std dev: 0 ms.)

CPU threads=2: 6 ms. (std dev: 5 ms.)
Background CPU task: bz2 compression (C)
CPU threads=0: 0 ms. (std dev: 1 ms.)

CPU threads=1: 0 ms. (std dev: 0 ms.)

CPU threads=2: 0 ms. (std dev: 1 ms.)
Background CPU task: SHA1 hashing (C)
CPU threads=0: 0 ms. (std dev: 0 ms.)

CPU threads=1: 0 ms. (std dev: 1 ms.)

CPU threads=2: 0 ms. (std dev: 0 ms.)
Background CPU task: SHA512 hashing (C) - GIL is released
CPU threads=0: 0 ms. (std dev: 1 ms.)

CPU threads=1: 0 ms. (std dev: 0 ms.)

CPU threads=2: 0 ms. (std dev: 1 ms.)
--- I/O bandwidth ---
Background CPU task: Pi calculation (Python)
CPU threads=0: 67515.5 packets/s.

CPU threads=1: 48054.3 ( 71 %)

CPU threads=2: 34161.5 ( 50 %)

Sophist-UK Mar 23, 2022
Author

@luzfcb Fabio - I have a possible fix for the display of the iopriority if you need it. I suspect that the impact is small, and it can probably wait until the next time I upload the script.

Sophist-UK · 2022-03-23T18:05:55Z

Sophist-UK
Mar 23, 2022
Author

I have now added reporting of CPU usage and it turns out that this is particularly insightful:
ccbench_updated.py.txt (with a typo corrected)

Additional observations

Throughput

For the Pi/regex workloads, we don't really learn anything new with the CPU information, but I believe that the results from the compression/hash tests tell us something quite important:

As the threads build up to the number of cores, we see the CPU scaling proportionately - we are maxing out the CPUs with these GIL-releasing CPU-heavy workloads, however the throughput is not increasing. I don't think that this is some sort of bottleneck constraining execution, because if it was then the total CPU would not be proportional to the number of cores.

So it seems that something is wasting a lot of CPU cycles, and one possible cause could be the competition-based approach for gaining the GIL. If this is the case, then by massively increasing the Switch Interval from 0.005s to (say) 0.1s, and creating a corresponding reduction in the number of timeslice based GIL switches, I would expect this overhead to decrease and to see a significant improvement in throughput.

I tried this and the results are shown below, and we definitely see an improvements for some GIL-releasing workloads, but not so much for others - and at this stage I do not have an explanation for this. But for e.g. the hashing workloads we can see that for 2 threads you actually get 100% more throughput rather than only 60% more. But this quickly tails off for threads >= 3, and I have no theory about this either. (I did try a run with a Switch Interval of 0.5 and that was no better - but with a 2s run time, a switch interval of 1/4 of that may introduce other issues.)

But if we look at the base results then the numbers are really very bad indeed - for the bz workload (which is the worst) the incremental figures as you add threads look like this:

1st thread : 100% through put for 100% CPU (baseline)
2nd thread: 58% more throughput for 100% more CPU - 42% overhead
3rd thread: 25% more throughput for 100% more CPU - 75% overhead
4th thread: 25% more throughput for 100% more CPU - 75% overhead
5th thread: 9% less throughput for 100% more CPU - a glitch in these measurements?
6th thread: 20% more throughput for 100% more CPU
7th thread: 49% more throughput for 100% more CPU
8th thread: 28% more throughput for 50% more CPU
subsequent threads - variable probably within experimental variation.

If we combine the 5th to 16th threads into a single group, we get c. 56% increase in throughput for 400% increase in CPU.

I think we can all agree that these are really not good figures and represent a massive opportunity for improving the performance of GIL-releasing C-based modules.

IMO a good starting point for this would be to move away from a GIL handover approach based on competing for the GIL to one where the GIL is handed off explicitly to the chosen thread.

Bandwidth

The bandwidth results are also a bit odd as we see a tail off in the additional CPU usage. ~~I don't have any theory as yet about why the CPU soak threads are not soaking up the CPU as you would expect.~~

Edit: I now have a theory that the CPU-soaker threads are not getting sufficient access to the GIL to keep the non-GIL CPU load in runnable state, and as per some later discussions which have suggested the same thing from another direction, I intend (tomorrow?) to add some sleep statements to the client code in order to reduce the intensity of the latency, and throughput traffic and see if that changes the measurements.

Results with CPU information added

ccbench_updated.py -a 8

== CPython 3.9.5 (tags/v3.9.5:0a7dcbd) 64-bit ==
== AMD64 Windows 64bit on 'Intel64 Family 6 Model 60 Stepping 3, GenuineIntel' with 8 cores ==
== Threads: nt, Lock: None, Version: None ==
== Check interval: None, Switch interval: 0.005 ==
== Cores: 4, Hyperthreads: 8, Priority: Normal, I/O priority: Normal, Affinity: [0, 1, 2, 3, 4, 5, 6, 7] ==
!! Process priority set to HIGH !!
== Cores: 4, Hyperthreads: 8, Priority: High, I/O priority: Normal, Affinity: [0, 1, 2, 3, 4, 5, 6, 7] ==

--- Throughput ---

Pi calculation (Python)

threads= 1: 1682 iterations/sec        - Baseline: CPU: User:  2.0156, System:  0.0000
threads= 2: 1633 (  -2.9%, std dev:   2 its/sec) - CPU: User:  2.0156 (  +0.0%), System:  0.0000
threads= 3: 1740 (  +3.4%, std dev:   8 its/sec) - CPU: User:  2.0000 (  -0.8%), System:  0.0000
threads= 4: 1781 (  +5.9%, std dev:  65 its/sec) - CPU: User:  2.0625 (  +2.3%), System:  0.0000
threads= 5: 1726 (  +2.6%, std dev:  36 its/sec) - CPU: User:  2.0156 (  +0.0%), System:  0.0000
threads= 6: 1773 (  +5.4%, std dev:  66 its/sec) - CPU: User:  2.0625 (  +2.3%), System:  0.0000
threads= 7: 1665 (  -1.0%, std dev:  57 its/sec) - CPU: User:  2.0312 (  +0.8%), System:  0.0000
threads= 8: 1700 (  +1.1%, std dev:  52 its/sec) - CPU: User:  2.0938 (  +3.9%), System:  0.0000
threads= 9: 1521 (  -9.6%, std dev:  47 its/sec) - CPU: User:  2.0781 (  +3.1%), System:  0.0156
threads=10: 1463 ( -13.0%, std dev:  37 its/sec) - CPU: User:  2.1719 (  +7.8%), System:  0.0000
threads=11: 1445 ( -14.1%, std dev:  33 its/sec) - CPU: User:  2.1406 (  +6.2%), System:  0.0000
threads=12: 1679 (  -0.2%, std dev:  43 its/sec) - CPU: User:  2.0938 (  +3.9%), System:  0.0000
threads=13: 1553 (  -7.7%, std dev:  55 its/sec) - CPU: User:  2.0781 (  +3.1%), System:  0.0000
threads=14: 1575 (  -6.4%, std dev:  43 its/sec) - CPU: User:  2.2500 ( +11.6%), System:  0.0312
threads=15: 1800 (  +7.0%, std dev:  52 its/sec) - CPU: User:  2.0625 (  +2.3%), System:  0.0000
threads=16: 1712 (  +1.8%, std dev:  44 its/sec) - CPU: User:  2.1719 (  +7.8%), System:  0.0000

regular expression (C)

threads= 1:  437 iterations/sec        - Baseline: CPU: User:  2.0781, System:  0.0000
threads= 2:  360 ( -17.6%, std dev:   1 its/sec) - CPU: User:  2.1094 (  +1.5%), System:  0.0000
threads= 3:  292 ( -33.2%, std dev:   2 its/sec) - CPU: User:  2.0781 (  +0.0%), System:  0.0156
threads= 4:  310 ( -29.1%, std dev:   8 its/sec) - CPU: User:  2.0625 (  -0.8%), System:  0.0000
threads= 5:  273 ( -37.5%, std dev:  15 its/sec) - CPU: User:  2.0938 (  +0.8%), System:  0.0156
threads= 6:  310 ( -29.1%, std dev:   4 its/sec) - CPU: User:  2.1250 (  +2.3%), System:  0.0000
threads= 7:  187 ( -57.2%, std dev:   4 its/sec) - CPU: User:  2.1562 (  +3.8%), System:  0.0156
threads= 8:  282 ( -35.5%, std dev:  15 its/sec) - CPU: User:  2.2031 (  +6.0%), System:  0.0156
threads= 9:  195 ( -55.4%, std dev:   8 its/sec) - CPU: User:  2.4844 ( +19.5%), System:  0.0000
threads=10:  351 ( -19.7%, std dev:  17 its/sec) - CPU: User:  2.3125 ( +11.3%), System:  0.0156
threads=11:  437 (  +0.0%, std dev:  15 its/sec) - CPU: User:  2.2969 ( +10.5%), System:  0.0000
threads=12:  309 ( -29.3%, std dev:  18 its/sec) - CPU: User:  2.2656 (  +9.0%), System:  0.0000
threads=13:  445 (  +1.8%, std dev:  22 its/sec) - CPU: User:  2.1562 (  +3.8%), System:  0.0000
threads=14:  690 ( +57.9%, std dev:  16 its/sec) - CPU: User:  2.2969 ( +10.5%), System:  0.0000
threads=15:  591 ( +35.2%, std dev:  26 its/sec) - CPU: User:  2.2500 (  +8.3%), System:  0.0000
threads=16:  570 ( +30.4%, std dev:  24 its/sec) - CPU: User:  2.2500 (  +8.3%), System:  0.0000

bz2 compression (C)

threads= 1:  575 iterations/sec        - Baseline: CPU: User:  1.9062, System:  0.1094
threads= 2:  907 ( +57.7%, std dev:   5 its/sec) - CPU: User:  3.7344 ( +95.9%), System:  0.2031 ( +85.7%)
threads= 3: 1051 ( +82.8%, std dev:   1 its/sec) - CPU: User:  5.7969 (+204.1%), System:  0.2344 (+114.3%)
threads= 4: 1193 (+107.5%, std dev:   4 its/sec) - CPU: User:  7.6250 (+300.0%), System:  0.4062 (+271.4%)
threads= 5: 1140 ( +98.3%, std dev:   5 its/sec) - CPU: User:  9.6094 (+404.1%), System:  0.4844 (+342.9%)
threads= 6: 1264 (+119.8%, std dev:  12 its/sec) - CPU: User: 11.3594 (+495.9%), System:  0.4062 (+271.4%)
threads= 7: 1542 (+168.2%, std dev:   1 its/sec) - CPU: User: 12.9688 (+580.3%), System:  0.5469 (+400.0%)
threads= 8: 1705 (+196.5%, std dev:   5 its/sec) - CPU: User: 13.9844 (+633.6%), System:  0.5938 (+442.9%)
threads= 9: 1758 (+205.7%, std dev:  20 its/sec) - CPU: User: 14.8438 (+678.7%), System:  0.5156 (+371.4%)
threads=10: 1745 (+203.5%, std dev:  27 its/sec) - CPU: User: 14.8750 (+680.3%), System:  0.5312 (+385.7%)
threads=11: 1828 (+217.9%, std dev:  15 its/sec) - CPU: User: 15.7344 (+725.4%), System:  0.5938 (+442.9%)
threads=12: 1806 (+214.1%, std dev:   3 its/sec) - CPU: User: 15.4062 (+708.2%), System:  0.6406 (+485.7%)
threads=13: 1849 (+221.6%, std dev:  10 its/sec) - CPU: User: 15.4844 (+712.3%), System:  0.6094 (+457.1%)
threads=14: 1807 (+214.3%, std dev:  21 its/sec) - CPU: User: 15.5156 (+713.9%), System:  0.5000 (+357.1%)
threads=15: 1829 (+218.1%, std dev:   7 its/sec) - CPU: User: 15.5938 (+718.0%), System:  0.4219 (+285.7%)
threads=16: 1867 (+224.7%, std dev:   4 its/sec) - CPU: User: 15.8906 (+733.6%), System:  0.5156 (+371.4%)

SHA1 hashing (C)

threads= 1: 2339 iterations/sec        - Baseline: CPU: User:  2.0000, System:  0.0000
threads= 2: 4415 ( +88.8%, std dev:  36 its/sec) - CPU: User:  4.0156 (+100.8%), System:  0.0000
threads= 3: 5564 (+137.9%, std dev:  57 its/sec) - CPU: User:  5.9844 (+199.2%), System:  0.0000
threads= 4: 6796 (+190.6%, std dev:  25 its/sec) - CPU: User:  7.9688 (+298.4%), System:  0.0000
threads= 5: 7710 (+229.6%, std dev:  13 its/sec) - CPU: User:  9.9062 (+395.3%), System:  0.0156
threads= 6: 8742 (+273.7%, std dev:   9 its/sec) - CPU: User: 11.7656 (+488.3%), System:  0.0625
threads= 7: 10155 (+334.2%, std dev:  15 its/sec) - CPU: User: 13.9219 (+596.1%), System:  0.0156
threads= 8: 11518 (+392.4%, std dev:  15 its/sec) - CPU: User: 15.8594 (+693.0%), System:  0.0156
threads= 9: 11562 (+394.3%, std dev: 153 its/sec) - CPU: User: 16.0156 (+700.8%), System:  0.0000
threads=10: 11554 (+394.0%, std dev:  28 its/sec) - CPU: User: 15.8906 (+694.5%), System:  0.0156
threads=11: 11541 (+393.4%, std dev: 105 its/sec) - CPU: User: 16.0312 (+701.6%), System:  0.0000
threads=12: 11325 (+384.2%, std dev:  42 its/sec) - CPU: User: 15.6719 (+683.6%), System:  0.0000
threads=13: 11536 (+393.2%, std dev:  69 its/sec) - CPU: User: 15.9844 (+699.2%), System:  0.0156
threads=14: 11359 (+385.6%, std dev:  21 its/sec) - CPU: User: 15.7344 (+686.7%), System:  0.0156
threads=15: 11401 (+387.4%, std dev:  47 its/sec) - CPU: User: 15.6875 (+684.4%), System:  0.0469
threads=16: 11450 (+389.5%, std dev:  17 its/sec) - CPU: User: 16.0156 (+700.8%), System:  0.0156

SHA512 hashing (C) - GIL is released

threads= 1: 1145 iterations/sec        - Baseline: CPU: User:  1.9844, System:  0.0000
threads= 2: 2796 (+144.2%, std dev:  14 its/sec) - CPU: User:  3.9844 (+100.8%), System:  0.0000
threads= 3: 3802 (+232.1%, std dev:  28 its/sec) - CPU: User:  6.0000 (+202.4%), System:  0.0000
threads= 4: 4343 (+279.3%, std dev:  24 its/sec) - CPU: User:  8.0000 (+303.1%), System:  0.0000
threads= 5: 5108 (+346.1%, std dev:   3 its/sec) - CPU: User:  9.9844 (+403.1%), System:  0.0156
threads= 6: 5710 (+398.7%, std dev:   4 its/sec) - CPU: User: 11.8750 (+498.4%), System:  0.0156
threads= 7: 6532 (+470.5%, std dev:  28 its/sec) - CPU: User: 13.6406 (+587.4%), System:  0.0156
threads= 8: 7697 (+572.2%, std dev:  10 its/sec) - CPU: User: 15.9375 (+703.1%), System:  0.0000
threads= 9: 7758 (+577.6%, std dev:  90 its/sec) - CPU: User: 15.8906 (+700.8%), System:  0.0156
threads=10: 7722 (+574.4%, std dev: 160 its/sec) - CPU: User: 16.0312 (+707.9%), System:  0.0000
threads=11: 7752 (+577.0%, std dev:  61 its/sec) - CPU: User: 15.9531 (+703.9%), System:  0.0000
threads=12: 7788 (+580.2%, std dev:  14 its/sec) - CPU: User: 15.9531 (+703.9%), System:  0.0000
threads=13: 7758 (+577.6%, std dev:  52 its/sec) - CPU: User: 15.9531 (+703.9%), System:  0.0000
threads=14: 7784 (+579.8%, std dev:  76 its/sec) - CPU: User: 16.0625 (+709.4%), System:  0.0156
threads=15: 7811 (+582.2%, std dev:  39 its/sec) - CPU: User: 16.0938 (+711.0%), System:  0.0156
threads=16: 7675 (+570.3%, std dev:  16 its/sec) - CPU: User: 15.8750 (+700.0%), System:  0.0156


--- Latency ---

Background CPU task: Pi calculation (Python)

CPU threads= 0:   0 ms (std dev:   0 ms)
CPU threads= 1:  13 ms (std dev:   4 ms) - Throughput:  826 its/sec - CPU: User:  2.9844, System:  0.0156
CPU threads= 2:  25 ms (std dev:  22 ms) - Throughput:  981 its/sec - CPU: User:  3.0000 (  +0.5%), System:  0.0312 (+100.0%)
CPU threads= 3:  35 ms (std dev:  29 ms) - Throughput: 1313 its/sec - CPU: User:  3.0156 (  +1.0%), System:  0.0156 (  +0.0%)
CPU threads= 4: 103 ms (std dev: 103 ms) - Throughput: 1330 its/sec - CPU: User:  3.0625 (  +2.6%), System:  0.0156 (  +0.0%)
CPU threads= 5:  87 ms (std dev:  60 ms) - Throughput: 1406 its/sec - CPU: User:  3.0625 (  +2.6%), System:  0.0000 (-100.0%)
CPU threads= 6:  59 ms (std dev:  51 ms) - Throughput: 1302 its/sec - CPU: User:  3.0625 (  +2.6%), System:  0.0156 (  +0.0%)
CPU threads= 7:  75 ms (std dev:  65 ms) - Throughput: 1583 its/sec - CPU: User:  3.0469 (  +2.1%), System:  0.0156 (  +0.0%)
CPU threads= 8: 102 ms (std dev:  76 ms) - Throughput: 1510 its/sec - CPU: User:  3.1094 (  +4.2%), System:  0.0156 (  +0.0%)
CPU threads= 9: 207 ms (std dev: 158 ms) - Throughput: 1610 its/sec - CPU: User:  3.0469 (  +2.1%), System:  0.0000 (-100.0%)
CPU threads=10: 114 ms (std dev: 122 ms) - Throughput: 1664 its/sec - CPU: User:  3.0781 (  +3.1%), System:  0.0000 (-100.0%)
CPU threads=11: 130 ms (std dev:  91 ms) - Throughput: 1584 its/sec - CPU: User:  3.0938 (  +3.7%), System:  0.0312 (+100.0%)
CPU threads=12: 605 ms (std dev: 308 ms) - Throughput: 1652 its/sec - CPU: User:  3.0625 (  +2.6%), System:  0.0000 (-100.0%)
CPU threads=13: 614 ms (std dev: 393 ms) - Throughput: 1671 its/sec - CPU: User:  3.0938 (  +3.7%), System:  0.0156 (  +0.0%)
CPU threads=14: 636 ms (std dev: 355 ms) - Throughput: 1789 its/sec - CPU: User:  3.1250 (  +4.7%), System:  0.0000 (-100.0%)
CPU threads=15: 341 ms (std dev: 151 ms) - Throughput: 1685 its/sec - CPU: User:  3.1250 (  +4.7%), System:  0.0000 (-100.0%)
CPU threads=16: 460 ms (std dev: 268 ms) - Throughput: 1633 its/sec - CPU: User:  3.1562 (  +5.8%), System:  0.0000 (-100.0%)

Background CPU task: regular expression (C)

CPU threads= 0:   0 ms (std dev:   0 ms)
CPU threads= 1:  14 ms (std dev:   5 ms) - Throughput:  576 its/sec - CPU: User:  3.0000, System:  0.0156
CPU threads= 2:  25 ms (std dev:  17 ms) - Throughput:  524 its/sec - CPU: User:  2.9844 (  -0.5%), System:  0.0000 (-100.0%)
CPU threads= 3: 106 ms (std dev:  86 ms) - Throughput:  402 its/sec - CPU: User:  3.0625 (  +2.1%), System:  0.0000 (-100.0%)
CPU threads= 4:  62 ms (std dev:  51 ms) - Throughput:  575 its/sec - CPU: User:  3.0625 (  +2.1%), System:  0.0156 (  +0.0%)
CPU threads= 5:  74 ms (std dev:  55 ms) - Throughput:  519 its/sec - CPU: User:  3.0625 (  +2.1%), System:  0.0000 (-100.0%)
CPU threads= 6: 307 ms (std dev: 210 ms) - Throughput:  505 its/sec - CPU: User:  3.1250 (  +4.2%), System:  0.0000 (-100.0%)
CPU threads= 7: 118 ms (std dev:  78 ms) - Throughput:  539 its/sec - CPU: User:  3.1719 (  +5.7%), System:  0.0000 (-100.0%)
CPU threads= 8:  94 ms (std dev:  72 ms) - Throughput:  542 its/sec - CPU: User:  3.1406 (  +4.7%), System:  0.0156 (  +0.0%)
CPU threads= 9: 334 ms (std dev: 280 ms) - Throughput:  533 its/sec - CPU: User:  3.2031 (  +6.8%), System:  0.0000 (-100.0%)
CPU threads=10: 397 ms (std dev: 185 ms) - Throughput:  549 its/sec - CPU: User:  3.1562 (  +5.2%), System:  0.0000 (-100.0%)
CPU threads=11: 117 ms (std dev:  77 ms) - Throughput:  623 its/sec - CPU: User:  3.3438 ( +11.5%), System:  0.0000 (-100.0%)
CPU threads=12: 178 ms (std dev: 104 ms) - Throughput:  562 its/sec - CPU: User:  3.1875 (  +6.2%), System:  0.0312 (+100.0%)
CPU threads=13: 439 ms (std dev: 208 ms) - Throughput:  624 its/sec - CPU: User:  3.2500 (  +8.3%), System:  0.0156 (  +0.0%)
CPU threads=14: 471 ms (std dev: 302 ms) - Throughput:  476 its/sec - CPU: User:  3.1719 (  +5.7%), System:  0.0312 (+100.0%)
CPU threads=15: 415 ms (std dev: 240 ms) - Throughput:  548 its/sec - CPU: User:  3.4062 ( +13.5%), System:  0.0156 (  +0.0%)
CPU threads=16: 581 ms (std dev: 345 ms) - Throughput:  704 its/sec - CPU: User:  3.2344 (  +7.8%), System:  0.0156 (  +0.0%)

Background CPU task: bz2 compression (C)

CPU threads= 0:   0 ms (std dev:   0 ms)
CPU threads= 1:   0 ms (std dev:   0 ms) - Throughput:  547 its/sec - CPU: User:  2.8594, System:  0.1719
CPU threads= 2:   0 ms (std dev:   0 ms) - Throughput:  942 its/sec - CPU: User:  5.6250 ( +96.7%), System:  0.3438 (+100.0%)
CPU threads= 3:   0 ms (std dev:   0 ms) - Throughput: 1036 its/sec - CPU: User:  8.6250 (+201.6%), System:  0.3281 ( +90.9%)
CPU threads= 4:   0 ms (std dev:   0 ms) - Throughput: 1235 its/sec - CPU: User: 11.5938 (+305.5%), System:  0.3906 (+127.3%)
CPU threads= 5:   0 ms (std dev:   0 ms) - Throughput: 1371 its/sec - CPU: User: 14.2344 (+397.8%), System:  0.5781 (+236.4%)
CPU threads= 6:   0 ms (std dev:   0 ms) - Throughput: 1451 its/sec - CPU: User: 16.8750 (+490.2%), System:  0.7344 (+327.3%)
CPU threads= 7:   0 ms (std dev:   0 ms) - Throughput: 1494 its/sec - CPU: User: 19.5938 (+585.2%), System:  0.7031 (+309.1%)
CPU threads= 8:   0 ms (std dev:   0 ms) - Throughput: 1690 its/sec - CPU: User: 21.9062 (+666.1%), System:  0.7812 (+354.5%)
CPU threads= 9:   0 ms (std dev:   1 ms) - Throughput: 1728 its/sec - CPU: User: 22.1562 (+674.9%), System:  0.8594 (+400.0%)
CPU threads=10:   1 ms (std dev:   2 ms) - Throughput: 1705 its/sec - CPU: User: 21.9688 (+668.3%), System:  1.0781 (+527.3%)
CPU threads=11:   0 ms (std dev:   0 ms) - Throughput: 1798 its/sec - CPU: User: 23.1562 (+709.8%), System:  0.7812 (+354.5%)
CPU threads=12:   1 ms (std dev:   1 ms) - Throughput: 1809 its/sec - CPU: User: 23.1250 (+708.7%), System:  0.8438 (+390.9%)
CPU threads=13:   0 ms (std dev:   1 ms) - Throughput: 1806 its/sec - CPU: User: 22.8750 (+700.0%), System:  1.0469 (+509.1%)
CPU threads=14:   0 ms (std dev:   1 ms) - Throughput: 1802 its/sec - CPU: User: 23.2031 (+711.5%), System:  0.9219 (+436.4%)
CPU threads=15:   0 ms (std dev:   0 ms) - Throughput: 1809 its/sec - CPU: User: 23.4688 (+720.8%), System:  0.6719 (+290.9%)
CPU threads=16:   1 ms (std dev:   1 ms) - Throughput: 1817 its/sec - CPU: User: 23.5000 (+721.9%), System:  0.8281 (+381.8%)

Background CPU task: SHA1 hashing (C)

CPU threads= 0:   0 ms (std dev:   0 ms)
CPU threads= 1:   0 ms (std dev:   0 ms) - Throughput: 4419 its/sec - CPU: User:  2.9531, System:  0.0000
CPU threads= 2:   0 ms (std dev:   0 ms) - Throughput: 6427 its/sec - CPU: User:  6.0156 (+103.7%), System:  0.0156
CPU threads= 3:   0 ms (std dev:   0 ms) - Throughput: 8029 its/sec - CPU: User:  9.0000 (+204.8%), System:  0.0000
CPU threads= 4:   0 ms (std dev:   0 ms) - Throughput: 8760 its/sec - CPU: User: 11.9219 (+303.7%), System:  0.0312
CPU threads= 5:   0 ms (std dev:   0 ms) - Throughput: 9149 its/sec - CPU: User: 14.8906 (+404.2%), System:  0.0312
CPU threads= 6:   0 ms (std dev:   0 ms) - Throughput: 9235 its/sec - CPU: User: 17.8750 (+505.3%), System:  0.0625
CPU threads= 7:   0 ms (std dev:   0 ms) - Throughput: 9657 its/sec - CPU: User: 20.5625 (+596.3%), System:  0.0625
CPU threads= 8:   0 ms (std dev:   0 ms) - Throughput: 11220 its/sec - CPU: User: 23.9375 (+710.6%), System:  0.0312
CPU threads= 9:   0 ms (std dev:   1 ms) - Throughput: 11244 its/sec - CPU: User: 23.9375 (+710.6%), System:  0.0000
CPU threads=10:   0 ms (std dev:   0 ms) - Throughput: 11080 its/sec - CPU: User: 23.8594 (+707.9%), System:  0.0156
CPU threads=11:   0 ms (std dev:   1 ms) - Throughput: 11099 its/sec - CPU: User: 23.9375 (+710.6%), System:  0.0156
CPU threads=12:   0 ms (std dev:   0 ms) - Throughput: 11287 its/sec - CPU: User: 23.8125 (+706.3%), System:  0.0156
CPU threads=13:   0 ms (std dev:   0 ms) - Throughput: 11507 its/sec - CPU: User: 24.0000 (+712.7%), System:  0.0000
CPU threads=14:   0 ms (std dev:   0 ms) - Throughput: 11324 its/sec - CPU: User: 23.9688 (+711.6%), System:  0.0156
CPU threads=15:   0 ms (std dev:   1 ms) - Throughput: 11241 its/sec - CPU: User: 24.0156 (+713.2%), System:  0.0312
CPU threads=16:   0 ms (std dev:   3 ms) - Throughput: 11294 its/sec - CPU: User: 23.9531 (+711.1%), System:  0.0156

Background CPU task: SHA512 hashing (C) - GIL is released

CPU threads= 0:   0 ms (std dev:   0 ms)
CPU threads= 1:   0 ms (std dev:   0 ms) - Throughput: 3025 its/sec - CPU: User:  3.0000, System:  0.0156
CPU threads= 2:   0 ms (std dev:   0 ms) - Throughput: 4267 its/sec - CPU: User:  6.0469 (+101.6%), System:  0.0156 (  +0.0%)
CPU threads= 3:   0 ms (std dev:   0 ms) - Throughput: 5216 its/sec - CPU: User:  8.9688 (+199.0%), System:  0.0000 (-100.0%)
CPU threads= 4:   0 ms (std dev:   0 ms) - Throughput: 5535 its/sec - CPU: User: 11.9844 (+299.5%), System:  0.0000 (-100.0%)
CPU threads= 5:   0 ms (std dev:   0 ms) - Throughput: 6190 its/sec - CPU: User: 15.0156 (+400.5%), System:  0.0312 (+100.0%)
CPU threads= 6:   0 ms (std dev:   0 ms) - Throughput: 6279 its/sec - CPU: User: 17.7969 (+493.2%), System:  0.0625 (+300.0%)
CPU threads= 7:   0 ms (std dev:   0 ms) - Throughput: 6419 its/sec - CPU: User: 20.4688 (+582.3%), System:  0.0312 (+100.0%)
CPU threads= 8:   0 ms (std dev:   0 ms) - Throughput: 7501 its/sec - CPU: User: 23.9062 (+696.9%), System:  0.0156 (  +0.0%)
CPU threads= 9:   0 ms (std dev:   1 ms) - Throughput: 7363 its/sec - CPU: User: 23.9219 (+697.4%), System:  0.0469 (+200.0%)
CPU threads=10:   0 ms (std dev:   1 ms) - Throughput: 7672 its/sec - CPU: User: 23.9062 (+696.9%), System:  0.0000 (-100.0%)
CPU threads=11:   0 ms (std dev:   0 ms) - Throughput: 7475 its/sec - CPU: User: 24.0000 (+700.0%), System:  0.0156 (  +0.0%)
CPU threads=12:   0 ms (std dev:   0 ms) - Throughput: 7559 its/sec - CPU: User: 24.0000 (+700.0%), System:  0.0156 (  +0.0%)
CPU threads=13:   0 ms (std dev:   0 ms) - Throughput: 7657 its/sec - CPU: User: 24.0000 (+700.0%), System:  0.0312 (+100.0%)
CPU threads=14:   0 ms (std dev:   0 ms) - Throughput: 7568 its/sec - CPU: User: 24.0625 (+702.1%), System:  0.0000 (-100.0%)
CPU threads=15:   0 ms (std dev:   0 ms) - Throughput: 7574 its/sec - CPU: User: 24.0781 (+702.6%), System:  0.0312 (+100.0%)
CPU threads=16:   0 ms (std dev:   0 ms) - Throughput: 7517 its/sec - CPU: User: 24.0156 (+700.5%), System:  0.0312 (+100.0%)

--- I/O bandwidth ---

Background CPU task: Pi calculation (Python)

CPU threads= 0: 6398.3 packets/sec
CPU threads= 1:   32.5 pkt/sec ( -99.5%) - Throughput: 1515 its/sec - CPU: User:  3.0156, System:  0.0156
CPU threads= 2:   16.0 pkt/sec ( -99.7%) - Throughput: 1415 its/sec - CPU: User:  3.0156 (  +0.0%), System:  0.0156 (  +0.0%)
CPU threads= 3:   10.8 pkt/sec ( -99.8%) - Throughput: 1511 its/sec - CPU: User:  3.0469 (  +1.0%), System:  0.0156 (  +0.0%)
CPU threads= 4:    6.9 pkt/sec ( -99.9%) - Throughput: 1656 its/sec - CPU: User:  3.0156 (  +0.0%), System:  0.0156 (  +0.0%)
CPU threads= 5:    6.6 pkt/sec ( -99.9%) - Throughput: 1469 its/sec - CPU: User:  3.0938 (  +2.6%), System:  0.0156 (  +0.0%)
CPU threads= 6:    4.7 pkt/sec ( -99.9%) - Throughput: 1526 its/sec - CPU: User:  3.0469 (  +1.0%), System:  0.0156 (  +0.0%)
CPU threads= 7:    4.3 pkt/sec ( -99.9%) - Throughput: 1686 its/sec - CPU: User:  3.1094 (  +3.1%), System:  0.0156 (  +0.0%)
CPU threads= 8:    5.1 pkt/sec ( -99.9%) - Throughput: 1818 its/sec - CPU: User:  3.0781 (  +2.1%), System:  0.0156 (  +0.0%)
CPU threads= 9:    4.4 pkt/sec ( -99.9%) - Throughput: 1644 its/sec - CPU: User:  3.0625 (  +1.6%), System:  0.0156 (  +0.0%)
CPU threads=10:    4.8 pkt/sec ( -99.9%) - Throughput: 1331 its/sec - CPU: User:  3.0938 (  +2.6%), System:  0.0156 (  +0.0%)
CPU threads=11:    3.2 pkt/sec ( -99.9%) - Throughput: 1780 its/sec - CPU: User:  3.0625 (  +1.6%), System:  0.0156 (  +0.0%)
CPU threads=12:    2.9 pkt/sec (-100.0%) - Throughput: 1621 its/sec - CPU: User:  3.2500 (  +7.8%), System:  0.0000 (-100.0%)
CPU threads=13:    4.2 pkt/sec ( -99.9%) - Throughput:  870 its/sec - CPU: User:  3.0469 (  +1.0%), System:  0.0156 (  +0.0%)
CPU threads=14:    2.7 pkt/sec (-100.0%) - Throughput: 1503 its/sec - CPU: User:  3.0781 (  +2.1%), System:  0.0312 (+100.0%)
CPU threads=15:    3.1 pkt/sec (-100.0%) - Throughput: 1627 its/sec - CPU: User:  3.0938 (  +2.6%), System:  0.0156 (  +0.0%)
CPU threads=16:    4.1 pkt/sec ( -99.9%) - Throughput: 1641 its/sec - CPU: User:  3.0781 (  +2.1%), System:  0.0156 (  +0.0%)

Background CPU task: regular expression (C)

CPU threads= 0: 7120.7 packets/sec
CPU threads= 1:   33.7 pkt/sec ( -99.5%) - Throughput:  542 its/sec - CPU: User:  3.0000, System:  0.0156
CPU threads= 2:   16.1 pkt/sec ( -99.8%) - Throughput:  325 its/sec - CPU: User:  3.0156 (  +0.5%), System:  0.0312 (+100.0%)
CPU threads= 3:    8.7 pkt/sec ( -99.9%) - Throughput:  464 its/sec - CPU: User:  3.0625 (  +2.1%), System:  0.0156 (  +0.0%)
CPU threads= 4:    8.7 pkt/sec ( -99.9%) - Throughput:  478 its/sec - CPU: User:  3.0000 (  +0.0%), System:  0.0000 (-100.0%)
CPU threads= 5:    7.9 pkt/sec ( -99.9%) - Throughput:  335 its/sec - CPU: User:  3.1094 (  +3.6%), System:  0.0000 (-100.0%)
CPU threads= 6:    5.6 pkt/sec ( -99.9%) - Throughput:  495 its/sec - CPU: User:  3.2031 (  +6.8%), System:  0.0000 (-100.0%)
CPU threads= 7:    4.0 pkt/sec ( -99.9%) - Throughput:  312 its/sec - CPU: User:  3.1094 (  +3.6%), System:  0.0312 (+100.0%)
CPU threads= 8:    6.1 pkt/sec ( -99.9%) - Throughput:  488 its/sec - CPU: User:  3.1875 (  +6.2%), System:  0.0156 (  +0.0%)
CPU threads= 9:    7.3 pkt/sec ( -99.9%) - Throughput:  497 its/sec - CPU: User:  3.0781 (  +2.6%), System:  0.0156 (  +0.0%)
CPU threads=10:    2.7 pkt/sec (-100.0%) - Throughput:  546 its/sec - CPU: User:  3.2656 (  +8.9%), System:  0.0156 (  +0.0%)
CPU threads=11:    4.1 pkt/sec ( -99.9%) - Throughput:  538 its/sec - CPU: User:  3.2188 (  +7.3%), System:  0.0312 (+100.0%)
CPU threads=12:    1.8 pkt/sec (-100.0%) - Throughput:  586 its/sec - CPU: User:  3.1406 (  +4.7%), System:  0.0156 (  +0.0%)
CPU threads=13:    2.8 pkt/sec (-100.0%) - Throughput:  366 its/sec - CPU: User:  3.1250 (  +4.2%), System:  0.0469 (+200.0%)
CPU threads=14:    3.0 pkt/sec (-100.0%) - Throughput:  539 its/sec - CPU: User:  3.2031 (  +6.8%), System:  0.0156 (  +0.0%)
CPU threads=15:    3.5 pkt/sec (-100.0%) - Throughput:  567 its/sec - CPU: User:  3.2344 (  +7.8%), System:  0.0469 (+200.0%)
CPU threads=16:    3.3 pkt/sec (-100.0%) - Throughput:  469 its/sec - CPU: User:  3.2031 (  +6.8%), System:  0.0156 (  +0.0%)

Background CPU task: bz2 compression (C)

CPU threads= 0: 6724.8 packets/sec
CPU threads= 1: 5752.2 pkt/sec ( -14.5%) - Throughput:  424 its/sec - CPU: User:  3.5469, System:  1.0938
CPU threads= 2: 5310.5 pkt/sec ( -21.0%) - Throughput:  768 its/sec - CPU: User:  6.0781 ( +71.4%), System:  1.2969 ( +18.6%)
CPU threads= 3: 4456.7 pkt/sec ( -33.7%) - Throughput:  891 its/sec - CPU: User:  9.1719 (+158.6%), System:  1.2656 ( +15.7%)
CPU threads= 4: 4249.2 pkt/sec ( -36.8%) - Throughput: 1054 its/sec - CPU: User: 12.1406 (+242.3%), System:  1.2969 ( +18.6%)
CPU threads= 5: 4036.9 pkt/sec ( -40.0%) - Throughput: 1096 its/sec - CPU: User: 14.2812 (+302.6%), System:  1.6719 ( +52.9%)
CPU threads= 6: 3863.8 pkt/sec ( -42.5%) - Throughput: 1301 its/sec - CPU: User: 17.2812 (+387.2%), System:  1.6719 ( +52.9%)
CPU threads= 7: 4006.0 pkt/sec ( -40.4%) - Throughput: 1312 its/sec - CPU: User: 17.9062 (+404.8%), System:  1.8438 ( +68.6%)
CPU threads= 8: 3899.3 pkt/sec ( -42.0%) - Throughput: 1465 its/sec - CPU: User: 19.7812 (+457.7%), System:  1.8125 ( +65.7%)
CPU threads= 9: 4158.4 pkt/sec ( -38.2%) - Throughput: 1568 its/sec - CPU: User: 20.1406 (+467.8%), System:  1.9688 ( +80.0%)
CPU threads=10: 4008.4 pkt/sec ( -40.4%) - Throughput: 1521 its/sec - CPU: User: 20.1094 (+467.0%), System:  2.0625 ( +88.6%)
CPU threads=11: 3889.6 pkt/sec ( -42.2%) - Throughput: 1543 its/sec - CPU: User: 20.9375 (+490.3%), System:  1.7188 ( +57.1%)
CPU threads=12: 3790.2 pkt/sec ( -43.6%) - Throughput: 1566 its/sec - CPU: User: 20.4688 (+477.1%), System:  2.2500 (+105.7%)
CPU threads=13: 3669.5 pkt/sec ( -45.4%) - Throughput: 1575 its/sec - CPU: User: 20.6094 (+481.1%), System:  2.0781 ( +90.0%)
CPU threads=14: 3752.2 pkt/sec ( -44.2%) - Throughput: 1582 its/sec - CPU: User: 20.8906 (+489.0%), System:  1.9688 ( +80.0%)
CPU threads=15: 3661.9 pkt/sec ( -45.5%) - Throughput: 1571 its/sec - CPU: User: 21.0625 (+493.8%), System:  1.9531 ( +78.6%)
CPU threads=16: 3514.0 pkt/sec ( -47.7%) - Throughput: 1591 its/sec - CPU: User: 20.9531 (+490.7%), System:  2.1562 ( +97.1%)

Background CPU task: SHA1 hashing (C)

CPU threads= 0: 6415.2 packets/sec
CPU threads= 1: 5854.2 pkt/sec (  -8.7%) - Throughput: 3137 its/sec - CPU: User:  3.5000, System:  1.1406
CPU threads= 2: 5105.4 pkt/sec ( -20.4%) - Throughput: 5190 its/sec - CPU: User:  6.5781 ( +87.9%), System:  1.1875 (  +4.1%)
CPU threads= 3: 4376.9 pkt/sec ( -31.8%) - Throughput: 5411 its/sec - CPU: User:  9.3750 (+167.9%), System:  1.0781 (  -5.5%)
CPU threads= 4: 4429.4 pkt/sec ( -31.0%) - Throughput: 6700 its/sec - CPU: User: 12.5312 (+258.0%), System:  1.0156 ( -11.0%)
CPU threads= 5: 4605.8 pkt/sec ( -28.2%) - Throughput: 7246 its/sec - CPU: User: 14.7656 (+321.9%), System:  1.3750 ( +20.5%)
CPU threads= 6: 4271.2 pkt/sec ( -33.4%) - Throughput: 8535 its/sec - CPU: User: 18.3281 (+423.7%), System:  1.2656 ( +11.0%)
CPU threads= 7: 4657.9 pkt/sec ( -27.4%) - Throughput: 9064 its/sec - CPU: User: 19.4375 (+455.4%), System:  1.3125 ( +15.1%)
CPU threads= 8: 4884.9 pkt/sec ( -23.9%) - Throughput: 9602 its/sec - CPU: User: 20.4062 (+483.0%), System:  1.6406 ( +43.8%)
CPU threads= 9: 4890.9 pkt/sec ( -23.8%) - Throughput: 9585 its/sec - CPU: User: 20.4531 (+484.4%), System:  1.4688 ( +28.8%)
CPU threads=10: 4916.8 pkt/sec ( -23.4%) - Throughput: 9833 its/sec - CPU: User: 20.2031 (+477.2%), System:  1.5469 ( +35.6%)
CPU threads=11: 4877.2 pkt/sec ( -24.0%) - Throughput: 9916 its/sec - CPU: User: 20.5469 (+487.1%), System:  1.3281 ( +16.4%)
CPU threads=12: 4765.5 pkt/sec ( -25.7%) - Throughput: 9581 its/sec - CPU: User: 20.6406 (+489.7%), System:  1.3594 ( +19.2%)
CPU threads=13: 4784.7 pkt/sec ( -25.4%) - Throughput: 9521 its/sec - CPU: User: 20.6719 (+490.6%), System:  1.5156 ( +32.9%)
CPU threads=14: 4716.0 pkt/sec ( -26.5%) - Throughput: 9485 its/sec - CPU: User: 20.4219 (+483.5%), System:  1.5938 ( +39.7%)
CPU threads=15: 4695.1 pkt/sec ( -26.8%) - Throughput: 9468 its/sec - CPU: User: 20.6719 (+490.6%), System:  1.6094 ( +41.1%)
CPU threads=16: 4631.2 pkt/sec ( -27.8%) - Throughput: 9509 its/sec - CPU: User: 20.6719 (+490.6%), System:  1.2969 ( +13.7%)

Background CPU task: SHA512 hashing (C) - GIL is released

CPU threads= 0: 6115.1 packets/sec
CPU threads= 1: 6070.7 pkt/sec (  -0.7%) - Throughput: 2218 its/sec - CPU: User:  3.6094, System:  0.9375
CPU threads= 2: 5356.5 pkt/sec ( -12.4%) - Throughput: 3650 its/sec - CPU: User:  6.6094 ( +83.1%), System:  1.1562 ( +23.3%)
CPU threads= 3: 4592.5 pkt/sec ( -24.9%) - Throughput: 4410 its/sec - CPU: User:  9.6094 (+166.2%), System:  1.1406 ( +21.7%)
CPU threads= 4: 4567.8 pkt/sec ( -25.3%) - Throughput: 4989 its/sec - CPU: User: 12.6875 (+251.5%), System:  1.0469 ( +11.7%)
CPU threads= 5: 4532.6 pkt/sec ( -25.9%) - Throughput: 5226 its/sec - CPU: User: 14.8750 (+312.1%), System:  1.3281 ( +41.7%)
CPU threads= 6: 4483.9 pkt/sec ( -26.7%) - Throughput: 5637 its/sec - CPU: User: 17.7188 (+390.9%), System:  1.2969 ( +38.3%)
CPU threads= 7: 4786.4 pkt/sec ( -21.7%) - Throughput: 5919 its/sec - CPU: User: 19.4531 (+439.0%), System:  1.2812 ( +36.7%)
CPU threads= 8: 5263.7 pkt/sec ( -13.9%) - Throughput: 6537 its/sec - CPU: User: 20.2344 (+460.6%), System:  1.5625 ( +66.7%)
CPU threads= 9: 5240.3 pkt/sec ( -14.3%) - Throughput: 6502 its/sec - CPU: User: 20.6250 (+471.4%), System:  1.1562 ( +23.3%)
CPU threads=10: 5001.8 pkt/sec ( -18.2%) - Throughput: 6310 its/sec - CPU: User: 20.5312 (+468.8%), System:  1.6250 ( +73.3%)
CPU threads=11: 5133.4 pkt/sec ( -16.1%) - Throughput: 6414 its/sec - CPU: User: 20.5781 (+470.1%), System:  1.2344 ( +31.7%)
CPU threads=12: 5039.4 pkt/sec ( -17.6%) - Throughput: 6360 its/sec - CPU: User: 20.4219 (+465.8%), System:  1.5156 ( +61.7%)
CPU threads=13: 5027.1 pkt/sec ( -17.8%) - Throughput: 6378 its/sec - CPU: User: 20.6562 (+472.3%), System:  1.2812 ( +36.7%)
CPU threads=14: 5056.3 pkt/sec ( -17.3%) - Throughput: 6374 its/sec - CPU: User: 20.5781 (+470.1%), System:  1.4062 ( +50.0%)
CPU threads=15: 5042.0 pkt/sec ( -17.5%) - Throughput: 6416 its/sec - CPU: User: 20.8438 (+477.5%), System:  1.3438 ( +43.3%)
CPU threads=16: 4985.1 pkt/sec ( -18.5%) - Throughput: 6532 its/sec - CPU: User: 20.7188 (+474.0%), System:  1.2188 ( +30.0%)

Throughput Results with Switch Interval of 0.1sec

ccbench_updated.py -a 8 -t -I 0.1

!! Setting switch interval to 0.100s !!
== CPython 3.9.5 (tags/v3.9.5:0a7dcbd) 64-bit ==
== AMD64 Windows 64bit on 'Intel64 Family 6 Model 60 Stepping 3, GenuineIntel' with 8 cores ==
== Threads: nt, Lock: None, Version: None ==
== Check interval: None, Switch interval: 0.09999999999999999 ==
== Cores: 4, Hyperthreads: 8, Priority: Normal, I/O priority: Normal, Affinity: [0, 1, 2, 3, 4, 5, 6, 7] ==
!! Process priority set to HIGH !!
== Cores: 4, Hyperthreads: 8, Priority: High, I/O priority: Normal, Affinity: [0, 1, 2, 3, 4, 5, 6, 7] ==

--- Throughput ---

Pi calculation (Python)

threads= 1: 1563 iterations/sec        - Baseline: CPU: User:  2.0000, System:  0.0000
threads= 2: 1496 (  -4.3%, std dev:  29 its/sec) - CPU: User:  2.0000 (  +0.0%), System:  0.0000
threads= 3: 1591 (  +1.8%, std dev: 196 its/sec) - CPU: User:  2.0156 (  +0.8%), System:  0.0000
threads= 4: 1829 ( +17.0%, std dev:  83 its/sec) - CPU: User:  2.0469 (  +2.3%), System:  0.0000
threads= 5: 1984 ( +26.9%, std dev:  86 its/sec) - CPU: User:  2.0625 (  +3.1%), System:  0.0000
threads= 6: 1515 (  -3.1%, std dev:  80 its/sec) - CPU: User:  2.0781 (  +3.9%), System:  0.0000
threads= 7:  666 ( -57.4%, std dev:  81 its/sec) - CPU: User:  2.1250 (  +6.2%), System:  0.0000
threads= 8: 1492 (  -4.5%, std dev:  75 its/sec) - CPU: User:  2.0156 (  +0.8%), System:  0.0312
threads= 9: 2759 ( +76.5%, std dev: 345 its/sec) - CPU: User:  2.1250 (  +6.2%), System:  0.0156
threads=10: 3018 ( +93.1%, std dev: 388 its/sec) - CPU: User:  2.0781 (  +3.9%), System:  0.0000
threads=11: 2183 ( +39.7%, std dev: 120 its/sec) - CPU: User:  2.1094 (  +5.5%), System:  0.0000
threads=12: 1729 ( +10.6%, std dev:  96 its/sec) - CPU: User:  2.0469 (  +2.3%), System:  0.0312
threads=13: 1858 ( +18.9%, std dev: 123 its/sec) - CPU: User:  2.1406 (  +7.0%), System:  0.0156
threads=14: 2935 ( +87.8%, std dev: 339 its/sec) - CPU: User:  2.1562 (  +7.8%), System:  0.0156
threads=15: 2690 ( +72.1%, std dev: 321 its/sec) - CPU: User:  2.1875 (  +9.4%), System:  0.0000
threads=16: 2543 ( +62.7%, std dev: 218 its/sec) - CPU: User:  2.3281 ( +16.4%), System:  0.0312

regular expression (C)

threads= 1:  325 iterations/sec        - Baseline: CPU: User:  2.0312, System:  0.0000
threads= 2:  493 ( +51.7%, std dev:   8 its/sec) - CPU: User:  2.0781 (  +2.3%), System:  0.0000
threads= 3:  494 ( +52.0%, std dev:  67 its/sec) - CPU: User:  2.1406 (  +5.4%), System:  0.0000
threads= 4:  552 ( +69.8%, std dev:  17 its/sec) - CPU: User:  2.1094 (  +3.8%), System:  0.0000
threads= 5:  592 ( +82.2%, std dev:  93 its/sec) - CPU: User:  2.3125 ( +13.8%), System:  0.0000
threads= 6:  543 ( +67.1%, std dev:  25 its/sec) - CPU: User:  2.2656 ( +11.5%), System:  0.0000
threads= 7:  736 (+126.5%, std dev:  47 its/sec) - CPU: User:  2.1562 (  +6.2%), System:  0.0000
threads= 8:  659 (+102.8%, std dev:  62 its/sec) - CPU: User:  2.5781 ( +26.9%), System:  0.0000
threads= 9:  290 ( -10.8%, std dev:  43 its/sec) - CPU: User:  2.3906 ( +17.7%), System:  0.0000
threads=10:  365 ( +12.3%, std dev:  42 its/sec) - CPU: User:  2.2812 ( +12.3%), System:  0.0000
threads=11:  549 ( +68.9%, std dev:  58 its/sec) - CPU: User:  2.3906 ( +17.7%), System:  0.0000
threads=12:  643 ( +97.8%, std dev:  55 its/sec) - CPU: User:  2.5156 ( +23.8%), System:  0.0000
threads=13:  680 (+109.2%, std dev:  43 its/sec) - CPU: User:  2.4219 ( +19.2%), System:  0.0000
threads=14:  831 (+155.7%, std dev:  74 its/sec) - CPU: User:  2.5000 ( +23.1%), System:  0.0000
threads=15:  834 (+156.6%, std dev: 116 its/sec) - CPU: User:  2.6094 ( +28.5%), System:  0.0000
threads=16: 1240 (+281.5%, std dev: 141 its/sec) - CPU: User:  2.4531 ( +20.8%), System:  0.0156

bz2 compression (C)

threads= 1:  528 iterations/sec        - Baseline: CPU: User:  1.7500, System:  0.2500
threads= 2:  872 ( +65.2%, std dev:   2 its/sec) - CPU: User:  3.8594 (+120.5%), System:  0.1406 ( -43.8%)
threads= 3:  974 ( +84.5%, std dev:   2 its/sec) - CPU: User:  5.7344 (+227.7%), System:  0.2188 ( -12.5%)
threads= 4:  929 ( +75.9%, std dev:   4 its/sec) - CPU: User:  7.6562 (+337.5%), System:  0.2500 (  +0.0%)
threads= 5: 1144 (+116.7%, std dev:   2 its/sec) - CPU: User:  9.4531 (+440.2%), System:  0.4062 ( +62.5%)
threads= 6: 1350 (+155.7%, std dev:   2 its/sec) - CPU: User: 11.1250 (+535.7%), System:  0.5469 (+118.8%)
threads= 7: 1515 (+186.9%, std dev:   1 its/sec) - CPU: User: 12.9375 (+639.3%), System:  0.6406 (+156.2%)
threads= 8: 1682 (+218.6%, std dev:   5 its/sec) - CPU: User: 14.4844 (+727.7%), System:  0.6094 (+143.8%)
threads= 9: 1740 (+229.5%, std dev:  21 its/sec) - CPU: User: 14.9375 (+753.6%), System:  0.5938 (+137.5%)
threads=10: 1724 (+226.5%, std dev:  28 its/sec) - CPU: User: 15.0469 (+759.8%), System:  0.5938 (+137.5%)
threads=11: 1779 (+236.9%, std dev:  17 its/sec) - CPU: User: 15.3438 (+776.8%), System:  0.7031 (+181.2%)
threads=12: 1802 (+241.3%, std dev:   4 its/sec) - CPU: User: 15.7344 (+799.1%), System:  0.4375 ( +75.0%)
threads=13: 1798 (+240.5%, std dev:   9 its/sec) - CPU: User: 15.6562 (+794.6%), System:  0.6875 (+175.0%)
threads=14: 1807 (+242.2%, std dev:  20 its/sec) - CPU: User: 15.4375 (+782.1%), System:  0.6719 (+168.8%)
threads=15: 1798 (+240.5%, std dev:   7 its/sec) - CPU: User: 15.6250 (+792.9%), System:  0.5938 (+137.5%)
threads=16: 1791 (+239.2%, std dev:   5 its/sec) - CPU: User: 15.8281 (+804.5%), System:  0.4844 ( +93.8%)

SHA1 hashing (C)

threads= 1: 2502 iterations/sec        - Baseline: CPU: User:  2.0000, System:  0.0000
threads= 2: 5019 (+100.6%, std dev:  49 its/sec) - CPU: User:  4.0000 (+100.0%), System:  0.0000
threads= 3: 6672 (+166.7%, std dev:  32 its/sec) - CPU: User:  6.0000 (+200.0%), System:  0.0000
threads= 4: 7605 (+204.0%, std dev:  49 its/sec) - CPU: User:  8.0156 (+300.8%), System:  0.0000
threads= 5: 8370 (+234.5%, std dev:  36 its/sec) - CPU: User:  9.9844 (+399.2%), System:  0.0156
threads= 6: 8769 (+250.5%, std dev:  11 its/sec) - CPU: User: 12.0312 (+501.6%), System:  0.0312
threads= 7: 9913 (+296.2%, std dev:  17 its/sec) - CPU: User: 13.9219 (+596.1%), System:  0.0156
threads= 8: 11031 (+340.9%, std dev:  14 its/sec) - CPU: User: 15.9219 (+696.1%), System:  0.0156
threads= 9: 11182 (+346.9%, std dev: 152 its/sec) - CPU: User: 16.0469 (+702.3%), System:  0.0000
threads=10: 11188 (+347.2%, std dev:  28 its/sec) - CPU: User: 15.9375 (+696.9%), System:  0.0156
threads=11: 11294 (+351.4%, std dev: 106 its/sec) - CPU: User: 15.9688 (+698.4%), System:  0.0156
threads=12: 11250 (+349.6%, std dev:  38 its/sec) - CPU: User: 15.9531 (+697.7%), System:  0.0625
threads=13: 10598 (+323.6%, std dev:  30 its/sec) - CPU: User: 15.2031 (+660.2%), System:  0.0000
threads=14: 11147 (+345.5%, std dev:  33 its/sec) - CPU: User: 15.7344 (+686.7%), System:  0.0312
threads=15: 11331 (+352.9%, std dev:  50 its/sec) - CPU: User: 16.0156 (+700.8%), System:  0.0156
threads=16: 10876 (+334.7%, std dev:  30 its/sec) - CPU: User: 15.5938 (+679.7%), System:  0.0312

SHA512 hashing (C) - GIL is released

threads= 1: 1550 iterations/sec        - Baseline: CPU: User:  2.0156, System:  0.0000
threads= 2: 3254 (+109.9%, std dev:   9 its/sec) - CPU: User:  3.9688 ( +96.9%), System:  0.0000
threads= 3: 4391 (+183.3%, std dev:  11 its/sec) - CPU: User:  6.0625 (+200.8%), System:  0.0000
threads= 4: 5827 (+275.9%, std dev:  31 its/sec) - CPU: User:  7.8906 (+291.5%), System:  0.0000
threads= 5: 5400 (+248.4%, std dev:   7 its/sec) - CPU: User:  9.9219 (+392.2%), System:  0.0312
threads= 6: 5783 (+273.1%, std dev:  16 its/sec) - CPU: User: 11.8125 (+486.0%), System:  0.0312
threads= 7: 6486 (+318.5%, std dev:  26 its/sec) - CPU: User: 13.7344 (+581.4%), System:  0.0000
threads= 8: 7506 (+384.3%, std dev:   9 its/sec) - CPU: User: 15.9219 (+689.9%), System:  0.0000
threads= 9: 7511 (+384.6%, std dev:  84 its/sec) - CPU: User: 15.9062 (+689.1%), System:  0.0000
threads=10: 7399 (+377.4%, std dev: 156 its/sec) - CPU: User: 16.0156 (+694.6%), System:  0.0000
threads=11: 7347 (+374.0%, std dev:  52 its/sec) - CPU: User: 15.9688 (+692.2%), System:  0.0000
threads=12: 7406 (+377.8%, std dev:  22 its/sec) - CPU: User: 15.9688 (+692.2%), System:  0.0625
threads=13: 7368 (+375.4%, std dev:  54 its/sec) - CPU: User: 15.9531 (+691.5%), System:  0.0469
threads=14: 7551 (+387.2%, std dev:  74 its/sec) - CPU: User: 16.1250 (+700.0%), System:  0.0000
threads=15: 7458 (+381.2%, std dev:  48 its/sec) - CPU: User: 15.8750 (+687.6%), System:  0.0156
threads=16: 7579 (+389.0%, std dev:  14 its/sec) - CPU: User: 16.1250 (+700.0%), System:  0.0312

3 replies

Sophist-UK Mar 23, 2022
Author

~~I tried running the bandwidth tests with a Switch Interval of 0.1s and the results for heavy GIL-utilisation were appalling - even with a single CPU thread it was literally down to zero.~~

D'oh - I looked at the results again and a Switch Interval of 0.1 did bring the CPU utilisation back up - so this supports the theory that all this CPU is being wasted on GIL switches.

Sophist-UK Mar 23, 2022
Author

I am still struggling to work out why the bandwidth tests with compression/sha workloads tail off on CPU usage whilst the latency tests don't. The code is pretty similar, just the amount of data being different. If anyone can review the code and see if they can spot any structural difference between latency and bandwidth code that would explain this, I would be very grateful.

Sophist-UK Mar 23, 2022
Author

Ok - a potential theory on the CPU tailoff on bandwidth vs latency (though I think it might be a bit lame TBH):

Latency: very fast execution, GIL-released workloads do not reach a point where they need to acquire the GIL in this time-scale.

Bandwidth: Much more data to process, takes more CPU, during which the a few/some/many/all of the GIL-released workloads reach a point where they need the GIL again, so they all queue up for the GIL during which time CPU utilisation drops. Then one by one they grab the GIL, do something and release it again and go back to using up CPU.

The trouble is, I can't imagine the bandwidth echo code taking that much more CPU that it would cause all the CPU threads to reach the point of needing the GIL, but then again, these echo requests are coming fast & furious, perhaps sufficient to create this issue. I might try adding a sleep option for the bandwidth client to introduce a delay between echos in order to allow the CPU threads all to get going again, and if that supports this theory.

gvanrossum · 2022-03-23T20:07:59Z

gvanrossum
Mar 23, 2022
Maintainer

I have a question regarding David Beazley's original tests. He said it specifically wasn't a throughput or latency test. His client was (in pseudo-code) something like

def client(n=10000):
    msg = b"xxxxxxxx" * 1024
    for i in range(n):
        rpc(msg)
        time.sleep(0.001)

The server was a trivial echo server, with 0 or 1 pure-python CPU-consuming threads. (I don't think he showed the CPU-consuming code, but we could assume it's as stupid as while True: pass.)

Which of the benchmarks in your (or Antoine's) code corresponds to this?

11 replies

Sophist-UK Mar 23, 2022
Author

However, I am not trying to duplicate exactly the results obtained by David - I am simply trying to establish whether a GIL scheduler would be beneficial compared to the current GIL mechanism. This is broadly similar to some of the live demonstrations that David did as part of some of his talks 12+ years ago, and definitely prompted by it, but my technical skills certainly do not come remotely close to those that David used for his excellent and highly fine-grained instrumentation of Python to determine exactly how the GIL behaves.

I was challenged by Guido to prove that this problem still exists, and that is the purpose of my tweaking of Antoine's already excellent ccbench code to give greater insight into what might be happening. But this can never come close to producing the insights that Dave's instrumenting of the GIL at a very low level was able to achieve.

dabeaz Mar 23, 2022

Some of the later tests (with the new GIL), were nothing more than launching a while True: pass loop into a background thread and watching everything else suddenly get a lot worse. In a nutshell: It's too hard for quick I/O-bound threads to get the GIL back to do anything (blocked by a minimum 5ms delay).

Sophist-UK Mar 23, 2022
Author

IIUC the thread that has the GIL will check the "eval breaker" flag repeatedly (this has been the subject of much discussion for 3.11, but IIUC we're still only looking at 3.9 here), then it releases the GIL, and then immediately tries to re-acquire it. The intention is that while the GIL is released, the other thread (i.e., the I/O thread) immediately acquires it, does a tiny amount of work, and releases it again. Dave is speculating that the I/O thread doesn't always succeed in getting the GIL, so that the CPU thread continues to run until it's executed N instructions (N=100 or so?) and then releases + acquires the GIL again, until finally the I/O thread is successful.

It would help me a lot if someone could explain how the GIL code works in 3.9 and 3.10, and in particular whether it distinguishes between CPU-light and CPU-heavy threads or simply releases the GIL and hopes that the CPU-light thread happens to get it in preference to one of the CPU-heavy threads. There is (or was at some point) a FORCED_SWITCHING which prevented the same thread from getting the GIL back immediately if there was a runnable alternative thread ready to go, but if you look at the Bandwidth measurements with 1 Pi CPU-soaker thread running, it is clear that the echo server thread is NOT getting anything remotely close to 1/2 of the time-slices.

Super-intense ping/echo clients are really nothing like real life - I/O workloads spend a lot of time waiting for the data they requested to arrive back from disk or network, and GUI workloads (to process mouse moves and keystrokes) are normally much less intensitive (at the microsecond level) as well.

So let's see what happens to the measurements when I reduce the intensity of the ping/echo clients and see if that changes the measurements.

There are various things we might investigate more before we come to the conclusion that we need to introduce thread priorities or implement our own scheduler, etc.

I am certainly not suggesting that we should make any decisions on how to proceed without having the level of analysis that demonstrates not only that there is a problem, but also exactly what that problem is and that the proposed solution is going to fix it. We are certainly a long way away from that stage.

That said, I do feel that the measurements shown to date in this discussion very strongly suggest (to me at least) that there is a significant opportunity to improve the responsiveness for CPU-light threads and reduce the overhead of multithreading (freeing up CPU cycles that can deliver additional performance for real work), and that it should therefore be worth the effort to take this to the next stage of more detailed analysis.

If the measurements in the above analysis are to be believed, this could be a very fertile area for delivering the Python performance improvements that have been set for the next couple of years.
Thread priorities are IMO a secondary objective. I am proposing that the primary benefits are:
- to reduce the multithreading overheads overheads by choosing a next thread to get the GIL rather than requiring threads to compete for it - and the measurements seem to suggest that this is primarily going to be beneficial for GIL-releasing C modules; and
- to prioritise CPU-light threads (that voluntarily gave up the GIL last time) over CPU-heavy threads that reached the end of their time-slice (which will improve GUI responsiveness and improve I/O throughput without having a major impact on CPU-heavy throughput.
I am hopeful that these could be implemented in an entirely backwards compatible way, which would (in the vast majority if not all cases) bring immediate improvements without any code changes.

Thread priorities are IMO an additional tool that a developer can then use to fine-tune the relative throughput of various worker threads, and are not necessary to achieve the primary goals.

Sophist-UK Mar 23, 2022
Author

In a nutshell: It's too hard for quick I/O-bound threads to get the GIL back to do anything (blocked by a minimum 5ms delay).

Ah - if only it was only a 5ms delay. I think that we are talking about 2 different things here.

So far I have been talking only about continuing the current ways that the GIL is given up - either part way through a 5ms time-slice (e.g. by starting a blocking I/O, waiting on a resource or signal or voluntarily giving up the GIL to continue execution of C code without it) or at the end of the 5-ms time-slice when you still have Python instructions to execute. The big 3 issues highlighted in the analysis to date seem to me to be:

That CPU-light threads (like these quick I/O bound threads) do not get any priority over the CPU-heavy threads - this is a very well understood Computer Science theory, understood literally for decades dating back to batch processing mainframes;
That there is currently no guarantee that a particular thread will actually get a time-slice in any deterministic time-scale - because the GIL is given to a thread through competition, it can easily happen that the I/O bound threads don't get their fair share of CPU. Indeed, previous analysis shows that the CPU bound threads that are already ready to run (e.g. are in memory) may well get favourable treatment from the O/S cf. I/O threads that have had some of their working set swapped out and have a significantly better chance of grabbing the GIL first, possibly multiple times.
That may be significant overhead involved in having threads compete for the GIL, whereas a scheduler might actually be significantly more efficient in transferring the GIL than a competitive process (as well as being more deterministic and fixing the above 2 issues).

If you want an I/O bound thread to pre-empt the thread that currently holds the GIL mid-way through a time-slice (and I am not saying that this is a bad thing - it may be a very good thing) that is a different and complementary discussion to the one about light vs. heavy CPU threads, deterministic scheduling of threads and the GIL transfer overheads. You could do pre-emptive GIL transfer either with or without a GIL scheduler.

One way that pre-emption could work is:

the thread that wants the GIL in order to process the I/O that has just completed and start a new one would send a signal to the thread that has the GIL - a new (possibly shorter than normal) time-slice would be started;
that thread would then free the GIL and signal back for that thread to take it;
when the I/.O thread started the next I/O or reached the end of the time-slice then it would not use the default means of choosing the next thread to get the GIL, but instead give it back to the thread that was pre-empted so that it could finish the previous time-slice.

In other words, you are sort of implementing interrupts and an interrupt stack, and computer-science has already identified the gotchas when programming such things like having excessive numbers of I/O threads that continually pre-empt each forever delaying the original thread, and possibly creating an ever increasing stack of interrupts. You can e.g. limit the stack to a single interrupt (so that one I/O thread cannot interrupt another I/O thread, but it can still get quite complicated.

The answer might be (instead) to tweak the GIL scheduling algorithm to use a time-slice of (say) 0.001ms instead of 0.005ms - defining CPU-light threads as those that voluntarily give up the GIL within this 0.001ms, and to allow CPU-light threads a single time-slice but give CPU-heavy threads (say) 5 or 10 or 20 consecutive time slices of 0.001ms before they are moved from the front to the back of the CPU-heavy round-robin queue. Personally I think this would be a better (and easier) solution than a fully pre-emptive model.

Sophist-UK Mar 24, 2022
Author

It should be "I/O bandwidth": this benchmark sends and receives data while there are CPU threads in the background.

I checked the code today - the latency/ping test had a delay. The bandwidth test didn't, but I have added some options and set a default bandwidth delay and the results are in a post below.

dabeaz · 2022-03-23T21:53:59Z

dabeaz
Mar 23, 2022

I would like to make a few general comments about some of things in this discussion as they pertain to my past work:

My prior work on the GIL. I always viewed my primary contribution to this topic as one of noticing and explaining the source of some bizarre behavior. In the teaching of a course on concurrency, some rather unexpected thread performance behavior was observed. I set off on an adventure to figure out why. Every talk I have given on this topic has been about explaining that behavior.
I have never spent any time trying to work on a full solution to this problem nor have I ever offered a complete solution. Yes, I demonstrated that the bad behavior could go away with the introduction of a "priority", but that was literally an afternoon hack wherein I added a "priority" attribute to Python's internal thread structure, a sys.setthreadpriority() function to explicitly set it, and a mechanism by which a high priority thread could immediately steal the GIL from a low-priority thread. The primary goal of such a demonstration was to spark a discussion, not to deliver a fleshed-out solution. If people interpreted my words or actions in a way to suggest that "priorities will never work", then I guess that was unfortunate.
In literally every talk I've given about this topic, I have always pretty clearly stated that "this would be a pretty interesting problem for someone ELSE to work on. Especially students." (For example, https://youtu.be/ph374fJqFPE?t=3594). I still feel that way. If I were still a professor, I could almost certainly see mentoring a student through the process of working on something like this and getting it published in a major academic systems conference. For example, echoing what I said here: https://youtu.be/fwzPF2JLoeU?t=1802
Is this a problem worth solving now? I really don't know. I can accept that maybe it was all just a weird curiosity that never reflected the reality of most Python programs. A theoretical oddity at best. On the other hand, understanding the interplay between CPU, I/O, scheduling, and other matters has always been a cornerstone of systems programming and systems performance. So, maybe there's something that can still be improved by it.
I have no particular opinion one way or the other about what a "production-grade" solution might look like. I just know that there are weird timing effects in the thread system that show up when CPU-bound threads and I/O-bound threads get mixed together and that these effects can potentially have a bad impact on response time (for example, in handling GUI events, responding to network RPCs, etc.). I'll leave it to others to decide if it matters or not.

11 replies

Sophist-UK Mar 24, 2022
Author

@pitrou Antoine - I am not sure I would be quite as sceptical as that, because as Dave Beazley showed more than a decade ago, it is possible to instrument Python and demonstrate precisely what is happening with the current GIL implementation and thus prove beyond any doubt the guesses we are making here.

But yes, in the end, as with the development of every new enhancement, you have to take an act of faith that the solution you are developing will fix the problem, and that is why you analyse as much as possible beforehand in order to minimise the risks of trying it out and finding it doesn't work. But this is true of almost every new endeavour - there are elements of risk.

So we need to weigh up the potential benefits (which these measurements suggest are large) against the risks that the effort fails (which with decent analysis should be small). If the argument was always "but there is a risk it won't work, so we shouldn't bother", then there would never be any innovation. I am not however suggesting that we follow Edison's approach and approach this on a pure trial-and-error basis, trying thousands of different GIL solutions until we find one that works. Rather, I am suggesting that we do sufficient research and analysis to understand the scale of the problem and its causes, and that (assuming that it fits the causes) we then use a tried and tested computer science algorithm to solve the problem.

I believe that Antoine's script (as modified by me) demonstrates clearly that the problems first identified by Dave Beazley over 12 years ago still exist in Python today and they are sufficiently big to take things on to the next step which is to understand the cause in detail.

The symptoms we are seeing today appear to me to be extremely similar to those identified by Dave 12+ years ago, and so it seems reasonable to me to assume that Dave's instrumented analysis 12 years ago is still a good description of what is happening. A review of the changes made to the GIL in intervening years - which cannot be that large because the symptoms are still very similar - would give insight into where Dave's analysis might no longer apply, and we might then conclude that we have sufficient information to move forward without repeating this analysis or undertaking different analysis.

Equally, someone with sufficient skills could repeat Dave's analysis (I am sure that his code can be remade available) and confirm whether it is the same, slightly different or wildly different.

We might then move onto undertaking a high-level design of a solution (I am proposing a GIL scheduler, but there might well be other solutions that would be equally good or better) and determine whether this is likely to solve the problem.

But in the end, yes, you have to bite the bullet, develop a prototype containing the basic functionality and run benchmarks on it to see if it works. But I suggest we follow the advice of the Knight Templar in the Indiana Jones movie and "choose wisely".

As an aside, if you look at https://bugs.python.org/issue7946 you can see that someone tried to shortcut the above process - not understanding the issues, not considering what algorithm would provide the best and most easily implemented solution, and with hindsight we can see that they chose a solution (BFS) that was really not a good fit to the problem, was way to complex for the need, and was not cross-platform, essentially following the Edison trial-and-error method without having Edison's stubborn stickability. With hindsight (which is a lot easier than foresight), I might suggest that they chose BFS "unwisely". However, I think there is a lesson we can learn from this failed attempt - that it is not that difficult to implement a scheduler prototype - the BFS scheduler worked, it just wasn't the right solution to fix the problem or to be cross-platform. It seems that a single individual was able to retrofit a complex piece of existing Linux kernel scheduling code into Python in a reasonably short order.

If I was a C programmer and understood the Python internals already, I doubt it would take me that long to code this - but I just don't have the skills. I am almost certainly being simplistic here, but there is already code that runs when the GIL is released - which signals all the threads to compete to get the GIL - and it seems to me that "all we need to do" (what an example of understatement that is) is to get it to select the first available thread from the two lists of those runnable threads waiting for the GIL - the CPU-light and CPU-heavy lists) and send a signal to that thread to take the GIL and run. Of course it is not quite that simple because you need to maintain that list of runnable threads as threads give up the GIL and as they become ready to reacquire the GIL - but I do not believe that implementation of a prototype is that hard.

Then, having chosen if the results look encouraging, as you say there is more hard work to make it production quality and eliminate or mitigate the issues and iron out all the idiosyncrasies and minor incompatibilities that will arise (i.e. make sure that classifying a thread as CPU-light or CPU-heavy is done correctly for every different situation in which the GIL is released) and perhaps do some performance tweaking on e.g. the best switch-interval and the best multiple of switch intervals to give a CPY-heavy thread).

So, I accept that this is a non-trivial amount of work. (Before I retired I was an IT Programme Manager with 30 years experience managing large programmes and projects, and of applying formal methods to the managing of risks in these programmes and projects - so I speak from professional experience on the planning and risk management side.) But equally, I believe that the analysis is showing that this is a non-trivial issue currently existing.

Sophist-UK Mar 24, 2022
Author

I took a quick look at the BFS patch from 2010 and it seems to include a lot of platform-specific code looking at timers etc. that I would expect to be in the Python code already - perhaps because it is a replacement for the GIL code at the time rather than an enhancement. But even with all the DIFF overheads and the comments (which are substantial) etc. it still only comes in at c. 1700 lines which in the big scheme of things seems quite small.

brandtbucher Mar 24, 2022
Maintainer

@Sophist-UK, I appreciate the time and effort you are putting into this. But for the sake of those following along, perhaps you might consider condensing your posts a bit?

It's really tough for me to track your high-level thoughts and observations when they are essentially walls of text posted every few hours. A blog would be a great place to put the more in-depth analysis for those who are interested.

Sophist-UK Mar 24, 2022
Author

@brandtbucher Brandt - apologies for the "walls of text" but there seems to be a lot of scepticism here, and sceptical comments do need some explanation as to why they shouldn't apply in this case.

Sophist-UK Mar 24, 2022
Author

While I have seen this come up at work in production Python code, I believe it has only been twice. It annoys those who hit it and they spend more time than they'd like debugging or asking for consults (the only way i hear about it), but after that they figure out a workaround and move on.

I am assuming that "work" production code refers to internal-use software and I suspect that these issues are likely to come up more in community / publicly purchasable code than internal products. Internal products are IME more likely to run on standardised and reasonably modern and capable hardware and are IMO likely to have smaller feature sets than community / commercial software.

My own interest in this came up at around the time that Dave was doing his work from my own attempts to improve the performance of a community app that needed to keep the UI responsive whilst doing several streams of I/O and a single heavy CPU task in the background. And I spent a lot of time trying to make it work before coming to the conclusion that it was impossible. Yet here we are more than a decade later, and it is still an issue for those needing this functionality.

dabeaz · 2022-03-24T16:48:31Z

dabeaz
Mar 24, 2022

For reference for anyone who it might interest, here's the discussion about fixing the GIL the first time. https://mail.python.org/archives/list/[email protected]/thread/B5MUWPLGDBWXTUSKZEJLPAIGYM32XMPB/

1 reply

Sophist-UK Mar 24, 2022
Author

It is heartening to see that the same names are still here discussing this again. I am guessing by the 2009 date that this is discussing the rewrite of the GIL for Python 3.4 (??)??

Sophist-UK · 2022-03-24T18:36:36Z

Sophist-UK
Mar 24, 2022
Author

Updated Bandwidth measurements with a delay of 0.01s.

Analysis

We can see that the CPU levels for GIL-release C code are almost where we would expect at +700% for 8+ threads (rather than at the c. 460% on previous runs where there was no delay between sending a previous echo response and receiving the next one). This presumably allows the echo thread to give up the GIL and other CPU-soak threads that are waiting for the GIL to acquire it, get started on the next calculation and release the GIL again.

I don't think that this has much if any impact on the analysis of what the shortcomings of the existing GIL are - it just resolved the unknown reason that CPU-soakers were not soaking up all the available CPU as we expected them to.

Results

ccbench_updated.py -a 8 -b

== CPython 3.9.5 (tags/v3.9.5:0a7dcbd) 64-bit ==
== AMD64 Windows 64bit on 'Intel64 Family 6 Model 60 Stepping 3, GenuineIntel' with 8 cores ==
== Threads: nt, Lock: None, Version: None ==
== Check interval: None, Switch interval: 0.005 ==
== Cores: 4, Hyperthreads: 8, Priority: Normal, I/O priority: Normal, Affinity: [0, 1, 2, 3, 4, 5, 6, 7] ==
!! Process priority set to HIGH !!
== Cores: 4, Hyperthreads: 8, Priority: High, I/O priority: Normal, Affinity: [0, 1, 2, 3, 4, 5, 6, 7] ==

--- I/O bandwidth ---

Background CPU task: Pi calculation (Python)

CPU threads= 0:   64.5 packets/sec
CPU threads= 1:   32.8 pkt/sec ( -49.1%) - Throughput: 1469 its/sec - CPU: Total:  3.0312, User:  3.0156, System:  0.0156
CPU threads= 2:   17.0 pkt/sec ( -73.7%) - Throughput: 1097 its/sec - CPU: Total:  3.0469 (  +0.5%), User:  3.0312 (  +0.5%), System:  0.0156 (  +0.0%)
CPU threads= 3:   11.1 pkt/sec ( -82.8%) - Throughput: 1128 its/sec - CPU: Total:  3.1094 (  +2.6%), User:  3.1094 (  +3.1%), System:  0.0000 (-100.0%)
CPU threads= 4:    6.6 pkt/sec ( -89.7%) - Throughput: 1396 its/sec - CPU: Total:  3.0469 (  +0.5%), User:  3.0312 (  +0.5%), System:  0.0156 (  +0.0%)
CPU threads= 5:    6.5 pkt/sec ( -90.0%) - Throughput:  630 its/sec - CPU: Total:  3.1250 (  +3.1%), User:  3.1250 (  +3.6%), System:  0.0000 (-100.0%)
CPU threads= 6:    5.9 pkt/sec ( -90.8%) - Throughput: 1383 its/sec - CPU: Total:  3.0781 (  +1.5%), User:  3.0625 (  +1.6%), System:  0.0156 (  +0.0%)
CPU threads= 7:    5.1 pkt/sec ( -92.1%) - Throughput: 1246 its/sec - CPU: Total:  3.1094 (  +2.6%), User:  3.0938 (  +2.6%), System:  0.0156 (  +0.0%)
CPU threads= 8:    3.6 pkt/sec ( -94.5%) - Throughput: 1253 its/sec - CPU: Total:  3.1094 (  +2.6%), User:  3.0938 (  +2.6%), System:  0.0156 (  +0.0%)
CPU threads= 9:    4.2 pkt/sec ( -93.5%) - Throughput: 1293 its/sec - CPU: Total:  3.1250 (  +3.1%), User:  3.1250 (  +3.6%), System:  0.0000 (-100.0%)
CPU threads=10:    3.7 pkt/sec ( -94.2%) - Throughput: 1505 its/sec - CPU: Total:  3.1094 (  +2.6%), User:  3.1094 (  +3.1%), System:  0.0000 (-100.0%)
CPU threads=11:    2.6 pkt/sec ( -96.0%) - Throughput: 1341 its/sec - CPU: Total:  3.0625 (  +1.0%), User:  3.0625 (  +1.6%), System:  0.0000 (-100.0%)
CPU threads=12:    3.7 pkt/sec ( -94.2%) - Throughput: 1638 its/sec - CPU: Total:  3.1250 (  +3.1%), User:  3.1094 (  +3.1%), System:  0.0156 (  +0.0%)
CPU threads=13:    4.3 pkt/sec ( -93.3%) - Throughput: 1447 its/sec - CPU: Total:  3.1719 (  +4.6%), User:  3.1719 (  +5.2%), System:  0.0000 (-100.0%)
CPU threads=14:    2.8 pkt/sec ( -95.7%) - Throughput: 1432 its/sec - CPU: Total:  3.1094 (  +2.6%), User:  3.0938 (  +2.6%), System:  0.0156 (  +0.0%)
CPU threads=15:    2.7 pkt/sec ( -95.8%) - Throughput: 1428 its/sec - CPU: Total:  3.1094 (  +2.6%), User:  3.0938 (  +2.6%), System:  0.0156 (  +0.0%)
CPU threads=16:    4.8 pkt/sec ( -92.5%) - Throughput: 1521 its/sec - CPU: Total:  3.1250 (  +3.1%), User:  3.1094 (  +3.1%), System:  0.0156 (  +0.0%)

Background CPU task: regular expression (C)

CPU threads= 0:   64.8 packets/sec
CPU threads= 1:   33.0 pkt/sec ( -49.0%) - Throughput:  324 its/sec - CPU: Total:  3.0312, User:  3.0312, System:  0.0000
CPU threads= 2:   16.4 pkt/sec ( -74.7%) - Throughput:  395 its/sec - CPU: Total:  3.1719 (  +4.6%), User:  3.1406 (  +3.6%), System:  0.0312
CPU threads= 3:   12.7 pkt/sec ( -80.5%) - Throughput:  449 its/sec - CPU: Total:  3.0938 (  +2.1%), User:  3.0781 (  +1.5%), System:  0.0156
CPU threads= 4:    6.6 pkt/sec ( -89.8%) - Throughput:  463 its/sec - CPU: Total:  3.1250 (  +3.1%), User:  3.1250 (  +3.1%), System:  0.0000
CPU threads= 5:    7.2 pkt/sec ( -88.9%) - Throughput:  402 its/sec - CPU: Total:  3.2031 (  +5.7%), User:  3.1719 (  +4.6%), System:  0.0312
CPU threads= 6:    5.3 pkt/sec ( -91.9%) - Throughput:  227 its/sec - CPU: Total:  3.2188 (  +6.2%), User:  3.1875 (  +5.2%), System:  0.0312
CPU threads= 7:    4.1 pkt/sec ( -93.7%) - Throughput:  387 its/sec - CPU: Total:  3.2031 (  +5.7%), User:  3.1719 (  +4.6%), System:  0.0312
CPU threads= 8:    4.0 pkt/sec ( -93.8%) - Throughput:  469 its/sec - CPU: Total:  3.1562 (  +4.1%), User:  3.1250 (  +3.1%), System:  0.0312
CPU threads= 9:    5.2 pkt/sec ( -91.9%) - Throughput:  327 its/sec - CPU: Total:  3.3594 ( +10.8%), User:  3.3438 ( +10.3%), System:  0.0156
CPU threads=10:    3.3 pkt/sec ( -94.9%) - Throughput:  401 its/sec - CPU: Total:  3.1875 (  +5.2%), User:  3.1719 (  +4.6%), System:  0.0156
CPU threads=11:    6.1 pkt/sec ( -90.6%) - Throughput:  444 its/sec - CPU: Total:  3.2656 (  +7.7%), User:  3.2656 (  +7.7%), System:  0.0000
CPU threads=12:    3.9 pkt/sec ( -94.0%) - Throughput:  415 its/sec - CPU: Total:  3.2031 (  +5.7%), User:  3.2031 (  +5.7%), System:  0.0000
CPU threads=13:    2.8 pkt/sec ( -95.6%) - Throughput:  382 its/sec - CPU: Total:  3.2656 (  +7.7%), User:  3.2188 (  +6.2%), System:  0.0469
CPU threads=14:    2.9 pkt/sec ( -95.5%) - Throughput:  422 its/sec - CPU: Total:  3.3125 (  +9.3%), User:  3.3125 (  +9.3%), System:  0.0000
CPU threads=15:    3.6 pkt/sec ( -94.4%) - Throughput:  518 its/sec - CPU: Total:  3.4219 ( +12.9%), User:  3.4062 ( +12.4%), System:  0.0156
CPU threads=16:    4.6 pkt/sec ( -92.9%) - Throughput:  421 its/sec - CPU: Total:  3.2656 (  +7.7%), User:  3.2500 (  +7.2%), System:  0.0156

Background CPU task: bz2 compression (C)

CPU threads= 0:   64.6 packets/sec
CPU threads= 1:   64.9 pkt/sec (  +0.5%) - Throughput:  311 its/sec - CPU: Total:  3.0000, User:  2.7812, System:  0.2188
CPU threads= 2:   65.0 pkt/sec (  +0.5%) - Throughput:  577 its/sec - CPU: Total:  6.0312 (+101.0%), User:  5.5469 ( +99.4%), System:  0.4844 (+121.4%)
CPU threads= 3:   64.9 pkt/sec (  +0.4%) - Throughput:  713 its/sec - CPU: Total:  8.8438 (+194.8%), User:  8.4375 (+203.4%), System:  0.4062 ( +85.7%)
CPU threads= 4:   65.0 pkt/sec (  +0.6%) - Throughput:  995 its/sec - CPU: Total: 11.8281 (+294.3%), User: 11.4375 (+311.2%), System:  0.3906 ( +78.6%)
CPU threads= 5:   64.5 pkt/sec (  -0.2%) - Throughput: 1091 its/sec - CPU: Total: 14.7969 (+393.2%), User: 14.2500 (+412.4%), System:  0.5469 (+150.0%)
CPU threads= 6:   64.1 pkt/sec (  -0.8%) - Throughput: 1296 its/sec - CPU: Total: 17.5312 (+484.4%), User: 16.7344 (+501.7%), System:  0.7969 (+264.3%)
CPU threads= 7:   65.2 pkt/sec (  +0.9%) - Throughput: 1485 its/sec - CPU: Total: 19.9844 (+566.1%), User: 19.2031 (+590.4%), System:  0.7812 (+257.1%)
CPU threads= 8:   65.6 pkt/sec (  +1.5%) - Throughput: 1683 its/sec - CPU: Total: 22.5312 (+651.0%), User: 21.6719 (+679.2%), System:  0.8594 (+292.9%)
CPU threads= 9:   65.7 pkt/sec (  +1.7%) - Throughput: 1708 its/sec - CPU: Total: 22.8594 (+662.0%), User: 21.9375 (+688.8%), System:  0.9219 (+321.4%)
CPU threads=10:   65.9 pkt/sec (  +2.0%) - Throughput: 1785 its/sec - CPU: Total: 23.8438 (+694.8%), User: 22.8750 (+722.5%), System:  0.9688 (+342.9%)
CPU threads=11:   65.6 pkt/sec (  +1.6%) - Throughput: 1803 its/sec - CPU: Total: 24.0625 (+702.1%), User: 23.2656 (+736.5%), System:  0.7969 (+264.3%)
CPU threads=12:   66.0 pkt/sec (  +2.2%) - Throughput: 1783 its/sec - CPU: Total: 24.0000 (+700.0%), User: 23.0781 (+729.8%), System:  0.9219 (+321.4%)
CPU threads=13:   65.3 pkt/sec (  +1.0%) - Throughput: 1816 its/sec - CPU: Total: 23.9375 (+697.9%), User: 23.0156 (+727.5%), System:  0.9219 (+321.4%)
CPU threads=14:   64.0 pkt/sec (  -1.0%) - Throughput: 1800 its/sec - CPU: Total: 24.0938 (+703.1%), User: 23.1250 (+731.5%), System:  0.9688 (+342.9%)
CPU threads=15:   65.3 pkt/sec (  +1.0%) - Throughput: 1802 its/sec - CPU: Total: 24.3438 (+711.5%), User: 23.3750 (+740.4%), System:  0.9688 (+342.9%)
CPU threads=16:   65.2 pkt/sec (  +0.9%) - Throughput: 1802 its/sec - CPU: Total: 24.1562 (+705.2%), User: 23.3438 (+739.3%), System:  0.8125 (+271.4%)

Background CPU task: SHA1 hashing (C)

CPU threads= 0:   64.5 packets/sec
CPU threads= 1:   65.0 pkt/sec (  +0.8%) - Throughput: 2764 its/sec - CPU: Total:  3.0156, User:  3.0156, System:  0.0000
CPU threads= 2:   65.1 pkt/sec (  +0.9%) - Throughput: 4539 its/sec - CPU: Total:  5.9062 ( +95.9%), User:  5.9062 ( +95.9%), System:  0.0000
CPU threads= 3:   64.5 pkt/sec (  +0.1%) - Throughput: 6294 its/sec - CPU: Total:  8.9844 (+197.9%), User:  8.9219 (+195.9%), System:  0.0625
CPU threads= 4:   64.7 pkt/sec (  +0.2%) - Throughput: 7507 its/sec - CPU: Total: 12.0469 (+299.5%), User: 11.9844 (+297.4%), System:  0.0625
CPU threads= 5:   65.0 pkt/sec (  +0.7%) - Throughput: 7623 its/sec - CPU: Total: 14.8438 (+392.2%), User: 14.7812 (+390.2%), System:  0.0625
CPU threads= 6:   65.0 pkt/sec (  +0.8%) - Throughput: 8798 its/sec - CPU: Total: 17.7188 (+487.6%), User: 17.6719 (+486.0%), System:  0.0469
CPU threads= 7:   65.6 pkt/sec (  +1.7%) - Throughput: 10383 its/sec - CPU: Total: 20.7812 (+589.1%), User: 20.7656 (+588.6%), System:  0.0156
CPU threads= 8:   65.6 pkt/sec (  +1.8%) - Throughput: 11372 its/sec - CPU: Total: 23.7031 (+686.0%), User: 23.6719 (+685.0%), System:  0.0312
CPU threads= 9:   65.7 pkt/sec (  +1.8%) - Throughput: 11240 its/sec - CPU: Total: 23.9375 (+693.8%), User: 23.8906 (+692.2%), System:  0.0469
CPU threads=10:   65.5 pkt/sec (  +1.5%) - Throughput: 11284 its/sec - CPU: Total: 23.8438 (+690.7%), User: 23.7969 (+689.1%), System:  0.0469
CPU threads=11:   65.6 pkt/sec (  +1.6%) - Throughput: 11437 its/sec - CPU: Total: 23.9062 (+692.7%), User: 23.8750 (+691.7%), System:  0.0312
CPU threads=12:   65.6 pkt/sec (  +1.7%) - Throughput: 11277 its/sec - CPU: Total: 23.9062 (+692.7%), User: 23.8594 (+691.2%), System:  0.0469
CPU threads=13:   66.0 pkt/sec (  +2.3%) - Throughput: 11070 its/sec - CPU: Total: 23.6250 (+683.4%), User: 23.6094 (+682.9%), System:  0.0156
CPU threads=14:   64.8 pkt/sec (  +0.4%) - Throughput: 10988 its/sec - CPU: Total: 23.9531 (+694.3%), User: 23.9375 (+693.8%), System:  0.0156
CPU threads=15:   65.7 pkt/sec (  +1.8%) - Throughput: 11032 its/sec - CPU: Total: 23.8906 (+692.2%), User: 23.8594 (+691.2%), System:  0.0312
CPU threads=16:   64.4 pkt/sec (  -0.2%) - Throughput: 11048 its/sec - CPU: Total: 24.0000 (+695.9%), User: 23.9531 (+694.3%), System:  0.0469

Background CPU task: SHA512 hashing (C) - GIL is released

CPU threads= 0:   64.8 packets/sec
CPU threads= 1:   65.1 pkt/sec (  +0.4%) - Throughput: 1511 its/sec - CPU: Total:  3.0625, User:  3.0469, System:  0.0156
CPU threads= 2:   64.6 pkt/sec (  -0.4%) - Throughput: 3298 its/sec - CPU: Total:  6.0000 ( +95.9%), User:  5.9844 ( +96.4%), System:  0.0156 (  +0.0%)
CPU threads= 3:   64.8 pkt/sec (  -0.1%) - Throughput: 3697 its/sec - CPU: Total:  8.8594 (+189.3%), User:  8.8125 (+189.2%), System:  0.0469 (+200.0%)
CPU threads= 4:   64.7 pkt/sec (  -0.2%) - Throughput: 4115 its/sec - CPU: Total: 11.9844 (+291.3%), User: 11.9375 (+291.8%), System:  0.0469 (+200.0%)
CPU threads= 5:   64.8 pkt/sec (  -0.1%) - Throughput: 4745 its/sec - CPU: Total: 14.7031 (+380.1%), User: 14.6875 (+382.1%), System:  0.0156 (  +0.0%)
CPU threads= 6:   64.8 pkt/sec (  +0.0%) - Throughput: 5558 its/sec - CPU: Total: 17.5625 (+473.5%), User: 17.4688 (+473.3%), System:  0.0938 (+500.0%)
CPU threads= 7:   64.8 pkt/sec (  -0.1%) - Throughput: 6585 its/sec - CPU: Total: 20.7031 (+576.0%), User: 20.6719 (+578.5%), System:  0.0312 (+100.0%)
CPU threads= 8:   66.0 pkt/sec (  +1.8%) - Throughput: 7659 its/sec - CPU: Total: 23.8281 (+678.1%), User: 23.7969 (+681.0%), System:  0.0312 (+100.0%)
CPU threads= 9:   65.4 pkt/sec (  +0.9%) - Throughput: 7436 its/sec - CPU: Total: 23.9531 (+682.1%), User: 23.9219 (+685.1%), System:  0.0312 (+100.0%)
CPU threads=10:   65.8 pkt/sec (  +1.4%) - Throughput: 7477 its/sec - CPU: Total: 24.0469 (+685.2%), User: 24.0312 (+688.7%), System:  0.0156 (  +0.0%)
CPU threads=11:   65.4 pkt/sec (  +0.9%) - Throughput: 7363 its/sec - CPU: Total: 23.8438 (+678.6%), User: 23.8125 (+681.5%), System:  0.0312 (+100.0%)
CPU threads=12:   66.4 pkt/sec (  +2.4%) - Throughput: 7516 its/sec - CPU: Total: 24.0625 (+685.7%), User: 24.0000 (+687.7%), System:  0.0625 (+300.0%)
CPU threads=13:   65.4 pkt/sec (  +0.8%) - Throughput: 7330 its/sec - CPU: Total: 23.9688 (+682.7%), User: 23.9531 (+686.2%), System:  0.0156 (  +0.0%)
CPU threads=14:   65.6 pkt/sec (  +1.2%) - Throughput: 7443 its/sec - CPU: Total: 23.9688 (+682.7%), User: 23.9531 (+686.2%), System:  0.0156 (  +0.0%)
CPU threads=15:   65.7 pkt/sec (  +1.4%) - Throughput: 7658 its/sec - CPU: Total: 24.0469 (+685.2%), User: 24.0312 (+688.7%), System:  0.0156 (  +0.0%)
CPU threads=16:   65.9 pkt/sec (  +1.6%) - Throughput: 7667 its/sec - CPU: Total: 24.0156 (+684.2%), User: 24.0000 (+687.7%), System:  0.0156 (  +0.0%)

4 replies

gvanrossum Mar 24, 2022
Maintainer

@Sophist-UK Could you submit your updates to the benchmark script as a PR, so they don't get lost? I suppose Antoine could review them.

Sophist-UK Mar 24, 2022
Author

@gvanrossum Guido - I was planning to do that anyway once it has stopped evolving. Having a consistent way of measuring and evaluating the performance of various versions of Python as the GIL evolves is important, regardless of whether this specific proposal gains any traction or not.

Sophist-UK Mar 24, 2022
Author

One thing I would add is that having introduced a delay to the echo workload, I am not sure that it is really any different to the ping workload. With the single exception of the size of the payload (c. 1k bytes vs. c. 10 bytes) there is really no difference - it was the intensity and lack of any delay that really distinguished the bandwidth test from the ping test.

I still believe that an intense bandwidth test with zero delay does not match any real-life scenario, the question is whether it adds any beneficial insight into what is happening now, or possibly give insight or highlight issues in a future version of Python.

Hence I am currently in two minds whether to leave both set of tests with delays, to switch bandwidth back to zero delay (so that it is different) or to simplify the code by removing one of the sets of tests. Comments please.

Sophist-UK Feb 18, 2024
Author

@gvanrossum Guido - As you requested, I submitted a PR with my updated script about 2 years ago and it was never reviewed or merged. Today it was closed unmerged as part of a different PR to remove several performance measurement scripts (including this one) from the core cpython repo.

So unfortunately, this update script is being lost, along with several others (and an entire suite of benchmark tools by @vstinner) are indeed being lost.

dabeaz · 2022-03-24T21:20:58Z

dabeaz
Mar 24, 2022

I sometimes wonder if half the problem is the whole terminology of "priorities." It's like a magnet for complexity. Threads already have a concept of "daemonic." Maybe you could introduce a new concept like "holiness." When you create a new thread, make it "holy" by default--kind of like the "holy hand grenade" from Monty Python. If desired, a thread can declare itself "unholy" by calling some function like threading.unholy(). Any unholy thread can be immediately preempted by a holy thread that wants the GIL. That's it. That's the whole holiness.

If anyone complains about further strangeness, you could just say "well, what were you expecting from a bunch of unholy threads?"

1 reply

Sophist-UK Mar 24, 2022
Author

Lol. Including concepts from Monty Python is always a good thing. Or Terry Pratchett. Or Inspector Clouseau of the Sûreté.

Of course, when I started this discussion I was fully expecting the Spanish Inquisition!!

Regarding Priorities and Pre-emption, in the solution I have in my head, I see this as being implementable in four stages (only two of which are IMO essential):

Switch from runnable threads competing for the GIL to the GIL being explicitly given to the next runnable thread in a FIFO queue. Aside from changing the existing signals, the big difference is maintaining a FIFO queue of runnable threads that are ready to be given the GIL - this is going to need mutexes etc. or signals because non-GIL parallel threads can all want access simultaneously.

I think that this is probably the hardest of the three stages, but given that the current GIL is only 300-ish lines long and already has a mutex for the GIL, I doubt that this is going to be that difficult.

IMO this would solve the most serious issue at present - that several x 10% of CPU is being wasted in competing for the GIL.
Introduce the concept of CPU-heavy and CPU-light threads each with its own queue (depending on whether they are giving up the GIL by reaching the end of a Switch Interval or for some other reason) with CPU light threads getting the GIL before CPU-heavy threads get the GIL. As previously mentioned, you can tweak the Switch Interval so that typical lightweight GUI interactions and Disk/Web I/O etc. complete within the Switch Interval, and you can allow a Switch Multiplier to apply to CPU-heavy threads so that they get a decent whack of CPU before a different thread gets a look in.

This second stage is also IMO essential because it deals with the second issue whereby one or more CPU-heavy threads prevent GUI and I/O responsiveness.
Introduce thread priorities - allowing the programmer to influence which threads get the GIL in preference to other threads. This would not be a major amount of extra GIL code - you just add an API to set the thread priority, add a few more queues and select the threads to run from the highest priority queue.

IMO this stage is optional. As a programmer this would be nice to have, but I believe that the major benefits will come from stages 1 and 2. OTOH I think it is essential to achieve stage 4.
Pre-emption - On reflection, I think that pre-emption is actually not difficult to achieve if you have implemented stage 3 above. In essence, when you get a higher priority thread indicating that it is ready to be given the GIL, the currently executing thread inserts itself back at the head of the appropriate queue it came off (with a not about how much of the Switch Interval remains) and then it gives the GIL to the higher priority thread. Everything then happens as normal. If no other threads of higher priority than the original are added to the queue while the pre-empting thread is executing, then the old thread will be the next thread to execute by default. If higher priority threads have been added to the queue in the mean time, they won't need to pre-empt the restored thread, but will execute as a matter of course before the original thread gets the CPU back again.

It might also be possible to do pre-emption when you have implemented stage 2 but not stage 3 - whereby a thread which is already marked as CPU-light declares it is ready to get the GIL back is allowed to pre-empt a CPU-heavy thread currently executing. The benefit of this is that it would not require the developer to explicitly set thread priorities.

Sophist-UK · 2022-03-24T23:19:45Z

Sophist-UK
Mar 24, 2022
Author

I am not sure that there is much more analysis I can do. In which case the time has come to decide whether this is an issue which needs addressing and if so what the next steps are...

0 replies

Fidget-Spinner · 2022-03-25T16:57:49Z

Fidget-Spinner
Mar 25, 2022
Collaborator

Chiming in as a non-expert: I doubt I can contribute anything to the already heavily-researched field of schedulers, but I'm game to implement a different thread scheduler in CPython. I will probably use an algorithm that is already well received instead of conceiving a new one.

In a few weeks time I will have loads of free time to research and hack on the C parts of this. Maybe I can finally put my rudimentary knowledge of the dinosaur OS book to some use :).

5 replies

Fidget-Spinner Mar 25, 2022
Collaborator

More thoughts: I question the usefulness of improving our thread scheduling considering async/await is increasing in popularity. I don't mean to start a threads vs async war, but I really wonder if improving the asyncio event loop would be a better use of our time.

Besides, my own observations in prod is that heavily threaded applications use some other library anyways (eg. greenlet), and don't really rely on CPython's scheduler.

dabeaz Mar 25, 2022

A good way to crater the performance of asyncio is to block the event loop by performing any kind of blocking (non-async) operation or by carrying out CPU-bound intensive processing. A common workaround is to offload such processing to a thread-pool. Thus, having thread scheduling work better in the presence of CPU-bound processing would also likely improve asyncio.

Sophist-UK Mar 25, 2022
Author

@Fidget-Spinner / Ken

I am not on the Python Steering Committee, do not represent the Python team, and frankly don't understand the Python decision making processes, but I would like to express my personal thanks for this offer - which I interpret as an offer to create a prototype alternative GIL implementation which would be presented as a Pull Request on the Python/CPython master repo.

As with all PRs, the choice of what is presented in the PR is up to the author - ultimately it will be your choice (not mine or anyone else's) on how you approach this. Of course, if others wish to offer alternative PRs that is also their choice. This is part of the fundamentals of how Open Source software works.

However, it is my hope as the instigator of this discussion that you will be willing to approach this in a cooperative way, willing to listen to the input of others, in the hope that the key design decisions can be reached by consensus. IMO this would maximise the benefits and minimise the amount of duplicated effort. (I can see parallels in this previous statement with the analysis and my own opinions on this discussion - that we need to move away from the GIL being based on threads competing with one another for resources and towards a cooperative approach. LoL) There is a huge amount of Python expertise represented by the others here - including Guido whose contribution to Python can never be overstated, and Antoine who wrote the current GIL, so I would hope that there will be a goodly amount of supportive advice made available to you if you want / need it. Of course, if consensus cannot be reached the option is always open for someone to put forward alternative implementations as a different PR.

Once you are happy with the PR, you are of course free to put it forward immediately for consideration for merging through the normal process, however I think that it would help this formal consideration if the interested parties here were to review it first and give some feedback.

I would therefore suggest that once a prototype has been developed to a sufficient level as to A) successfully run a majority of Python single and multi-threaded code bases including the script I have been using, and B) that you believes demonstrates significant improvements in multi-threading performance, then we reconvene this discussion to allow the participants to review your prototype against this script and a range of other performance benchmarks and real-life applications so that we can see how the changes will impact performance in a spectrum of applications and benchmarks. I would certainly be happy to contribute to this testing effort. (If as part of your efforts you are able to create a classification of the types of GIL hand-off that exist, that might help us to build additional code profiles that will test specific aspects of how the new-new-gil behaves in specific cases.)

Hopefully this will lead to a more robust solution and help everyone be confident that the PR is a sound one.

Whilst I have a theoretical opinion on a solution, an ounce of practical delivery is worth a ton of theory - personally I will be happy to see any solution delivered that results in an even better Python (though of course my ego hopes that my proposed solution is the right one). If there is anything I can do to assist you with the choice of algorithms/design, act as a sounding board etc., I am happy to help.

Sophist-UK Mar 25, 2022
Author

More thoughts: I question the usefulness of improving our thread scheduling considering async/await is increasing in popularity. I don't mean to start a threads vs async war, but I really wonder if improving the asyncio event loop would be a better use of our time.

I have never quite fully got to grips with async/await (because the parallel apps I worked on predated this functionality), but to the level I do understand it this loop seems much like the main event loop that UI frameworks like PyQT and wxPython employ.

What I am unclear about is whether aync/await is compatible with event loops that these frameworks require? And since both examples are Python versions built by providing an interface layer on top of C-based widgets, if they are not compatible now I am unclear whether they could ever be made compatible with async/await.

But either way, surely we should be providing reasonably efficient and effective implementations for both styles of coding, partly so that legacy code doesn't need to be rewritten and partly so that the developer can have a choice of approaches to suit their personal coding style, Python knowledge and the needs of the application?

jab Mar 25, 2022

I don't know about asyncio, but I know Trio supports this kind of thing with "guest mode", and the docs for it may help with understanding the concepts more generally: https://trio.readthedocs.io/en/stable/reference-lowlevel.html#using-guest-mode-to-run-trio-on-top-of-other-event-loops

Fidget-Spinner · 2022-03-26T12:27:23Z

Fidget-Spinner
Mar 26, 2022
Collaborator

Going back to David's original bpo https://bugs.python.org/issue7946 (which you are also on). The most straightforward way I see recommended by Gregory is to update the BFS implementation for main. I tried and unfortunately there are some changes since 10 years ago that have created more snags:

GIL isn't as simple as back then. subinterpreters' interaction with the GIL confuses me. Unfortunately I'm not an authority on subinterpreters, so I will need to study them.
Not sure why there's a ceval_state and another _ceval_runtime_state. Will need more research.
Modifying PyThreadState struct usually means we break most C extensions.
Lots more C-API functions now interact with the GIL than 10 years ago.

Some of the arguments against the BFS implementation no longer apply though. Like static inline is now accepted in our code, along with the clock functions.

Reading through the bpo, I'm not sure why nobody accepted David's implementation. Was it just due to a lack of conclusive benchmarks? Maybe @dabeaz could shed some light here please. His implementation seems least likely to break anything in our codebase, despite it purportedly performing worse than the full BFS.

Disclaimer: my efforts may be fruitless, or it may take months. I don't see this being an easy task. I'm not an expert at all in the areas of python runtime state (unfortunately, I maintain other parts of CPython).

15 replies

Sophist-UK Apr 25, 2022
Author

@Fidget-Spinner / Ken: Thanks for letting us know. 😃

jab May 7, 2022

Curious if there were any takeaways from the recent PyCon that are relevant to this topic (maybe from the language summit)?

Sophist-UK May 7, 2022
Author

@jab I would assume that if there were, then someone would post here. I am not aware of anything.

JelleZijlstra May 8, 2022

We didn't talk about thread scheduling at the language summit, though there was discussion of several topics that affect performance, including Sam Gross's nogil work.

Sophist-UK May 8, 2022
Author

Clearly "nogil" would mean that a GIL scheduler was not needed.

dabeaz · 2022-05-19T01:01:46Z

dabeaz
May 19, 2022

I implemented my "holy" thread idea. It wholly solves my original performance problem. https://github.com/dabeaz-fork/cpython/blob/main/HOLY.md.

3 replies

Sophist-UK May 19, 2022
Author

@dabeaz

This is a great quick fix alternative to a better scheduler. A few smallish comments if I may:

I am unclear why the ceval_gil.h change is one step less indented than the statement it replaces. I am not knowledgeable enough to know whether this is correct or not, so I mention it so you can check.
Any chance you can run the python code I used above (which I will send to you by email), to create a baseline against Python without this change, and then add the required setholiness call and run it again with your mod so that we can see the results with a standardised bechmark?
What happens with this code if you set holiness on several CPU intensive threads? What happens if you set it on an I/O intensive thread? Can you set this on the main thread and if so what are the consequences?
Personally I am not sure that I really like the name setholiness (rather than something like set_cpubound or something even more descriptive of what effect it has on the GIL). But once you have had some feedback - and in the absence of a volunteer to write a new GIL scheduler - would you be prepared to submit a PR with this code and work it through the acceptance process?

dabeaz May 22, 2022

(1) Meh. Sorry about that. emacs decided it wanted to insert some tabs in there.
(2) I will run any code you send me. There are a lot of code fragments on this issue already so I'm not really sure which you are referring to however.
(3) Holiness is self-declared. It can be declared on any thread. There is not much going on here. If a "holy" thread is holding the GIL, it will be forced to drop it after 1 microsecond if a non-holy thread wants it. Note: A "holy" thread wanting the GIL will not preempt another "holy" thread (the usual thread switching interval is used in that case). The patch makes no pretense about preventing some kind of holy thread war should the feature be misused by the user.
(4) I have no opinion on the naming of it. I picked something whimsical. I am NOT able to shepherd a PR for this through python-dev at this time. Sorry.

Sophist-UK May 22, 2022
Author

@dabeaz
3./4. Not sure why a holy thread should be preempted by an unholy thread - surely it would be better named as "unholy" so that good prevails?

I quite understand about your time availability. I have the time, just not the C skills.

@everyone
For the benefit of others reading this thread, Dave's "holiness" experiment shows just how easily a programmer's opt-out can be implemented, and is IMO an indicator that a full GIL scheduler would not be a major piece of coding.

Sophist-UK · 2022-05-22T15:42:22Z

Sophist-UK
May 22, 2022
Author

I think we have pretty much reached the end of this discussion.

I have demonstrated that the current GIL continues to have some VERY poor performance characteristics, and I cannot express just how disappointed I am about the apparent lack of interest from the community in implementing a solution, despite the involvement in this thread of both @gvanrossum (BDfL of Python) and @pitrou (the author of the current GIL implementation).

So, for a final time, I will ask again whether there is anyone interested in attempting to address this issue?

3 replies

notatallshaw May 22, 2022

Hi @Sophist-UK I see you have a lot of enthusiasm for this specific issue and that the performance characteristics here are demonstrable, but I think everyone who is already subscribed to this thread has heard your call to action multiple times and there is no need to keep repeating it. If you would like to reach a broader audience I suggest trying to reach out to programmers on social media platforms (Reddit, Hackernews, etc.).

That said IMO it still remains a dubious proposition that this needs solving. Most programmers when faced with the prospect of a GUI that needs to be associated with heavy computation come up with a server-client architecture where the GUI and the computation are sitting in different processes, potentially on different machines, making all of this a non-issue. And I'm not aware of any Python performance projects have called this out as a motivating issue.

Further in the future if Python upstreams the nogil project this issue should disappear, or if PEP 684 (A Per-Interpreter GIL) proceeds to develop allowing users to run multiple interpreters then the workaround is simple to run the heavy computation in a separate interpreter.

I'm not saying this isn't a good idea to solve but someone needs to be motivated to solve it. Perhaps that could be you? Maybe look at David's implementation and also read up on how the inner workings of CPython are implemented, there is an excellent book called "CPython Internals" which I am making my way through now.

Sophist-UK May 23, 2022
Author

@notatallshaw / Damian

Thanks for these ideas.

If I had the C skills, I would certainly consider coding a GIL scheduler myself, but the reality is that I don't.

I am unclear, however, why you think this is a problem that doesn't need solving. Firstly, not all problems can be easily solved by a client-server architecture using multi-processing, because I believe that you need to use a copy of the data that you want the sub-process to handle, and there may be performance and / or real-time update issues that cannot be solved by multi-processing. Similarly as I understand it the proposed PEP 684 Per-interpreter GIL also requires data separation. So I am really not convinced that these are an alternative to a GIL which gives threads an appropriate share of CPU. It is also potentially the case that a multi-processing or Per-interpreter solution still requires multi-threading and would still hit these same problems.

And of course, both the Per-Interpreter GIL and No-Gil changes are only potential changes at this stage and are not guaranteed to happen, and if they have considerable side-effects that require significant re-coding then it is potentially possible that they might even need to be in a separate Python version 4. Even if one or both of these get incorporated, then the time-scales for these are not yet defined - which I assume means that they could still be quite some time away.

I would agree, however, that a No-Gil version of cPython would (obviously) not need a GIL scheduler, and since it enables true multi-tasking performance will be much better than could ever be achieved with a GIL scheduler. So, assuming that there was agreement that there is currently a multi-threading issue, the question then becomes one of if/when No-Gil will be implemented, and whether it is worth the effort to implement an interim Gil-Scheduler between now and No-Gil - and weighing this up requires intimate knowledge of the No-Gil roadmap and risks, and needs to be done by whatever the cPython steering group is, and is way above my (non-existent) pay-grade.

So, as you can see, I am quite realistic and pragmatic about how the current multi-threading problem is resolved by the cPython powers-that-be, but at present we seem to be at a point where we have code that clearly demonstrates a problem (subject to the 3.10 tests mentioned below) but yet this is not accepted as a genuine issue by anyone in a position capable of prioritising it or weighing up the need for a solution before No-Gil is delivered.

Sophist-UK May 24, 2022
Author

Thanks to @terryjreedy who posted elsewhere links to a summary of the recent GIL-related discussions.

No-GIL

https://pyfound.blogspot.com/2022/05/the-2022-python-language-summit-python_11.html

Summary: Will require small changes in C-extension modules, and separate GIL/NoGIL versions of both Python and C-extensions. Judged not to be a show-stopper, but will need careful planning - scheduled no earlier than 3.12. IMO, this sounds likely to happen, but GIL version likely to remain the default version and the version pre-installed with O/ses for some years.

Per-Interpreter GIL

https://pyfound.blogspot.com/2022/05/the-2022-python-language-summit-per.html

Summary: this will require a lot of code changes, not only in Python but also "any C-extension module that wishes to be compatible with sub-interpreters will have to make changes to their design". This seems to me to be an even BIGGER hurdle to overcome than the changes for No-GIL. IMO it feels far less likely to happen despite a growing awareness of the benefits.

Summary

With widespread adoption of No-Gil being several years away (my guess is up to a decade), and Per-Interpreter having even larger hurdles, it seems to me that a fix for the current performance issues should probably not be disregarded just because No-GIL and Per-Interpreter are on the horizon as potential enhancements.

Aside

From a Python coding perspective, because of the same need to use copies of objects for the extra processes / interpreters, Per-Interpreter GIL seems to me to be more akin to, and an alternative to, multi-processing but in a single Python instance / process.

Conversely, because in both current GIL and No-GIL Python takes care of preventing multiple simultaneous access to objects, multi-threading with No-GIL seems to me to be an alternative and better performing way to do multi-threading.

Sophist-UK · 2022-05-23T10:04:23Z

Sophist-UK
May 23, 2022
Author

According to a comment by reddit user skeeto (who may or may not be @skeeto here) in a thread about nogil, Python 3.10 includes a change to the GIL in commit python/cpython@4958f5d that changes how Python multi-threading transfers the GIL. When I have time, I will install Py3.10 and compare benchmarks between 3.9 and 3.10.

0 replies

skeeto · 2022-05-23T13:38:36Z

skeeto
May 23, 2022

Yup, that's me. I'm still surprised this significant GIL change went unannounced, and that 7 months after release it seems nobody has even noticed. I just checked again now, and the GIL still works this way on CPython main.

6 replies

ericsnowcurrently May 27, 2022
Maintainer

a lack of a standardised and repeatable methodology and review process for ensuring that the performance impact of all PRs (and especially those changing the GIL) are positive (or not detrimental) is a little worrying.

FWIW, the topic of benchmarking each PR has come up in discussions a few times recently. It couldn't be a required check, in part because the benchmarks take so long to run. Also, AFAIK getting the necessary tooling and hardware in place would be a huge project. Do you know of anything existing that would fit the need?

Sophist-UK May 28, 2022
Author

The problem with benchmarking is that you need a dedicated environment and standard for meaningful performance testing - and Github's environments are all shared.

As a technical solution, the Python project could create a dedicated server and make it a Github runner, and then create a specialised performance script to take measurements and analyse them. BUT this would take resources that would probably be better spent on other things.

But I am not sure it needs a technical solution. We have one script - though perhaps some other standardised benchmarks could be added - and all it needs is for reviewers to recognise PRs which e.g. might affect the GIL or bytecode performance, and run the script on the old and new versions of Python and report back on whether there are improvements or impacts.

ericsnowcurrently May 31, 2022
Maintainer

Keep in mind that speed.python.org is updated regularly (or at least that's the intent). That should be sufficient to catch regressions.

Sophist-UK May 31, 2022
Author

I didn't know about http://speed.python.org but there is nothing in the list of benchmarks run to indicate anything looking at multi-threading performance with and without CPU soaks.

And without specific benchmarks to measure multi-threading performance, all this site does is give a false sense of security.

Sophist-UK May 31, 2022
Author

I took a slightly more in depth look at the site documentation etc.

The good news is that it has been going for 10 years on the exact same hardware, and also that it is being actively maintained.

The bad news is that the documentation on what benchmarks are running is a bit thin on the ground.

There is some discussion from 2016 about someone writing a thread benchmark, but the summary suggests to me that this is looking at the performance of the threading primatives and not the performance of multi-threaded applications with and without CPU sinks. The fact that someone looked at a threading benchmark suggests that there isn't anything else, and certainly not something that would look at this specific issue.

Also, I am unclear whether there are benchmarks for the major C-extensions like numpy or not. Certainly there is not a bm_numpy at https://github.com/python/pyperformance/tree/main/pyperformance/data-files/benchmarks and the todo list still has Numpy on it since October 2016.

So, whilst I genuinely applaud this site and the past and continuing efforts to create and maintain it, it really does need to have more real-life benchmarks reflecting real-life issues such as this one.

ericsnowcurrently · 2022-06-01T16:11:26Z

ericsnowcurrently
Jun 1, 2022
Maintainer

The good news is that it has been going for 10 years on the exact same hardware, and also that it is being actively maintained.

FYI, the Faster CPython team (my team at Microsoft) has its own dedicated host(s) where we run benchmarks. We post generally useful results in this repo (https://github.com/faster-cpython/ideas/tree/main/benchmark-results), but no fancy UI.

Also see https://github.com/faster-cpython/ideas/wiki/All-About-Python-Benchmarking (and the related "tables" pages), which I added just yesterday and plan on filling in further. In fact, anyone can pitch in.

The bad news is that the documentation on what benchmarks are running is a bit thin on the ground.

It is running the pyperformance default suite. See https://pyperformance.readthedocs.io/usage.html and https://pyperformance.readthedocs.io/benchmarks.html.

Also note that pyperformance supports running external benchmarks. The Pyston benchmark suite has benchmarks like that: https://github.com/pyston/python-macrobenchmarks.

there is nothing in the list of benchmarks run to indicate anything looking at multi-threading performance with and without CPU soaks.

There are definitely gaps in the benchmark suite. You can open issues and submit PRs to pyperformance if you have ideas on benchmarks we could add.

So, whilst I genuinely applaud this site and the past and continuing efforts to create and maintain it, it really does need to have more real-life benchmarks reflecting real-life issues such as this one.

Agreed! 😄 We'd love help in adding good benchmarks, if you have some time. It seems like you have valuable insight.

4 replies

Sophist-UK Jun 1, 2022
Author

@ericsnowcurrently said:

the Faster CPython team (my team at Microsoft)

So, as a member of the Faster cPython team, what is your take on the performance issues highlighted in this discussion and the need for a GIL dispatcher (new name rather than a scheduler which makes it sound more complex than it needs to be)?

(So far I have been underwhelmed by the interest shown by the interest shown in this being a real and significant issue by other core Python developers who are in this discussion.)

We'd love help in adding good benchmarks, if you have some time. It seems like you have valuable insight.

When I have time I will try to massage the script used here into something that can be used with http://speed.python.org and submit a PR. If I do this will the new benchmarks be run against all currently supported versions of python (and ideally older ones) so that we can see the historical impact over time for the newly benchmarked area?

ericsnowcurrently Jun 2, 2022
Maintainer

So, as a member of the Faster cPython team, what is your take on the performance issues highlighted in this discussion and the need for a GIL dispatcher (new name rather than a scheduler which makes it sound more complex than it needs to be)?

I don't have same level of expertise in this area as others that have already chimed in, so I don't have a lot to add. However, I don't mind giving it a shot. Honestly, I haven't followed the thread closely enough to give you a good answer right now. When I get a chance, I'll take a little time to catch up and share my thoughts.

In the meantime, keep in mind that the GIL is a critical feature, yet the current implementation is relatively simple. Changes to something in that situation have to pass a fairly high bar, particularly due to the mostly volunteer nature of CPython development.

(So far I have been underwhelmed by the interest shown by the interest shown in this being a real and significant issue by other core Python developers who are in this discussion.)

I'm sorry to hear that. I know that, as a group, we core devs are genuinely committed to solving problems for the community. If it seems like there's a lack of enthusiasm for some topic then it usually means either no one has had time to dig into it (we're mostly volunteers), there aren't enough experts, there was a gap in communication about the problem or solution, or some combination of those.

I'm not saying it's necessarily the case here, but it might also be that the problem isn't worth solving, or a proposed solution isn't the right approach (and no one has had the time/motivation to find the right one). Typically this means the problem has hard-to-see complexities that imply challenging constraints which invalidate the obvious solutions. While we can't bend reality to our will, sadly, we can certainly do better about communicating about the challenging complexities, as well as being more open to new perspectives that clear the way to a solution.

When I have time I will try to massage the script used here into something that can be used with http://speed.python.org/ and submit a PR.

❤️ Awesome! ❤️

If I do this will the new benchmarks be run against all currently supported versions of python (and ideally older ones) so that we can see the historical impact over time for the newly benchmarked area?

IIRC, the automation for speed.python.org is relatively limited, so this would require some manual work. @pablogsal would know more.

pablogsal Jun 2, 2022
Collaborator

IIRC, the automation for speed.python.org is relatively limited, so this would require some manual work. @pablogsal would know more.

Speed.python.org runs automatically against the latest version of python and old versions are benchmarked on demand, as their performance profiles don't change after they are released.

Sophist-UK Nov 12, 2022
Author

@ericsnowcurrently

I don't have same level of expertise in this area as others that have already chimed in, so I don't have a lot to add. However, I don't mind giving it a shot. Honestly, I haven't followed the thread closely enough to give you a good answer right now. When I get a chance, I'll take a little time to catch up and share my thoughts.

In the meantime, keep in mind that the GIL is a critical feature, yet the current implementation is relatively simple. Changes to something in that situation have to pass a fairly high bar, particularly due to the mostly volunteer nature of CPython development.

If and when you get time to look at this, please reach out to me.

I do not believe that the scheduler I have in mind will be any more complex than the current GIL code. The fundamental difference is that when the GIL is released, a choice is made and a single thread signalled to take the GIL, rather than at present a signal being sent to all threads that then compete for the GIL.

Maintaining a few linked lists of runnable threads is not difficult in C. A thread which gives up the GIL because it has used up its time-slice is immediately runnable and is added at the back of the CPU-Heavy runnable list. A thread which gives up the GIL e.g. to do I/O is not put on the CPU-Light queue until the I/O completes and it becomes runnable. The scheduler chooses the next thread from the front of the CPU-Light queue, or if that is empty from the front of the CPU-Heavy queue. In C programming terms, this is really not that complex. (There is a slightly more sophisticated version of this that allows the python programmer to assign a thread "priority" (say high/medium/low), and you maintain 6 runnable lists instead of 2, but you don't need that to get the main benefits IMO.)

(Note: This is not something I dreamt up myself - this is the sort of priority scheduler that operating systems used to have when they were single core, (say) 45 years ago. It has a pedigree.)

Multithreading performance #328

Replies: 27 comments · 100 replies

gvanrossum Mar 21, 2022 Maintainer

Sophist-UK Mar 23, 2022 Author

Sophist-UK Mar 30, 2022 Author

gvanrossum Mar 21, 2022 Maintainer

gvanrossum Mar 21, 2022 Maintainer

Sophist-UK Mar 21, 2022 Author

gvanrossum Mar 21, 2022 Maintainer

Sophist-UK Mar 21, 2022 Author

Sophist-UK Mar 21, 2022 Author

gvanrossum Mar 21, 2022 Maintainer

Sophist-UK Mar 21, 2022 Author

gvanrossum Mar 21, 2022 Maintainer

gvanrossum Mar 21, 2022 Maintainer

Sophist-UK Mar 21, 2022 Author

Sophist-UK Mar 21, 2022 Author

gvanrossum Mar 22, 2022 Maintainer

Sophist-UK Mar 22, 2022 Author

Sophist-UK Mar 22, 2022 Author

Sophist-UK Mar 22, 2022 Author

Sophist-UK Mar 23, 2022 Author

Sophist-UK Mar 23, 2022 Author

Sophist-UK Mar 23, 2022 Author

gvanrossum Mar 23, 2022 Maintainer

Sophist-UK Mar 23, 2022 Author

Sophist-UK Mar 23, 2022 Author

Sophist-UK Mar 23, 2022 Author

Conclusions (personal view)

Sophist-UK Mar 23, 2022 Author

Sophist-UK Mar 23, 2022 Author

Sophist-UK Mar 23, 2022 Author

Sophist-UK Mar 23, 2022 Author

Additional observations

Throughput

Bandwidth

Results with CPU information added

Throughput Results with Switch Interval of 0.1sec

Sophist-UK Mar 23, 2022 Author

Sophist-UK Mar 23, 2022 Author

Sophist-UK Mar 23, 2022 Author

gvanrossum Mar 23, 2022 Maintainer

Sophist-UK Mar 23, 2022 Author

Sophist-UK Mar 23, 2022 Author

Sophist-UK Mar 23, 2022 Author

Sophist-UK Mar 24, 2022 Author

Replies: 27 comments 100 replies

gvanrossum
Mar 21, 2022
Maintainer

Sophist-UK Mar 23, 2022
Author

Sophist-UK Mar 30, 2022
Author

gvanrossum
Mar 21, 2022
Maintainer

gvanrossum Mar 21, 2022
Maintainer

Sophist-UK
Mar 21, 2022
Author

gvanrossum Mar 21, 2022
Maintainer

Sophist-UK Mar 21, 2022
Author

Sophist-UK Mar 21, 2022
Author

gvanrossum Mar 21, 2022
Maintainer

Sophist-UK Mar 21, 2022
Author

gvanrossum
Mar 21, 2022
Maintainer

gvanrossum Mar 21, 2022
Maintainer

Sophist-UK
Mar 21, 2022
Author

Sophist-UK
Mar 21, 2022
Author

gvanrossum Mar 22, 2022
Maintainer

Sophist-UK Mar 22, 2022
Author

Sophist-UK Mar 22, 2022
Author

Sophist-UK Mar 22, 2022
Author

Sophist-UK
Mar 23, 2022
Author

Sophist-UK Mar 23, 2022
Author

Sophist-UK
Mar 23, 2022
Author

gvanrossum
Mar 23, 2022
Maintainer

Sophist-UK
Mar 23, 2022
Author

Sophist-UK Mar 23, 2022
Author

Sophist-UK
Mar 23, 2022
Author

Sophist-UK Mar 23, 2022
Author

Sophist-UK Mar 23, 2022
Author

Sophist-UK Mar 23, 2022
Author

Sophist-UK
Mar 23, 2022
Author

Sophist-UK Mar 23, 2022
Author

Sophist-UK Mar 23, 2022
Author

Sophist-UK Mar 23, 2022
Author

gvanrossum
Mar 23, 2022
Maintainer

Sophist-UK Mar 23, 2022
Author

Sophist-UK Mar 23, 2022
Author

Sophist-UK Mar 23, 2022
Author

Sophist-UK Mar 24, 2022
Author