Multithreading performance #328
Replies: 27 comments 100 replies
-
I'd like to see some thoughts from @pitrou on this, since IIRC he touched it last. :-) |
Beta Was this translation helpful? Give feedback.
-
Apart from the fact that I don't understand where the mentioned "priorities" come from, it seems to me that the only way to evaluate this proposal is for someone to write the code and gather metrics of interest. |
Beta Was this translation helpful? Give feedback.
-
@Sophist-UK Could you run Beazley's original experiment with Python 3.10? It's a very small amount of pure Python code. In his talk he claims a 7x slowdown when the server thread is competing with one CPU-bound thread. That should be simple to do. You can forget about the ZeroMQ version -- in fact any of the versions (e.g. Python + blocking sockets) would be fine. |
Beta Was this translation helpful? Give feedback.
-
@gvanrossum Guido I am really not sure I have the skills. I certainly have never done anything like this before and I have no idea where to start. If it needs python / windows skills, then I would probably be OK. If it needs C/C++ skills then I am instantly out of my depth with way to big a learning curve. I will try to find the time in the next few days to research what David ran and try to run it again. I have an aging 4th gen i7 laptop with 4 cores and 8 hyperthreads so I probably have the hardware. I have py3,9 64bit installed and can easily install py3.10 alongside. However what I am trying to solve is not necessarily the same as David was looking at - it is some years since I last worked on the multithreading app, and reached out to David to understand the GIL issues, so I am not quite sure what my issues were then and no idea whether they still exist several years later. But I do have an understanding of the sorts of issues that might arise with one or more CPU bound threads i.e. CPU threads not getting fair shares, I/O threads getting no or infrequent CPU. I probably have the Python skills to tweak existing test code to test for the things we are talking about (except hooking into functionality that would e.g. record the end timestamp of an I/o and the timestamp when the thread that called that I/O next gets scheduled). But if I need to start tweaking C-code and recompile Python, I am going to be lost. |
Beta Was this translation helpful? Give feedback.
-
Could you reconstruct it from first principles given David's slides? He gave some pseudo-code, maybe you could turn that into working real code. |
Beta Was this translation helpful? Give feedback.
-
I have downloaded the UDP code and had a quick look - they look good as a basis. I will need to tweak the code so that we can see the CPU-bound thread performance depending on the number of threads. I would also like to check that there is no performance with the UDP echo server running in foreground vs. background. And I probably want to allow the client to be throttled or not so as to see how the CPU intensive workload is impacted depending on zero, low or high volumes of echos and to capture and do a histogram graph on the response times as well as the total throughput. But none of this is going to be that difficult. I will post results here once I have them. |
Beta Was this translation helpful? Give feedback.
-
That said I am not sure that I have any results to compare these runs to. We will have to evaluate them on their own merits. But if I install various key versions of Py3 and see what the differences are that might give us some insights too. |
Beta Was this translation helpful? Give feedback.
-
How do you determine a thread is I/O bound? Assuming that one that releases the GIL rather than hitting the end of their interpreter loop time-slice is doing so for I/O is unrealistic. C API users release the GIL all the time to enable CPU parallelism during pure non-Python code. Not just for potentially blocking I/O. The interpreters loop time-slices seem like an insufficient proxy for CPU vs IO bound. Example: A thread running If this requires all C API uses to update code to use new explicit purpose GIL releasing APIs stating the reason (Blocking vs non-Python-CPU) they are releasing it, adoption could be a challenge and you may find an "unknown" scheduling state falling between or acting as both IO and CPU bound becomes necessary. I'm not suggesting more informative GIL releasing APIs are necessarily a bad idea, just that they wouldn't magically win on existing code. https://docs.python.org/3/c-api/init.html#thread-state-and-the-global-interpreter-lock |
Beta Was this translation helpful? Give feedback.
-
I am making good progress with adding some stuff to @pitrou 's ccbench code e.g. improved environmental reporting at the start of the run, ability to set process affinity, and automatically setting high O/S process priority so that the impact of other system activities are minimised. I think that this script now has the basic functionality that I think is needed for some benchmarking, and I now need to think about:
A couple of key things you might want to try if you decide to check out the new script:
Any / all feedback on this new script and the next steps are very welcome as I do not have a clear vision yet of what I need to achieve next. I will post an example output in a subsequent comment in a few mins. |
Beta Was this translation helpful? Give feedback.
-
Example output: ccbench_updated.py -a 3
|
Beta Was this translation helpful? Give feedback.
-
I wouldn't be so sure that the pi calculation is Python. It looks as if it is Python, but once you have a few hundred digits it spends all its time in the long integer code, which is all C. |
Beta Was this translation helpful? Give feedback.
-
Ok I have tweaked the code a little more (to give std dev on throughput and to run the latency/bandwidth clients at high priority with full affinity) and rerun the example, and I am enclosing the file and the example results here and will give my first-cut analysis in the next comment. ccbench_updated.py -a 3
|
Beta Was this translation helpful? Give feedback.
-
Here is my first cut analysis of the above example:
Conclusions (personal view)So what conclusions might potentially be drawn from this about whether there is a need for a GIL scheduler.
Obviously, these are my personal observations and I am as capable as the next person of drawing false conclusions - so please chip in with your own observations and / or suggestions on how to proceed from here. |
Beta Was this translation helpful? Give feedback.
-
I have now added reporting of CPU usage and it turns out that this is particularly insightful: Additional observationsThroughputFor the Pi/regex workloads, we don't really learn anything new with the CPU information, but I believe that the results from the compression/hash tests tell us something quite important: As the threads build up to the number of cores, we see the CPU scaling proportionately - we are maxing out the CPUs with these GIL-releasing CPU-heavy workloads, however the throughput is not increasing. I don't think that this is some sort of bottleneck constraining execution, because if it was then the total CPU would not be proportional to the number of cores. So it seems that something is wasting a lot of CPU cycles, and one possible cause could be the competition-based approach for gaining the GIL. If this is the case, then by massively increasing the Switch Interval from 0.005s to (say) 0.1s, and creating a corresponding reduction in the number of timeslice based GIL switches, I would expect this overhead to decrease and to see a significant improvement in throughput. I tried this and the results are shown below, and we definitely see an improvements for some GIL-releasing workloads, but not so much for others - and at this stage I do not have an explanation for this. But for e.g. the hashing workloads we can see that for 2 threads you actually get 100% more throughput rather than only 60% more. But this quickly tails off for threads >= 3, and I have no theory about this either. (I did try a run with a Switch Interval of 0.5 and that was no better - but with a 2s run time, a switch interval of 1/4 of that may introduce other issues.) But if we look at the base results then the numbers are really very bad indeed - for the bz workload (which is the worst) the incremental figures as you add threads look like this: 1st thread : 100% through put for 100% CPU (baseline) If we combine the 5th to 16th threads into a single group, we get c. 56% increase in throughput for 400% increase in CPU. I think we can all agree that these are really not good figures and represent a massive opportunity for improving the performance of GIL-releasing C-based modules. IMO a good starting point for this would be to move away from a GIL handover approach based on competing for the GIL to one where the GIL is handed off explicitly to the chosen thread. BandwidthThe bandwidth results are also a bit odd as we see a tail off in the additional CPU usage. Edit: I now have a theory that the CPU-soaker threads are not getting sufficient access to the GIL to keep the non-GIL CPU load in runnable state, and as per some later discussions which have suggested the same thing from another direction, I intend (tomorrow?) to add some sleep statements to the client code in order to reduce the intensity of the latency, and throughput traffic and see if that changes the measurements. Results with CPU information addedccbench_updated.py -a 8
Throughput Results with Switch Interval of 0.1secccbench_updated.py -a 8 -t -I 0.1
|
Beta Was this translation helpful? Give feedback.
-
I have a question regarding David Beazley's original tests. He said it specifically wasn't a throughput or latency test. His client was (in pseudo-code) something like def client(n=10000):
msg = b"xxxxxxxx" * 1024
for i in range(n):
rpc(msg)
time.sleep(0.001) The server was a trivial echo server, with 0 or 1 pure-python CPU-consuming threads. (I don't think he showed the CPU-consuming code, but we could assume it's as stupid as Which of the benchmarks in your (or Antoine's) code corresponds to this? |
Beta Was this translation helpful? Give feedback.
-
I would like to make a few general comments about some of things in this discussion as they pertain to my past work:
|
Beta Was this translation helpful? Give feedback.
-
For reference for anyone who it might interest, here's the discussion about fixing the GIL the first time. https://mail.python.org/archives/list/[email protected]/thread/B5MUWPLGDBWXTUSKZEJLPAIGYM32XMPB/ |
Beta Was this translation helpful? Give feedback.
-
Updated Bandwidth measurements with a delay of 0.01s. AnalysisWe can see that the CPU levels for GIL-release C code are almost where we would expect at +700% for 8+ threads (rather than at the c. 460% on previous runs where there was no delay between sending a previous echo response and receiving the next one). This presumably allows the echo thread to give up the GIL and other CPU-soak threads that are waiting for the GIL to acquire it, get started on the next calculation and release the GIL again. I don't think that this has much if any impact on the analysis of what the shortcomings of the existing GIL are - it just resolved the unknown reason that CPU-soakers were not soaking up all the available CPU as we expected them to. Resultsccbench_updated.py -a 8 -b
|
Beta Was this translation helpful? Give feedback.
-
I sometimes wonder if half the problem is the whole terminology of "priorities." It's like a magnet for complexity. Threads already have a concept of "daemonic." Maybe you could introduce a new concept like "holiness." When you create a new thread, make it "holy" by default--kind of like the "holy hand grenade" from Monty Python. If desired, a thread can declare itself "unholy" by calling some function like If anyone complains about further strangeness, you could just say "well, what were you expecting from a bunch of unholy threads?" |
Beta Was this translation helpful? Give feedback.
-
I am not sure that there is much more analysis I can do. In which case the time has come to decide whether this is an issue which needs addressing and if so what the next steps are... |
Beta Was this translation helpful? Give feedback.
-
Chiming in as a non-expert: I doubt I can contribute anything to the already heavily-researched field of schedulers, but I'm game to implement a different thread scheduler in CPython. I will probably use an algorithm that is already well received instead of conceiving a new one. In a few weeks time I will have loads of free time to research and hack on the C parts of this. Maybe I can finally put my rudimentary knowledge of the dinosaur OS book to some use :). |
Beta Was this translation helpful? Give feedback.
-
Going back to David's original bpo https://bugs.python.org/issue7946 (which you are also on). The most straightforward way I see recommended by Gregory is to update the BFS implementation for
Some of the arguments against the BFS implementation no longer apply though. Like Reading through the bpo, I'm not sure why nobody accepted David's implementation. Was it just due to a lack of conclusive benchmarks? Maybe @dabeaz could shed some light here please. His implementation seems least likely to break anything in our codebase, despite it purportedly performing worse than the full BFS. Disclaimer: my efforts may be fruitless, or it may take months. I don't see this being an easy task. I'm not an expert at all in the areas of python runtime state (unfortunately, I maintain other parts of CPython). |
Beta Was this translation helpful? Give feedback.
-
I implemented my "holy" thread idea. It wholly solves my original performance problem. https://github.com/dabeaz-fork/cpython/blob/main/HOLY.md. |
Beta Was this translation helpful? Give feedback.
-
I think we have pretty much reached the end of this discussion. I have demonstrated that the current GIL continues to have some VERY poor performance characteristics, and I cannot express just how disappointed I am about the apparent lack of interest from the community in implementing a solution, despite the involvement in this thread of both @gvanrossum (BDfL of Python) and @pitrou (the author of the current GIL implementation). So, for a final time, I will ask again whether there is anyone interested in attempting to address this issue? |
Beta Was this translation helpful? Give feedback.
-
According to a comment by reddit user skeeto (who may or may not be @skeeto here) in a thread about |
Beta Was this translation helpful? Give feedback.
-
Yup, that's me. I'm still surprised this significant GIL change went
unannounced, and that 7 months after release it seems nobody has even
noticed. I just checked again now, and the GIL still works this way on
CPython main.
|
Beta Was this translation helpful? Give feedback.
-
FYI, the Faster CPython team (my team at Microsoft) has its own dedicated host(s) where we run benchmarks. We post generally useful results in this repo (https://github.com/faster-cpython/ideas/tree/main/benchmark-results), but no fancy UI. Also see https://github.com/faster-cpython/ideas/wiki/All-About-Python-Benchmarking (and the related "tables" pages), which I added just yesterday and plan on filling in further. In fact, anyone can pitch in.
It is running the pyperformance default suite. See https://pyperformance.readthedocs.io/usage.html and https://pyperformance.readthedocs.io/benchmarks.html. Also note that pyperformance supports running external benchmarks. The Pyston benchmark suite has benchmarks like that: https://github.com/pyston/python-macrobenchmarks.
There are definitely gaps in the benchmark suite. You can open issues and submit PRs to pyperformance if you have ideas on benchmarks we could add.
Agreed! 😄 We'd love help in adding good benchmarks, if you have some time. It seems like you have valuable insight. |
Beta Was this translation helpful? Give feedback.
-
Some years ago a guy named Dave Beazley instrumented Python and undertook an analysis of multithreading and why multi-threaded apps had poor perceived performance as seen by users particularly when there were one or more CPU bound threads. In essence Python release the GIL, all waiting threads became runnable, and whichever thread got in first would grab the GIL. This should in theory have been random rather than deterministic, but in essence what Dave Beazley showed was that at the end of a time interval, the current executing thread was (obviously) already scheduled by the operating system and so it got in first and grabbed the GIL back - and this would happen multiple times. In 2009 a change was made in Python 2.9 and 3.2 which implemented "FORCED_SWITCHING" (commit) to try to avoid CPU being given back to the same thread again and again, but the alternative thread that got the GIL was still random/pseudo-random.
In 2010 David Beazley ran his analysis again against Python 3.2 and noted a different set of issues. He raised these in https://bugs.python.org/issue7946. Soon after it was opened, someone actually implemented a full Linux BFS scheduler as a test (which I personally think was way too complex) and it wasn't beneficial, and there were (understandable) concerns about a scheduler needing to be platform agnostic. The subject was then not mentioned for about 10 years despite the consensus being that a scheduler was probably what was needed essentially to provide a deterministic solution, and was mentioned in passing in the last year or so of the issue before the issue was closed.
I am opening this discussion to suggest a specific and (IMO) easily implemented algorithm for a GIL scheduler. Unfortunately I am a Python programmer (and computer science nerd) and not a C-programmer so, whilst I have had real life experiences of the difficulties of delivering a Python app which has a consistently snappy GUI whilst doing a whole load of stuff in background threads, creating this GIL enhancement or even undertaking the analysis to confirm that it is still needed is beyond my skills.
The difficulties I have experienced can be summarised as:
a. Inability to assign an operating system priority to a thread (even though a lot of O/S allow this).
b. An inability for I/O and CPU bound threads to be differentiated and handled diffferently.
c. The random nature of who gets the GIL next.
d. The fluctuating nature of who gets the GIL next depending on the GIL implementation in the version of Python the code is run against.
Here is my wish list for a GIL scheduler, some but not all of which were discussed in the above issue 7946:
So, to the detail of my proposal:
Note: This is an entire different approach to the current scheme where threads compete for the GIL - the GIL scheduler would normally execute in the thread that is currently giving up the GIL and it would choose a new thread to run and signal it to take the GIL. I anticipate that the CPU used for scheduling will in most cases be less than that saved by not having all threads competing.
I should add that this type of simple scheduler is not a new one - it is pretty much a Priority / Foreground-Background / Round-Robin / Cooperative / Pre-Emptive scheduler (see wikipedia) combining several well understood algorithms that I first came across many decades ago in my mainframe days.
In general terms this seems like a very simple thing to implement, but of course reality is often never quite that easy.
And of course, I might not be up to date with the current state of the GIL, or I might be off the mark here in other ways, but I want to put this forward and see if people think it is a good idea and whether there is any interest in implementing it.
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions