diff --git a/config.yaml b/config.yaml index dd0cf1be..61f7588a 100644 --- a/config.yaml +++ b/config.yaml @@ -64,11 +64,10 @@ episodes: - profiling-lines.md - profiling-conclusion.md - optimisation-introduction.md +- optimisation-data-structures-algorithms.md +- optimisation-minimise-python.md - optimisation-use-latest.md - optimisation-memory.md -- optimisation-list-tuple.md -- optimisation-dict-set.md -- optimisation-minimise-python.md - optimisation-conclusion.md # Information for Learners diff --git a/fig/viztracer-example.png b/fig/viztracer-example.png new file mode 100644 index 00000000..08b6b9bb Binary files /dev/null and b/fig/viztracer-example.png differ diff --git a/md5sum.txt b/md5sum.txt index d75d0dfb..da0347ca 100644 --- a/md5sum.txt +++ b/md5sum.txt @@ -1,19 +1,18 @@ "file" "checksum" "built" "date" "CODE_OF_CONDUCT.md" "c93c83c630db2fe2462240bf72552548" "site/built/CODE_OF_CONDUCT.md" "2024-01-03" "LICENSE.md" "b24ebbb41b14ca25cf6b8216dda83e5f" "site/built/LICENSE.md" "2024-01-03" -"config.yaml" "71b7cc873eb97b0f2c6a1f8d878a817f" "site/built/config.yaml" "2024-01-29" +"config.yaml" "107b738ce1400fd80598278d369d73d1" "site/built/config.yaml" "2024-01-30" "index.md" "5d420b7de3ab84e1eda988e6bc4d58b4" "site/built/index.md" "2024-01-29" "links.md" "8184cf4149eafbf03ce8da8ff0778c14" "site/built/links.md" "2024-01-03" -"episodes/profiling-introduction.md" "a043fb5f1f772b7415f32175810c5f1e" "site/built/profiling-introduction.md" "2024-01-29" -"episodes/profiling-functions.md" "85294a5fa905fc2ea9dd5068164aed40" "site/built/profiling-functions.md" "2024-01-29" -"episodes/profiling-lines.md" "639730e60d1dee7cfa6624a24de92abe" "site/built/profiling-lines.md" "2024-01-29" +"episodes/profiling-introduction.md" "43c2a8c0f88185ca507a1eb1ea9cebe3" "site/built/profiling-introduction.md" "2024-01-30" +"episodes/profiling-functions.md" "f610fd53c1ebab976ff9a0dcdcc45526" "site/built/profiling-functions.md" "2024-01-30" +"episodes/profiling-lines.md" "9bbfacf2ba050c1bbedca0d1daacbe77" "site/built/profiling-lines.md" "2024-01-30" "episodes/profiling-conclusion.md" "340969a321636eb94fff540191a511e7" "site/built/profiling-conclusion.md" "2024-01-29" -"episodes/optimisation-introduction.md" "ae3baa53a96cab9c1aace409de6c7634" "site/built/optimisation-introduction.md" "2024-01-29" -"episodes/optimisation-use-latest.md" "33531063e2b4d3b473f3f066cea65a14" "site/built/optimisation-use-latest.md" "2024-01-29" -"episodes/optimisation-memory.md" "ae7bb4df0f5b640f6000d65c1ee145b1" "site/built/optimisation-memory.md" "2024-01-29" -"episodes/optimisation-list-tuple.md" "9e9a398923bf1137ce92fa6e78446746" "site/built/optimisation-list-tuple.md" "2024-01-29" -"episodes/optimisation-dict-set.md" "64b8261d0c29bea3135e48501e6f8b56" "site/built/optimisation-dict-set.md" "2024-01-29" -"episodes/optimisation-minimise-python.md" "efab1af49121b0a197dab94e49b6ff30" "site/built/optimisation-minimise-python.md" "2024-01-29" +"episodes/optimisation-introduction.md" "e654a1c147b600a271682b31773b9474" "site/built/optimisation-introduction.md" "2024-01-30" +"episodes/optimisation-data-structures-algorithms.md" "35babab92cb48f2a462e48298d6c8235" "site/built/optimisation-data-structures-algorithms.md" "2024-01-30" +"episodes/optimisation-minimise-python.md" "92567c502a88fac1327bfe4d5da57c5e" "site/built/optimisation-minimise-python.md" "2024-01-30" +"episodes/optimisation-use-latest.md" "829f7a813b0a9a131fa22e6dbb534cf7" "site/built/optimisation-use-latest.md" "2024-01-30" +"episodes/optimisation-memory.md" "327cb08d4f7a7b10d80abbc2442f35dd" "site/built/optimisation-memory.md" "2024-01-30" "episodes/optimisation-conclusion.md" "e4a79aa1713310c75bc0ae9e258641c2" "site/built/optimisation-conclusion.md" "2024-01-29" "instructors/instructor-notes.md" "cae72b6712578d74a49fea7513099f8c" "site/built/instructor-notes.md" "2024-01-03" "learners/setup.md" "50d49ff7eb0ea2d12d75773ce1decd45" "site/built/setup.md" "2024-01-29" diff --git a/optimisation-dict-set.md b/optimisation-data-structures-algorithms.md similarity index 57% rename from optimisation-dict-set.md rename to optimisation-data-structures-algorithms.md index 4bc8c0a2..822eac26 100644 --- a/optimisation-dict-set.md +++ b/optimisation-data-structures-algorithms.md @@ -1,25 +1,203 @@ --- -title: "Dictionaries & Sets" +title: "Data Structures & Algorithms" teaching: 0 exercises: 0 --- :::::::::::::::::::::::::::::::::::::: questions +- What's the most efficient way to construct a list? +- When should Tuples be used? +- When should generator functions be used? - When are sets appropriate? -- How are sets used in Python? - What is the best way to search a list? :::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::: objectives -- Able to identify appropriate use-cases for dictionaries and sets -- Able to use dictionaries and sets effectively +- Able to summarise how Lists and Tuples work behind the scenes. +- Able to identify appropriate use-cases for tuples. +- Able to use generator functions in appropriate situations. +- Able to utilise dictionaries and sets effectively - Able to use `bisect_left()` to perform a binary search of a list or array :::::::::::::::::::::::::::::::::::::::::::::::: +## Lists + +Lists are a fundamental data structure within Python. + +It is implemented as a form of dynamic array found within many programming languages by different names (C++: `std::vector`, Java: `ArrayList`, R: `vector`, Julia: `Vector`). + +They allows direct and sequential element access, with the convenience to append items. + +This is achieved by internally storing items in a static array. +This array however can be longer than the `List`. +When an item is added, the `List` checks whether it has enough spare space to add the item to the end. +If it doesn't, it will reallocate a larger array, copy across the elements, and deallocate the old array. + +The growth is dependent on the implementation's growth factor. +CPython for example uses [`newsize + (newsize >> 3) + 6`](https://github.com/python/cpython/blob/a571a2fd3fdaeafdfd71f3d80ed5a3b22b63d0f7/Objects/listobject.c#L74), which works out to an over allocation of roughly ~12.5%. + +![The relationship between the number of appends to an empty list, and the number of internal resizes in CPython.](episodes/fig/cpython_list_allocations.png){alt='A line graph displaying the relationship between the number of calls to append() and the number of internal resizes of a CPython list. It has a logarithmic relationship, at 1 million appends there have been 84 internal resizes.'} + +This has two implications: + +* If you are creating large static lists, they will use upto 12.5% excess memory. +* If you are growing a list with `append()`, there will be large amounts of redundant allocations and copies as the list grows. + +### List Comprehension + +If creating a list via `append()` is undesirable, the natural alternative is to use list-comprehension. + +List comprehension can be twice as fast at building lists than using `append()`. +This is primarily because list-comprehension allows Python to offload much of the computation into faster C code. +General python loops in contrast can be used for much more, so they remain in Python bytecode during computation which has additional overheads. + +This can be demonstrated with the below benchmark: + +```python +from timeit import timeit + +def list_append(): + li = [] + for i in range(100000): + li.append(i) + +def list_preallocate(): + li = [0]*100000 + for i in range(100000): + li[i] = i + +def list_comprehension(): + li = [i for i in range(100000)] + +repeats = 1000 +print(f"Append: {timeit(list_append, number=repeats):.2f}ms") +print(f"Preallocate: {timeit(list_preallocate, number=repeats):.2f}ms") +print(f"Comprehension: {timeit(list_comprehension, number=repeats):.2f}ms") +``` + +`timeit` is used to run each function 1000 times, providing the below averages: + +```output +Append: 3.50ms +Preallocate: 2.48ms +Comprehension: 1.69ms +``` + +Results will vary between Python versions, hardware and list lengths. But in this example list comprehension was 2x faster, with pre-allocate fairing in the middle. Although this is milliseconds, this can soon add up if you are regularly creating lists. + +## Tuples + +In contrast, Python's Tuples are immutable static arrays (similar to strings), their elements cannot be modified and they cannot be resized. + +Their potential use-cases are greatly reduced due to these two limitations, they are only suitable for groups of immutable properties. + +Tuples can still be joined with the `+` operator, similar to appending lists, however the result is always a newly allocated tuple (without a list's over-allocation). + +Python caches a large number of short (1-20 element) tuples. This greatly reduces the cost of creating and destroying them during execution at the cost of a slight memory overhead. + +This can be easily demonstrated with Python's `timeit` module in your console. + +```sh +>python -m timeit "li = [0,1,2,3,4,5]" +10000000 loops, best of 5: 26.4 nsec per loop + +>python -m timeit "tu = (0,1,2,3,4,5)" +50000000 loops, best of 5: 7.99 nsec per loop +``` + +It takes 3x as long to allocate a short list than a tuple of equal length. This gap only grows with the length, as the tuple cost remains roughly static whereas the cost of allocating the list grows slightly. + + +## Generator Functions + +You may not even require your data be stored in a list or tuple if it is only accessed once and in sequence. + +Generators are special functions, that use `yield` rather than `return`. Each time the generator is called, it resumes computation until the next `yield` statement is hit to return the next value. + +This avoids needing to allocate a data structure, and can greatly reduce memory utilisation. + +Common examples for generators include: + +* Reading from a large file that may not fit in memory. +* Any generated sequence where the required length is unknown. + +The below example demonstrates how a generator function (`fibonnaci_generator()`) differs from one that simply returns a constructed list (`fibonacci_list()`). + +```python +from timeit import timeit + +N = 1000000 +repeats = 1000 + +def fibonacci_generator(): + a=0 + b=1 + while True: + yield b + a,b= b,a+b + +def fibonacci_list(max_val): + rtn = [] + a=0 + b=1 + while b < max_val: + rtn.append(b) + a,b= b,a+b + return rtn + +def test_generator(): + t = 0 + max_val = N + for i in fibonacci_generator(): + if i > max_val: + break + t += i + +def test_list(): + li = fibonacci_list(N) + t = 0 + for i in li: + t += i + +def test_list_long(): + t = 0 + max_val = N + li = fibonacci_list(max_val*10) + for i in li: + if i > max_val: + break + t += i + +print(f"Gen: {timeit(test_generator, number=repeats):.5f}ms") +print(f"List: {timeit(test_list, number=repeats):.5f}ms") +print(f"List_long: {timeit(test_list_long, number=repeats):.5f}ms") +``` + +The performance of `test_generator()` and `test_list()` are comparable, however `test_long_list()` which generates a list with 5 extra elements (35 vs 30) is consistently slower. + +```output +Gen: 0.00251ms +List: 0.00256ms +List_long: 0.00332ms +``` + +Unlike list comprehension, a generator function will normally involve a Python loop. Therefore, their performance is typically slower than constructing a list where much of the computation can be offloaded to the CPython backend. + +::::::::::::::::::::::::::::::::::::: callout + +The use of `max_val` in the previous example moves the value of `N` from global to local scope. + +The Python interpreter checks local scope first when finding variables, therefore this makes accessing local scope variables slightly faster than global scope, this is most visible when a variable is being accessed regularly such as within a loop. + +Replacing the use of `max_val` with `N` inside `test_generator()` causes the function to consistently perform a little slower than `test_list()`, whereas before the change it would normally be a little faster. + +::::::::::::::::::::::::::::::::::::::::::::: + + ## Dictionaries Dictionaries are another fundamental Python data-structure. @@ -163,15 +341,15 @@ uniqueListSort: 2.67ms ::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::::::::::::: -## Checking Existence +## Searching -Independent of the performance to construct a unique set (as covered in the previous), it's worth identifying the performance to search the data-structure to retrieve an item or check whether it exists. +Independent of the performance to construct a unique set (as covered in the previous section), it's worth identifying the performance to search the data-structure to retrieve an item or check whether it exists. The performance of a hashing data structure is subject to the load factor and number of collisions. An item that hashes with no collision can be checked almost directly, whereas one with collisions will probe until it finds the correct item or an empty slot. In the worst possible case, whereby all insert items have collided this would mean checking every single item. In practice, hashing data-structures are designed to minimise the chances of this happening and most items should be found or identified as missing with a single access. In contrast if searching a list or array, the default approach is to start at the first item and check all subsequent items until the correct item has been found. If the correct item is not present, this will require the entire list to be checked. Therefore the worst-case is similar to that of the hashing data-structure, however it is guaranteed in cases where the item is missing. Similarly, on-average we would expect an item to be found half way through the list, meaning that an average search will require checking half of the items. -If the list or array is however sorted a binary search can be used. A binary search divides the list in half and checks which half the target item would be found in, this continues recursively until the search is exhausted whereby the item should be found or dismissed. This is significantly faster than performing a linear search of the list, checking `log N` items every time. +If however the list or array is sorted, a binary search can be used. A binary search divides the list in half and checks which half the target item would be found in, this continues recursively until the search is exhausted whereby the item should be found or dismissed. This is significantly faster than performing a linear search of the list, checking a total of `log N` items every time. The below code demonstrates these approaches and their performance. @@ -232,10 +410,13 @@ binary_search_list: 5.79ms These results are subject to change based on the number of items and the proportion of searched items that exist within the list. However, the pattern is likely to remain the same. Linear searches should be avoided! + ::::::::::::::::::::::::::::::::::::: keypoints +- List comprehension should be preferred when constructing lists. +- Where appropriate, Tuples and Generator functions should be preferred over Python lists. - Dictionaries and sets are appropriate for storing a collection of unique data with no intrinsic order for random access. - When used appropriately, dictionaries and sets are significantly faster than lists. -- If a list or array is used in-place of a set, it should be sorted and searched using `bisect_left()` (binary search). +- If searching a list or array is required, it should be sorted and searched using `bisect_left()` (binary search). :::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/optimisation-introduction.md b/optimisation-introduction.md index 049ae5fc..52f863be 100644 --- a/optimisation-introduction.md +++ b/optimisation-introduction.md @@ -25,10 +25,10 @@ Now that you're able to find the most expensive components of your code with pro In order to optimise code for performance, it is necessary to have an understanding of what a computer is doing to execute it. -Even a high-level understanding of a typical computer architecture, the most common data-structures/algorithms and how Python executes your code, enable the identification of suboptimal approaches. If you have learned to write code informally out of necessity, to get something to work, it's not uncommon to have collected some bad habits along the way. +Even a high-level understanding of a typical computer architecture; the most common data-structures and algorithms; and how Python executes your code, enable the identification of suboptimal approaches. If you have learned to write code informally out of necessity, to get something to work, it's not uncommon to have collected some bad habits along the way. -The remaining content is often abstract knowledge, that is transferable to the vast majority of programming languages. +The remaining content is often abstract knowledge, that is transferable to the vast majority of programming languages. This is because the hardware architecture, data-structures and algorithms used are common to many languages and they hold some of the greatest influence over performance bottlenecks. ## Premature Optimisation @@ -36,32 +36,91 @@ The remaining content is often abstract knowledge, that is transferable to the v This classic quote among computer scientists states; when considering optimisation it is important to focus on the potential impact, both to the performance and maintainability of the code. -Profiling is a valuable tool in this cause. Should effort be expended to optimise a component which occupies 1% of the runtime? Or would that time be better spent focusing on the mostly components? +Profiling is a valuable tool in this cause. Should effort be expended to optimise a component which occupies 1% of the runtime? Or would that time be better spent focusing on the most expensive components? Advanced optimisations, mostly outside the scope of this course, can increase the cost of maintenance by obfuscating what code is doing. Even if you are a solo-developer working on private code, your future self should be able to easily comprehend your implementation. Therefore, the balance between the impact to both performance and maintainability should be considered when optimising code. -This is not to say, don't consider optimisation when first writing code. The selection of appropriate algorithms and data-structures as will be covered is good practice, simply don't fret over a need to micro-optimise every small component of the code that you write. +This is not to say, don't consider performance when first writing code. The selection of appropriate algorithms and data-structures covered in this course form good practice, simply don't fret over a need to micro-optimise every small component of the code that you write. + + +## Ensuring Reproducible Results + + +When optimising your code, you are making speculative changes. It's easy to make mistakes, many of which can be subtle. Therefore, it's important to have a strategy in place to check that the outputs remain correct. + +There are a plethora of methods for testing code. The most common is unit testing, the Python package [pytest](https://docs.pytest.org/en/latest/) provides an easy to use unit testing framework. + +Python files containing tests are created, their filename must begin with `test`. + +Within this file, any functions that begin `test` are considered tests that can be executed by pytest. + +The `assert` keyword is used, to test whether a condition evaluates to `True`. + +```python +# file: test_demonstration.py + +# A simple function to be tested, this could instead be an imported package +def squared(x): + return x**2 + +# A simple test case +def test_example(): + assert squared(5) == 24 +``` + +When `py.test` is called inside a working directory, it will then recursively find and execute all the available tests. + +```sh +>py.test +================================================= test session starts ================================================= +platform win32 -- Python 3.10.12, pytest-7.3.1, pluggy-1.3.0 +rootdir: C:\demo +plugins: anyio-4.0.0, cov-4.1.0, xdoctest-1.1.2 +collected 1 item + +test_demonstration.py F [100%] + +====================================================== FAILURES ======================================================= +____________________________________________________ test_example _____________________________________________________ + + def test_example(): +> assert squared(5) == 24 +E assert 25 == 24 +E + where 25 = squared(5) + +test_demonstration.py:9: AssertionError +=============================================== short test summary info =============================================== +FAILED test_demonstration.py::test_example - assert 25 == 24 +================================================== 1 failed in 0.07s ================================================== +``` + +Whilst not designed for benchmarking, it does provide the total time the test suite took to execute. In some cases this may help identify whether the optimisations have had a significant impact on performance. + +This is only the simplest introduction to using pytest, it has advanced features common to unit testing frameworks such as fixtures, mocking and test skipping. +[Pytest's documentation](https://docs.pytest.org/en/latest/how-to/index.html) covers all this and more. + + ## Coming Up In the remainder of this course we will cover: -- todo - - +- Data Structures & Algorithms + - Lists & Tuples + - Dictionaries & Sets + - Generator Functions + - Searching +- How Python Executes + - Why less Python is often faster + - How to use NumPy for performance + - How to get the most from pandas +- Newer is Often faster + - Keeping Python and packages upto date +- How Computer Hardware Affects Performance + - Why some accessing some variables can be faster than others + - Putting latencies in perspective ::::::::::::::::::::::::::::::::::::: keypoints diff --git a/optimisation-list-tuple.md b/optimisation-list-tuple.md deleted file mode 100644 index abb2a487..00000000 --- a/optimisation-list-tuple.md +++ /dev/null @@ -1,202 +0,0 @@ ---- -title: "Lists (& Tuples)" -teaching: 0 -exercises: 0 ---- - -:::::::::::::::::::::::::::::::::::::: questions - -- What's the most efficient way to construct a list? -- When should Tuples be used? -- When should generator functions be used? - -:::::::::::::::::::::::::::::::::::::::::::::::: - -::::::::::::::::::::::::::::::::::::: objectives - -- Able to summarise how Lists and Tuples work behind the scenes. -- Able to identify appropriate use-cases for tuples. -- Able to use generator functions in appropriate situations. - -:::::::::::::::::::::::::::::::::::::::::::::::: - -## Lists - -Lists are a fundamental data structure within Python. - -It is implemented as a form of dynamic array found within many programming languages by different names (C++: `std::vector`, Java: `ArrayList`, R: `vector`, Julia: `Vector`). - -They allows direct and sequential element access, with the convenience to append items. - -This is achieved by internally storing items in a static array. -This array however can be longer than the `List`. -When an item is added, the `List` checks whether it has enough spare space to add the item to the end. -If it doesn't, it will reallocate a larger array, copy across the elements, and deallocate the old array. - -The growth is dependent on the implementation's growth factor. -CPython for example uses [`newsize + (newsize >> 3) + 6`](https://github.com/python/cpython/blob/a571a2fd3fdaeafdfd71f3d80ed5a3b22b63d0f7/Objects/listobject.c#L74), which works out to an over allocation of roughly ~12.5%. - -![The relationship between the number of appends to an empty list, and the number of internal resizes in CPython.](episodes/fig/cpython_list_allocations.png){alt='A line graph displaying the relationship between the number of calls to append() and the number of internal resizes of a CPython list. It has a logarithmic relationship, at 1 million appends there have been 84 internal resizes.'} - -This has two implications: - -* If you are creating large static lists, they will use upto 12.5% excess memory. -* If you are growing a list with `append()`, there will be large amounts of redundant allocations and copies as the list grows. - -### List Comprehension - -If creating a list via `append()` is undesirable, the natural alternative is to use list-comprehension. - -List comprehension can be twice as fast at building lists than using `append()`. -This is primarily because list-comprehension allows Python to offload much of the computation into faster C code. -General python loops in contrast can be used for much more, so they remain in Python bytecode during computation which has additional overheads. - -This can be demonstrated with the below benchmark: - -```python -from timeit import timeit - -def list_append(): - li = [] - for i in range(100000): - li.append(i) - -def list_preallocate(): - li = [0]*100000 - for i in range(100000): - li[i] = i - -def list_comprehension(): - li = [i for i in range(100000)] - -repeats = 1000 -print(f"Append: {timeit(list_append, number=repeats):.2f}ms") -print(f"Preallocate: {timeit(list_preallocate, number=repeats):.2f}ms") -print(f"Comprehension: {timeit(list_comprehension, number=repeats):.2f}ms") -``` - -`timeit` is used to run each function 1000 times, providing the below averages: - -```output -Append: 3.50ms -Preallocate: 2.48ms -Comprehension: 1.69ms -``` - -Results will vary between Python versions, hardware and list lengths. But in this example list comprehension was 2x faster, with pre-allocate fairing in the middle. Although this is milliseconds, this can soon add up if you are regularly creating lists. - -## Tuples - -In contrast, Python's Tuples are immutable static arrays (similar to strings), their elements cannot be modified and they cannot be resized. - -Their potential use-cases are greatly reduced due to these two limitations, they are only suitable for groups of immutable properties. - -Tuples can still be joined with the `+` operator, similar to appending lists, however the result is always a newly allocated tuple (without a list's over-allocation). - -Python caches a large number of short (1-20 element) tuples. This greatly reduces the cost of creating and destroying them during execution at the cost of a slight memory overhead. - -This can be easily demonstrated with Python's `timeit` module in your console. - -```sh ->python -m timeit "li = [0,1,2,3,4,5]" -10000000 loops, best of 5: 26.4 nsec per loop - ->python -m timeit "tu = (0,1,2,3,4,5)" -50000000 loops, best of 5: 7.99 nsec per loop -``` - -It takes 3x as long to allocate a short list than a tuple of equal length. This gap only grows with the length, as the tuple cost remains roughly static whereas the cost of allocating the list grows slightly. - - -## Generator Functions - -You may not even require your data be stored in a list or tuple if it is only accessed once and in sequence. - -Generators are special functions, that use `yield` rather than `return`. Each time the generator is called, it resumes computation until the next `yield` statement is hit to return the next value. - -This avoids needing to allocate a data structure, and can greatly reduce memory utilisation. - -Common examples for generators include: -* Reading from a large file that may not fit in memory. -* Any generated sequence where the required length is unknown. - -The below example demonstrates how a generator function (`fibonnaci_generator()`) differs from one that simply returns a constructed list (`fibonacci_list()`). - -```python -from timeit import timeit - -N = 1000000 -repeats = 1000 - -def fibonacci_generator(): - a=0 - b=1 - while True: - yield b - a,b= b,a+b - -def fibonacci_list(max_val): - rtn = [] - a=0 - b=1 - while b < max_val: - rtn.append(b) - a,b= b,a+b - return rtn - -def test_generator(): - t = 0 - max_val = N - for i in fibonacci_generator(): - if i > max_val: - break - t += i - -def test_list(): - li = fibonacci_list(N) - t = 0 - for i in li: - t += i - -def test_list_long(): - t = 0 - max_val = N - li = fibonacci_list(max_val*10) - for i in li: - if i > max_val: - break - t += i - -print(f"Gen: {timeit(test_generator, number=repeats):.5f}ms") -print(f"List: {timeit(test_list, number=repeats):.5f}ms") -print(f"List_long: {timeit(test_list_long, number=repeats):.5f}ms") -``` - -The performance of `test_generator()` and `test_list()` are comparable, however `test_long_list()` which generates a list with 5 extra elements (35 vs 30) is consistently slower. - -```output -Gen: 0.00251ms -List: 0.00256ms -List_long: 0.00332ms -``` - -Unlike list comprehension, a generator function will normally involve a Python loop. Therefore, their performance is typically slower than constructing a list where much of the computation can be offloaded to the CPython backend. - -::::::::::::::::::::::::::::::::::::: callout - -The use of `max_val` in the previous example moves the value of `N` from global to local scope. - -The Python interpreter checks local scope first when finding variables, therefore this makes accessing local scope variables slightly faster than global scope, this is most visible when a variable is being accessed regularly such as within a loop. - -Replacing the use of `max_val` with `N` inside `test_generator()` causes the function to consistently perform a little slower than `test_list()`, whereas before the change it would normally be a little faster. - -::::::::::::::::::::::::::::::::::::::::::::: - - - -::::::::::::::::::::::::::::::::::::: keypoints - -- List comprehension should be preferred when constructing lists. -- Where appropriate, Tuples and Generator functions should be preferred over Python lists. - -:::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/optimisation-memory.md b/optimisation-memory.md index f6d8fb6a..505140f1 100644 --- a/optimisation-memory.md +++ b/optimisation-memory.md @@ -23,11 +23,10 @@ exercises: 0 The storage and movement of data plays a large role in the performance of executing software. - When reading a variable, to perform an operation with it, the CPU will first look in it's registers. These exist per core, they are the location that computation is actually performed. Accessing them is incredibly fast, but there only exists enough storage for around 32 variables (typical number, e.g. 4 bytes). As the register file is so small, most variables won't be found and the CPU's caches will be searched. -It will first check the current processing core's L1 cache, this small cache (typically 64 KB per physical core) is the smallest and fastest to access cache on a processor. +It will first check the current processing core's L1 (Level 1) cache, this small cache (typically 64 KB per physical core) is the smallest and fastest to access cache on a processor. If the variable is not found in the L1 cache, the L2 cache that is shared between multiple cores will be checked. This shared cache, is slower to access but larger than L1 (typically 1-3MB per core). This process then repeats for the L3 cache which may be shared among all cores of the processor. This cache again has higher latency to access, but increased size (typically slightly larger than the total L2 cache size). If the variable has not been found in any of the CPU's cache, the CPU will look to the computer's RAM. This is an order of magnitude slower to access, with several orders of magnitude greater capacity (tens to hundreds of GB are now standard). @@ -53,7 +52,6 @@ Python as a programming language, does not give you enough control to carefully However all is not lost, packages such as `numpy` and `pandas` implemented in C/C++ enable Python users to take advantage of efficient memory accesses (when they are used correctly). -*More on this later* ::::::::::::::::::::::::::::::::::::::::::::: +Python is an interpreted programming language. When you execute your `.py` file, the (default) CPython back-end compiles your Python source code to an intermediate bytecode. This bytecode is then interpreted in software at runtime generating instructions for the processor as necessary. This interpretation stage, and other features of the language, harm the performance of Python (whilst improving it's usability). -In comparison, many languages such as C/C++ compile directly to machine code. This allows them to better exploit hardware nuance to achieve fast performance, at the cost of compiled software not being cross-platform. +In comparison, many languages such as C/C++ compile directly to machine code. This allows the compiler to perform low-level optimisations that better exploit hardware nuance to achieve fast performance. This however comes at the cost of compiled software not being cross-platform. -Whilst Python will rarely be as fast as compiled languages like C/C++, it is possible to take advantage of the CPython back-end and other libraries such as NumPy and pandas that have been written in compiled languages to expose this performance. +Whilst Python will rarely be as fast as compiled languages like C/C++, it is possible to take advantage of the CPython back-end and packages such as NumPy and Pandas that have been written in compiled languages to expose this performance. -A simple example of this would be to search a list. +A simple example of this would be to perform a linear search of a list (in the previous episode we did say this is not recommended). The below example creates a list of 2500 integers in the inclusive-exclusive range `[0, 5000)`. It then searches for all of the even numbers in that range. `searchlistPython()` is implemented manually, iterating `ls` checking each individual item in Python code. @@ -316,7 +316,9 @@ Passing a Python list to `numpy.random.choice()` is 65.6x slower than passing a 200000 loops, best of 5: 1.22 usec per loop ``` -Regardless, for this simple application of `numpy.random.choice()`, merely using `numpy.random.randint(len())` is around 4x faster regardless whether a Python list or NumPy array is used. With `numpy.random.choice()` being such a general function (it has many of possible parameters), there is significant internal branching. If you don't require this advanced functionality and are calling a function regularly, it can be worthwhile considering using a more limited function. +Regardless, for this simple application of `numpy.random.choice()`, merely using `numpy.random.randint(len())` is around 4x faster regardless whether a Python list or NumPy array is used. + +With `numpy.random.choice()` being such a general function (it has many possible parameters), there is significant internal branching. If you don't require this advanced functionality and are calling a function regularly, it can be worthwhile considering using a more limited function. There is however a trade-off, using `numpy.random.choice()` can be clearer to someone reading your code, and is more difficult to use incorrectly. @@ -413,7 +415,7 @@ Pandas' methods by default operate on columns. Each column or series can be thou Following the theme of this episode, iterating over the rows of a data frame using a `for` loop is not advised. The pythonic iteration will be slower than other approaches. -Pandas allows it's own methods to be applied to rows in many cases by passing `axis=1`, where available these functions should be preferred over manual loops. Where you can't find a suitable method `apply()` can be similar the `map()`/`vectorize()` to apply your own function to rows. +Pandas allows it's own methods to be applied to rows in many cases by passing `axis=1`, where available these functions should be preferred over manual loops. Where you can't find a suitable method, `apply()` can be used, which is similar to `map()`/`vectorize()`, to apply your own function to rows. ```python from timeit import timeit @@ -469,7 +471,7 @@ for_iterrows: 1677.14ms pandas_apply: 390.49ms ``` -**However**, rows don't exist in memory as arrays (columns do!), so `apply()` does not take advantage of NumPys vectorisation. You may be able to go a step further and avoid explicitly operating on rows entirely by passing only the required columns NumPy. +However, rows don't exist in memory as arrays (columns do!), so `apply()` does not take advantage of NumPys vectorisation. You may be able to go a step further and avoid explicitly operating on rows entirely by passing only the required columns to NumPy. ```python def vectorize(): @@ -504,7 +506,7 @@ Whilst still nearly 100x slower than pure vectorisation, it's twice as fast as ` to_dict: 131.15ms ``` -This is because indexing into Pandas' `Series` (rows) is significantly slower than a Python dictionary. There is a slight overhead to creating the dictionary (40ms in this example), however the stark difference in access speed is more than enough to overcome that cost for any large data-table. +This is because indexing into Pandas' `Series` (rows) is significantly slower than a Python dictionary. There is a slight overhead to creating the dictionary (40ms in this example), however the stark difference in access speed is more than enough to overcome that cost for any large dataframe. ```python from timeit import timeit diff --git a/optimisation-use-latest.md b/optimisation-use-latest.md index 2b62e37c..2b623957 100644 --- a/optimisation-use-latest.md +++ b/optimisation-use-latest.md @@ -29,7 +29,7 @@ It's important to use the latest Python wherever feasible. In addition to new fe Future proposals, such changes to the [JIT](https://tonybaloney.github.io/posts/python-gets-a-jit.html) and [GIL](https://peps.python.org/pep-0703/) will provide further improvements to performance. -Similarly, major packages particularly those with a performance focus such as NumPy and Pandas should be kept up to date for similar reasons. +Similarly, major packages with a performance focus such as NumPy and Pandas, should be kept up to date for the same reasons. These improvements are often free, requiring minimal changes to any code (unlike the jump from Python 2 to Python 3). @@ -40,67 +40,10 @@ Performance regressions within major packages should be considered rare, they of However, the more packages and language features your code touches, and the older the Python it currently uses, the greater chance of incompatibilities making it difficult to upgrade. -When updating, it's important to have tests in place, to validate the correctness of your code. -A single small dependent package could introduce a breaking change. +Similar to optimising, when updating it's important to have tests in place to validate the correctness of your code before and after changes. +An update to a single small dependent package could introduce a breaking change. This could cause your code to crash, or worse subtly change your results. - -When optimising your code, these tests will come in handy too. -Mistakes are easily introduced when updating code that wasn't written recently, even for experienced programmers, so be sure that they will be found. - -## Ensuring Reproducible Results - -There are a plethora of methods for testing code. The most common is the package [pytest](https://docs.pytest.org/en/latest/) which provides an easy to use unit testing framework. - -Python files containing tests are created, their filename must begin with `test`. - -Within this file, any functions that begin `test` are considered tests that can be executed by pytest. - -The `assert` keyword is used, to test whether a condition evaluates to `True`. - -```python -# file: test_demonstration.py - -# A simple function to be tested, this could instead be an imported package -def squared(x): - return x**2 - -# A simple test case -def test_example(): - assert squared(5) == 24 -``` - -When `py.test` is called inside a working directory, it will then recursively find and execute all the available tests. - -```sh ->py.test -================================================= test session starts ================================================= -platform win32 -- Python 3.10.12, pytest-7.3.1, pluggy-1.3.0 -rootdir: C:\demo -plugins: anyio-4.0.0, cov-4.1.0, xdoctest-1.1.2 -collected 1 item - -test_demonstration.py F [100%] - -====================================================== FAILURES ======================================================= -____________________________________________________ test_example _____________________________________________________ - - def test_example(): -> assert squared(5) == 24 -E assert 25 == 24 -E + where 25 = squared(5) - -test_demonstration.py:9: AssertionError -=============================================== short test summary info =============================================== -FAILED test_demonstration.py::test_example - assert 25 == 24 -================================================== 1 failed in 0.07s ================================================== -``` - -This is only the simplest introduction to using pytest, it has advanced features such as fixtures, mocking and test skipping. -[Pytest's documentation](https://docs.pytest.org/en/latest/how-to/index.html) covers all this and more. - - - ## Updating Python & Packages diff --git a/profiling-functions.md b/profiling-functions.md index 2aa851e7..1de4f3a9 100644 --- a/profiling-functions.md +++ b/profiling-functions.md @@ -35,6 +35,59 @@ This allows functions that occupy a disproportionate amount of the total runtime In this episode we will cover the usage of the function-level profiler `cProfile`, how it's output can be visualised with `snakeviz` and how the output can be interpreted. + +::::::::::::::::::::::::::::::::::::: callout + +## What is a Call Stack? + +The call stack keeps track of the active hierarchy of function calls and their associated variables. + +As a stack it is last-in first-out (LIFO) data structure. + +When a function is called, a frame to track it's variables and metadata is pushed to the call stack. +When that same function finishes and returns, it is popped from the stack and variables local the function are dropped. + +If you've ever seen a stack overflow error, this refers to the call stack becoming too large. +These are typically caused by recursive algorithms, whereby a function calls itself, that don't exit early enough. + +Within Python the current call-stack can be printed using the core `traceback` package, `traceback.print_stack()` will print the current call stack. + + +The below example: + +```python +import traceback + +def a(): + b1() + b2() +def b1(): + pass +def b2(): + c() +def c(): + traceback.print_stack() + +a() +``` + +Prints the following call stack: + +```output + File "C:\call_stack.py", line 13, in + a() + File "C:\call_stack.py", line 5, in a + b2() + File "C:\call_stack.py", line 9, in b2 + c() + File "C:\call_stack.py", line 11, in c + traceback.print_stack() +``` + +In this instance the base of the stack is printed first, other visualisations of call stacks may use the reverse ordering. + +::::::::::::::::::::::::::::::::::::::::::::: + ## cProfile @@ -299,8 +352,8 @@ Maybe we could investigate this further with line profiling! ::::::::::::::::::::::::::::::::::::: keypoints -- A python program can be function level profiled with `cProfile` via `python -m cProfile -o