Screen optimizations #160

eldipa · 2022-07-14T03:17:45Z

What is this PR about?

Optimization. My goal was to make pyte faster and lighter specially
for large geometries (think in a screen of 240x800 or 2400x8000).

Results (overview of the results)

For large geometries (240x800, 2400x8000), Screen.display runs several orders
of magnitude faster and consumes between 1.10 and 50.0 times less memory.

For smaller geometries the minimum improvement was of 2 times faster.

Stream.feed is now between 1.10 and 7.30 times faster and if
Screen is tuned, the speedup is between 1.14 and 12.0.

However there is a regression for the mc.input test of up to 4 times
slower.

For memory usage, Stream.feed is between 1.10 and 17.0 times lighter
and up to 44.0 times lighter if Screen is tuned.

Screen.reset is between 1.10 and 1.50 slower but several cases improve
if the Screen is tuned (but not all).

Context (background)

byexample executes snippets
of code using real interpreters (Python, Ruby, Java) and capturing the
output so then byexample can check if it is the expected or not.

While most of the interpreters are "terminal-naive", some are
"terminal-aware" and they will output escape and control sequences
which obviously are not of interest.

byexample uses pyte for handling those cases (thanks for such great
lib!!).

Unfortunately using a screen introduces artifacts due the hard
boundaries of the screen (24x80). Think in very long lines that are
"unexpectedly" cut into two lines (cut that happen because the screen
has a finite width).

A simple and elegant solution could create screen for larger geometries
where these artifacts are much more rare.

Sadly, while pyte implements a sparse buffer, most of its algorithms
are not aware and they don't take advantage of that making the terminal
emulation really slow and consuming a lot of memory.

This is the motivation for this PR: make it perform better!!

Note for the reviewers

This PR is not trivial. The commits are simple of understand (as much as
I could) but still it is a quite large PR.

So I will be available for discussion and explanation of each commit so
I can guide the review process.

Contributions

Upgrade pyperf
Extended the benchmark tests to test Screen.display,
Screen.resize and Screen.reset under different geometries (24x80,
240x800, 2400x8000, 24x8000, 2400x80). With these the benchmark takes
much more time (sorry!) but it gives a deeper view of how pyte works.
Fixed a bug in the benchmarks that used Stream instead of
ByteStream. The use of the former led to an incorrect interpretation
of the new lines; the use of ByteStream fixed that and it is aligned
with the test_input_output.py tests.
Optimize Screen.display to work (approx) linearly with the input
and not with the size of the screen (quadratic). Improved by a lot
both for runtime and memory (specially for large geometries).
Implement Screen.compressed_display that works similar to
Screen.display but it allows to "strip" empty space from the left or
right and "filter" empty lines on top and bottom of the screen
reducing time and memory.
Optimize Screen.draw with caching of attributes and methods (the
same optimizations already present in Stream._parser_fsm).
Refactor out Char's foreground, bold, blink (...) into a separated
namedtuple CharStyle. When possible, reuse the same style for
multiple characters reducing the memory usage at the expense of an
additional lookup (instead of char.fg you have char.style.fg).
Make Char a mutable object allowing changes in the data and
width fields to be in-place instead of creating a new Char object.
Sparse-aware algorithms for Screen.index and Screen.reverse_index
which improved indirectly Screen.draw and Stream.feed.
Sparse-aware algorithms for Screen.resize
Sparse-aware algorithms for Screen.tabstop
Sparse-aware algorithms for ScreenHistory.prev_page and
ScreenHistory.next_page.
Sparse-aware algorithms for Screen.insert_characters,
Screen.delete_characters, Screen.insert_lines and
Screen.delete_characters which improved the performance of "terminal
aware" programs.
Statistics about Screen's buffer and lines to have insight about
the sparsity and usage of these elements. (The API is not not standard
like DebugScreen).
Make the public attribute Screen.buffer return a BufferView.
Retrieve of lines from it yield LineView instead of Line objects.
This adds an overhead on user code but allows a separation between the
public part and the internals. Iterate over LineView still yields
Char objects as usual (to much high penalty otherwise).
Make the public Screen.history's top and bottom queues return
LineView and not Line objects
Make the private attribute Screen._buffer a dict and not a
defaultdict. This prevent adding entries unintentionally which would
make the buffer less sparse and therefore slow.
If the current cursor attributes (style) matches the default
attributes of the screen, do not write explicit spaces on erase methods
(Screen.erase_characters, Screen.erase_in_line and
Screen.erase_in_display)
When disable_display_graphic is True prevent
Screen.select_graphic_rendition to change the cursor attributes
(style). If the cursor attrs don't change, we can optimize the erase
methods. The flag is False by default.
but just remove the chars from the buffer. This makes speedup other
algorithms and maintain high the sparsity (and consume less memory).
When track_dirty_lines is False use a NullSet for Screen.dirty
attribute to not consume any
memory and discard any element, disabling effectively the dirty
functionality. This saves time and memory for large geometries.
The flag is True by default.
Make Screen.margin always a Margin object so we can avoid
checking if it is None or not.

Compatibility changes

The following are changes in the API that may break user code. A special
care was taken to avoid this situation.

Char is not longer a namedtuple so things like _replace are
gone. If necessary we could reimplement the API of namedtuple but I
don't think users will use.
Char is mutable but the user must not relay on this: changes to
character will have undefined behaviour. The user must use always the
API provided by Screen.
Char not longer has attributes for fg, bg, bold. Instead, it has a
single read-only CharStyle. The Char class implements fg, bg, bold
as properties to do the lookup to the style behind the scene. User
code should not break then.
Screen.buffer now is a property that returns a BufferView with a
similar API to a dictionary. It yields LineView objects instead of
Line objects. These in turn yield Char objects (not views). User can
still iterate over the lines and chars as if the buffer were a dense
array and not a sparse array as it is really.
Like any view, these are valid until the next modification of the
screen. This change may break user code if it uses buffer in another
way.
The queues top and bottom of ScreenHistory.history contain
LineView and not Line objects. This may break user code.
Screen.margin is always a Margin object: the None value is not
longer supported`.

TL;DR - Numbers overview

The following is a overview of the numbers got. To make this post
as short as possible, the some results were omitted
(rows omitted are marked with :::).

Full benchmark results are left attached in this commit. People are encouraged to do
their own benchmarks for cross validation.

`Screen.display`

Screen.display was optimized to generate large chunks of spaces
very quickly.

For large geometries, this has an huge impact on the performance:

+----------------------------------------------------------+----------+-----------------------------------------+
| Benchmark                                                | 0.8.1    | 0.8.1+screen-optimizations+default-conf |
+==========================================================+==========+=========================================+
| [screen_display 2400x8000] mc.input->Screen              | 6.69 sec | 9.95 us: 672459.98x faster              |
| [screen_display 2400x8000] mc.input->HistoryScreen       | 6.60 sec | 10.3 us: 638209.07x faster              |
| [screen_display 2400x8000] vi.input->HistoryScreen       | 6.80 sec | 132 us: 51648.56x faster                |
| [screen_display 2400x8000] vi.input->Screen              | 6.60 sec | 130 us: 50691.68x faster                |
    :::                 ::::                                    ::::         :::
| [screen_display 240x800] ls.input->HistoryScreen         | 70.1 ms  | 248 us: 283.21x faster                  |
| [screen_display 240x800] ls.input->Screen                | 69.1 ms  | 244 us: 282.76x faster                  |
| [screen_display 240x800] top.input->Screen               | 68.2 ms  | 249 us: 273.98x faster                  |
| [screen_display 240x800] top.input->HistoryScreen        | 67.3 ms  | 257 us: 261.64x faster                  |
    :::                 ::::                                    ::::         :::
| [screen_display 24x80] find-etc.input->HistoryScreen     | 670 us   | 101 us: 6.63x faster                    |
| [screen_display 24x80] find-etc.input->Screen            | 640 us   | 97.9 us: 6.54x faster                   |
| [screen_display 24x80] vi.input->HistoryScreen           | 606 us   | 115 us: 5.27x faster                    |
| [screen_display 24x80] vi.input->Screen                  | 584 us   | 117 us: 5.00x faster                    |
| [screen_display 24x80] cat-gpl3.input->Screen            | 605 us   | 189 us: 3.21x faster                    |
| [screen_display 24x80] cat-gpl3.input->HistoryScreen     | 605 us   | 195 us: 3.11x faster                    |
| [screen_display 24x80] ls.input->HistoryScreen           | 609 us   | 221 us: 2.75x faster                    |
| [screen_display 24x80] ls.input->Screen                  | 585 us   | 221 us: 2.65x faster                    |
| [screen_display 24x80] top.input->Screen                 | 567 us   | 244 us: 2.33x faster                    |
| [screen_display 24x80] top.input->HistoryScreen          | 580 us   | 251 us: 2.32x faster                    |
| [screen_display 24x80] htop.input->HistoryScreen         | 564 us   | 269 us: 2.10x faster                    |
| [screen_display 24x80] htop.input->Screen                | 559 us   | 273 us: 2.05x faster                    |

Screen.display takes advantage of the sparsity of the screen and therefore
it was indirectly beneficed by the optimizations done across Screen
to avoid filling it with false entries.

Screen.display it was also optimized on memory (tracemalloc) avoiding
then append of each space character separately when they could be
appended in a single chunk.

+--------------------------------------------------------------+-------------------+-----------------------------------------------------+
| Benchmark                                                    | 0.8.1.tracemalloc | 0.8.1+screen-optimizations+default-conf.tracemalloc |
+==============================================================+===================+=====================================================+
| [screen_display 2400x8000] ls.input->HistoryScreen           | 19.7 MB           | 408.1 kB: 49.43x faster                             |
| [screen_display 2400x8000] mc.input->HistoryScreen           | 19.7 MB           | 411.4 kB: 49.04x faster                             |
| [screen_display 2400x8000] mc.input->Screen                  | 19.7 MB           | 411.4 kB: 49.04x faster                             |
| [screen_display 2400x8000] ls.input->Screen                  | 19.7 MB           | 411.5 kB: 49.03x faster                             |
| [screen_display 2400x8000] vi.input->HistoryScreen           | 18.5 MB           | 404.7 kB: 46.84x faster                             |
| [screen_display 2400x8000] top.input->HistoryScreen          | 18.5 MB           | 408.3 kB: 46.43x faster                             |
| [screen_display 2400x8000] vi.input->Screen                  | 18.5 MB           | 408.5 kB: 46.40x faster                             |
| [screen_display 2400x8000] top.input->Screen                 | 18.5 MB           | 411.5 kB: 46.07x faster                             |
| [screen_display 2400x8000] htop.input->Screen                | 18.5 MB           | 1102.6 kB: 17.19x faster                            |
| [screen_display 2400x8000] htop.input->HistoryScreen         | 18.5 MB           | 1103.2 kB: 17.18x faster                            |
| [screen_display 2400x8000] cat-gpl3.input->HistoryScreen     | 19.5 MB           | 5392.6 kB: 3.70x faster                             |
| [screen_display 2400x8000] cat-gpl3.input->Screen            | 19.5 MB           | 5392.0 kB: 3.70x faster                             |
| [screen_display 240x800] mc.input->Screen                    | 513.2 kB          | 403.5 kB: 1.27x faster                              |
| [screen_display 240x800] ls.input->Screen                    | 517.0 kB          | 411.5 kB: 1.26x faster                              |
| [screen_display 240x800] ls.input->HistoryScreen             | 511.7 kB          | 408.1 kB: 1.25x faster                              |
| [screen_display 240x800] mc.input->HistoryScreen             | 510.4 kB          | 411.4 kB: 1.24x faster                              |
| [screen_display 2400x8000] find-etc.input->HistoryScreen     | 18.7 MB           | 16.6 MB: 1.12x faster                               |
| [screen_display 2400x8000] find-etc.input->Screen            | 18.7 MB           | 16.6 MB: 1.12x faster                               |

The only two regressions are:

| [screen_display 240x800] htop.input->HistoryScreen           | 408.7 kB          | 487.3 kB: 1.19x slower                              |
| [screen_display 240x800] htop.input->Screen                  | 405.8 kB          | 486.2 kB: 1.20x slower                              |

Not sure why this happen.

`Stream.feed`

stream.feed was not modified but its runtime depends on Screen's
performance.

For terminal programs that just write into then terminal, like
cat-gpl3 and find-etc, stream.feed merely sends then input
to Screen.draw for rendering.

The method Screen.draw was optimized to avoid the modification
of the cursor internally and update it only at the exit. This saved a
few lookups.

While not been frequently called, Screen.index was the next bottleneck
for Screen.draw: it moves all the lines of the screen which it means
that all the entries of the buffer are rewritten.

Screen.index and Screen.reverse_index were optimized to take advantage
of the sparsity and to avoid adding false entries.

This resulted on a speedup across the tests:

+----------------------------------------------------------+----------+-----------------------------------------+
| Benchmark                                                | 0.8.1    | 0.8.1+screen-optimizations+default-conf |
+==========================================================+==========+=========================================+
| [stream_feed 2400x8000] vi.input->HistoryScreen          | 49.4 ms  | 6.70 ms: 7.38x faster                   |
| [stream_feed 2400x8000] top.input->HistoryScreen         | 7.35 ms  | 1.31 ms: 5.62x faster                   |
| [stream_feed 2400x8000] find-etc.input->HistoryScreen    | 2.92 sec | 543 ms: 5.38x faster                    |
| [stream_feed 240x800] ls.input->HistoryScreen            | 9.29 ms  | 1.75 ms: 5.31x faster                   |
| [stream_feed 24x80] top.input->HistoryScreen             | 6.61 ms  | 1.25 ms: 5.30x faster                   |
| [stream_feed 240x800] top.input->HistoryScreen           | 6.57 ms  | 1.24 ms: 5.29x faster                   |
| [stream_feed 240x800] cat-gpl3.input->HistoryScreen      | 215 ms   | 43.3 ms: 4.97x faster                   |
| [stream_feed 24x80] ls.input->HistoryScreen              | 6.26 ms  | 1.34 ms: 4.68x faster                   |
| [stream_feed 24x80] cat-gpl3.input->HistoryScreen        | 140 ms   | 31.9 ms: 4.38x faster                   |
| [stream_feed 240x800] find-etc.input->HistoryScreen      | 532 ms   | 123 ms: 4.32x faster                    |
| [stream_feed 2400x8000] vi.input->Screen                 | 13.7 ms  | 3.53 ms: 3.88x faster                   |
| [stream_feed 24x80] find-etc.input->HistoryScreen        | 294 ms   | 81.1 ms: 3.62x faster                   |
| [stream_feed 2400x8000] htop.input->HistoryScreen        | 122 ms   | 34.4 ms: 3.54x faster                   |
    :::                 ::::                                    ::::         :::
| [stream_feed 240x800] vi.input->Screen                   | 5.39 ms  | 2.67 ms: 2.02x faster                   |
| [stream_feed 24x80] mc.input->HistoryScreen              | 44.3 ms  | 24.4 ms: 1.82x faster                   |
| [stream_feed 2400x8000] htop.input->Screen               | 38.1 ms  | 21.4 ms: 1.78x faster                   |
| [stream_feed 240x800] htop.input->Screen                 | 23.2 ms  | 13.2 ms: 1.76x faster                   |
| [stream_feed 240x800] find-etc.input->Screen             | 134 ms   | 77.0 ms: 1.74x faster                   |
| [stream_feed 24x80] vi.input->Screen                     | 4.45 ms  | 2.57 ms: 1.73x faster                   |
| [stream_feed 24x80] htop.input->Screen                   | 20.8 ms  | 12.2 ms: 1.71x faster                   |
| [stream_feed 2400x8000] cat-gpl3.input->HistoryScreen    | 262 ms   | 157 ms: 1.67x faster                    |
| [stream_feed 240x800] mc.input->HistoryScreen            | 63.7 ms  | 43.8 ms: 1.45x faster                   |
| [stream_feed 2400x8000] ls.input->Screen                 | 7.77 ms  | 5.61 ms: 1.38x faster                   |
| [stream_feed 24x80] mc.input->Screen                     | 17.6 ms  | 13.1 ms: 1.34x faster                   |
| [stream_feed 2400x8000] find-etc.input->Screen           | 616 ms   | 501 ms: 1.23x faster                    |
| [stream_feed 2400x8000] cat-gpl3.input->Screen           | 170 ms   | 143 ms: 1.19x faster                    |
| [stream_feed 2400x8000] mc.input->HistoryScreen          | 259 ms   | 285 ms: 1.10x slower                    |
| [stream_feed 240x800] mc.input->Screen                   | 23.3 ms  | 32.3 ms: 1.39x slower                   |
| [stream_feed 2400x8000] mc.input->Screen                 | 71.2 ms  | 281 ms: 3.94x slower                    |

The mc.input however took much more time.
When track_dirty_lines is False and disable_display_graphic is True,
the overall performance increases even further.

+----------------------------------------------------------+----------+----------------------------------------+
| Benchmark                                                | 0.8.1    | 0.8.1+screen-optimizations+custom-conf |
+==========================================================+==========+========================================+
| [stream_feed 2400x8000] mc.input->HistoryScreen          | 259 ms   | 21.2 ms: 12.19x faster                 |
| [stream_feed 2400x8000] vi.input->HistoryScreen          | 49.4 ms  | 5.52 ms: 8.95x faster                  |
| [stream_feed 2400x8000] mc.input->Screen                 | 71.2 ms  | 10.8 ms: 6.60x faster                  |
| [stream_feed 2400x8000] find-etc.input->HistoryScreen    | 2.92 sec | 464 ms: 6.29x faster                   |
| [stream_feed 2400x8000] top.input->HistoryScreen         | 7.35 ms  | 1.27 ms: 5.80x faster                  |
| [stream_feed 2400x8000] htop.input->HistoryScreen        | 122 ms   | 22.3 ms: 5.45x faster                  |
| [stream_feed 2400x8000] vi.input->Screen                 | 13.7 ms  | 2.52 ms: 5.43x faster                  |
| [stream_feed 24x80] top.input->HistoryScreen             | 6.61 ms  | 1.23 ms: 5.38x faster                  |
| [stream_feed 240x800] ls.input->HistoryScreen            | 9.29 ms  | 1.73 ms: 5.37x faster                  |
| [stream_feed 240x800] cat-gpl3.input->HistoryScreen      | 215 ms   | 42.0 ms: 5.12x faster                  |
    :::                 ::::                                    ::::         :::
| [stream_feed 24x80] cat-gpl3.input->Screen               | 46.3 ms  | 17.2 ms: 2.69x faster                  |
| [stream_feed 240x800] top.input->Screen                  | 2.39 ms  | 913 us: 2.61x faster                   |
| [stream_feed 24x80] top.input->Screen                    | 2.36 ms  | 914 us: 2.58x faster                   |
| [stream_feed 2400x8000] ls.input->HistoryScreen          | 13.4 ms  | 5.39 ms: 2.50x faster                  |
| [stream_feed 24x80] htop.input->HistoryScreen            | 55.3 ms  | 22.6 ms: 2.44x faster                  |
    :::                 ::::                                    ::::         :::
| [stream_feed 240x800] vi.input->Screen                   | 5.39 ms  | 2.57 ms: 2.10x faster                  |
| [stream_feed 24x80] mc.input->HistoryScreen              | 44.3 ms  | 21.3 ms: 2.08x faster                  |
| [stream_feed 2400x8000] cat-gpl3.input->HistoryScreen    | 262 ms   | 132 ms: 1.99x faster                   |
| [stream_feed 240x800] find-etc.input->Screen             | 134 ms   | 76.3 ms: 1.76x faster                  |
| [stream_feed 24x80] vi.input->Screen                     | 4.45 ms  | 2.54 ms: 1.75x faster                  |
| [stream_feed 24x80] mc.input->Screen                     | 17.6 ms  | 10.6 ms: 1.66x faster                  |
| [stream_feed 2400x8000] ls.input->Screen                 | 7.77 ms  | 4.85 ms: 1.60x faster                  |
| [stream_feed 2400x8000] find-etc.input->Screen           | 616 ms   | 422 ms: 1.46x faster                   |
| [stream_feed 2400x8000] cat-gpl3.input->Screen           | 170 ms   | 118 ms: 1.44x faster                   |

On memory there is an improvement too:

+--------------------------------------------------------------+-------------------+-----------------------------------------------------+
| Benchmark                                                    | 0.8.1.tracemalloc | 0.8.1+screen-optimizations+default-conf.tracemalloc |
+==============================================================+===================+=====================================================+
| [stream_feed 2400x8000] vi.input->HistoryScreen              | 11.7 MB           | 686.8 kB: 17.45x faster                             |
| [stream_feed 2400x8000] vi.input->Screen                     | 4742.1 kB         | 538.0 kB: 8.81x faster                              |
| [stream_feed 2400x8000] htop.input->HistoryScreen            | 14.5 MB           | 3552.1 kB: 4.18x faster                             |
| [stream_feed 240x800] vi.input->HistoryScreen                | 2679.0 kB         | 686.8 kB: 3.90x faster                              |
| [stream_feed 2400x8000] top.input->HistoryScreen             | 2120.7 kB         | 611.8 kB: 3.47x faster                              |
| [stream_feed 2400x8000] htop.input->Screen                   | 11.5 MB           | 3408.2 kB: 3.45x faster                             |
| [stream_feed 2400x8000] top.input->Screen                    | 2155.1 kB         | 680.2 kB: 3.17x faster                              |
| [stream_feed 240x800] htop.input->HistoryScreen              | 2189.7 kB         | 1005.8 kB: 2.18x faster                             |
| [stream_feed 240x800] vi.input->Screen                       | 1107.7 kB         | 536.1 kB: 2.07x faster                              |
| [stream_feed 240x800] htop.input->Screen                     | 1782.7 kB         | 990.6 kB: 1.80x faster                              |
            :::                 ::::                                    ::::         :::
| [stream_feed 240x800] find-etc.input->HistoryScreen          | 2233.5 kB         | 1502.4 kB: 1.49x faster                             |
| [stream_feed 24x80] ls.input->HistoryScreen                  | 1554.3 kB         | 1086.0 kB: 1.43x faster                             |
| [stream_feed 24x80] cat-gpl3.input->HistoryScreen            | 1354.0 kB         | 960.1 kB: 1.41x faster                              |
| [stream_feed 24x80] top.input->Screen                        | 948.2 kB          | 680.2 kB: 1.39x faster                              |
| [stream_feed 24x80] vi.input->HistoryScreen                  | 954.7 kB          | 686.8 kB: 1.39x faster                              |
| [stream_feed 24x80] find-etc.input->HistoryScreen            | 1017.6 kB         | 774.9 kB: 1.31x faster                              |
| [stream_feed 240x800] top.input->HistoryScreen               | 763.6 kB          | 653.6 kB: 1.17x faster                              |
| [stream_feed 24x80] mc.input->HistoryScreen                  | 485.0 kB          | 417.6 kB: 1.16x faster                              |
| [stream_feed 24x80] htop.input->Screen                       | 936.1 kB          | 814.3 kB: 1.15x faster                              |
| [stream_feed 24x80] mc.input->Screen                         | 722.3 kB          | 651.6 kB: 1.11x faster                              |

The following are the tests that show regression on memory usage.

| [stream_feed 240x800] mc.input->Screen                       | 1842.2 kB         | 2577.4 kB: 1.40x slower                             |
| [stream_feed 240x800] mc.input->HistoryScreen                | 1793.2 kB         | 2548.1 kB: 1.42x slower                             |
| [stream_feed 2400x8000] mc.input->HistoryScreen              | 13.6 MB           | 22.3 MB: 1.64x slower                               |
| [stream_feed 2400x8000] ls.input->HistoryScreen              | 8422.1 kB         | 13.7 MB: 1.67x slower                               |
| [stream_feed 2400x8000] mc.input->Screen                     | 12.2 MB           | 22.3 MB: 1.82x slower                               |

When track_dirty_lines is False and disable_display_graphic is True, this is even better:

+--------------------------------------------------------------+-------------------+----------------------------------------------------+
| Benchmark                                                    | 0.8.1.tracemalloc | 0.8.1+screen-optimizations+custom-conf.tracemalloc |
+==============================================================+===================+====================================================+
| [stream_feed 2400x8000] mc.input->HistoryScreen              | 13.6 MB           | 414.1 kB: 33.60x faster                            |
| [stream_feed 2400x8000] mc.input->Screen                     | 12.2 MB           | 447.6 kB: 27.98x faster                            |
| [stream_feed 2400x8000] htop.input->Screen                   | 11.5 MB           | 600.5 kB: 19.59x faster                            |
| [stream_feed 2400x8000] vi.input->HistoryScreen              | 11.7 MB           | 665.4 kB: 18.01x faster                            |
| [stream_feed 2400x8000] htop.input->HistoryScreen            | 14.5 MB           | 1009.0 kB: 14.73x faster                           |
| [stream_feed 2400x8000] vi.input->Screen                     | 4742.1 kB         | 522.4 kB: 9.08x faster                             |
| [stream_feed 240x800] mc.input->HistoryScreen                | 1793.2 kB         | 417.4 kB: 4.30x faster                             |
| [stream_feed 240x800] mc.input->Screen                       | 1842.2 kB         | 447.6 kB: 4.12x faster                             |
| [stream_feed 240x800] vi.input->HistoryScreen                | 2679.0 kB         | 652.6 kB: 4.11x faster                             |
| [stream_feed 2400x8000] top.input->HistoryScreen             | 2120.7 kB         | 653.6 kB: 3.24x faster                             |
| [stream_feed 2400x8000] top.input->Screen                    | 2155.1 kB         | 680.6 kB: 3.17x faster                             |
| [stream_feed 240x800] htop.input->Screen                     | 1782.7 kB         | 600.5 kB: 2.97x faster                             |
| [stream_feed 240x800] htop.input->HistoryScreen              | 2189.7 kB         | 785.0 kB: 2.79x faster                             |
| [stream_feed 240x800] vi.input->Screen                       | 1107.7 kB         | 522.4 kB: 2.12x faster                             |
| [stream_feed 2400x8000] cat-gpl3.input->HistoryScreen        | 20.3 MB           | 11.8 MB: 1.72x faster                              |
            :::                 ::::                                    ::::         :::
| [stream_feed 24x80] find-etc.input->HistoryScreen            | 1017.6 kB         | 774.8 kB: 1.31x faster                             |
| [stream_feed 240x800] top.input->HistoryScreen               | 763.6 kB          | 653.6 kB: 1.17x faster                             |
| [stream_feed 24x80] mc.input->HistoryScreen                  | 485.0 kB          | 422.0 kB: 1.15x faster                             |

However, we still have some regressions:

| [stream_feed 24x80] htop.input->HistoryScreen                | 863.7 kB          | 1009.0 kB: 1.17x slower                            |
| [stream_feed 2400x8000] ls.input->HistoryScreen              | 8422.1 kB         | 13.7 MB: 1.67x slower                              |

`Screen.reset`

For Screen.reset we have a regressions, some minor, some not-so-much
minor:

+----------------------------------------------------------+----------+-----------------------------------------+
| Benchmark                                                | 0.8.1    | 0.8.1+screen-optimizations+default-conf |
+==========================================================+==========+=========================================+
| [screen_reset 2400x8000] ls.input->HistoryScreen         | 65.4 us  | 68.9 us: 1.05x slower                   |
| [screen_reset 2400x8000] mc.input->Screen                | 51.9 us  | 54.8 us: 1.06x slower                   |
| [screen_reset 2400x8000] top.input->HistoryScreen        | 65.6 us  | 69.5 us: 1.06x slower                   |
    :::                 ::::                                    ::::         :::
| [screen_reset 24x80] cat-gpl3.input->HistoryScreen       | 13.2 us  | 15.4 us: 1.17x slower                   |
| [screen_reset 24x80] vi.input->HistoryScreen             | 13.0 us  | 15.3 us: 1.18x slower                   |
| [screen_reset 240x800] htop.input->Screen                | 4.87 us  | 5.78 us: 1.19x slower                   |
| [screen_reset 24x80] mc.input->HistoryScreen             | 13.1 us  | 15.7 us: 1.19x slower                   |
| [screen_reset 240x800] mc.input->Screen                  | 4.81 us  | 5.75 us: 1.20x slower                   |
| [screen_reset 24x80] ls.input->HistoryScreen             | 13.0 us  | 15.5 us: 1.20x slower                   |
| [screen_reset 24x80] find-etc.input->HistoryScreen       | 13.0 us  | 15.6 us: 1.20x slower                   |
| [screen_reset 24x80] htop.input->HistoryScreen           | 12.9 us  | 15.5 us: 1.21x slower                   |
| [screen_reset 240x800] find-etc.input->Screen            | 4.86 us  | 5.87 us: 1.21x slower                   |
| [screen_reset 240x800] htop.input->HistoryScreen         | 15.6 us  | 18.9 us: 1.21x slower                   |
| [screen_reset 240x800] vi.input->Screen                  | 4.83 us  | 5.87 us: 1.22x slower                   |
| [screen_reset 240x800] top.input->Screen                 | 4.72 us  | 5.77 us: 1.22x slower                   |
    :::                 ::::                                    ::::         :::
| [screen_reset 240x800] ls.input->Screen                  | 4.79 us  | 5.86 us: 1.22x slower                   |
| [screen_reset 240x800] cat-gpl3.input->Screen            | 4.79 us  | 5.89 us: 1.23x slower                   |
| [screen_reset 24x80] vi.input->Screen                    | 2.05 us  | 3.05 us: 1.49x slower                   |
| [screen_reset 24x80] mc.input->Screen                    | 2.04 us  | 3.05 us: 1.49x slower                   |
| [screen_reset 24x80] ls.input->Screen                    | 2.01 us  | 3.01 us: 1.50x slower                   |
| [screen_reset 24x80] htop.input->Screen                  | 2.02 us  | 3.06 us: 1.51x slower                   |
| [screen_reset 24x80] cat-gpl3.input->Screen              | 2.03 us  | 3.07 us: 1.52x slower                   |
| [screen_reset 24x80] top.input->Screen                   | 2.03 us  | 3.11 us: 1.53x slower                   |
| [screen_reset 24x80] find-etc.input->Screen              | 2.00 us  | 3.06 us: 1.53x slower                   |

However when
track_dirty_lines is False and disable_display_graphic is True,
the things improves (but we still have regressions):

+----------------------------------------------------------+----------+----------------------------------------+
| Benchmark                                                | 0.8.1    | 0.8.1+screen-optimizations+custom-conf |
+==========================================================+==========+========================================+
| [screen_reset 2400x8000] find-etc.input->Screen          | 51.3 us  | 15.2 us: 3.38x faster                  |
| [screen_reset 2400x8000] mc.input->Screen                | 51.9 us  | 15.5 us: 3.35x faster                  |
| [screen_reset 2400x8000] vi.input->Screen                | 52.8 us  | 15.9 us: 3.32x faster                  |
    :::                 ::::                                    ::::         :::
| [screen_reset 2400x8000] cat-gpl3.input->HistoryScreen   | 66.5 us  | 29.9 us: 2.22x faster                  |
| [screen_reset 2400x8000] htop.input->HistoryScreen       | 64.6 us  | 29.4 us: 2.20x faster                  |
| [screen_reset 240x800] htop.input->Screen                | 4.87 us  | 4.58 us: 1.06x faster                  |
| [screen_reset 240x800] find-etc.input->Screen            | 4.86 us  | 4.62 us: 1.05x faster                  |
| [screen_reset 240x800] cat-gpl3.input->HistoryScreen     | 16.0 us  | 17.0 us: 1.06x slower                  |
| [screen_reset 240x800] mc.input->HistoryScreen           | 16.0 us  | 17.1 us: 1.07x slower                  |
| [screen_reset 240x800] find-etc.input->HistoryScreen     | 16.1 us  | 17.3 us: 1.07x slower                  |
| [screen_reset 240x800] top.input->HistoryScreen          | 15.9 us  | 17.1 us: 1.08x slower                  |
| [screen_reset 240x800] htop.input->HistoryScreen         | 15.6 us  | 17.5 us: 1.12x slower                  |
    :::                 ::::                                    ::::         :::
| [screen_reset 24x80] htop.input->HistoryScreen           | 12.9 us  | 15.3 us: 1.19x slower                  |
| [screen_reset 24x80] htop.input->Screen                  | 2.02 us  | 2.89 us: 1.43x slower                  |
| [screen_reset 24x80] top.input->Screen                   | 2.03 us  | 2.92 us: 1.44x slower                  |
| [screen_reset 24x80] mc.input->Screen                    | 2.04 us  | 2.94 us: 1.44x slower                  |
| [screen_reset 24x80] ls.input->Screen                    | 2.01 us  | 2.90 us: 1.44x slower                  |
| [screen_reset 24x80] vi.input->Screen                    | 2.05 us  | 2.96 us: 1.44x slower                  |
| [screen_reset 24x80] find-etc.input->Screen              | 2.00 us  | 2.90 us: 1.45x slower                  |
| [screen_reset 24x80] cat-gpl3.input->Screen              | 2.03 us  | 2.96 us: 1.46x slower                  |

Since 0.8.1 pyte does not support Python 2.x anymore so it makes sense to upgrade one of its dev dependencies, pyperf.

Receive via environ the geometry of the screen to test with a default of 24 lines by 80 columns. Add this and the input file into Runner's metadata so it is preserved in the log file (if any)

Implement three more benchmark scenarios for testing screen.display, screen.reset and screen.resize. For the standard 24x80 geometry, these methods have a negligible cost however of larger geometries, they can be up to 100 times slower than stream.feed so benchmarking them is important. Changed how the metadata is stored so on each bench_func call we encode which scenario are we testing, with which screen class and geometry.

A shell script to test all the captured input files and run them under different terminal geometries (24x80, 240x800, 2400x8000, 24x8000 and 2400x80). These settings aim to stress pyte with larger and larger screens (by a 10 factor on both dimensions and on each dimension separately).

The input files in the tests/captured must be loaded with ByteStream and not Stream, otherwise the \r are lost and the benchmark results may not reflect real scenarios.

The former `for x in range(...)` implementation iterated over the all the possibly indexes (for columns and lines) wasting cyclies because some of those indexes (and in some cases most) pointed to non-existing entries. These non-existing entries were faked and a default character was returned in place. This commit instead makes display to iterate over the existing entries. When gaps between to entries are detected, the gap is filled with the same default character without having to pay for indexing non-entries. Note: I found that in the current implementation of screen, screen.buffer may have entries (chars in a line) outside of the width of the screen. At the display method those are filtered out however I'm not sure if this is not a real bug that was uncovered because never we iterated over the data entries. If this is true, we may be wasting space as we keep in memory chars that are outside of the screen.

Python generators (yield) and function calls are slower then normal for-loops. Improve screen.display by x1 to x1.8 times faster by inlining the code.

The assert that checks the width of each char is removed from screen.display and put it into the tests. This ensures that our test suite maintains the same quality and at the same time we make screen.display ~x1.7 faster.

Instead of computing it on each screen.display, compute the width of the char once on screen.draw and store it in the Char tuple. This makes screen.display ~x1.10 to ~x1.20 faster and it makes stream.feed only ~x1.01 slower in the worst case. This negative impact is due the change on screen.draw but measurements on my lab show inconsistent results (stream.feed didn't show a consistent performance regression and ~x1.01 slower was the worst value that I've got).

Fetch some attributes that were frequently accessed in the for-loop of screen.draw avoiding accessing them on each iteration. Most of them remain constant within the draw() method anyways. Others, like cursor.x, cursor.y and line are updated infrequently inside the for-loop so it still faster pre-fetch them outside and update them if needed than accessing them on each iteration. Benchmark results show stream.feed is x1.20 to x2.0 faster with these optimizations. Benchmark files that have more control sequences (like htop, mc and vim) have a lower improvement as the parsing of these sequences dominates the runtime of stream.feed.

Instead of checking if cursor_x > columns at the end of iteration and set cursor_x to the minimum of (cursor_x and columns), delay that decision to the begin of the next iteration or at the end of the for-loop. This removes one "if" statement at the end of the for-loop and allows us to use the local variable cursor_x all the time without having to update cursor.x. Only this happens before insert_characters() and at the end of the draw() method when the cursor.x is been visible by code outside draw() and therefore must be updated with the latest value of cursor_x. This optimization makes stream.feed between x1.05 and x1.14 faster. As in any optimization on draw(), the use cases that gets more improvements are the ones that have very few control sequences in their input (so stream.feed is dominated by screen.draw and not be stream._parse_fsm)

Make Char mutable (and ordinary object) so we can modify each char in place avoiding calling _replace. This commit only changed the Char class and implements some methods to emulate the namedtuple API. Theoretically it could be possible to emulate the whole namedtuple API but it is unclear if it worth. In this scenario, user code may break. Using a plain object instead of a namedtuple added a regression on memory usage of x1.20 for htop and mc benchmark files when HistoryScreen was used. The rest of the benchmarks didn't change significantly (but it is expected to be slightly more inefficient).

Reduce the memory footprint reusing/sharing the same CharStyle object among different Char instances. A specialized _replace_data changes the data and width of the char but not its style. This reduces the footprint between x1.05 and x1.30 with respect the 0.8.1-memory.json baseline results.

…ter; regress on mem) Instead of calling _replace() to create a new Char object, modify the existing one. For that, the Line (ex StaticDefaultDict) is in charge to fetch the char and do the modifications. If no Char is found, only then a Char is created and inserted in the Line (dict). See write_data(). In some cases we need to get a Char, read it and then update it so a copy of the Line's default char is returned and added to the line. A copy is required because now the Char are mutable. See char_at(). Changed the API of Char: _asdict renamed as as_dict and _replace as copy_and_change; removed _replace_data and added a copy method. The constructor also changed: it is required data, width and style. The former way to construct a Char can be done with from_attributes class method. This commits improved the runtime of stream.feed by x1.20 to x1.90 (faster) however a regression on the memory footprint was found (between x1.10 and x1.50). I don't have an explanation for this last point.

The test was using a legacy API of screen.buffer when the buffer was a dense matrix. Now it is sparse we cannot use len(screen.buffer) anymore or buffer[-1] either.

This improvement impacts slighly negatively over small geometries (x1.01 to x1.05 slower) but improves on larger geometries and for almost all the cases of HistoryScreen (x1.10 to x1.20)

It is handy to get some stats about the layout and chars locations in the lines/buffer and see how sparse they are. The statistics are not part of the stable API so they may change between versions.

This layer of abstraction will allow use to changes on the real buffer without breaking the public API.

Because screen._buffer is a defaultdict, an access to a non-existent element has the side effect

screen.default_char is always the space character so instead of using screen.default_char for padding we use the space character directly.

Because the default char of a line is always the space character, we don't need to overwrite it with screen.default_char, just we need to change its style.

If the next line of the margin's top was not empty, the for-loop used that entry to override the top line working as expected. But when the next line of the top was empty, the top line was untouched so an explicit pop is required. A similar issue happen on reverse_index. Both bugs were covered due a side effect to iterating over screen.buffer: on each line lookup, if no such exist, a new line is added to the buffer. This added new line then was used to override the top on an index() call.

If the line's default char and the cursor's attributes are the same, instead of writing spaces into the line, delete the chars. This should be equivalent from user's perspective. This applies to erase_characters, erase_in_line and erase_in_display. This optimization reduced the number of false lines (lines without any char) and the number of blank lines (lines with only spaces). With less entries in the buffer, the rest of the iterations can take advantage of the sparsity of the buffer and process much less.

We avoid a full scan and instead we do a sparse iteration. This also avoids adding empty lines into the buffer as real entries when a non-existing entry has the same effect and it consumes less memory.

… bug)

…ware

…cters

By default Screen tracks which lines are dirty and should be of interest for the user. This functionality may not be of interest for all the use cases and this tracking is expensibe, specially for large geometries. If track_dirty_lines is set to False, the screen.dirty attribute becomes a dummy or null set that it is always empty saving memory and time.

Instead of using None as a special case, set the margins to (0, lines-1) by default. This may break user code if the user is expecting None as a valid value or if it is setting it. If required we could make screen.margins a property and hide the internal implementation.

Using a defaultdict can easily introduce false entries in the buffer making it less sparse. The new Buffer class supports all the dict's operations but it does not add an entry if the key is missing. To add a new entry (a new line), do a buffer.line_at(y). This is equivalent to dict.setdefault(y, new_line()) but avoids the call to new_line() if an entry y exists.

When the graphic attributes are disabled, select_graphic_rendition always set the default style. With this, the lines' default char and the cursor's char will always match and the screen will optimize the erase* methods.

…ttom empty lines This is an optimization over screen.display where it is possible to strip left/right spaces of the lines and/or filter top/bottom whole empty lines. It is implemented in an opportunistic fashion so it may not fully strip/filter all that it is expected. In particular, lines with left or right spaces with non-default attributes are not stripped; lines that contains only spaces at the top or bottom are not filtered (even if lstrip/rstrip is set). This implementation is meant to be used when the screen has very large geometries and screen.display wastes too much time and memory on padding large lines and/or filling with a lot of empty lines.

eldipa · 2022-07-16T18:58:08Z

@superbobry , the PR is quite long but I have no problem to review it with you, piece by piece. Despite its length I think that the changes are an solid improvement (ok, I'm a little biased here).

A possible first round for the review process could be:

review 8513fe8 to eec4a2e: changes on how the benchmark was done
from there to 020fce6 for Screen.display optimizations
from there to 9721698 for Screen.draw optimizations

This walkthrough covers ~30% of the PR. From there we can think in the next rounds for review.

Thank for your time!

Moult · 2024-12-23T01:58:50Z

Bump - any chance of reviewing this? I've also been experiencing performance issues (I'm using feed). It's 26 commits behind but it looks pretty easy to sync up.

eldipa · 2024-12-23T15:01:30Z

Bump - any chance of reviewing this? I've also been experiencing performance issues (I'm using feed). It's 26 commits behind but it looks pretty easy to sync up.

Hi @Moult , I'm ok to assist in the review of the PR but I'm not sure if there is intention to merge it by the owners of the project. Until that, doing a rebase + fixing the conflict may yield no result.

If you want, you could try to use the PR directly to see if there is an improvement in the performance or not. Depending of your use case, you may also try a fork that I made where I removed features in order to be a little faster.

PR branch (compatible with the functionality of pyte): https://github.com/byexamples/pyte/tree/Screen-Optimizations
Fork (with less features but faster): https://github.com/byexamples/termscraper

Moult · 2024-12-23T22:20:40Z

Thanks! I've actually tested it and I can confirm there are really, really significant speedups to feed(), and as far as I can tell (at least in the realm of NetHack ttyrecs) there are no regressions (I did spot incorrect example code for looping through BufferView though in the docstring). I don't use display so I can't really commment there.

One minor (major?) issue for me is that the Char namedtuple has changed into a class. I write a ttyrec player, and so to implement a "rewind" feature, I copy the contents of buffer (or now, since there is a BufferView, I copy _buffer). The structure of buffer used to be basically dict[dict[namedtuple]] whereas now it's dict[dict[class[namedtuple, namedtuple]]] and that absolutely kills performance if one needed to do a copy of buffer 500,000 times like myself.

I naively reintroduced the Char namedtuple and tweaked write_data to use it and it seems to work fine again, and now super fast buffer copying via this code works. I'd love to know if this has side effects or why a Char class was chosen:

# Example copy which is only fast if Char is a namedtuple, and not possible if Char is a class
buffer_copy = {y: dict(row) for y, row in self.screen._buffer.items()}

eldipa added 30 commits June 17, 2022 15:25

Upgrade pyperf (drop support for Python 2.x)

8513fe8

Since 0.8.1 pyte does not support Python 2.x anymore so it makes sense to upgrade one of its dev dependencies, pyperf.

Allow change the screen geometry

cabc0a5

Receive via environ the geometry of the screen to test with a default of 24 lines by 80 columns. Add this and the input file into Runner's metadata so it is preserved in the log file (if any)

Fix benchmark.py using ByteStream and not Stream

e0b0e8b

The input files in the tests/captured must be loaded with ByteStream and not Stream, otherwise the \r are lost and the benchmark results may not reflect real scenarios.

Enable optionally tracemalloc on full benchmark

eec4a2e

Inline generator into display inner loop

b3b7db4

Python generators (yield) and function calls are slower then normal for-loops. Improve screen.display by x1 to x1.8 times faster by inlining the code.

Move assert out of prod code

de59245

The assert that checks the width of each char is removed from screen.display and put it into the tests. This ensures that our test suite maintains the same quality and at the same time we make screen.display ~x1.7 faster.

Refactor Char's style in a separated namedtuple object.

e881d25

Fix test_reverse_index (history) due old API

e49fb3f

The test was using a legacy API of screen.buffer when the buffer was a dense matrix. Now it is sparse we cannot use len(screen.buffer) anymore or buffer[-1] either.

Use binary search over non-empty lines on index/reverse_index

d94299d

This improvement impacts slighly negatively over small geometries (x1.01 to x1.05 slower) but improves on larger geometries and for almost all the cases of HistoryScreen (x1.10 to x1.20)

Minor optimizations.

912028f

Calculate statistics about buffer's and lines' internals (no stable API)

4c04935

It is handy to get some stats about the layout and chars locations in the lines/buffer and see how sparse they are. The statistics are not part of the stable API so they may change between versions.

On screen.buffer return a read-only view (BufferView/LineView)

84cd21f

This layer of abstraction will allow use to changes on the real buffer without breaking the public API.

Minor lookup prefetch.

5ae46bc

Do not unintentionally create empty lines

da66a7e

Because screen._buffer is a defaultdict, an access to a non-existent element has the side effect

Add blankcs Stats

1fd373a

Use a space for padding screen.display

b8250ea

screen.default_char is always the space character so instead of using screen.default_char for padding we use the space character directly.

Replace line's default style instead overwriting its char

8d71528

Because the default char of a line is always the space character, we don't need to overwrite it with screen.default_char, just we need to change its style.

BufferView not longer add new lines on iteration.

01f96bc

Impl prev_page/next_page with sparse iteration

4a15d3a

We avoid a full scan and instead we do a sparse iteration. This also avoids adding empty lines into the buffer as real entries when a non-existing entry has the same effect and it consumes less memory.

eldipa added 21 commits July 8, 2022 19:28

Test sparsity on insert_characters/delete_characters/erase_characters

3056742

Impl repr of a Char

f96ab6b

Extend erase_* meth tests for sparsity and cursor attr usage (fixed a…

70763b6

… bug)

Make erase_in_display conformant (test with non-default cursor attrs)

51e79a6

Sparse iteration of insert_characters and delete_characters

47d7c62

Add more checks to tests; fix bug on after_event and make it sparse-a…

b4258e1

…ware

Impl sparse iteration for insert_lines/delete_lines; fix insert_chara…

f178712

…cters

Sparse iter for erase_characters/erase_in_line

05c8c2c

Impl resize with sparse iter

80aa50a

Binary search for the tabstop

065b31d

Optionally disable display graphic attributes

e3fdf41

When the graphic attributes are disabled, select_graphic_rendition always set the default style. With this, the lines' default char and the cursor's char will always match and the screen will optimize the erase* methods.

Replace explicit for-loops with map calls

7839ded

Pass track_dirty_lines and disable_display_graphic to HistoryScreen

2ca29a5

Document what are and how to interpret the LineStats and BufferStats

c589265

Fuzzy tests *_characters and *_lines methods

1b42c89

Improve the docs (plus minor fixes on Line and BufferView)

c65ac82

Optimize the insert/delete characters/lines with map-loops

ba980a0

eldipa mentioned this pull request Jul 14, 2022

Display optimizations (between 2x00 and 8x00 times faster) (ignore, superseded by #160) #158

Closed

eldipa mentioned this pull request Jul 16, 2022

Assigning to Screen.buffer to restore its state #155

Open

superbobry force-pushed the master branch 2 times, most recently from 6cea8ec to 259ee02 Compare November 12, 2023 11:33

mumu-lhl mentioned this pull request May 12, 2024

Improve performance mumu-lhl/eaf-pyqterminal#31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Screen optimizations #160

Screen optimizations #160

eldipa commented Jul 14, 2022 •

edited

Loading

eldipa commented Jul 16, 2022 •

edited

Loading

Moult commented Dec 23, 2024

eldipa commented Dec 23, 2024

Moult commented Dec 23, 2024 •

edited

Loading

Screen optimizations #160

Are you sure you want to change the base?

Screen optimizations #160

Conversation

eldipa commented Jul 14, 2022 • edited Loading

What is this PR about?

Results (overview of the results)

Context (background)

Note for the reviewers

Contributions

Compatibility changes

TL;DR - Numbers overview

Screen.display

Stream.feed

Screen.reset

eldipa commented Jul 16, 2022 • edited Loading

Moult commented Dec 23, 2024

eldipa commented Dec 23, 2024

Moult commented Dec 23, 2024 • edited Loading

eldipa commented Jul 14, 2022 •

edited

Loading

`Screen.display`

`Stream.feed`

`Screen.reset`

eldipa commented Jul 16, 2022 •

edited

Loading

Moult commented Dec 23, 2024 •

edited

Loading