Add message history and retransmission #3199

afullerx · 2024-06-10T04:48:00Z

This PR attempts to resolve #3143 by adding a message history to outbox and providing for the retransmission of missed messages in order to resynchronize the client's state during a reconnection. If this cannot be accomplished, a reload of the page is triggered. The goal of this is to prevent a connected client's state from ever being out of sync with the server.

For the auto-index page, a history duration of 30 seconds was arbitrarily chosen. Since this value only determines when the UI is updated through resending messages instead of a page reload, the UI should stay properly synchronized regardless of this value.

For a ui.page, the history duration is computed based on the expected lifetime of the client object. Currently, with the default reconnect_timeout = 3.0, this is a max of 9 seconds. With this change, a re-evaluation of this default could be warranted. Now that UI state can be resynchronized indefinitely, discarding the user's page after only 5-9s of disconnection seems premature. See #3143 (comment) for more.

Open tasks (October 24, 2024):

message_history_length isn't being used
handle reconnect when next message ID has already been pruned
fix failing pytests
fix test_no_object_duplication_on_index_client
Should the auto-index client reload when trying to reconnect (because there is no message history)? -> No.
thorough test
test On Air

afullerx · 2024-06-10T05:57:55Z

I discovered a potential edge case where the client can get out of sync. Converted to draft until I can investigate.

Also, I got an email about a failed test with Python 3.9. It looks like it's from before the tests are even run. Not sure what to do about this.

falkoschindler · 2024-06-10T07:08:43Z

Thanks for starting this pull request, @afullerx! We're looking forward to reviewing your implementation once it's ready.

Regarding the failing test. Sometimes one of the "startup tests" fails because of some caching that takes longer than expected. This can safely be ignored. Next time the test will probably pass.

afullerx · 2024-06-10T08:31:26Z

OK, I believe this pull request is good to go. The desync I was seeing was caused by two new issues I discovered in the current codebase.

One is a race condition when multiple clients are connecting to an auto-index page.

The other is due to a gap in time between when the webpage is generated and when updates can be received. This could actually be fixed using the new message history, but I think it's best left for a future PR.

I'll submit issues and/or pull requests once this one is done.

afullerx · 2024-06-11T03:48:22Z

Regarding the pre-existing issue with missed updates due to a gap between page render and websocket connection. I realized I could fix it by just including a clients initial last_message_ID in the page render. Now the message history protects the initial connection as well. I'm not sure how long this gap can be, so I set the minimum history durations to 30 seconds. Maybe they should be longer.

falkoschindler

Ok, I finally had a chance to take a look into your code. Amazing work!
Just a few thoughts:

Somehow a retransmission ID is added to every message from the message history, which is then broadcasted to all clients, where it is checked against the expected retransmission ID:
```
for i in range(start, len(self._history)):
    args = self._history[i][2]
    args[1]['retransmit_id'] = retransmit_id
    self.enqueue_message('retransmit', args, '')
```
```
if (
  data.message_id <= window.last_message_id ||
  ("retransmit_id" in data && data.retransmit_id != window.retransmitId)
) {
  return;
}
```
This seems like a lot of overhead. Can't we pass the socket ID of the handshaking client to synchronize() and send a custom "retransmit" message containing all missed messages? This way we wouldn't need to manipulate messages and filter them on the client.
What do you think about additional CPU and memory consumption? Now that we keep every message for at least 30 seconds, this can accumulate quickly when, e.g., streaming 3D data. Should we make the history length configurable?
We should check how the new retransmission works with ui.scene and ui.leaflet, because they use a separate "init" message for initialization. (Maybe we can solve their initialization problem more elegantly by introducing an on_handshake method to ui.element that is called whenever a client handshakes... But that's probably out of scope of this pull request.)
Before merging, @rodja and I should check if it works seamlessly with NiceGUI On Air.

afullerx · 2024-06-17T06:00:16Z

Thanks for the feedback. Good idea about bundling the retransmissions into a single special message. However, I didn't see any way to send a message directly to a client connected via Air. We can still get almost all the benefit, as other clients will only need to filter a single infrequent message instead of checking every message.

I did think the history duration deserved a config option, but decided it wasn't my place to make that decision. I'll add a message_history_duration option. Setting it to zero will completely disable it and restore the previous behavior.

I'll also do some testing with ui.scene and ui.leaflet.

afullerx · 2024-06-27T23:49:25Z

After being short on time for a bit, I was finally able to implement the improvements. I should be able to push the changes in the next couple days after I do some final testing.

afullerx · 2024-07-04T00:22:23Z

I decided it's probably better to allow the user to configure the maximum number of history entries (message_buffer_max) rather than the history duration. This correlates better with both the memory needed and the size of the message backlog the client will have to handle. With a default of 1000 entries, this resulted in, at most, 2-3 MB of additional RAM usage with the message types I tested.

I did some profiling of the message handling overhead, and it seemed pretty negligible. For example, on average, calls to _append_history() only took ~10μs.

I realized the message history isn't needed to cover the initial connection for ui.page since outbox will hold messages until the connection is established. So, I set the history duration to the expected lifetime of the page.

Since core.app.config isn't always available during the initialization of outbox, setting up the history buffer in loop() was the next best option.

As far as I can tell, ui.scene and ui.leaflet are working fine with these changes.

As a possible enhancement, when sync fails, instead of reloading the page, we could dump the entire state of the page (as we do on page render) and send it in a message. We would then just replace the element object with the up-to-date one. This is much faster and more seamless than a full page reload and, for a ui.page, doesn't result in the loss of state. I experimented with this, and it worked very well, but it seemed a little too experimental to include. I can imagine problems with components like ui.scene. I'm interested in your thoughts on this. I'd love to go this route if it's not going to cause too many problems.

afullerx · 2024-07-07T04:07:56Z

Regarding the enhancement I mentioned in my previous post, if we can do a full state resync without a page reload when synchronize() fails, we could probably do without the message_buffer_max config option and make do with a smaller fixed-size history buffer. Since state would be seamlessly synced either way, it would just be about sizing it to the point where it stops being more efficient to sync through replaying messages.

While ui.scene and ui.leaflet using a special "init" event isn't an issue for this PR in its current form, this would need to be reworked to implement this enhancement. We would need them to transmit their state whenever we do a full state resync in addition to the initial page load.

Ultimately, I'm not sure if this would work out or not, but I think there's enough merit in the idea that I should take some time to fully explore it.

falkoschindler · 2024-07-09T09:29:52Z

Oh, wow, this PR keeps on growing... But it is certainly a good idea to re-evaluate our options and to think about the best path forward, before spending more time with the implementation or even merging something that hinders us later.

The special initialization of ui.scene and ui.leaflet is definitely something I'd love to get rid of. I expect this to be quite some work since we need to re-implement (part of) their object model in Python. But maybe I overestimate its complexity and it would be a valuable groundwork before continuing to work on the general message transmission.

afullerx · 2024-07-13T03:33:01Z

I decided that doing a full-state resync without reloading is going to be a no-go. I was able to get it working pretty well in most cases by having ui.leaflet and ui.scene clear their contents and basically reinitialize. But since they can have their state altered by arbitrary method calls (e.g., the "change icon" example for leaflet), this wouldn't be safe to do in all cases.

Anyway, I believe this PR is ready for review again. Some other improvements I made:

Fixed compatibility with On Air by also adding synchronize() to its "handshake" handler.
The "sync" event is emitted directly by synchronize() now. This fixes an edge case with out-of-order messages when messages are queued while the sync is performed.
The front end now keeps a list of all its past socket IDs. This is then used by synchronize() to filter out messages intended for other targets.

falkoschindler

Hi @afullerx, I finally found enough time and focus to review your pull request.

I made just a few minor changes:

I made use of the Message type to simplify argument lists and type annotations a bit.
I think _message_count should always be increased when emitting a message.
Instead of ignoring a type error we can safely assert that self._history is not None.
I restructured the sync method in JavaScript using early exits and destructuring.

Apart from that, I have some more thoughts I'd like to clarify:

As far as I understand setting message_buffer_max to 0 disables the deque, which behaves differently than set maxlen=0? Or could we assume to always work with a deque, just sometimes with zero length?
I thought about creating the deque in the initializer with a default length of 1000, and changing it in loop() according to message_buffer_max. The maxlen attribute is readonly, but we could create a copy like d = deque(d, maxlen=...). But what would we do it the current deque already contains more messages than the new maxlen?
> The front end now keeps a list of all its past socket IDs. This is then used by synchronize() to filter out messages intended for other targets.

We should propable prune these socket IDs...
Maybe there is a better parameter name than message_buffer_max. Maybe message_history_length?
In client.js we compare msg.target against window.socket.id. I think we can avoid sending sync messages to the wrong clients in the first place like this: await self._emit('sync', {...}, socket_ids[-1]).
You're adding message_id to data and removing it again on the client. Couldn't this interfer with the other payload? Maybe it's better to keep this attribute separate, even if this would complicate the data structure of a history item once again.

afullerx · 2024-07-31T20:25:43Z

I made additional changes to address some of the remaining concerns. Most of them are explained by my previous post and the commit messages.

One additional change is that I realized that previous socket IDs only need to be kept by the client temporarily. Once the sync operation is complete, all previous socket IDs become irrelevant.

As far as maintaining the length of the history using the maxlen parameter, I ultimately decided to leave it alone. None of the alternatives seem clearly superior to the current implementation.

I didn't change it so that the "sync" message is emitted directly to the client's socket ID because this didn't work with On Air in my previous testing

falkoschindler · 2024-09-06T14:18:03Z

Sorry, @afullerx, your pull request is not forgotten. I just didn't find the time and focus to dive into the details of this implementation once again, especially analyzing the issue with NiceGUI On Air. But it's still on my list. I'll be traveling over the next two weeks, so I hope to continue working on it by the end of September.

falkoschindler · 2024-10-12T19:19:52Z

I finally had another look into this pull request. It looks like we're almost good. I'm just experimenting with sending sync messages directly to the socket ID doing the handshake, so that we can remove the windows.socketIds array. Apart from that, NiceGUI On Air doesn't seem to work at all:

  File "/Users/falko/Projects/nicegui/nicegui/air.py", line 127, in _handle_handshake
    await client.outbox.synchronize(data['last_message_id'], data['socket_ids'])
                                    ~~~~^^^^^^^^^^^^^^^^^^^
KeyError: 'last_message_id'

It looks like we need to ~~adjust the On Air server implementation~~ test with a local server instance.

afullerx · 2024-10-13T06:28:38Z

Thanks for revisiting this, @falkoschindler. I believe the KeyError exception is due to the browser using the unmodified nicegui.js. I was able to get it working using the "Redirector" and "CORS Unblock" browser extensions to force the use of the updated version. I imagine you have a more straightforward way of testing this, and that is what you meant by "test with a local server instance."

The only thing that didn't work with On Air was emitting the sync message directly to the client using it's socket ID. I assumed this was because the On Air server wasn't properly forwarding messages targeted in this way.

Even if this were remedied, I believe we would still need windows.socketIds to keep track of the clients past SIDs. Since the history can contain messages targeted specifically at other clients, we need a way to filter them out. As an alternative to the current approach, we could send all messages and filter them on the client.

# Conflicts: # nicegui/static/nicegui.js

falkoschindler · 2024-10-24T18:00:03Z

While discussing this pull request with @rodja, we decided to simplify the whole retransmission logic by excluding the shared auto-index page. We can include it later if we really want to. But for the moment we chose simplicity over completeness.

This way every message is sent to one client only and we can simply keep it in the already existing message queue. A message_index marks the current position in the queue from where to send the next message. If a client reconnects, it simply asks the outbox to move the index back to the position of the next expected message ID. All older messages can be pruned.

It still needs some testing with On Air and elements like ui.scene. But I'm optimistic.

falkoschindler · 2024-10-24T18:04:09Z

Local tests with ui.log and ui.scene seem to work well:

import random
import time
from nicegui import ui

@ui.page('/', reconnect_timeout=10.0)
def page():
    log = ui.log()
    ui.timer(1.0, lambda: log.push(f'{time.time():.0f}'))

    scene = ui.scene()
    ui.timer(1.0, lambda: scene.sphere().scale(0.5).move(random.random() - 0.5, random.random() - 0.5, random.random()))

falkoschindler · 2024-10-24T18:14:23Z

I forgot to handle the case when client reconnects too late and the message history isn't long enough. I'll add that tomorrow.

falkoschindler · 2024-10-25T17:00:36Z

Apparently, updates based on running methods like "update_grid" are broken:

grid = ui.aggrid({'columnDefs': [{'field': 'name'}], 'rowData': []})

def update():
    grid.options['rowData'].append({'name': 'Alice'})
    grid.update()

ui.button('Update', on_click=update)

The update message might be enqueued in a wrong place. But changing

self.messages.append((self.client.id, self.next_message_id, time.time(), 'update', data))

to

self.messages.insert(self._message_index, (self.client.id, self.next_message_id, time.time(), 'update', data))

didn't help immediately.

falkoschindler · 2024-10-25T17:09:02Z

Ah, inserting the update message is basically correct, but it messes up the order of message IDs.

falkoschindler · 2024-10-26T11:36:30Z

@rodja Tests are green, ready for review.
But before merging I'd like to test a little more, also On Air and on some of our robots.

afullerx added 2 commits June 10, 2024 00:38

add message history and retransmission

957b32d

remove console.log call

ce74dca

afullerx marked this pull request as draft June 10, 2024 05:20

afullerx marked this pull request as ready for review June 10, 2024 08:32

falkoschindler self-requested a review June 10, 2024 09:12

move initial message ID to page render

4754e47

afullerx and others added 4 commits June 11, 2024 00:07

update docstring

61bd0ad

add retransmit ID

0a5712b

minor refactor

9161d4e

code review

7ac6136

falkoschindler reviewed Jun 16, 2024

View reviewed changes

afullerx added 2 commits July 2, 2024 22:11

Merge branch 'main' into message-retransmission

bd29dc4

lower overhead

8606c5e

add emit target filter

87bd9e5

afullerx marked this pull request as draft July 7, 2024 04:09

fix on air compatibility

744ea93

afullerx marked this pull request as ready for review July 13, 2024 03:33

code review

230adce

falkoschindler reviewed Jul 27, 2024

View reviewed changes

falkoschindler added the enhancement New feature or request label Jul 27, 2024

afullerx added 5 commits July 28, 2024 02:44

add missing "not" to log message

0deb85d

prevent incrementing _message_count for "sync" message

a3631b5

change config option to "message_history_length"

3dbbfd6

wrap message payload

a8ace71

remove previous socket ID after sync

33e890e

falkoschindler added 3 commits October 12, 2024 20:01

Merge branch 'main' into message-retransmission

18e3d9d

code review

2163fe0

fix pytest fixture

2daa782

falkoschindler self-assigned this Oct 17, 2024

falkoschindler added 2 commits October 24, 2024 19:46

simplify retransmission by keeping sent messages in message queue

8edf298

Merge branch 'main' into message-retransmission

f0eb99c

# Conflicts: # nicegui/static/nicegui.js

falkoschindler requested a review from rodja October 24, 2024 18:00

falkoschindler marked this pull request as draft October 24, 2024 18:13

consider maximum message_history_length

16098a8

falkoschindler added 3 commits October 26, 2024 12:41

re-introduce a message_history queue to fix message ID hiccups

d3448c7

cleanup and small corrections

3408c38

don't reload on shared pages

69e8a52

falkoschindler added this to the 2.6 milestone Oct 26, 2024

falkoschindler modified the milestones: 2.6, 2.7 Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add message history and retransmission #3199

Add message history and retransmission #3199

afullerx commented Jun 10, 2024 •

edited by falkoschindler

Loading

afullerx commented Jun 10, 2024

falkoschindler commented Jun 10, 2024

afullerx commented Jun 10, 2024

afullerx commented Jun 11, 2024

falkoschindler left a comment •

edited

Loading

afullerx commented Jun 17, 2024 •

edited

Loading

afullerx commented Jun 27, 2024

afullerx commented Jul 4, 2024 •

edited

Loading

afullerx commented Jul 7, 2024

falkoschindler commented Jul 9, 2024

afullerx commented Jul 13, 2024 •

edited by falkoschindler

Loading

falkoschindler left a comment •

edited

Loading

afullerx commented Jul 31, 2024

falkoschindler commented Sep 6, 2024

falkoschindler commented Oct 12, 2024 •

edited

Loading

afullerx commented Oct 13, 2024

falkoschindler commented Oct 24, 2024

falkoschindler commented Oct 24, 2024

falkoschindler commented Oct 24, 2024

falkoschindler commented Oct 25, 2024 •

edited

Loading

falkoschindler commented Oct 25, 2024

falkoschindler commented Oct 26, 2024

Add message history and retransmission #3199

Are you sure you want to change the base?

Add message history and retransmission #3199

Conversation

afullerx commented Jun 10, 2024 • edited by falkoschindler Loading

afullerx commented Jun 10, 2024

falkoschindler commented Jun 10, 2024

afullerx commented Jun 10, 2024

afullerx commented Jun 11, 2024

falkoschindler left a comment • edited Loading

Choose a reason for hiding this comment

afullerx commented Jun 17, 2024 • edited Loading

afullerx commented Jun 27, 2024

afullerx commented Jul 4, 2024 • edited Loading

afullerx commented Jul 7, 2024

falkoschindler commented Jul 9, 2024

afullerx commented Jul 13, 2024 • edited by falkoschindler Loading

falkoschindler left a comment • edited Loading

Choose a reason for hiding this comment

afullerx commented Jul 31, 2024

falkoschindler commented Sep 6, 2024

falkoschindler commented Oct 12, 2024 • edited Loading

afullerx commented Oct 13, 2024

falkoschindler commented Oct 24, 2024

falkoschindler commented Oct 24, 2024

falkoschindler commented Oct 24, 2024

falkoschindler commented Oct 25, 2024 • edited Loading

falkoschindler commented Oct 25, 2024

falkoschindler commented Oct 26, 2024

afullerx commented Jun 10, 2024 •

edited by falkoschindler

Loading

falkoschindler left a comment •

edited

Loading

afullerx commented Jun 17, 2024 •

edited

Loading

afullerx commented Jul 4, 2024 •

edited

Loading

afullerx commented Jul 13, 2024 •

edited by falkoschindler

Loading

falkoschindler left a comment •

edited

Loading

falkoschindler commented Oct 12, 2024 •

edited

Loading

falkoschindler commented Oct 25, 2024 •

edited

Loading