Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ratelimit set_presence updates #18000

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

rda0
Copy link

@rda0 rda0 commented Dec 5, 2024

This adds rate-limiting applied for presence updates per user.

This is my first contribution to synapse. The default settings probably need guidance from developers and should be tested on other server configurations. I will post a test report in the comments.

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct
    (run the linters)

@rda0 rda0 requested a review from a team as a code owner December 5, 2024 18:38
@rda0 rda0 force-pushed the set-presence-ratelimit branch 2 times, most recently from 9ee8357 to 9a48930 Compare December 5, 2024 18:56
@rda0 rda0 force-pushed the set-presence-ratelimit branch from 9a48930 to aed1b1e Compare December 5, 2024 18:57
@rda0
Copy link
Author

rda0 commented Dec 5, 2024

I tested this PR on our production server (using v1.120.2) with 340 active users today. Here follows a test report that appears to fix #16843 for us. It shows the changes in CPU usage patterns when enabling or disabling the patch during our peak usage times.

Test timeline

2024-12-05T11:20:27+0100  systemctl restart nginx.service
2024-12-05T12:15:44+0100  systemctl restart netdata.service
2024-12-05T12:19:04+0100  systemctl restart synapse-phys.target                # 1 restart v1.120.2
2024-12-05T12:22:01+0100  mv env env_orig; mv env_patched env
2024-12-05T12:26:56+0100  vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml
2024-12-05T12:30:11+0100  mv env env_patched; mv env_orig env
2024-12-05T12:30:18+0100  systemctl restart synapse-phys.target                # 2 restart v1.120.2 patched
2024-12-05T12:30:51+0100  vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml    # set per_second: 0.02
2024-12-05T12:45:03+0100  systemctl restart synapse-phys.target                # 3 restart v1.120.2
2024-12-05T12:55:56+0100  systemctl restart synapse-phys.target                # 4 restart v1.120.2
2024-12-05T12:57:11+0100  mv env env_orig; mv env_patched env
2024-12-05T12:57:17+0100  systemctl restart synapse-phys.target                # 5 restart v1.120.2 patched (0.02)
2024-12-05T13:05:00+0100  cd /etc/opt/synapse/
2024-12-05T13:05:06+0100  vim phys.ethz.ch.homeserver.yaml                     # set per_second: 0.002
2024-12-05T13:05:34+0100  systemctl restart synapse-phys.target                # 6 restart v1.120.2 patched (0.002)
2024-12-05T13:52:17+0100  mv env env_patched; mv env_orig env
2024-12-05T13:52:20+0100  systemctl restart synapse-phys.target                # 7 restart v1.120.2
2024-12-05T13:58:36+0100  vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml    # set per_second: 0.01?
2024-12-05T13:59:22+0100  mv env env_orig; mv env_patched env
2024-12-05T13:59:25+0100  systemctl restart synapse-phys.target                # 8 restart v1.120.2 patched (0.01?)
2024-12-05T14:05:43+0100  vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml    # set ?
2024-12-05T14:17:19+0100  systemctl restart synapse-phys.target                # 9 restart v1.120.2 patched (?)
2024-12-05T14:18:32+0100  vim env/lib/python3.11/site-packages/synapse/rest/client/sync.py    # logger.info -> logger.debug (disabling the "User set_presence ratelimit exceeded; ignoring it." log spam, synapse log level was always set to info during these tests)
2024-12-05T14:19:09+0100  vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml    # set per_second: 0.02?
2024-12-05T14:20:27+0100  systemctl restart synapse-phys.target                # 10 restart v1.120.2 patched (0.02?)
2024-12-05T15:12:21+0100  mv env env_patched; mv env_orig env
2024-12-05T15:12:27+0100  systemctl restart synapse-phys.target                # 11 restart v1.120.2
2024-12-05T15:13:41+0100  vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml    # unset (use defaults: per_second: 0.1)
2024-12-05T15:18:38+0100  mv env env_orig; mv env_patched env
2024-12-05T15:18:41+0100  systemctl restart synapse-phys.target                # 12 restart v1.120.2 patched (0.1)

Numbered screenshots of CPU usage patterns

1
1
2
2
3
3
4
4
5
5
6
6
6 (repeating pattern)
6_repeating_pattern
7
7
8
8
9
9
10 (includes a code change of the debug logging from logger.info -> logger.debug (disabling the "User set_presence ratelimit exceeded; ignoring it." log spam, synapse log level was always set to info during these tests))
10
10 (repeating pattern)
10_repeating_pattern
11
11
12
12
12 (repeating pattern)
12_repeating_pattern

@rda0
Copy link
Author

rda0 commented Dec 10, 2024

After 2.5 work days the issue (#16843) did not reappear:

2024-12-10 13_53_02-View panel - phd-matrix synapse prometheus 20231213 - linux - Dashboards - Grafa

The day with the red line was the day of testing #18000 (comment). After the red line the PR is live with v1.120.2 using defaults (per_second: 0.1).

Copy link
Member

@anoadragon453 anoadragon453 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good to me! This is only thing missing are unit tests (though thank you for the proof-in-production graphs!).

At minimum, could you write two tests:

  1. Send a presence update, check that it went through, immediately send another and check that it was ignored.
  2. Send a presence update, check that it went through, advancing time a sufficient amount, send another presence update and check that it also worked.

Presence-related tests go in PresenceUpdateTestCase. Here is an example of advancing time in a test. Time won't advance otherwise.

I'd recommend switching the presence state (offline -> online, etc.) so you can check that a change occurred.

Let me know or shout in #synapse-dev:matrix.org if you need further guidance!

Comment on lines +1873 to +1875
The `rc_set_presence.per_user` sets ratelimiting how often a specific users' presence
updates are evaluated. Ratelimited presence updates are ignored.
`per_user` defaults to `per_second: 0.1`, `burst_count: 1`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight clarification:

Suggested change
The `rc_set_presence.per_user` sets ratelimiting how often a specific users' presence
updates are evaluated. Ratelimited presence updates are ignored.
`per_user` defaults to `per_second: 0.1`, `burst_count: 1`.
The `rc_set_presence.per_user` option sets rate limits on how often a specific
users' presence updates are evaluated. Ratelimited presence updates are
ignored, and no error is returned to the client.
`per_user` defaults to `per_second: 0.1`, `burst_count: 1`.

Comment on lines +1881 to +1882
per_second: 0.1
burst_count: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically the example configuration should be something other than the default.

Suggested change
per_second: 0.1
burst_count: 1
per_second: 0.05
burst_count: 0.5

@github-actions github-actions bot deployed to PR Documentation Preview December 20, 2024 18:47 Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rate-Limiting for set_presence changes
2 participants