Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify concurrency per background job type #18006

Open
anoadragon453 opened this issue Dec 6, 2024 · 0 comments
Open

Specify concurrency per background job type #18006

anoadragon453 opened this issue Dec 6, 2024 · 0 comments

Comments

@anoadragon453
Copy link
Member

Proposal

In our production clusters, we're occasionally seeing degraded performance on some of our hosts. This has been traced back to 5 concurrent room purges happening at once, which end up starving the database for any other tasks that need doing. These room purges are triggered by the forgotten_room_retention_period config option being enabled.

This option works by scheduling the purge when a user forgets the room (if they're the last one to do so):

# If everyone locally has left the room, then there is no reason for us to keep the
# room around and we automatically purge room after a little bit
if (
not do_not_schedule_purge
and self._forgotten_room_retention_period
and await self.store.is_locally_forgotten_room(room_id)
):
await self.hs.get_task_scheduler().schedule_task(
PURGE_ROOM_ACTION_NAME,
resource_id=room_id,
timestamp=self.clock.time_msec()
+ self._forgotten_room_retention_period,
)

The TaskScheduler was chosen for this so that if Synapse restarts during the purge, the room isn't left in a half-deleted state, as the task will be resumed again on startup. We have a few different types of jobs that can be queued using the TaskScheduler:

  • Purging an entire room
  • Purging history from a room
  • Deleting old to-device messages
  • Redacting the events of a given user in a set of given rooms

Completing one of each of these tasks varies wildly in terms of how many resources it takes. For instance, purging a large room with lots of message history and state events will take much longer than deleting old to-device messages for a fairly inactive user. You probably want a low concurrency for the former, and a high concurrency for the latter.

The global task concurrency is set here:

# Maximum number of tasks that can run at the same time
MAX_CONCURRENT_RUNNING_TASKS = 5

To work around this problem, we could introduce a hardcoded concurrency per-task type. Perhaps 1 or 2 for purge room, and something much higher for to-device messages. Perhaps we could still have an overall concurrency limit, but tasks of a certain type are prevented from starting if its already at that type's limit (and a different type is allowed to run instead).

For this, I think we would need to consolidate where task types are defined inside of synapse/util/task_scheduler.py, as right now they're simply defined on the fly in each file. We can then assign each type a concurrency limit.

An additional benefit is that this will allow smaller tasks to still run while continuing to work on large ones slowly.

Potential Issues

  • This is still a fairly coarse approach. For instance, purging small rooms will be much quicker than purging large ones. A better solution would require us to estimate a task's resource impact. Perhaps we could do that in the future, but this proposal is at least a step in the right direction.
  • This is still a hardcoded solution. Future work could involve allowing each limit to be configurable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant