Skip to content
This repository has been archived by the owner on Aug 28, 2023. It is now read-only.

Node out of sync post webhook triggers degrade response times over all endpoints #849

Closed
jpalvarezl opened this issue Apr 11, 2022 · 1 comment
Assignees

Comments

@jpalvarezl
Copy link
Contributor

jpalvarezl commented Apr 11, 2022

Problem Statement

Describe the bug
Whenever we have nodes going out of sync the tx-service builds up a backlog of webhook triggers that are clearly visible in our kibana dashboard for the CGW. As soon as this starts, the worst response times for the CGW degrade drastically.

To Reproduce

Expected behavior
The CGW, having async handlers for all endpoints and a connection pool to redis is expected to handle the increased amount of calls to the webhook endpoint.

Environment (please complete the following information):

  • Production, happens on bursts of cache invalidations

Additional context

  1. We have currently have the connection pool hardcoded to 15 we should evaluate and see whether this value is appropriate for our setup.
    Rocket worker reference

  2. We could also set a request connection timeout in the pool to avoid blocking for too long

  3. We should consider a way to mitigate the excessive time it takes to look up keys in our redis setup, in the past we attempted to introduce a multiple database setup that was considered an anti-pattern. We need to find the correct way to reduce the redis response time. Link to issue (see also associated PRs) Redis database id selection for CacheResponse and RequestCached #344

@jpalvarezl jpalvarezl added the bug Something isn't working label Apr 11, 2022
@fmrsabino fmrsabino self-assigned this Apr 14, 2022
@rmeissner rmeissner added the Critical https://github.com/gnosis/safe/wiki/Bug-priorities label Apr 27, 2022
@rmeissner rmeissner removed bug Something isn't working Critical https://github.com/gnosis/safe/wiki/Bug-priorities labels May 12, 2022
@fmrsabino
Copy link
Contributor

Part of this issue was mitigated via a new Redis instance for mainnet.

However this issue was not fixed and is currently part of a bigger refactor around caching for the Safe Client Gateway.

@fmrsabino fmrsabino closed this as not planned Won't fix, can't repro, duplicate, stale Aug 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants