Node out of sync post webhook triggers degrade response times over all endpoints #849

jpalvarezl · 2022-04-11T09:04:50Z

Problem Statement

The current infrastructure usage is not optimized to our requirements safe-global/safe-pm#89

Describe the bug
Whenever we have nodes going out of sync the tx-service builds up a backlog of webhook triggers that are clearly visible in our kibana dashboard for the CGW. As soon as this starts, the worst response times for the CGW degrade drastically.

To Reproduce

Expected behavior
The CGW, having async handlers for all endpoints and a connection pool to redis is expected to handle the increased amount of calls to the webhook endpoint.

Environment (please complete the following information):

Production, happens on bursts of cache invalidations

Additional context

We have currently have the connection pool hardcoded to 15 we should evaluate and see whether this value is appropriate for our setup.
Rocket worker reference
We could also set a request connection timeout in the pool to avoid blocking for too long
We should consider a way to mitigate the excessive time it takes to look up keys in our redis setup, in the past we attempted to introduce a multiple database setup that was considered an anti-pattern. We need to find the correct way to reduce the redis response time. Link to issue (see also associated PRs) Redis database id selection for CacheResponse and RequestCached #344

The text was updated successfully, but these errors were encountered:

fmrsabino · 2022-08-11T14:49:55Z

Part of this issue was mitigated via a new Redis instance for mainnet.

However this issue was not fixed and is currently part of a bigger refactor around caching for the Safe Client Gateway.

jpalvarezl added the bug Something isn't working label Apr 11, 2022

fmrsabino self-assigned this Apr 14, 2022

rmeissner added the Critical https://github.com/gnosis/safe/wiki/Bug-priorities label Apr 27, 2022

rmeissner removed bug Something isn't working Critical https://github.com/gnosis/safe/wiki/Bug-priorities labels May 12, 2022

rmeissner mentioned this issue Jun 23, 2022

The current infrastructure usage is not optimized to our requirements safe-global/safe-pm#89

Open

fmrsabino closed this as not planned Won't fix, can't repro, duplicate, stale Aug 11, 2022

fmrsabino mentioned this issue Aug 19, 2022

Integrate caching solution safe-global/safe-client-gateway#44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node out of sync post webhook triggers degrade response times over all endpoints #849

Node out of sync post webhook triggers degrade response times over all endpoints #849

jpalvarezl commented Apr 11, 2022 •

edited by rmeissner

Loading

fmrsabino commented Aug 11, 2022

Node out of sync post webhook triggers degrade response times over all endpoints #849

Node out of sync post webhook triggers degrade response times over all endpoints #849

Comments

jpalvarezl commented Apr 11, 2022 • edited by rmeissner Loading

fmrsabino commented Aug 11, 2022

jpalvarezl commented Apr 11, 2022 •

edited by rmeissner

Loading