Skip to content
This repository has been archived by the owner on Apr 18, 2024. It is now read-only.

L1 scoring ends up with no peers on bifrost-stage1-ny #48

Closed
lidel opened this issue Feb 24, 2023 · 3 comments
Closed

L1 scoring ends up with no peers on bifrost-stage1-ny #48

lidel opened this issue Feb 24, 2023 · 3 comments
Assignees

Comments

@lidel
Copy link
Contributor

lidel commented Feb 24, 2023

Problem

After deploying bifrost-gateway 2023-02-24-c305b3b (with #36) to bifrost-stage1-ny traffic died very quick:

2023-02-24-234420_3440x1440_scrot

At the same time, Nginx that is in front of bifrost-gateway staryed returning 502 for nearly every request:

Screenshot 2023-02-24 at 23-57-49 bifrost-gw staging metrics - Bifrost - Dashboards - Grafana

Checked nginx logs and this happen because caboose has no available strn backend:

> GET /ipfs/QmSr3uQLnXRTfhrpwcNTs14pfphf6yHDHXq4CD8MRBC7mf/17_036.jpg HTTP/1.1
< HTTP/1.1 502 Bad Gateway
failed to resolve /ipfs/QmSr3uQLnXRTfhrpwcNTs14pfphf6yHDHXq4CD8MRBC7mf/17_036.jpg: Bad Gateway: no available strn backend

Some ideas

  • I was not able to reproduce this locally, but i tested with ab -k -n 10000 -c 1000 -w "http://en.wikipedia-on-ipfs.org.ipns.localhost:8081/wiki/Archaeology" and BLOCK_CACHE_SIZE=2 ./bifrost-gateway, which is easy to find, and does not test diverse set of CIDs that would penalize entire pool
  • we are still running against https://orchestrator.strn.pl/nodes/nearby?count=1000&core=true which returns only ~40 L1s. It could be that given very diverse traffic, we quickly run out of useful peers.

@willscott @aarshkshah1992 as you are more familiar with #36

@willscott
Copy link
Contributor

@aarshkshah1992 - I hacked around this to get traffic flowing over the weekend for metrics. the compromise is described in #49

We should figure out how to get the weighting algorithm tuned in.

  • maybe test using the ab (apache benchmark) client against actual saturn to understand actual failure rate and tune the weighting changes to work for those rates.
  • maybe think about up/down voting in relation to the observed 'average' failure rate caboose is currently seeing (dynamic up/down weighting #50)

@aarshkshah1992
Copy link
Contributor

@willscott @lidel Honestly, this is more a function of the orchestrator simply not having deployed the bifrost changes to enough L1s and Caboose not having a disaster recovery mechanism in place like the DHT. Let me raise a PR .

@willscott
Copy link
Contributor

this is resolved at this point

@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in bifrost-gateway Mar 8, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants