You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We did some inadvertent load testing against our prod instance of SDR during summer of 2022 (see here, here, here). But this revealed that we might have a relatively low effort way to get into intentional load testing of SDR, which we don't currently have any tooling for, afaik. For the Catalhoyuk remediation we just sequentially ran 144k small metadata updates to existing druids, and this sometimes produced enough activity to run afoul of the inter-firewall-zone network timeout that we suspect caused our preservation storage headaches.
Because it's very likely that normal human and gbooks accessioning activity would've ultimately led to the same errors in prod, maybe at a slower rate, it seems doubtful to me that we really needed that sdr-api based remediation script in prod to trigger the problems we saw. However, the volume of both human and gbooks accessioning in our test environments is much lower than it is in prod. Scripting a way to have sdr-api send a bunch of test updates to QA or stage might be a relatively easy way to apply more load on demand. I think it'd be fine to start with one thread of sequential updates, which would be more than what we have now, but configurable parallelization might be a nice next step.
We could also consider some automated repetition or parallelization of infrastructure-integration-tests, but the resource limitations of an individual developer's laptop running a bunch of browsers might keep us from usefully load testing SDR that way. Still maybe worth a spike ticket in infrastructure-integration-tests? 🤷
The text was updated successfully, but these errors were encountered:
FWIW, I've run bulk actions in stage with ~5000 items, similar to the size of the Catalhoyuk batches, with similar types of updates - metadata-only changes. I've been doing things like that for quite a while when it was relevant to an FR ticket and/or workcycle work. I don't think there was ever a time that load testing on stage reproduced a problem on prod in the same way, though maybe on Cocina and Ceph the environments will behave more similarly.
There were definitely things that would work on stage but not on prod or vice versa. Or they would work the same. I don't remember having much success getting stage to give the same error on prod under similar loads. Usually stage would work better than prod for batches of small updates, possibly because the overall Fedora "database" (if you can call it that) is so much smaller. Without more thorough investigation it was hard to say.
This is a non-urgent follow-on ticket from https://github.com/sul-dlss/operations-tasks/issues/3100#issuecomment-1261547020
We did some inadvertent load testing against our prod instance of SDR during summer of 2022 (see here, here, here). But this revealed that we might have a relatively low effort way to get into intentional load testing of SDR, which we don't currently have any tooling for, afaik. For the Catalhoyuk remediation we just sequentially ran 144k small metadata updates to existing druids, and this sometimes produced enough activity to run afoul of the inter-firewall-zone network timeout that we suspect caused our preservation storage headaches.
Because it's very likely that normal human and gbooks accessioning activity would've ultimately led to the same errors in prod, maybe at a slower rate, it seems doubtful to me that we really needed that sdr-api based remediation script in prod to trigger the problems we saw. However, the volume of both human and gbooks accessioning in our test environments is much lower than it is in prod. Scripting a way to have sdr-api send a bunch of test updates to QA or stage might be a relatively easy way to apply more load on demand. I think it'd be fine to start with one thread of sequential updates, which would be more than what we have now, but configurable parallelization might be a nice next step.
We could also consider some automated repetition or parallelization of infrastructure-integration-tests, but the resource limitations of an individual developer's laptop running a bunch of browsers might keep us from usefully load testing SDR that way. Still maybe worth a spike ticket in infrastructure-integration-tests? 🤷
The text was updated successfully, but these errors were encountered: