-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network errors cause mmsIDs to fail to index #2320
Comments
Just noting that network errors observed on the traject logs on bibdata-alma-worker{1,2} the week of 3/18 take a different form and don't log a specific MMS ID: They show up as:
|
A third permutation |
In the open house on 3/26/2024 we checked the physical location of the solr-prod and bibdata-prod VMs. We noticed that the lib-solr-prod8 boxes are in Forrestal while the bibdata-alma-workers and bidata web servers are in New South. Ops informed us about a private non routable IPs update that failed for host 1a which is in Forrestal and they are currently working on this upgrade. The upgrade started on March 11th, 2024 which is very close to the time we started seeing indexing failures because of We will monitor the logs today (3/26/2024) and tomorrow and let Ops know if moving the VMs fixed the issue. Thank you everyone who worked on this during the Ansible open house: @VickieKarasic @acozine @kayiwa @carolyncole @sandbergja @maxkadel @leefaisonr @christinach |
Last timestamp of a network related error on bibdata-alma-worker{1,2} is
|
The logs from yesterday indicate that the last time the issue was triggered was at 2024-03-26T14:52:49-04:00. We will keep an eye to see if the network issue was fixed and there are no errors today. We moved the bibdata worker VMs in the same physical location with the solr-prod VMs at 12:40pm. |
Added two more logs in traject error logs worker1 and 2. The second error at 2024-03-27T21:52:52-04:00 occurs when |
@maxkadel and I are looking in datadog: time window March 27 2024 21:51pm - 22:04:pm lib-solr-prod6 error
|
Today operations added 100 GB on each one of the solr-prod boxes to avoid one of the above errors failing to connect and build the replica because there is no space. See docs from the incident report. |
On March 28th, 2024:
On March 29th, 2024:
|
Suggestion 1: Add a feature in the admin UI in bibdata where we show information from the Indexing Manager, for example: last_dump indexed, the date timestamp of the event, and the in_progress boolean. |
On March 14th, we noticed some network errors that randomly caused a number of mms_ids to fail to index. These are not errors related to the data.
The network errors can be found in the traject logs from both bibdata-alma-workers with a March 14th time stamp.
example log:
Acceptance Criteria
Implementation Notes
Traject does not include any rails helpers because it doesn't load the rails environment when in a non-test setting.
Honeybadger error:
honeybadger#103044592
The text was updated successfully, but these errors were encountered: