Network errors cause mmsIDs to fail to index #2320

christinach · 2024-03-18T17:44:09Z

On March 14th, we noticed some network errors that randomly caused a number of mms_ids to fail to index. These are not errors related to the data.

The network errors can be found in the traject logs from both bibdata-alma-workers with a March 14th time stamp.

example log:

2024-03-14T16:14:45-04:00 [ERROR] 841218 : Could not add record 99130256719806421 at source file position 4339:     Exception: Errno::ECONNREFUSED: Connection refused - Connection refused - connect(2) for "lib-solr8-prod.princeton.edu" port 8983 (lib-solr8-prod.princeton.edu:8983)
    /opt/bibdata/shared/bundle/ruby/3.1.0/gems/httpclient-2.8.3/lib/httpclient/session.rb:623:in `rescue in create_socket'
    Caused by
    Errno::ECONNREFUSED: Connection refused - connect(2) for "lib-solr8-prod.princeton.edu" port 8983
    /opt/bibdata/shared/bundle/ruby/3.1.0/gems/httpclient-2.8.3/lib/httpclient/session.rb:611:in `initialize'

Acceptance Criteria

Better understanding of what's causing the errors Duplicate events can cause indexing to get stuck #1929
Check with operations team regarding the physical location of solr8 production. Network errors cause mmsIDs to fail to index #2320 (comment)
Any recommendations for configuration settings we might change to address. Part of this improvements through bibdata#1929
Any recommendations/tickets we can create to make the code handle these errors. Part of this the result of bibdata#2326

Implementation Notes

Traject does not include any rails helpers because it doesn't load the rails environment when in a non-test setting.

Honeybadger error:

honeybadger#103044592

The text was updated successfully, but these errors were encountered:

kevinreiss · 2024-03-25T15:37:37Z

Just noting that network errors observed on the traject logs on bibdata-alma-worker{1,2} the week of 3/18 take a different form and don't log a specific MMS ID:

They show up as:

    /opt/bibdata/shared/bundle/ruby/3.1.0/gems/httpclient-2.8.3/lib/httpclient/session.rb:625:in `rescue in create_socket'
    Caused by
    SocketError: getaddrinfo: Temporary failure in name resolution
    /opt/bibdata/shared/bundle/ruby/3.1.0/gems/httpclient-2.8.3/lib/httpclient/session.rb:611:in `initialize'```

kevinreiss · 2024-03-26T15:25:11Z

A third permutation
2024-03-24T20:20:19-04:00 [FATAL] 2007455 : Traject::CommandLine: Unexpected exception, terminating execution: Exception: SocketError: getaddrinfo: Temporary failure in name resolution (lib-solr8-prod.princeton.edu:8983)

christinach · 2024-03-26T16:53:52Z

In the open house on 3/26/2024 we checked the physical location of the solr-prod and bibdata-prod VMs. We noticed that the lib-solr-prod8 boxes are in Forrestal while the bibdata-alma-workers and bidata web servers are in New South.
We moved the bibdata-alma-worker(1,2) from New South to Forrestal.

Ops informed us about a private non routable IPs update that failed for host 1a which is in Forrestal and they are currently working on this upgrade. The upgrade started on March 11th, 2024 which is very close to the time we started seeing indexing failures because of Temporary failure in name resolution (lib-solr8-prod.princeton.edu:8983).

We will monitor the logs today (3/26/2024) and tomorrow and let Ops know if moving the VMs fixed the issue.

Thank you everyone who worked on this during the Ansible open house: @VickieKarasic @acozine @kayiwa @carolyncole @sandbergja @maxkadel @leefaisonr @christinach

kevinreiss · 2024-03-27T13:46:00Z

Last timestamp of a network related error on bibdata-alma-worker{1,2} is

2024-03-26T14:52:49-04:00 [FATAL] 2651102 : Traject::CommandLine: Unexpected exception, terminating execution: Exception: HTTPClient::ReceiveTimeoutError: execution expired

christinach · 2024-03-27T14:39:09Z

The logs from yesterday indicate that the last time the issue was triggered was at 2024-03-26T14:52:49-04:00. We will keep an eye to see if the network issue was fixed and there are no errors today. We moved the bibdata worker VMs in the same physical location with the solr-prod VMs at 12:40pm.
My understanding is that if the issue was fixed after 2024-03-26T14:52:49-04:00 it was for a different reason and not because of moving the VMs in the same location.

christinach · 2024-03-28T12:14:38Z

Added two more logs in traject error logs worker1 and 2.
The error timestamp is at 2024-03-27T21:52:52-04:00.
The first error occurs when Solr response: 503: because of .."msg":"No registered leader was found after waiting for 4000ms ,..

The second error at 2024-03-27T21:52:52-04:00 occurs when Solr response: 500: because of Error from server at http://lib-solr-prod6.princeton.edu:8983/solr/catalog-alma-production3_shard1_replica_n7/: null\n\n\n\nrequest: http://lib-solr-prod6.princeton.edu:8983/solr/catalog-alma-production3_shard1_replica_n7/\nRemote error message: java.io.IOException: No space left on device"

christinach · 2024-03-28T12:55:27Z

@maxkadel and I are looking in datadog: time window March 27 2024 21:51pm - 22:04:pm

lib-solr-prod6 error

org.apache.solr.common.SolrException: missing content stream
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:63)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:211)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:2596)
	at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:799)
	at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:578)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:419)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:351)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1711)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1347)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1678)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1249)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:152)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.server.Server.handle(Server.java:505)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
	at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:781)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:917)
	at java.lang.Thread.run(Thread.java:750)

2024/03/28 01:51:00 [warn] 756033#756033: *13916733 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000884822, client: 128.112.204.56, server: lib-solr8-prod.princeton.edu, request: "POST /solr/catalog-alma-production/update/json HTTP/1.1", host: "lib-solr8-prod.princeton.edu:8983"

christinach · 2024-03-28T16:57:08Z

Today operations added 100 GB on each one of the solr-prod boxes to avoid one of the above errors failing to connect and build the replica because there is no space. See docs from the incident report.

christinach · 2024-03-29T13:05:37Z

On March 28th, 2024:

We noticed that there were two duplicate events that were causing the index to loop in between them
We pointed the IndexManager to the next dump and it indexed all the dump files.
We noticed indexing errors every time we restart the load balancer. Created bibdata#2326 to investigate

On March 29th, 2024:

I checked the traject logs in both workers and can't see the connection error after 2024-03-28T12:49:50-04:00. Note that we pointed the IndexManager to the next dump file of the two identical ones on March 28th, 2024 around 1:40pm.
Confirmed that the records that were reported as missing or not up to date are now in a good state.

christinach · 2024-04-01T13:50:32Z

Suggestion 1: Add a feature in the admin UI in bibdata where we show information from the Indexing Manager, for example: last_dump indexed, the date timestamp of the event, and the in_progress boolean.
Suggestion 2: Create an alert in datadog which will trigger if the last_dump indexed event timestamp is older than one day from today.

kevinreiss added the investigate Tickets related to work that needs investigation label Mar 19, 2024

kevinreiss mentioned this issue Mar 26, 2024

Repeated Timeout Errors Sending Mail. pulibrary/orangelight#3975

Closed

kevinreiss self-assigned this Mar 27, 2024

kevinreiss assigned christinach Mar 28, 2024

kevinreiss closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network errors cause mmsIDs to fail to index #2320

Network errors cause mmsIDs to fail to index #2320

christinach commented Mar 18, 2024 •

edited by kevinreiss

Loading

kevinreiss commented Mar 25, 2024 •

edited

Loading

kevinreiss commented Mar 26, 2024

christinach commented Mar 26, 2024 •

edited

Loading

kevinreiss commented Mar 27, 2024

christinach commented Mar 27, 2024

christinach commented Mar 28, 2024

christinach commented Mar 28, 2024

christinach commented Mar 28, 2024

christinach commented Mar 29, 2024

christinach commented Apr 1, 2024 •

edited

Loading

Network errors cause mmsIDs to fail to index #2320

Network errors cause mmsIDs to fail to index #2320

Comments

christinach commented Mar 18, 2024 • edited by kevinreiss Loading

Acceptance Criteria

Implementation Notes

Honeybadger error:

kevinreiss commented Mar 25, 2024 • edited Loading

kevinreiss commented Mar 26, 2024

christinach commented Mar 26, 2024 • edited Loading

kevinreiss commented Mar 27, 2024

christinach commented Mar 27, 2024

christinach commented Mar 28, 2024

christinach commented Mar 28, 2024

christinach commented Mar 28, 2024

christinach commented Mar 29, 2024

christinach commented Apr 1, 2024 • edited Loading

christinach commented Mar 18, 2024 •

edited by kevinreiss

Loading

kevinreiss commented Mar 25, 2024 •

edited

Loading

christinach commented Mar 26, 2024 •

edited

Loading

christinach commented Apr 1, 2024 •

edited

Loading