Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network errors cause mmsIDs to fail to index #2320

Closed
4 tasks done
christinach opened this issue Mar 18, 2024 · 10 comments
Closed
4 tasks done

Network errors cause mmsIDs to fail to index #2320

christinach opened this issue Mar 18, 2024 · 10 comments
Assignees
Labels
investigate Tickets related to work that needs investigation

Comments

@christinach
Copy link
Member

christinach commented Mar 18, 2024

On March 14th, we noticed some network errors that randomly caused a number of mms_ids to fail to index. These are not errors related to the data.

The network errors can be found in the traject logs from both bibdata-alma-workers with a March 14th time stamp.

example log:

2024-03-14T16:14:45-04:00 [ERROR] 841218 : Could not add record 99130256719806421 at source file position 4339:     Exception: Errno::ECONNREFUSED: Connection refused - Connection refused - connect(2) for "lib-solr8-prod.princeton.edu" port 8983 (lib-solr8-prod.princeton.edu:8983)
    /opt/bibdata/shared/bundle/ruby/3.1.0/gems/httpclient-2.8.3/lib/httpclient/session.rb:623:in `rescue in create_socket'
    Caused by
    Errno::ECONNREFUSED: Connection refused - connect(2) for "lib-solr8-prod.princeton.edu" port 8983
    /opt/bibdata/shared/bundle/ruby/3.1.0/gems/httpclient-2.8.3/lib/httpclient/session.rb:611:in `initialize'

Acceptance Criteria

Implementation Notes

Traject does not include any rails helpers because it doesn't load the rails environment when in a non-test setting.

Honeybadger error:

honeybadger#103044592

@kevinreiss kevinreiss added the investigate Tickets related to work that needs investigation label Mar 19, 2024
@kevinreiss
Copy link
Member

kevinreiss commented Mar 25, 2024

Just noting that network errors observed on the traject logs on bibdata-alma-worker{1,2} the week of 3/18 take a different form and don't log a specific MMS ID:

They show up as:

    /opt/bibdata/shared/bundle/ruby/3.1.0/gems/httpclient-2.8.3/lib/httpclient/session.rb:625:in `rescue in create_socket'
    Caused by
    SocketError: getaddrinfo: Temporary failure in name resolution
    /opt/bibdata/shared/bundle/ruby/3.1.0/gems/httpclient-2.8.3/lib/httpclient/session.rb:611:in `initialize'```

@kevinreiss
Copy link
Member

A third permutation
2024-03-24T20:20:19-04:00 [FATAL] 2007455 : Traject::CommandLine: Unexpected exception, terminating execution: Exception: SocketError: getaddrinfo: Temporary failure in name resolution (lib-solr8-prod.princeton.edu:8983)

@christinach
Copy link
Member Author

christinach commented Mar 26, 2024

In the open house on 3/26/2024 we checked the physical location of the solr-prod and bibdata-prod VMs. We noticed that the lib-solr-prod8 boxes are in Forrestal while the bibdata-alma-workers and bidata web servers are in New South.
We moved the bibdata-alma-worker(1,2) from New South to Forrestal.

Ops informed us about a private non routable IPs update that failed for host 1a which is in Forrestal and they are currently working on this upgrade. The upgrade started on March 11th, 2024 which is very close to the time we started seeing indexing failures because of Temporary failure in name resolution (lib-solr8-prod.princeton.edu:8983).

We will monitor the logs today (3/26/2024) and tomorrow and let Ops know if moving the VMs fixed the issue.

Thank you everyone who worked on this during the Ansible open house: @VickieKarasic @acozine @kayiwa @carolyncole @sandbergja @maxkadel @leefaisonr @christinach

@kevinreiss
Copy link
Member

Last timestamp of a network related error on bibdata-alma-worker{1,2} is

2024-03-26T14:52:49-04:00 [FATAL] 2651102 : Traject::CommandLine: Unexpected exception, terminating execution: Exception: HTTPClient::ReceiveTimeoutError: execution expired

@kevinreiss kevinreiss self-assigned this Mar 27, 2024
@christinach
Copy link
Member Author

The logs from yesterday indicate that the last time the issue was triggered was at 2024-03-26T14:52:49-04:00. We will keep an eye to see if the network issue was fixed and there are no errors today. We moved the bibdata worker VMs in the same physical location with the solr-prod VMs at 12:40pm.
My understanding is that if the issue was fixed after 2024-03-26T14:52:49-04:00 it was for a different reason and not because of moving the VMs in the same location.

@christinach
Copy link
Member Author

Added two more logs in traject error logs worker1 and 2.
The error timestamp is at 2024-03-27T21:52:52-04:00.
The first error occurs when Solr response: 503: because of .."msg":"No registered leader was found after waiting for 4000ms ,..

The second error at 2024-03-27T21:52:52-04:00 occurs when Solr response: 500: because of Error from server at http://lib-solr-prod6.princeton.edu:8983/solr/catalog-alma-production3_shard1_replica_n7/: null\n\n\n\nrequest: http://lib-solr-prod6.princeton.edu:8983/solr/catalog-alma-production3_shard1_replica_n7/\nRemote error message: java.io.IOException: No space left on device"

@christinach
Copy link
Member Author

@maxkadel and I are looking in datadog: time window March 27 2024 21:51pm - 22:04:pm

lib-solr-prod6 error

org.apache.solr.common.SolrException: missing content stream
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:63)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:211)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:2596)
	at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:799)
	at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:578)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:419)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:351)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1711)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1347)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1678)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1249)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:152)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.server.Server.handle(Server.java:505)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
	at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:781)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:917)
	at java.lang.Thread.run(Thread.java:750)
2024/03/28 01:51:00 [warn] 756033#756033: *13916733 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000884822, client: 128.112.204.56, server: lib-solr8-prod.princeton.edu, request: "POST /solr/catalog-alma-production/update/json HTTP/1.1", host: "lib-solr8-prod.princeton.edu:8983"

@christinach
Copy link
Member Author

Today operations added 100 GB on each one of the solr-prod boxes to avoid one of the above errors failing to connect and build the replica because there is no space. See docs from the incident report.

@christinach
Copy link
Member Author

On March 28th, 2024:

On March 29th, 2024:

@christinach
Copy link
Member Author

christinach commented Apr 1, 2024

Suggestion 1: Add a feature in the admin UI in bibdata where we show information from the Indexing Manager, for example: last_dump indexed, the date timestamp of the event, and the in_progress boolean.
Suggestion 2: Create an alert in datadog which will trigger if the last_dump indexed event timestamp is older than one day from today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigate Tickets related to work that needs investigation
Projects
None yet
Development

No branches or pull requests

2 participants