Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken Sidekiq <> Datadog Integration #68617

Closed
LindseySaari opened this issue Oct 30, 2023 · 19 comments
Closed

Broken Sidekiq <> Datadog Integration #68617

LindseySaari opened this issue Oct 30, 2023 · 19 comments
Assignees

Comments

@LindseySaari
Copy link
Contributor

We encountered an issue with the official Datadog Sidekiq integration on our Va.gov API after upgrading to version 5.6.1. Metrics (only for Sidekiq) stopped coming into Datadog. After downgrading back to 5.6.0, the metrics resumed as expected. For context, our Sidekiq pods operate within EKS. Additionally, we followed the direct Datadog instructions for setting up the Datadog <> Sidekiq Integration.

Ticket opened here in the gem repo and the maintainers have asked us to open a ticket because it may be related to our Sidekiq-Pro integration

@LindseySaari
Copy link
Contributor Author

I opened an official issues with Datadog

@LindseySaari
Copy link
Contributor Author

Private Zenhub Image

Link to the ticket opened up with Datadog: https://help.datadoghq.com/hc/requests/1414562

@LindseySaari
Copy link
Contributor Author

In communication with a Datadog engineer. I am working through some back and fourth and discussing the necessary config for engineers on the Datadog side to analyze the issues between the working (5.6.0) and broken (5.6.1) versions for the sidekiq metrics

@laineymajor
Copy link
Contributor

Looking at Monday or Tuesday BEFORE daily sync to act on this... TBD.

To do

  • research how to send a flare

@laineymajor
Copy link
Contributor

Lindsey is actively working this ticket today.
Changes are in review with appropriate team(s).

@laineymajor
Copy link
Contributor

Waiting on datadog to analyze flare that was sent.

@laineymajor
Copy link
Contributor

Waiting on DD. This is carrying over to new sprint.

@LindseySaari
Copy link
Contributor Author

Datadog is still analyzing the logs this afternoon. The Solutions engineer assigned to the ticket will pass along any updates

1 similar comment
@LindseySaari
Copy link
Contributor Author

Datadog is still analyzing the logs this afternoon. The Solutions engineer assigned to the ticket will pass along any updates

@LindseySaari
Copy link
Contributor Author

Update from Datadog

"Just reaching out to give you an update on this one. The team is still reviewing this ticket on our side and will let me know when they have next steps. In the meantime, feel free to reach out if you have any other questions on this ticket."

@LindseySaari
Copy link
Contributor Author

I heard back from Datadog and they need us to execute a few more steps to help with the debugging process. I will aim for getting these changes into staging Friday morning and executing the necessary commands. It's important to start early to maximize the amount of time before the production deploy. Once the necessary information is gathered, this change will need to be reverted and the agent restarted before it goes on the conveyor belt to production.

Steps

  1. Merge in dogstatsd gem update - make sure it deploys
  2. Merge in the datadog-agent change. (This should autosync)
  3. Make sure it syncs
  4. Restart the agent: kubectl rollout restart daemonset datadog-agent
  5. Run command kubectl exec -it ds/datadog-agent -- agent dogstatsd-stats
  6. Run tcp dump: kubectl exec -it ds/datadog-agent -- tcpdump -i any "udp port 8125" -w output.pcap
  7. Provide these reports to datadog
  8. Revert Vets API and Datadog agent change
  9. Merge & Verify Deploy/Sync
  10. Restart the agent: kubectl rollout restart daemonset datadog-agent (edited)

@LindseySaari
Copy link
Contributor Author

I adjusted the config this morning and ran the tcpdump command. After that, I copied the output/files to my local machine and forwarded them to Datadog.

@laineymajor
Copy link
Contributor

Adding this to our next sprint as we need some additional devops help to move this work forward to close.

@LindseySaari
Copy link
Contributor Author

Still following up with Datadog. They were mistaken on where the "proxy" resided in our setup. They thought it sat between the agent and the datadog endpoint, but it's actually more on the frontend of things where we use socat to proxy metrics from our rails app to the agent

@laineymajor
Copy link
Contributor

@flooose to review ticket and sync with Lindsey if needed.

@LindseySaari
Copy link
Contributor Author

Chris is looking at a possible workaround to test the issue here

@laineymajor
Copy link
Contributor

Need time to test in the cluster

@RachalCassity
Copy link
Member

Datadogstatsd 5.6.1 was deployed to prod.

Host 127.0.0.1 enforces IPv4 connection

@LindseySaari
Copy link
Contributor Author

This has been fixed via the change to force IPv4 via 127.0.0.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants