-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ripe atlas exporter memory leak #12
Comments
FWIW, I am evaluating performance of the atlas exporter including memory use. I started the test on 2019-12-30 with a single streaming measurement (#23669846). While memory usage has been going up and down, it never crossed 24 MiB (according to atop). We’re using the docker image. We’re scraping every 60s using a curl cronjob for testing. Update 2020-01-14: Memory usage is at 17 MiB right now, but we still never saw usage higher than 24 MiB (RSS) during all testing. |
Same issue here. Using Streaming API with 4 measurements on the Docker image inside Kubernetes. Whether limited to 256M or 512M, the memory always fills up. |
Same here, although I'm only subscribing to 4 measurements (many probes assigned) |
Hi, thanks for pointing this out. I will have a look at it as soon as possible |
I spent an afternoon digging into this, and I think I've found the issue - leaking goroutines caused by a deadlock between the ripeatlas and golang-socketio libraries. Here's how I got there. After letting the exporter run for a while, memory usage was high, but pprof didn't show especially high heap usage. There were, however, >100k goroutines running, which seemed strange. Getting a stack dump for the running goroutines showed most of them waiting to acquire a mutex in two places:
and at
Looking at the code, both of these (loop.go:77 and loop.go:87) are trying to acquire the This led me to fork the libraries and replace the mutex with github.com/sasha-s/go-deadlock so I could confirm that was the issue. That resulted in the following output:
From this, it seems certain the issue is a deadlock, and it's a strange one since it's the intersection of the ripeatlas lib and the golang-socketio lib. Here's what I think is happening (NOTE:
Hopefully that all makes sense. I don't have too much time to keep digging into this, but I'll work on it a little next week to see if there's a simple fix. In the meantime, I wanted to get the info out there in case the fix is obvious to someone. |
It could also be some contention in processing, since I can also find goroutine leaks in processing:
where I think func2 is the one that sends a result to the channel:
So, I am not sure if the potential deadlock above is a problem, of if there is some other interaction with closing the channel that causes this leak as well, or if it's just pure contention ¯\(ツ)/¯ Edit: looking through the full goroutine stack dump, I think they are related. For every unique time (i.e. |
We've forked the ripeatlas and exporter repos and are working on some fixes and testing them. The digitalocean/atlas_exporter repo contains a fork with some of our interim fixes if anyone want to test it. The current fixes are a little hacky, so if they work, we'll do some work coming up with better solutions and upstreaming them. |
Hi @glightfoot, thanks for your effort. I will have a look at the fork and looking forward for your PR 👍 |
Is there an update to this? I've set a limit of 1GB with systemd and get regular OOM kills even with only one measurement.
It starts with the atlas_error and a few seconds later it gets killed.
|
@bauruine Have a go with the DigitalOcean fork - I've been running it fine for a few months and I can't see any OOM kills. I am still seeing breaks in the data, I'm not entirely sure why though. I would be very interested in your results. I wonder if this commit on the parent repo might help as it splits measurements into their own streaming sessions. digitalocean@a58a4ce I should also sit down and review the improvements made on the DNS-OAC repo which we (DigitalOcean) also forked for our experiments. |
Hi @tardoe
|
Thanks for the feedback @bauruine - were there gaps in the metrics running the previous master build? I suspect I’ll have to rework the DO fork’s changes into master to sort this. |
Yes it looked exactly the same. |
Hello,
Thanks for the tool :)
Since the beginning of my use of the exporter, I could see the exporter using lots of memory until it gets OOM killed.
See screen attached:
It uses the RIPE streaming API for < 20 measurements.
I've been installing the exporter via
go get
or building it from master, same result.Is there anything I can do to help tshoot this ?
Does anyone else is experiencing the same behavior ?
@++
Lodpp
ps: as a band-aid / workaround, I've added a daily cronjob to restart the exporter
The text was updated successfully, but these errors were encountered: