ripe atlas exporter memory leak #12

lodpp · 2019-08-20T16:38:37Z

Hello,

Thanks for the tool :)

Since the beginning of my use of the exporter, I could see the exporter using lots of memory until it gets OOM killed.
See screen attached:

It uses the RIPE streaming API for < 20 measurements.

I've been installing the exporter via go get or building it from master, same result.

Is there anything I can do to help tshoot this ?
Does anyone else is experiencing the same behavior ?

@++
Lodpp

ps: as a band-aid / workaround, I've added a daily cronjob to restart the exporter

The text was updated successfully, but these errors were encountered:

horazont · 2020-01-03T09:23:55Z

FWIW, I am evaluating performance of the atlas exporter including memory use. I started the test on 2019-12-30 with a single streaming measurement (#23669846). While memory usage has been going up and down, it never crossed 24 MiB (according to atop).

We’re using the docker image.

We’re scraping every 60s using a curl cronjob for testing.

Update 2020-01-14: Memory usage is at 17 MiB right now, but we still never saw usage higher than 24 MiB (RSS) during all testing.

Lavaburn · 2020-03-05T09:29:48Z

Same issue here. Using Streaming API with 4 measurements on the Docker image inside Kubernetes.

Whether limited to 256M or 512M, the memory always fills up.

Lavaburn · 2020-03-19T14:35:28Z

Unfortunately I'm not a Go programmer, so can't help much. I'm wondering though if the leak is in here or in the DNS-OARC library... Definitely seems related to the streaming.

nicoduck · 2021-03-06T15:54:56Z

Same here, although I'm only subscribing to 4 measurements (many probes assigned)

czerwonk · 2021-03-08T11:43:35Z

Hi, thanks for pointing this out. I will have a look at it as soon as possible

czerwonk · 2021-03-17T08:49:00Z

Heap in use after 8 days:

I will have a look in the code allocating/using the memory in the next few days.

tardoe · 2021-04-19T00:20:37Z

Further follow-up on this one, we're also seeing this same behaviour. It's causing gaps in results being served which makes analysis difficult due to the null values.

glightfoot · 2021-04-30T17:56:29Z

I spent an afternoon digging into this, and I think I've found the issue - leaking goroutines caused by a deadlock between the ripeatlas and golang-socketio libraries. Here's how I got there.

After letting the exporter run for a while, memory usage was high, but pprof didn't show especially high heap usage. There were, however, >100k goroutines running, which seemed strange. Getting a stack dump for the running goroutines showed most of them waiting to acquire a mutex in two places:

goroutine 493 [semacquire, 49 minutes]:
sync.runtime_SemacquireMutex(0xc000290e88, 0x0, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc000290e84)
	/usr/local/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
github.com/graarh/golang-socketio.closeChannel(0xc000290e30, 0xc000290e00, 0xc00022c7a0, 0x1, 0x1, 0x0, 0x0)
	/go/pkg/mod/github.com/graarh/[email protected]/loop.go:87 +0x1e5
github.com/graarh/golang-socketio.outLoop(0xc000290e30, 0xc000290e00, 0xc0007dd8f0, 0x0)
	/go/pkg/mod/github.com/graarh/[email protected]/loop.go:177 +0x187
created by github.com/graarh/golang-socketio.Dial
	/go/pkg/mod/github.com/graarh/[email protected]/client.go:55 +0x198

and at

goroutine 579 [semacquire, 49 minutes]:
sync.runtime_SemacquireMutex(0xc000290408, 0xc000224f00, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc000290404)
	/usr/local/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
github.com/graarh/golang-socketio.(*Channel).IsAlive(0xc0002903b0, 0x6fc23ac00)
	/go/pkg/mod/github.com/graarh/[email protected]/loop.go:77 +0x9d
github.com/graarh/golang-socketio.pinger(0xc0002903b0)
	/go/pkg/mod/github.com/graarh/[email protected]/loop.go:190 +0x73
created by github.com/graarh/golang-socketio.Dial
	/go/pkg/mod/github.com/graarh/[email protected]/client.go:56 +0x1ba

Looking at the code, both of these (loop.go:77 and loop.go:87) are trying to acquire the aliveLock on the socketio "Channel" struct (side note: this lock would be better suited to a RWMutex). At this point, it looks like a deadlock, but the cause isn't entirely clear.

This led me to fork the libraries and replace the mutex with github.com/sasha-s/go-deadlock so I could confirm that was the issue. That resulted in the following output:

POTENTIAL DEADLOCK: Duplicate locking, saw callers this locks in one goroutine:
ERRO[0001] atlas_error: Too many connections             source="stream_strategy.go:107"
current goroutine 485 lock &{{{9 0} 0 0 -1073741824 0}} 
all callers to this lock in the goroutine
INFO[0001] Got result for 0 from probe 0                 source="stream_strategy.go:126"
../golang-socketio/loop.go:88 golang-socketio.closeChannel { c.aliveLock.Lock() } <<<<<
../golang-socketio/loop.go:123 golang-socketio.inLoop { closeChannel(c, m, protocol.ErrorWrongPacket) }

../golang-socketio/loop.go:88 golang-socketio.closeChannel { c.aliveLock.Lock() } <<<<<
../golang-socketio/client.go:65 ripeatlas.(*Stream).MeasurementLatest.func3 { closeChannel(&c.Channel, &c.methods) }
../golang-socketio/client.go:65 golang-socketio.(*Client).Close { closeChannel(&c.Channel, &c.methods) }
/usr/local/Cellar/go/1.16.2/libexec/src/reflect/value.go:476 reflect.Value.call { call(frametype, fn, args, uint32(frametype.size), uint32(retOffset)) }
/usr/local/Cellar/go/1.16.2/libexec/src/reflect/value.go:337 reflect.Value.Call { return v.call("Call", in) }
../golang-socketio/caller.go:74 golang-socketio.(*caller).callFunc { return c.Func.Call(a) }
../golang-socketio/handler.go:79 golang-socketio.(*methods).callLoopEvent { f.callFunc(c, &struct{}{}) }
/usr/local/Cellar/go/1.16.2/libexec/src/sync/mutex.go:74 sync.(*Mutex).Lock { if atomic.CompareAndSwapInt32(&m.state, 0, mutexLocked) { }
../golang-socketio/loop.go:123 golang-socketio.inLoop { closeChannel(c, m, protocol.ErrorWrongPacket) }

From this, it seems certain the issue is a deadlock, and it's a strange one since it's the intersection of the ripeatlas lib and the golang-socketio lib. Here's what I think is happening (NOTE: channel refers to a golang channel, whereas Channel refers to the Channel struct in the golang-socketio library):

An issue occurs with the stream, resulting in an error being returned from the atlas stream (too many connections, change to volatile, etc). That triggers this handler in the ripeatlas library, which calls the Close method on the golang-socketio connection and closes the channel used for results.
Calling the Close method on the connection results in a call to the closeChannel function in golang-socketio to close the connection, which acquires a lock on aliveLock and defers releasing it until the function completes.
As a part of closing the Channel, the closeChannel function triggers the OnDisconnection event, which triggers the handler configured in the ripeatlas library for OnDisconnection, which also tries to call the Close method, resulting in a deadlock on the aliveLock mutex.

Hopefully that all makes sense. I don't have too much time to keep digging into this, but I'll work on it a little next week to see if there's a simple fix. In the meantime, I wanted to get the info out there in case the fix is obvious to someone.

glightfoot · 2021-05-03T19:07:07Z

It could also be some contention in processing, since I can also find goroutine leaks in processing:

122438 @ 0x43ada5 0x4059fa 0x405755 0x9145b5 0x49e5a7 0x49da79 0x90ae07 0x90b938 0x470ee1
#	0x9145b4	github.com/DNS-OARC/ripeatlas.(*Stream).MeasurementLatest.func2+0x94		/go/pkg/mod/github.com/!d!n!s-!o!a!r!c/[email protected]/stream.go:132
#	0x49e5a6	reflect.Value.call+0x8e6							/usr/local/go/src/reflect/value.go:476
#	0x49da78	reflect.Value.Call+0xb8								/usr/local/go/src/reflect/value.go:337
#	0x90ae06	github.com/graarh/golang-socketio.(*caller).callFunc+0x206			/go/pkg/mod/github.com/graarh/[email protected]/caller.go:74
#	0x90b937	github.com/graarh/golang-socketio.(*methods).processIncomingMessage+0x417	/go/pkg/mod/github.com/graarh/[email protected]/handler.go:107

where I think func2 is the one that sends a result to the channel:

err = c.On("atlas_result", func(h *gosocketio.Channel, r measurement.Result) {
        ch <- &r
    })

So, I am not sure if the potential deadlock above is a problem, of if there is some other interaction with closing the channel that causes this leak as well, or if it's just pure contention ¯\(ツ)/¯

Edit: looking through the full goroutine stack dump, I think they are related. For every unique time (i.e. goroutine 130928 [chan send, 14 minutes]: github.com/DNS-OARC/ripeatlas.(*Stream).MeasurementLatest.func2(0xc00027fed0, 0x0, 0x0, 0x139c, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) there is a corresponding set of goroutines in the deadlock condition described above. It seems like the exporter stops processing messages once an error happens, but the deadlock prevents the channel from being closed. It might be worth checking the state of the connection before sending new messages to the channel, but I am not too familiar with this code as a whole

glightfoot · 2021-05-05T19:40:19Z

We've forked the ripeatlas and exporter repos and are working on some fixes and testing them. The digitalocean/atlas_exporter repo contains a fork with some of our interim fixes if anyone want to test it. The current fixes are a little hacky, so if they work, we'll do some work coming up with better solutions and upstreaming them.

czerwonk · 2021-05-06T13:10:48Z

Hi @glightfoot,

thanks for your effort. I will have a look at the fork and looking forward for your PR 👍

bauruine · 2022-11-18T19:04:40Z

Is there an update to this? I've set a limit of 1GB with systemd and get regular OOM kills even with only one measurement.

Nov 18 06:44:10 prometheus atlas-exporter[77046]: time="2022-11-18T06:44:10+01:00" level=error msg="atlas_error: The socket has been switched to volatile because the client is not able to keep up with the results. See the buffering option."
Nov 18 09:18:01 prometheus atlas-exporter[79326]: time="2022-11-18T09:18:01+01:00" level=error msg="atlas_error: The socket has been switched to volatile because the client is not able to keep up with the results. See the buffering option."
Nov 18 11:52:11 prometheus atlas-exporter[79848]: time="2022-11-18T11:52:11+01:00" level=error msg="atlas_error: The socket has been switched to volatile because the client is not able to keep up with the results. See the buffering option."
Nov 18 14:27:20 prometheus atlas-exporter[79996]: time="2022-11-18T14:27:20+01:00" level=error msg="atlas_error: The socket has been switched to volatile because the client is not able to keep up with the results. See the buffering option."
Nov 18 17:01:57 prometheus atlas-exporter[80157]: time="2022-11-18T17:01:57+01:00" level=error msg="atlas_error: The socket has been switched to volatile because the client is not able to keep up with the results. See the buffering option."
Nov 18 19:37:56 prometheus atlas-exporter[80358]: time="2022-11-18T19:37:56+01:00" level=error msg="atlas_error: The socket has been switched to volatile because the client is not able to keep up with the results. See the buffering option."

It starts with the atlas_error and a few seconds later it gets killed.

Nov 18 17:01:57 prometheus atlas-exporter[80157]: time="2022-11-18T17:01:57+01:00" level=error msg="atlas_error: The socket has been switched to volatile because the client is not able to keep up with the results. See the buffering option."
Nov 18 17:01:57 prometheus atlas-exporter[80157]: time="2022-11-18T17:01:57+01:00" level=info msg="Got result for 0 from probe 0"
Nov 18 17:01:57 prometheus atlas-exporter[80157]: time="2022-11-18T17:01:57+01:00" level=error msg="type  is not supported yet"
Nov 18 17:02:03 prometheus systemd[1]: atlas-exporter.service: Main process exited, code=killed, status=9/KILL
Nov 18 17:02:03 prometheus systemd[1]: atlas-exporter.service: Failed with result 'signal'.

tardoe · 2022-11-20T23:24:09Z

@bauruine Have a go with the DigitalOcean fork - I've been running it fine for a few months and I can't see any OOM kills. I am still seeing breaks in the data, I'm not entirely sure why though. I would be very interested in your results.

I wonder if this commit on the parent repo might help as it splits measurements into their own streaming sessions. digitalocean@a58a4ce

I should also sit down and review the improvements made on the DNS-OAC repo which we (DigitalOcean) also forked for our experiments.

bauruine · 2022-11-22T15:28:23Z

Hi @tardoe
I have been running the DO fork for about a day and I don't see any OOM kills anymore but I also see the breaks in the data. I still get the errors in the logs and that's at the same time I see the breaks in the data.

Nov 22 14:48:21 prometheus atlas-exporter[2705]: time="2022-11-22T14:48:21+01:00" level=error msg="atlas_error: The socket has been switched to volatile because the client is not able to keep up with the results. See the buffering option." source="stream_strategy.go:111"
Nov 22 14:48:21 prometheus atlas-exporter[2705]: time="2022-11-22T14:48:21+01:00" level=error msg="channel closed" source="stream_strategy.go:107"
Nov 22 14:48:21 prometheus atlas-exporter[2705]: time="2022-11-22T14:48:21+01:00" level=info msg="Got result for 0 from probe 0" source="stream_strategy.go:135"
Nov 22 14:48:21 prometheus atlas-exporter[2705]: time="2022-11-22T14:48:21+01:00" level=error msg="type  is not supported yet" source="stream_strategy.go:157"
Nov 22 14:48:51 prometheus atlas-exporter[2705]: time="2022-11-22T14:48:51+01:00" level=info msg="Subscribed to results of measurement #1234" source="stream_strategy.go:57"

tardoe · 2022-11-23T02:07:50Z

Thanks for the feedback @bauruine - were there gaps in the metrics running the previous master build? I suspect I’ll have to rework the DO fork’s changes into master to sort this.

bauruine · 2022-11-23T04:39:55Z

were there gaps in the metrics running the previous master build?

Yes it looked exactly the same.

czerwonk self-assigned this Mar 18, 2020

czerwonk added this to the 1.0.1 milestone Mar 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ripe atlas exporter memory leak #12

ripe atlas exporter memory leak #12

lodpp commented Aug 20, 2019

horazont commented Jan 3, 2020 •

edited

Loading

Lavaburn commented Mar 5, 2020

Lavaburn commented Mar 19, 2020

nicoduck commented Mar 6, 2021

czerwonk commented Mar 8, 2021

czerwonk commented Mar 17, 2021

tardoe commented Apr 19, 2021

glightfoot commented Apr 30, 2021 •

edited

Loading

glightfoot commented May 3, 2021 •

edited

Loading

glightfoot commented May 5, 2021

czerwonk commented May 6, 2021

bauruine commented Nov 18, 2022

tardoe commented Nov 20, 2022

bauruine commented Nov 22, 2022

tardoe commented Nov 23, 2022

bauruine commented Nov 23, 2022

ripe atlas exporter memory leak #12

ripe atlas exporter memory leak #12

Comments

lodpp commented Aug 20, 2019

horazont commented Jan 3, 2020 • edited Loading

Lavaburn commented Mar 5, 2020

Lavaburn commented Mar 19, 2020

nicoduck commented Mar 6, 2021

czerwonk commented Mar 8, 2021

czerwonk commented Mar 17, 2021

tardoe commented Apr 19, 2021

glightfoot commented Apr 30, 2021 • edited Loading

glightfoot commented May 3, 2021 • edited Loading

glightfoot commented May 5, 2021

czerwonk commented May 6, 2021

bauruine commented Nov 18, 2022

tardoe commented Nov 20, 2022

bauruine commented Nov 22, 2022

tardoe commented Nov 23, 2022

bauruine commented Nov 23, 2022

horazont commented Jan 3, 2020 •

edited

Loading

glightfoot commented Apr 30, 2021 •

edited

Loading

glightfoot commented May 3, 2021 •

edited

Loading