[Bug]: Producer failed after running for 48 hours #1519

ajhunyady · 2021-08-27T13:57:43Z

I have a python script that periodically pulls github stars and pushes to a fluvio topic.
The script run for 48 hours before it returned an error.

The code connect to fluvio every 2 minutes (and I assume the connection is automatically closed when the routine completes). There doesn't seem to be a way to close the connection.

The interesting part is that I have a consumer that is listening to the same topic on the same machine. The consumer runs in a different process and that one did not get disconnected.

So the server is fine.

Script:

import json
import datetime
import schedule
from fluvio import Fluvio

def get_stars():
    response = requests.get("https://api.github.com/repos/infinyon/fluvio")
    result = response.json()
    timestamp = datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S")

    return timestamp, result['stargazers_count']

def produce_to_fluvio(timestamp, stars):
    fluvio = Fluvio.connect()
    producer = fluvio.topic_producer("fluvio-stars")
    producer.send(timestamp.encode(), ("%s" % stars).encode())


def job():
    timestamp, stars = get_stars()
    print("[%s] %d" % (timestamp, stars))
    produce_to_fluvio(timestamp, stars)


def main():
    job()

    schedule.every(2).minutes.do(job)
    while True:
        schedule.run_pending()

if __name__ == "__main__":
    main()

Error:

[2021-08-27 09:34:31] 604
Traceback (most recent call last):
  File "get-stars.py", line 34, in <module>
    main()
  File "get-stars.py", line 31, in main
    schedule.run_pending()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/schedule/__init__.py", line 780, in run_pending
    default_scheduler.run_pending()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/schedule/__init__.py", line 100, in run_pending
    self._run_job(job)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/schedule/__init__.py", line 172, in _run_job
    ret = job.run()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/schedule/__init__.py", line 661, in run
    ret = self.job_func()
  File "get-stars.py", line 23, in job
    produce_to_fluvio(timestamp, stars)
  File "get-stars.py", line 15, in produce_to_fluvio
    fluvio = Fluvio.connect()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/fluvio/__init__.py", line 195, in connect
    return cls(_Fluvio.connect())
_fluvio_python.Error: Fluvio socket error
Caused by:
server has terminated connection

The text was updated successfully, but these errors were encountered:

tjtelan · 2021-09-04T01:25:54Z

As stated in #1560, the current theory is this error is caused by running out of TCP connections due to the test repeatedly calling Fluvio.connect().

Eventually the script accumulates a lot of TCP connections and hits a system limit. Resulting in the error when Fluvio.connect() is unable to open a new TCP connection.

(The following is a summarized dev conversation)

https://stackoverflow.com/questions/410616/increasing-the-maximum-number-of-tcp-ip-connections-in-linux

Our producer longevity test was modeled after this test script (new connection per message produced) and here's what TCP state looks like.

Before test:

netstat -nat | awk '{print $6}' | sort | uniq -c | sort -r
     94 ESTABLISHED
     18 LISTEN
      2 TIME_WAIT
      1 established)
      1 Foreign

At test start
(Note TIME_WAIT)

netstat -nat | awk '{print $6}' | sort | uniq -c | sort -r
    207 TIME_WAIT
     94 ESTABLISHED
     18 LISTEN
      1 established)
      1 Foreign

At peak, we see 2244 TCP connections

netstat -nat | awk '{print $6}' | sort | uniq -c | sort -r
   2244 TIME_WAIT
     90 ESTABLISHED
     18 LISTEN
      1 established)
      1 Foreign
      1 FIN_WAIT2
      1 FIN_WAIT1

Info about TIME_WAIT

TIME_WAIT is normal. It's a state after a socket has closed, used by the kernel to keep track of packets which may have got lost and turned up late to the party. A high number of TIME_WAIT connections is a symptom of getting lots of short lived connections, not nothing to worry about.

tjtelan · 2021-09-04T01:51:09Z

Workaround / suggestion:
Rewrite python driver code to re-use Fluvio connection. This is what I'm testing with

#!/usr/bin/env python3

import json
import datetime
import schedule
import requests
import time
from fluvio import Fluvio

class FluvioStarWatcher:
    def __init__(self):
        self.fluvio = Fluvio.connect()
        self.producer = self.fluvio.topic_producer('fluvio-stars')

    def send(self, key, value):
        self.producer.send(key, value)

    def get_stars(self):
        response = requests.get("https://api.github.com/repos/infinyon/fluvio")
        result = response.json()
        timestamp = datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S")

        return timestamp, result['stargazers_count']

    def job(self):
        timestamp, stars = self.get_stars()
        print("[%s] %d" % (timestamp, stars))
        self.send(timestamp.encode(), ("%s" % stars).encode())


def main():
    star_watcher = FluvioStarWatcher()
    star_watcher.job()

    schedule.every(2).minutes.do(star_watcher.job)
    while True:
        schedule.run_pending()
        time.sleep(1)

if __name__ == "__main__":
    main()

tjtelan · 2021-09-07T17:47:32Z

I've been running the code from #1519 (comment) for over 48 hours, at the same time as the code from #1526. Both have been running for more than 48 hours and the TIME_WAIT counter is low.

      5 TIME_WAIT
     42 ESTABLISHED
      2 CLOSE_WAIT
      1 Foreign
      1 established)
     19 LISTEN

I think this problem could happen again simply because the solution is for the user to handle connections in a specific way. There isn't an obvious way to explicitly close a connection.

tjtelan · 2021-09-07T18:52:50Z

Another theory is that this is a simple network connection issue. I re-ran the original code from #1519 (comment) and after 10 min or so I don't seem to see a steady rise of TIME_WAIT that would account for a failure at 48 hours

Baseline:

      8 TIME_WAIT
     39 ESTABLISHED
      2 CLOSE_WAIT
      1 Foreign
      1 established)
     19 LISTEN

After ~10 min

      5 TIME_WAIT
     39 ESTABLISHED
      2 CLOSE_WAIT
      1 Foreign
      1 established)
     19 LISTEN

After ~6 hrs

      4 TIME_WAIT
     45 ESTABLISHED
      2 CLOSE_WAIT
      1 Foreign
      1 established)
     19 LISTEN

This looks like connections are actually closing and we don't seem to be approaching a crash even without re-using connections.

tjtelan · 2021-09-10T00:46:43Z

I ran the original code for just shy of 24hr before the producer crashed. Error is different though.

/tmp/fluvio-stars$ ./stars-orig.py
[2021-09-07 18:31:42] 635
<trimmed output>
[2021-09-08 18:00:48] 636 
Traceback (most recent call last):
  File "/tmp/fluvio-stars/./stars-orig.py", line 37, in <module>
    main()
  File "/tmp/fluvio-stars/./stars-orig.py", line 34, in main
    schedule.run_pending()
  File "/tmp/fluvio-stars/stars/lib/python3.9/site-packages/schedule/__init__.py", line 780, in run_pending
    default_scheduler.run_pending()
  File "/tmp/fluvio-stars/stars/lib/python3.9/site-packages/schedule/__init__.py", line 100, in run_pending
    self._run_job(job)
  File "/tmp/fluvio-stars/stars/lib/python3.9/site-packages/schedule/__init__.py", line 172, in _run_job
    ret = job.run()
  File "/tmp/fluvio-stars/stars/lib/python3.9/site-packages/schedule/__init__.py", line 661, in run
    ret = self.job_func()
  File "/tmp/fluvio-stars/./stars-orig.py", line 26, in job
    produce_to_fluvio(timestamp, stars)
  File "/tmp/fluvio-stars/./stars-orig.py", line 19, in produce_to_fluvio
    producer = fluvio.topic_producer("fluvio-stars")
  File "/tmp/fluvio-stars/stars/lib/python3.9/site-packages/fluvio/__init__.py", line 222, in topic_producer
    return TopicProducer(self._inner.topic_producer(topic))
_fluvio_python.Error: timed out searching metadata Partition failed due to timeout: 60000 ms

For what it's worth, I am testing against a cluster running in k3d on the same host I'm running this script.

ajhunyady · 2021-09-10T01:30:21Z

Looks different from my issue, but still relevant.

ajhunyady · 2021-09-15T14:41:38Z

@tjtelan - it looks there may be different problems lurking here. But for your first observation with many open sockets. Why are we not implementing a close connection API that explicitly closes the connection to server.

TCP is tunable in various ways, so there are probably other alternatives for dealing with this issue.

sehz · 2021-09-15T16:21:27Z

TCP connection is not issue here since connection happens every 2 min. Just need to re-test with new version to see if same problem occurs

ajhunyady · 2021-09-15T16:46:23Z

@sehz happy to get the script going again once we have a fix. I'm going to run the old script even though I like @tjtelan's version better. Though we need to implement reconnect before I use it.

sehz · 2021-09-15T16:48:09Z

Try again, this maybe fixed in latest version. Make sure get latest python version from @simlay

ajhunyady · 2021-09-15T16:49:16Z

Happy to use whatever is published. I thought I may need to wait for 0.9.7.

tjtelan · 2021-09-23T18:28:29Z

Closing this as a duplicate. The root cause of this issue is tracked by #770 and is a high priority. The client should try to re-connect when the socket closes (like for network-stability reasons)

Let's create a new issue if we see this behavior after #770 is fixed.

ajhunyady added the bug Something isn't working label Aug 27, 2021

sehz added this to the 0.9.5 milestone Aug 29, 2021

sehz added Client Priority/critical stability labels Aug 29, 2021

sehz self-assigned this Aug 29, 2021

tjtelan assigned tjtelan and unassigned sehz Aug 30, 2021

tjtelan modified the milestones: 0.9.5, 0.9.6 Sep 2, 2021

tjtelan mentioned this issue Sep 4, 2021

Producer client error - Connection refused #1560

Closed

tjtelan mentioned this issue Sep 7, 2021

[Bug]: Consumer failed after 48 hours #1526

Closed

This comment has been minimized.

Sign in to view

tjtelan modified the milestones: 0.9.6, 0.9.7 Sep 7, 2021

tjtelan modified the milestones: 0.9.7, 0.9.6 Sep 10, 2021

sehz modified the milestones: 0.9.6, 0.9.7 Sep 10, 2021

tjtelan modified the milestones: 0.9.7, 0.9.8 Sep 15, 2021

tjtelan closed this as completed Sep 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Producer failed after running for 48 hours #1519

[Bug]: Producer failed after running for 48 hours #1519

ajhunyady commented Aug 27, 2021 •

edited

Loading

tjtelan commented Sep 4, 2021

tjtelan commented Sep 4, 2021 •

edited

Loading

tjtelan commented Sep 7, 2021

This comment has been minimized.

tjtelan commented Sep 7, 2021 •

edited

Loading

tjtelan commented Sep 10, 2021

ajhunyady commented Sep 10, 2021

ajhunyady commented Sep 15, 2021 •

edited

Loading

sehz commented Sep 15, 2021

ajhunyady commented Sep 15, 2021 •

edited

Loading

sehz commented Sep 15, 2021

ajhunyady commented Sep 15, 2021

tjtelan commented Sep 23, 2021

[Bug]: Producer failed after running for 48 hours #1519

[Bug]: Producer failed after running for 48 hours #1519

Comments

ajhunyady commented Aug 27, 2021 • edited Loading

tjtelan commented Sep 4, 2021

tjtelan commented Sep 4, 2021 • edited Loading

tjtelan commented Sep 7, 2021

This comment has been minimized.

tjtelan commented Sep 7, 2021 • edited Loading

tjtelan commented Sep 10, 2021

ajhunyady commented Sep 10, 2021

ajhunyady commented Sep 15, 2021 • edited Loading

sehz commented Sep 15, 2021

ajhunyady commented Sep 15, 2021 • edited Loading

sehz commented Sep 15, 2021

ajhunyady commented Sep 15, 2021

tjtelan commented Sep 23, 2021

ajhunyady commented Aug 27, 2021 •

edited

Loading

tjtelan commented Sep 4, 2021 •

edited

Loading

tjtelan commented Sep 7, 2021 •

edited

Loading

ajhunyady commented Sep 15, 2021 •

edited

Loading

ajhunyady commented Sep 15, 2021 •

edited

Loading