Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Producer failed after running for 48 hours #1519

Closed
ajhunyady opened this issue Aug 27, 2021 · 13 comments
Closed

[Bug]: Producer failed after running for 48 hours #1519

ajhunyady opened this issue Aug 27, 2021 · 13 comments
Assignees
Labels
bug Something isn't working Client stability
Milestone

Comments

@ajhunyady
Copy link
Contributor

ajhunyady commented Aug 27, 2021

I have a python script that periodically pulls github stars and pushes to a fluvio topic.
The script run for 48 hours before it returned an error.

The code connect to fluvio every 2 minutes (and I assume the connection is automatically closed when the routine completes). There doesn't seem to be a way to close the connection.

The interesting part is that I have a consumer that is listening to the same topic on the same machine. The consumer runs in a different process and that one did not get disconnected.

So the server is fine.

Script:

import json
import datetime
import schedule
from fluvio import Fluvio

def get_stars():
    response = requests.get("https://api.github.com/repos/infinyon/fluvio")
    result = response.json()
    timestamp = datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S")

    return timestamp, result['stargazers_count']

def produce_to_fluvio(timestamp, stars):
    fluvio = Fluvio.connect()
    producer = fluvio.topic_producer("fluvio-stars")
    producer.send(timestamp.encode(), ("%s" % stars).encode())


def job():
    timestamp, stars = get_stars()
    print("[%s] %d" % (timestamp, stars))
    produce_to_fluvio(timestamp, stars)


def main():
    job()

    schedule.every(2).minutes.do(job)
    while True:
        schedule.run_pending()

if __name__ == "__main__":
    main()

Error:

[2021-08-27 09:34:31] 604
Traceback (most recent call last):
  File "get-stars.py", line 34, in <module>
    main()
  File "get-stars.py", line 31, in main
    schedule.run_pending()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/schedule/__init__.py", line 780, in run_pending
    default_scheduler.run_pending()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/schedule/__init__.py", line 100, in run_pending
    self._run_job(job)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/schedule/__init__.py", line 172, in _run_job
    ret = job.run()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/schedule/__init__.py", line 661, in run
    ret = self.job_func()
  File "get-stars.py", line 23, in job
    produce_to_fluvio(timestamp, stars)
  File "get-stars.py", line 15, in produce_to_fluvio
    fluvio = Fluvio.connect()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/fluvio/__init__.py", line 195, in connect
    return cls(_Fluvio.connect())
_fluvio_python.Error: Fluvio socket error
Caused by:
server has terminated connection
@ajhunyady ajhunyady added the bug Something isn't working label Aug 27, 2021
@sehz sehz added this to the 0.9.5 milestone Aug 29, 2021
@sehz sehz self-assigned this Aug 29, 2021
@tjtelan tjtelan assigned tjtelan and unassigned sehz Aug 30, 2021
@tjtelan tjtelan modified the milestones: 0.9.5, 0.9.6 Sep 2, 2021
@tjtelan
Copy link
Contributor

tjtelan commented Sep 4, 2021

As stated in #1560, the current theory is this error is caused by running out of TCP connections due to the test repeatedly calling Fluvio.connect().

Eventually the script accumulates a lot of TCP connections and hits a system limit. Resulting in the error when Fluvio.connect() is unable to open a new TCP connection.


(The following is a summarized dev conversation)

https://stackoverflow.com/questions/410616/increasing-the-maximum-number-of-tcp-ip-connections-in-linux

Our producer longevity test was modeled after this test script (new connection per message produced) and here's what TCP state looks like.

Before test:

netstat -nat | awk '{print $6}' | sort | uniq -c | sort -r
     94 ESTABLISHED
     18 LISTEN
      2 TIME_WAIT
      1 established)
      1 Foreign

At test start
(Note TIME_WAIT)

netstat -nat | awk '{print $6}' | sort | uniq -c | sort -r
    207 TIME_WAIT
     94 ESTABLISHED
     18 LISTEN
      1 established)
      1 Foreign

At peak, we see 2244 TCP connections

netstat -nat | awk '{print $6}' | sort | uniq -c | sort -r
   2244 TIME_WAIT
     90 ESTABLISHED
     18 LISTEN
      1 established)
      1 Foreign
      1 FIN_WAIT2
      1 FIN_WAIT1

Info about TIME_WAIT

TIME_WAIT is normal. It's a state after a socket has closed, used by the kernel to keep track of packets which may have got lost and turned up late to the party. A high number of TIME_WAIT connections is a symptom of getting lots of short lived connections, not nothing to worry about.

@tjtelan
Copy link
Contributor

tjtelan commented Sep 4, 2021

Workaround / suggestion:
Rewrite python driver code to re-use Fluvio connection. This is what I'm testing with

#!/usr/bin/env python3

import json
import datetime
import schedule
import requests
import time
from fluvio import Fluvio

class FluvioStarWatcher:
    def __init__(self):
        self.fluvio = Fluvio.connect()
        self.producer = self.fluvio.topic_producer('fluvio-stars')

    def send(self, key, value):
        self.producer.send(key, value)

    def get_stars(self):
        response = requests.get("https://api.github.com/repos/infinyon/fluvio")
        result = response.json()
        timestamp = datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S")

        return timestamp, result['stargazers_count']

    def job(self):
        timestamp, stars = self.get_stars()
        print("[%s] %d" % (timestamp, stars))
        self.send(timestamp.encode(), ("%s" % stars).encode())


def main():
    star_watcher = FluvioStarWatcher()
    star_watcher.job()

    schedule.every(2).minutes.do(star_watcher.job)
    while True:
        schedule.run_pending()
        time.sleep(1)

if __name__ == "__main__":
    main()

@tjtelan
Copy link
Contributor

tjtelan commented Sep 7, 2021

I've been running the code from #1519 (comment) for over 48 hours, at the same time as the code from #1526. Both have been running for more than 48 hours and the TIME_WAIT counter is low.

      5 TIME_WAIT
     42 ESTABLISHED
      2 CLOSE_WAIT
      1 Foreign
      1 established)
     19 LISTEN

I think this problem could happen again simply because the solution is for the user to handle connections in a specific way. There isn't an obvious way to explicitly close a connection.

@tjtelan

This comment has been minimized.

@tjtelan
Copy link
Contributor

tjtelan commented Sep 7, 2021

Another theory is that this is a simple network connection issue. I re-ran the original code from #1519 (comment) and after 10 min or so I don't seem to see a steady rise of TIME_WAIT that would account for a failure at 48 hours

Baseline:

      8 TIME_WAIT
     39 ESTABLISHED
      2 CLOSE_WAIT
      1 Foreign
      1 established)
     19 LISTEN

After ~10 min

      5 TIME_WAIT
     39 ESTABLISHED
      2 CLOSE_WAIT
      1 Foreign
      1 established)
     19 LISTEN

After ~6 hrs

      4 TIME_WAIT
     45 ESTABLISHED
      2 CLOSE_WAIT
      1 Foreign
      1 established)
     19 LISTEN

This looks like connections are actually closing and we don't seem to be approaching a crash even without re-using connections.

@tjtelan tjtelan modified the milestones: 0.9.6, 0.9.7 Sep 7, 2021
@tjtelan
Copy link
Contributor

tjtelan commented Sep 10, 2021

I ran the original code for just shy of 24hr before the producer crashed. Error is different though.

/tmp/fluvio-stars$ ./stars-orig.py
[2021-09-07 18:31:42] 635
<trimmed output>
[2021-09-08 18:00:48] 636 
Traceback (most recent call last):
  File "/tmp/fluvio-stars/./stars-orig.py", line 37, in <module>
    main()
  File "/tmp/fluvio-stars/./stars-orig.py", line 34, in main
    schedule.run_pending()
  File "/tmp/fluvio-stars/stars/lib/python3.9/site-packages/schedule/__init__.py", line 780, in run_pending
    default_scheduler.run_pending()
  File "/tmp/fluvio-stars/stars/lib/python3.9/site-packages/schedule/__init__.py", line 100, in run_pending
    self._run_job(job)
  File "/tmp/fluvio-stars/stars/lib/python3.9/site-packages/schedule/__init__.py", line 172, in _run_job
    ret = job.run()
  File "/tmp/fluvio-stars/stars/lib/python3.9/site-packages/schedule/__init__.py", line 661, in run
    ret = self.job_func()
  File "/tmp/fluvio-stars/./stars-orig.py", line 26, in job
    produce_to_fluvio(timestamp, stars)
  File "/tmp/fluvio-stars/./stars-orig.py", line 19, in produce_to_fluvio
    producer = fluvio.topic_producer("fluvio-stars")
  File "/tmp/fluvio-stars/stars/lib/python3.9/site-packages/fluvio/__init__.py", line 222, in topic_producer
    return TopicProducer(self._inner.topic_producer(topic))
_fluvio_python.Error: timed out searching metadata Partition failed due to timeout: 60000 ms

For what it's worth, I am testing against a cluster running in k3d on the same host I'm running this script.

@ajhunyady
Copy link
Contributor Author

Looks different from my issue, but still relevant.

@tjtelan tjtelan modified the milestones: 0.9.7, 0.9.6 Sep 10, 2021
@sehz sehz modified the milestones: 0.9.6, 0.9.7 Sep 10, 2021
@ajhunyady
Copy link
Contributor Author

ajhunyady commented Sep 15, 2021

@tjtelan - it looks there may be different problems lurking here. But for your first observation with many open sockets. Why are we not implementing a close connection API that explicitly closes the connection to server.

TCP is tunable in various ways, so there are probably other alternatives for dealing with this issue.

@sehz
Copy link
Contributor

sehz commented Sep 15, 2021

TCP connection is not issue here since connection happens every 2 min. Just need to re-test with new version to see if same problem occurs

@ajhunyady
Copy link
Contributor Author

ajhunyady commented Sep 15, 2021

@sehz happy to get the script going again once we have a fix. I'm going to run the old script even though I like @tjtelan's version better. Though we need to implement reconnect before I use it.

@sehz
Copy link
Contributor

sehz commented Sep 15, 2021

Try again, this maybe fixed in latest version. Make sure get latest python version from @simlay

@ajhunyady
Copy link
Contributor Author

Happy to use whatever is published. I thought I may need to wait for 0.9.7.

@tjtelan tjtelan modified the milestones: 0.9.7, 0.9.8 Sep 15, 2021
@tjtelan
Copy link
Contributor

tjtelan commented Sep 23, 2021

Closing this as a duplicate. The root cause of this issue is tracked by #770 and is a high priority. The client should try to re-connect when the socket closes (like for network-stability reasons)

Let's create a new issue if we see this behavior after #770 is fixed.

@tjtelan tjtelan closed this as completed Sep 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Client stability
Projects
None yet
Development

No branches or pull requests

3 participants