-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Producer failed after running for 48 hours #1519
Comments
As stated in #1560, the current theory is this error is caused by running out of TCP connections due to the test repeatedly calling Eventually the script accumulates a lot of TCP connections and hits a system limit. Resulting in the error when (The following is a summarized dev conversation) Our producer longevity test was modeled after this test script (new connection per message produced) and here's what TCP state looks like. Before test:
At test start
At peak, we see 2244 TCP connections
Info about TIME_WAIT is normal. It's a state after a socket has closed, used by the kernel to keep track of packets which may have got lost and turned up late to the party. A high number of TIME_WAIT connections is a symptom of getting lots of short lived connections, not nothing to worry about. |
Workaround / suggestion: #!/usr/bin/env python3
import json
import datetime
import schedule
import requests
import time
from fluvio import Fluvio
class FluvioStarWatcher:
def __init__(self):
self.fluvio = Fluvio.connect()
self.producer = self.fluvio.topic_producer('fluvio-stars')
def send(self, key, value):
self.producer.send(key, value)
def get_stars(self):
response = requests.get("https://api.github.com/repos/infinyon/fluvio")
result = response.json()
timestamp = datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S")
return timestamp, result['stargazers_count']
def job(self):
timestamp, stars = self.get_stars()
print("[%s] %d" % (timestamp, stars))
self.send(timestamp.encode(), ("%s" % stars).encode())
def main():
star_watcher = FluvioStarWatcher()
star_watcher.job()
schedule.every(2).minutes.do(star_watcher.job)
while True:
schedule.run_pending()
time.sleep(1)
if __name__ == "__main__":
main() |
I've been running the code from #1519 (comment) for over 48 hours, at the same time as the code from #1526. Both have been running for more than 48 hours and the
I think this problem could happen again simply because the solution is for the user to handle connections in a specific way. There isn't an obvious way to explicitly close a connection. |
This comment has been minimized.
This comment has been minimized.
Another theory is that this is a simple network connection issue. I re-ran the original code from #1519 (comment) and after 10 min or so I don't seem to see a steady rise of Baseline:
After ~10 min
After ~6 hrs
This looks like connections are actually closing and we don't seem to be approaching a crash even without re-using connections. |
I ran the original code for just shy of 24hr before the producer crashed. Error is different though.
For what it's worth, I am testing against a cluster running in k3d on the same host I'm running this script. |
Looks different from my issue, but still relevant. |
@tjtelan - it looks there may be different problems lurking here. But for your first observation with many open sockets. Why are we not implementing a TCP is tunable in various ways, so there are probably other alternatives for dealing with this issue. |
TCP connection is not issue here since connection happens every 2 min. Just need to re-test with new version to see if same problem occurs |
Try again, this maybe fixed in latest version. Make sure get latest python version from @simlay |
Happy to use whatever is published. I thought I may need to wait for 0.9.7. |
I have a python script that periodically pulls github stars and pushes to a fluvio topic.
The script run for 48 hours before it returned an error.
The code connect to fluvio every 2 minutes (and I assume the connection is automatically closed when the routine completes). There doesn't seem to be a way to close the connection.
The interesting part is that I have a consumer that is listening to the same topic on the same machine. The consumer runs in a different process and that one did not get disconnected.
So the server is fine.
Script:
Error:
The text was updated successfully, but these errors were encountered: