Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Intermittent Dask errors running ransomware_detection pipeline #1990

Open
2 tasks done
dagardner-nv opened this issue Oct 24, 2024 · 0 comments
Open
2 tasks done
Labels
bug Something isn't working

Comments

@dagardner-nv
Copy link
Contributor

Version

24.10

Which installation method(s) does this occur on?

No response

Describe the bug.

This appears to happen at pipeline shutdown, and morpheus itself exits 0.

FromFile rate[Complete]: 1294592 messages [00:25, 51376.75 m2024-10-24 16:00:06,977 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.ures rate: 1553 messages [01:12, 35.42 messages/s]
Traceback (most recent call last):01:11,  3.24 messages/s]
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closeds]
ToFile rate: 171 messages [01:12,  3.08 messages/s]
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/worker.py", line 1250, in heartbeat
    response = await retry_operation(
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/utils_comm.py", line 461, in retry_operation
    return await retry(
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/utils_comm.py", line 440, in retry
    return await coro()
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/core.py", line 1256, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/core.py", line 1015, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/comm/tcp.py", line 236, in read
    convert_stream_closed_error(self, e)
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:58848 remote=tcp://127.0.0.1:44393>: Stream is closed
2024-10-24 16:00:06,978 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/worker.py", line 1250, in heartbeat
    response = await retry_operation(
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/utils_comm.py", line 461, in retry_operation
    return await retry(
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/utils_comm.py", line 440, in retry
    return await coro()
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/core.py", line 1256, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/core.py", line 1015, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/comm/tcp.py", line 236, in read
    convert_stream_closed_error(self, e)
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:58862 remote=tcp://127.0.0.1:44393>: Stream is closed
2024-10-24 16:00:06,978 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/worker.py", line 1250, in heartbeat
    response = await retry_operation(
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/utils_comm.py", line 461, in retry_operation
    return await retry(
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/utils_comm.py", line 440, in retry
    return await coro()
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/core.py", line 1256, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/core.py", line 1015, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/comm/tcp.py", line 236, in read
    convert_stream_closed_error(self, e)
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:58892 remote=tcp://127.0.0.1:44393>: Stream is closed
2024-10-24 16:00:06,980 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/worker.py", line 1250, in heartbeat
    response = await retry_operation(
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/utils_comm.py", line 461, in retry_operation
    return await retry(
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/utils_comm.py", line 440, in retry
    return await coro()
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/core.py", line 1256, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/core.py", line 1015, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/comm/tcp.py", line 236, in read
    convert_stream_closed_error(self, e)
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:58870 remote=tcp://127.0.0.1:44393>: Stream is closed
2024-10-24 16:00:06,979 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/worker.py", line 1250, in heartbeat
    response = await retry_operation(
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/utils_comm.py", line 461, in retry_operation
    return await retry(
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/utils_comm.py", line 440, in retry
    return await coro()
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/core.py", line 1256, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/core.py", line 1015, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/comm/tcp.py", line 236, in read
    convert_stream_closed_error(self, e)
  File "/home/dagardner/work/conda/envs/morpheus/envs/morpheus-2410reltest/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:58884 remote=tcp://127.0.0.1:44393>: Stream is closed

Minimum reproducible example

Run the ransomware_detection example a few times

Relevant log output

Click here to see error details

[Paste the error here, it will be hidden by default]

Full env printout

Click here to see environment details

[Paste the results of print_env.sh here, it will be hidden by default]

Other/Misc.

No response

Code of Conduct

  • I agree to follow Morpheus' Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Todo
Development

No branches or pull requests

1 participant