Experts fail to initialize > 50% of the time #2

Vectorrent · 2024-09-04T11:40:39Z

I have no idea why this happens. Even when bootstrapping from a local DHT node, initialization may fail with all kinds of errors:

Sep 04 06:37:01.905 [INFO] Server started with 3 modules:
Sep 04 06:37:01.905 [INFO] expert.0: PraxisMLP, 525568 parameters
Sep 04 06:37:01.905 [INFO] expert.1: PraxisMLP, 525568 parameters
Sep 04 06:37:01.905 [INFO] expert.2: PraxisMLP, 525568 parameters
Sep 04 06:37:01.936 [ERROR] [hivemind.moe.server.connection_handler._run:63] ConnectionHandler failed to start:
Traceback (most recent call last):
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/multiaddr/transforms.py", line 86, in bytes_iter
    proto = protocol_with_code(code)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/multiaddr/protocols.py", line 290, in protocol_with_code
    return REGISTRY.find_by_code(code)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/multiaddr/protocols.py", line 260, in find_by_code
    raise exceptions.ProtocolNotFoundError(code, "code")
multiaddr.exceptions.ProtocolNotFoundError: No protocol with code 465 found

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/moe/server/connection_handler.py", line 59, in _run
    self._p2p = await self.dht.replicate_p2p()
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/dht/dht.py", line 327, in replicate_p2p
    self._p2p_replica = await P2P.replicate(daemon_listen_maddr)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/p2p/p2p_daemon.py", line 312, in replicate
    await self._ping_daemon()
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/p2p/p2p_daemon.py", line 317, in _ping_daemon
    logger.debug(f"Launched p2pd with peer id = {self.peer_id}, host multiaddrs = {self._visible_maddrs}")
                                                                                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/multiaddr/multiaddr.py", line 147, in __repr__
    return "<Multiaddr %s>" % str(self)
                              ^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/multiaddr/multiaddr.py", line 135, in __str__
    return bytes_to_string(self._bytes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/multiaddr/transforms.py", line 30, in bytes_to_string
    for _, proto, codec, part in bytes_iter(buf):
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/multiaddr/transforms.py", line 89, in bytes_iter
    raise exceptions.BinaryParseError(
multiaddr.exceptions.BinaryParseError: Invalid binary MultiAddr protocol 465: Unknown Protocol
Sep 04 06:37:01.940 [ERROR] [hivemind.utils.mpfuture._process_updates_in_background:198] Could not retrieve update: caught TypeError("BinaryParseError.__init__() missing 2 required positional arguments: 'binary' and 'protocol'") (pid=242958)
Traceback (most recent call last):
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/utils/mpfuture.py", line 177, in _process_updates_in_background
    uid, update_type, payload = receiver_pipe.recv()
                                ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 251, in recv
    return _ForkingPickler.loads(buf.getbuffer())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: BinaryParseError.__init__() missing 2 required positional arguments: 'binary' and 'protocol'

I was sort of able to make this problem less frequent by adding a delay to startup, but it honestly doesn't work very well, if at all.

Could use some help with this one. I've been running into issues like this in Hivemind for years.

The text was updated successfully, but these errors were encountered:

Vectorrent · 2024-09-09T23:41:04Z

error.mp4

Example

Vectorrent · 2024-10-23T07:35:39Z

I found a solution to this problem. Long story short, if you call dht.get_visible_maddrs() before attempting to start the server, it will never hang. Clearly, this is not intended behavior, and this method should have no bearing on server bootstrapping... but it does. So, we fixed it with a hack, until upstream fixes this.

Vectorrent added bug Something isn't working help wanted Extra attention is needed labels Sep 4, 2024

Vectorrent closed this as completed Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experts fail to initialize > 50% of the time #2

Experts fail to initialize > 50% of the time #2

Vectorrent commented Sep 4, 2024

Vectorrent commented Sep 9, 2024

Vectorrent commented Oct 23, 2024

Experts fail to initialize > 50% of the time #2

Experts fail to initialize > 50% of the time #2

Comments

Vectorrent commented Sep 4, 2024

Vectorrent commented Sep 9, 2024

Vectorrent commented Oct 23, 2024