Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experts fail to initialize > 50% of the time #2

Closed
Vectorrent opened this issue Sep 4, 2024 · 2 comments
Closed

Experts fail to initialize > 50% of the time #2

Vectorrent opened this issue Sep 4, 2024 · 2 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@Vectorrent
Copy link
Contributor

image

I have no idea why this happens. Even when bootstrapping from a local DHT node, initialization may fail with all kinds of errors:

Sep 04 06:37:01.905 [INFO] Server started with 3 modules:
Sep 04 06:37:01.905 [INFO] expert.0: PraxisMLP, 525568 parameters
Sep 04 06:37:01.905 [INFO] expert.1: PraxisMLP, 525568 parameters
Sep 04 06:37:01.905 [INFO] expert.2: PraxisMLP, 525568 parameters
Sep 04 06:37:01.936 [ERROR] [hivemind.moe.server.connection_handler._run:63] ConnectionHandler failed to start:
Traceback (most recent call last):
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/multiaddr/transforms.py", line 86, in bytes_iter
    proto = protocol_with_code(code)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/multiaddr/protocols.py", line 290, in protocol_with_code
    return REGISTRY.find_by_code(code)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/multiaddr/protocols.py", line 260, in find_by_code
    raise exceptions.ProtocolNotFoundError(code, "code")
multiaddr.exceptions.ProtocolNotFoundError: No protocol with code 465 found

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/moe/server/connection_handler.py", line 59, in _run
    self._p2p = await self.dht.replicate_p2p()
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/dht/dht.py", line 327, in replicate_p2p
    self._p2p_replica = await P2P.replicate(daemon_listen_maddr)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/p2p/p2p_daemon.py", line 312, in replicate
    await self._ping_daemon()
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/p2p/p2p_daemon.py", line 317, in _ping_daemon
    logger.debug(f"Launched p2pd with peer id = {self.peer_id}, host multiaddrs = {self._visible_maddrs}")
                                                                                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/multiaddr/multiaddr.py", line 147, in __repr__
    return "<Multiaddr %s>" % str(self)
                              ^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/multiaddr/multiaddr.py", line 135, in __str__
    return bytes_to_string(self._bytes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/multiaddr/transforms.py", line 30, in bytes_to_string
    for _, proto, codec, part in bytes_iter(buf):
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/multiaddr/transforms.py", line 89, in bytes_iter
    raise exceptions.BinaryParseError(
multiaddr.exceptions.BinaryParseError: Invalid binary MultiAddr protocol 465: Unknown Protocol
Sep 04 06:37:01.940 [ERROR] [hivemind.utils.mpfuture._process_updates_in_background:198] Could not retrieve update: caught TypeError("BinaryParseError.__init__() missing 2 required positional arguments: 'binary' and 'protocol'") (pid=242958)
Traceback (most recent call last):
  File "/home/crow/repos/praxis/venv/lib/python3.12/site-packages/hivemind/utils/mpfuture.py", line 177, in _process_updates_in_background
    uid, update_type, payload = receiver_pipe.recv()
                                ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 251, in recv
    return _ForkingPickler.loads(buf.getbuffer())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: BinaryParseError.__init__() missing 2 required positional arguments: 'binary' and 'protocol'

I was sort of able to make this problem less frequent by adding a delay to startup, but it honestly doesn't work very well, if at all.

Could use some help with this one. I've been running into issues like this in Hivemind for years.

@Vectorrent Vectorrent added bug Something isn't working help wanted Extra attention is needed labels Sep 4, 2024
@Vectorrent
Copy link
Contributor Author

error.mp4

Example

@Vectorrent
Copy link
Contributor Author

I found a solution to this problem. Long story short, if you call dht.get_visible_maddrs() before attempting to start the server, it will never hang. Clearly, this is not intended behavior, and this method should have no bearing on server bootstrapping... but it does. So, we fixed it with a hack, until upstream fixes this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant