Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Avoid high concurrency issue with control master #4455

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Dec 10, 2024

When there are high concurrent ssh connection to the same node with control master, it seems any new connection with the control master will hang. We now add a detection to avoid using control master when the total connections are higher than a threshold.

To reproduce:

sky launch -c test-high --cpus 8 --cloud aws
for i in `seq 1 50`; do sky exec test-high echo hi & done
sky queue test-high

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@Michaelvll Michaelvll requested a review from cblmemo December 10, 2024 21:54
@Michaelvll Michaelvll marked this pull request as draft December 10, 2024 22:13
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @Michaelvll ! Left some discussions ;)


Args:
ip: The IP address to check
threshold: Maximum number of allowed concurrent SSH connections
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stale? seems now the threshold is a constant (but not an arg)

Comment on lines +80 to +81
# Use pgrep to efficiently find ssh processes and pipe to grep for IP
cmd = f'pgrep -f ssh | xargs -r ps -p | grep -c {ip}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that only tracks the SSH command from local machine? How about the connections from other laptops

Comment on lines +11 to +15
# The maximum number of concurrent ssh connections to a same node. This is a
# heuristic value, based on the observation that when the number of concurrent
# ssh connections to a node with control master is high, new connections through
# control master will hang.
_MAX_CONCURRENT_SSH_CONNECTIONS = 32
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you test if this is related to number of cpu?

Copy link
Collaborator

@cg505 cg505 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you experiment with setting MaxSessions on the node sshd instead?

@Michaelvll
Copy link
Collaborator Author

Actually, this cannot fix the issue with the high concurrency. We may need a better solution. : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants