-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add VM Scale Management Steps #861
Conversation
:return: | ||
""" | ||
try: | ||
vm_names = self._oc._get_all_vm_names() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to compare this against a list of expected VM names, if that's practical here. And you certainly want to make sure that there is at least one, or you'll simply be "verifying" a null case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You right, add MissingVMs error when no running VMs at all
if self._oc.wait_for_vm_ssh(vm_name=vm_name, node_ip=node_ip, vm_node_port=vm_node_port): | ||
logger.info(f"Successfully ssh into VM: '{vm_name}' in Node: '{vm_node}' ") | ||
return vm_node | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure you capture any stderr from the ssh command, so that when it fails you can at least have some hope of figuring out what went wrong (it might simply be a network problem or a credential problem).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I handled it in self._oc.wait_for_vm_ssh by raising VMStateTimeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that actually capture the stderr from the ssh command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see the full code here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not appear to me that the ssh output gets reported if the ssh fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a custom error for it:
raise VMStateTimeout(vm_name=vm_name, state='ssh')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But does it report the actual error reported by ssh? I want that to be reported to make it easier to debug and to distinguish glitches from real problems. It doesn't look like VMStateTimeout captures the reason for the error (what ssh reports on stderr).
self._data_dict.update({'total_run_time': total_run_time}) | ||
if not self._verification_only: | ||
total_run_time = self._get_bootstorm_vm_total_run_time() | ||
self._data_dict.update({'total_run_time': total_run_time}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wouldn't be a bad idea to stick some distinctive value in there even in verification mode, such as -1, so that the schema is the same both ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total_run_time will be empty when no input data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, but that's about testing that total_run_time is handled correctly. Not that big of a deal, but think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer that it will be empty in ElasticSearch instead of -1, more simple to handle it.
@RobertKrawitz, any more comments ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The big thing I want is to capture stderr if ssh (or any other command) fails.
if self._oc.wait_for_vm_ssh(vm_name=vm_name, node_ip=node_ip, vm_node_port=vm_node_port): | ||
logger.info(f"Successfully ssh into VM: '{vm_name}' in Node: '{vm_node}' ") | ||
return vm_node | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not appear to me that the ssh output gets reported if the ssh fails.
@RobertKrawitz, we dont need to get ssh output, we need to know if vm is accessible for ssh call or not |
If the ssh fails, we want to know why. Connection timed out, connection reset by peer, authentication are different failures that ssh can report and I believe we do want to see what happened. |
If something will fail, I will raise it here |
As long as stderr from ssh gets saved, I don't care how you do it. I do want to be sure it's possible to see after the fact what happened. |
Sorry I didnt get your point. |
I want any stderr output from ssh to be saved, so that if there is a failure, we can look at it to try to figure out what happened. ssh can fail for any number of reasons which may have nothing to do with whether the VM is running: there might be an authentication failure, there might be a transient network problem, there might be any number of things. Simply "ssh to node failed" doesn't help us figure out what happened. |
I am focusing on the main flow right now and I dont want to focus in log collection in this pr only on verification steps. |
OK, but I think that it's important to collect information if verification fails. I will accept opening a PR for that, but I'd like you to open an issue against it prior to my approving this PR. |
Open issue regarding it: #862 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with issue now opened to log ssh failures in detail.
Type of change
Note: Fill x in []
Description
[Chaos Testing] Adding VM Scale Management Steps:
For security reasons, all pull requests need to be approved first before running any automated CI