Add VM Scale Management Steps #861

ebattat · 2024-08-05T12:35:47Z

Type of change

Note: Fill x in []

bug
enhancement
documentation
dependencies

Description

[Chaos Testing] Adding VM Scale Management Steps:

Run VMs without deleting them before cluster upgrade (Add DELETE_ALL variable)
Verify VMs only after cluster upgrade (Add VERIFICATION_ONLY variable)
Add vm_ssh field to verify that SSH into VM is working properly

For security reasons, all pull requests need to be approved first before running any automated CI

RobertKrawitz · 2024-08-05T14:47:19Z

benchmark_runner/workloads/bootstorm_vm.py

+        :return:
+        """
+        try:
+            vm_names = self._oc._get_all_vm_names()


It might make sense to compare this against a list of expected VM names, if that's practical here. And you certainly want to make sure that there is at least one, or you'll simply be "verifying" a null case.

You right, add MissingVMs error when no running VMs at all

RobertKrawitz · 2024-08-05T14:49:33Z

benchmark_runner/workloads/bootstorm_vm.py

+                    if self._oc.wait_for_vm_ssh(vm_name=vm_name, node_ip=node_ip, vm_node_port=vm_node_port):
+                        logger.info(f"Successfully ssh into VM: '{vm_name}' in Node: '{vm_node}' ")
+                    return vm_node
+            return False


Make sure you capture any stderr from the ssh command, so that when it fails you can at least have some hope of figuring out what went wrong (it might simply be a network problem or a credential problem).

I handled it in self._oc.wait_for_vm_ssh by raising VMStateTimeout

Does that actually capture the stderr from the ssh command?

You can see the full code here

It does not appear to me that the ssh output gets reported if the ssh fails.

We have a custom error for it:
raise VMStateTimeout(vm_name=vm_name, state='ssh')

But does it report the actual error reported by ssh? I want that to be reported to make it easier to debug and to distinguish glitches from real problems. It doesn't look like VMStateTimeout captures the reason for the error (what ssh reports on stderr).

RobertKrawitz · 2024-08-05T14:51:01Z

benchmark_runner/workloads/bootstorm_vm.py

-        self._data_dict.update({'total_run_time': total_run_time})
+        if not self._verification_only:
+            total_run_time = self._get_bootstorm_vm_total_run_time()
+            self._data_dict.update({'total_run_time': total_run_time})


It wouldn't be a bad idea to stick some distinctive value in there even in verification mode, such as -1, so that the schema is the same both ways.

total_run_time will be empty when no input data

Understood, but that's about testing that total_run_time is handled correctly. Not that big of a deal, but think about it.

I prefer that it will be empty in ElasticSearch instead of -1, more simple to handle it.

ebattat · 2024-08-08T12:49:59Z

@RobertKrawitz, any more comments ?

RobertKrawitz

The big thing I want is to capture stderr if ssh (or any other command) fails.

RobertKrawitz · 2024-08-06T15:42:04Z

benchmark_runner/workloads/bootstorm_vm.py

+                    if self._oc.wait_for_vm_ssh(vm_name=vm_name, node_ip=node_ip, vm_node_port=vm_node_port):
+                        logger.info(f"Successfully ssh into VM: '{vm_name}' in Node: '{vm_node}' ")
+                    return vm_node
+            return False


It does not appear to me that the ssh output gets reported if the ssh fails.

ebattat · 2024-08-11T08:25:39Z

@RobertKrawitz, we dont need to get ssh output, we need to know if vm is accessible for ssh call or not

RobertKrawitz · 2024-08-11T17:50:45Z

@RobertKrawitz, we dont need to get ssh output, we need to know if vm is accessible for ssh call or not

If the ssh fails, we want to know why. Connection timed out, connection reset by peer, authentication are different failures that ssh can report and I believe we do want to see what happened.

ebattat · 2024-08-12T06:03:49Z

If something will fail, I will raise it here

RobertKrawitz · 2024-08-12T13:24:23Z

If something will fail, I will raise it here

As long as stderr from ssh gets saved, I don't care how you do it. I do want to be sure it's possible to see after the fact what happened.

ebattat · 2024-08-12T14:00:10Z

Sorry I didnt get your point.

RobertKrawitz · 2024-08-12T14:30:50Z

Sorry I didnt get your point.

I want any stderr output from ssh to be saved, so that if there is a failure, we can look at it to try to figure out what happened. ssh can fail for any number of reasons which may have nothing to do with whether the VM is running: there might be an authentication failure, there might be a transient network problem, there might be any number of things. Simply "ssh to node failed" doesn't help us figure out what happened.

ebattat · 2024-08-12T14:53:40Z

I am focusing on the main flow right now and I dont want to focus in log collection in this pr only on verification steps.
I will open a separate PR for VM logs collection.

RobertKrawitz · 2024-08-12T14:57:22Z

I am focusing on the main flow right now and I dont want to focus in log collection in this pr only on verification steps. I will open a separate PR for VM logs collection.

OK, but I think that it's important to collect information if verification fails. I will accept opening a PR for that, but I'd like you to open an issue against it prior to my approving this PR.

ebattat · 2024-08-12T15:06:06Z

Open issue regarding it: #862

RobertKrawitz

Approved with issue now opened to log ssh failures in detail.

ebattat added the enhancement New feature or request label Aug 5, 2024

ebattat requested a review from RobertKrawitz August 5, 2024 12:35

ebattat self-assigned this Aug 5, 2024

ebattat added the ok-to-test PR ok to test label Aug 5, 2024

RobertKrawitz requested changes Aug 5, 2024

View reviewed changes

ebattat force-pushed the vm_scale_steps branch from f8f0b42 to e9aed95 Compare August 6, 2024 09:19

RobertKrawitz requested changes Aug 8, 2024

View reviewed changes

Add VM Scale Management Steps

7371a41

ebattat force-pushed the vm_scale_steps branch from e9aed95 to 7371a41 Compare August 9, 2024 09:39

ebattat mentioned this pull request Aug 12, 2024

Collect information for VM verification fails #862

Open

RobertKrawitz approved these changes Aug 12, 2024

View reviewed changes

ebattat merged commit 022dc27 into redhat-performance:main Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add VM Scale Management Steps #861

Add VM Scale Management Steps #861

ebattat commented Aug 5, 2024 •

edited

Loading

RobertKrawitz Aug 5, 2024

ebattat Aug 6, 2024

RobertKrawitz Aug 5, 2024

ebattat Aug 6, 2024

RobertKrawitz Aug 6, 2024

ebattat Aug 6, 2024

RobertKrawitz Aug 6, 2024

ebattat Aug 9, 2024

RobertKrawitz Aug 9, 2024

RobertKrawitz Aug 5, 2024

ebattat Aug 6, 2024

RobertKrawitz Aug 6, 2024

ebattat Aug 6, 2024

ebattat commented Aug 8, 2024

RobertKrawitz left a comment

RobertKrawitz Aug 6, 2024

ebattat commented Aug 11, 2024

RobertKrawitz commented Aug 11, 2024

ebattat commented Aug 12, 2024

RobertKrawitz commented Aug 12, 2024

ebattat commented Aug 12, 2024

RobertKrawitz commented Aug 12, 2024

ebattat commented Aug 12, 2024

RobertKrawitz commented Aug 12, 2024

ebattat commented Aug 12, 2024

RobertKrawitz left a comment

Add VM Scale Management Steps #861

Add VM Scale Management Steps #861

Conversation

ebattat commented Aug 5, 2024 • edited Loading

Type of change

Description

For security reasons, all pull requests need to be approved first before running any automated CI

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebattat commented Aug 8, 2024

RobertKrawitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebattat commented Aug 11, 2024

RobertKrawitz commented Aug 11, 2024

ebattat commented Aug 12, 2024

RobertKrawitz commented Aug 12, 2024

ebattat commented Aug 12, 2024

RobertKrawitz commented Aug 12, 2024

ebattat commented Aug 12, 2024

RobertKrawitz commented Aug 12, 2024

ebattat commented Aug 12, 2024

RobertKrawitz left a comment

Choose a reason for hiding this comment

ebattat commented Aug 5, 2024 •

edited

Loading