Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cnf-tests: Add debug information for flake tests #2109

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

zeeke
Copy link
Member

@zeeke zeeke commented Nov 20, 2024

Test case

[sriov] NUMA node alignment [BeforeAll] Validate the creation of a pod with excludeTopology set to False and an SRIOV interface in a different NUMA node than the pod

flakes with the following error:

Can't find a suitable node for testing: node [cnfdu3] has no NUMA0 devices, node [cnfdu4] has no NUMA0 devices,

Dump NUMA nodes for each network device when failing.
Use Gomega instead of ginkgo.Fail to trigger a k8sreporter archive.

Sample failure

Copy link
Contributor

openshift-ci bot commented Nov 20, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zeeke

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 20, 2024
// [0200] is the class for Ethernet controllers
// https://admin.pci-ids.ucw.cz/read/PD/
out, err := testnode.ExecCommandOnNodeViaSriovDaemon(client.Client, node, []string{
"lspci", "-vv", "-nn", "-d", "::0200",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and so we introduce a hard dependency on both lspci and the pcidb. Is this a concern?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. I didn't see it from that perspective.

What about using a command like:

sh-5.1# sh -c 'for d in /sys/class/net/*/device/numa_node; do echo -n "$d "; cat ${d}; done'
/sys/class/net/eno12399/device/numa_node 0
/sys/class/net/eno12409/device/numa_node 0
/sys/class/net/eno12409v0/device/numa_node 0
/sys/class/net/eno12409v1/device/numa_node 0
/sys/class/net/eno12409v3/device/numa_node 0
/sys/class/net/eno8303/device/numa_node 0
/sys/class/net/eno8403/device/numa_node 0
/sys/class/net/ens2f0np0/device/numa_node 1
/sys/class/net/ens2f1np1/device/numa_node 1

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's better in the sense we don't add the implicit dep, and WORKSFORME

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As sidenote, we keep reinventing over and over custom solutiuons for the hw discovery/inventory problem, in both tests and application code

@zeeke zeeke force-pushed the numa-flake-cant-find-suitable-node branch 2 times, most recently from 31517f3 to aa314d7 Compare November 20, 2024 13:31
@zeeke
Copy link
Member Author

zeeke commented Nov 20, 2024

/test ?

Copy link
Contributor

openshift-ci bot commented Nov 20, 2024

@zeeke: The following commands are available to trigger required jobs:

  • /test ci
  • /test e2e-aws-ci-tests
  • /test images
  • /test ztp-ci

The following commands are available to trigger optional jobs:

  • /test e2e-aws-ran-profile
  • /test e2e-telco5g-cnftests
  • /test e2e-telco5g-hcp-cnftests
  • /test e2e-telco5g-sno-cnftests
  • /test security

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-kni-cnf-features-deploy-master-ci
  • pull-ci-openshift-kni-cnf-features-deploy-master-e2e-aws-ci-tests
  • pull-ci-openshift-kni-cnf-features-deploy-master-e2e-aws-ran-profile
  • pull-ci-openshift-kni-cnf-features-deploy-master-images
  • pull-ci-openshift-kni-cnf-features-deploy-master-security

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@zeeke
Copy link
Member Author

zeeke commented Nov 20, 2024

/test e2e-telco5g-cnftests

Test case
```
[sriov] NUMA node alignment [BeforeAll] Validate the creation of a pod with excludeTopology set to False and an SRIOV interface in a different NUMA node than the pod
```

flakes with the following error:
```
Can't find a suitable node for testing: node [cnfdu3] has no NUMA0 devices, node [cnfdu4] has no NUMA0 devices,
```

Dump NUMA nodes for each network device when failing.
Use Gomega instead of `ginkgo.Fail` to trigger a k8sreporter archive

Signed-off-by: Andrea Panattoni <[email protected]>
@zeeke zeeke force-pushed the numa-flake-cant-find-suitable-node branch from aa314d7 to 2167379 Compare November 20, 2024 13:39
@zeeke
Copy link
Member Author

zeeke commented Nov 20, 2024

/test e2e-telco5g-cnftests

2 similar comments
@zeeke
Copy link
Member Author

zeeke commented Nov 20, 2024

/test e2e-telco5g-cnftests

@zeeke
Copy link
Member Author

zeeke commented Nov 21, 2024

/test e2e-telco5g-cnftests

@zeeke
Copy link
Member Author

zeeke commented Nov 21, 2024

These changes spotted a problem in the ExecCommand API:

> Enter [BeforeAll] [sriov] NUMA node alignment - /tmp/cnf-vJw3e/cnf-features-deploy/cnf-tests/testsuites/e2esuite/dpdk/numa_node_sriov.go:47 @ 11/21/24 09:47:06.43
STEP: Clean SRIOV policies and networks - /tmp/cnf-vJw3e/cnf-features-deploy/cnf-tests/testsuites/e2esuite/dpdk/numa_node_sriov.go:61 @ 11/21/24 09:47:06.582
STEP: Discover SRIOV devices - /tmp/cnf-vJw3e/cnf-features-deploy/cnf-tests/testsuites/e2esuite/dpdk/numa_node_sriov.go:64 @ 11/21/24 09:47:22.141
[FAILED] Failure recorded during attempt 1:
Can't find a suitable node for testing: node [cnfdu11] has no NUMA0 devices
�[1;31m2024-11-21T09:47:28.530180Z: /sys/class/net/eno12399/device/numa_node 0
/sys/class/net/eno12409/device/numa_node 0
/sys/class/net/eno8303/device/numa_node 0
/sys/class/net/eno8403/device/numa_node 0
/sys/class/net/ens2f0/device/numa_node 0
/sys/class/net/ens2f1/device/numa_node �[0m
�[1;31m2024-11-21T09:47:28.530224Z: 0
/sys/class/net/ens3f0np0/device/numa_node 0
/sys/class/net/ens3f1np1/device/numa_node 0
/sys/class/net/ens7f0/device/numa_node 1
/sys/class/net/ens7f1/device/numa_node 1
�[0m


Expected
    <bool>: false
to be true
In [BeforeAll] at: /tmp/cnf-vJw3e/cnf-features-deploy/cnf-tests/testsuites/e2esuite/dpdk/numa_node_sriov.go:467 @ 11/21/24 09:47:38.754
< Exit [BeforeAll] [sriov] NUMA node alignment - /tmp/cnf-vJw3e/cnf-features-deploy/cnf-tests/testsuites/e2esuite/dpdk/numa_node_sriov.go:47 @ 11/21/24 09:47:38.754 (32.324s)

as the command returns some artifacts (e.g. �[1;31m2024-11-21T09:47:28.530180Z: ) in the standard output

@zeeke
Copy link
Member Author

zeeke commented Nov 21, 2024

/test e2e-telco5g-cnftests

2 similar comments
@zeeke
Copy link
Member Author

zeeke commented Nov 22, 2024

/test e2e-telco5g-cnftests

@zeeke
Copy link
Member Author

zeeke commented Nov 22, 2024

/test e2e-telco5g-cnftests

Copy link
Contributor

openshift-ci bot commented Nov 22, 2024

@zeeke: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ci-tests 36e3f03 link true /test e2e-aws-ci-tests
ci/prow/e2e-telco5g-cnftests 36e3f03 link false /test e2e-telco5g-cnftests

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants