Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vc-scheduler NPE when GPU-node not reporting devices properly #3923

Open
archlitchi opened this issue Dec 25, 2024 · 1 comment
Open

vc-scheduler NPE when GPU-node not reporting devices properly #3923

archlitchi opened this issue Dec 25, 2024 · 1 comment
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@archlitchi
Copy link
Contributor

archlitchi commented Dec 25, 2024

Description

This issue happens when activating volcano-vgpu, it happens when GPU node not report devices to node annotations properly.

Steps to reproduce the issue

  1. install volcano
  2. enable volcano-vgpu
  3. do not install volcano-vgpu-device-plugin

Describe the results you received and expected

NPE happens for vc-scheduler, related logs:

E1225 02:12:26.904890     112 node_info.go:389] "Idle resources turn into negative after allocated" nodeName="vm-node245-vgpu" task="rise-vast-system/yunji-deployment-1-5878fc9b8b-fbvm8" resources=["nvidia.com/gpu"] idle="cpu 39225.00, memory 192033869059.00, ephemeral-storage 1097689833472000.00, pods 105.00, nvidia.com/gpu -1000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00, volcano.sh/vgpu-number 20000.00" req="cpu 0.00, memory 0.00, nvidia.com/gpu 1000.00, pods 1.00"
E1225 02:12:26.906417     112 node_info.go:298] "Node out of sync" name="vm-node245-vgpu" resources=["nvidia.com/gpu"]
E1225 02:12:26.926206     112 node_info.go:389] "Idle resources turn into negative after allocated" nodeName="vm-node245-vgpu" task="rise-vast-system/yunji-deployment-1-5878fc9b8b-fbvm8" resources=["nvidia.com/gpu"] idle="cpu 39350.00, memory 192139269379.00, nvidia.com/gpu -1000.00, hugepages-1Gi 0.00, volcano.sh/vgpu-number 20000.00, ephemeral-storage 1097689833472000.00, pods 107.00, hugepages-2Mi 0.00" req="cpu 0.00, memory 0.00, nvidia.com/gpu 1000.00, pods 1.00"
W1225 02:12:26.926590     112 node_info.go:336] received argument of nil node, no need to set other resources for 
W1225 02:12:26.926703     112 node_info.go:231] the argument node is null.
W1225 02:12:26.973399     112 node_info.go:336] received argument of nil node, no need to set other resources for 
W1225 02:12:26.973484     112 node_info.go:231] the argument node is null.
W1225 02:12:26.974207     112 node_info.go:336] received argument of nil node, no need to set other resources for 
W1225 02:12:26.974251     112 node_info.go:231] the argument node is null.
E1225 02:12:26.974949     112 panic.go:261] "Observed a panic" panic="runtime error: invalid memory address or nil pointer dereference" panicGoValue="\"invalid memory address or nil pointer dereference\"" stacktrace=<
	goroutine 521 [running]:
	k8s.io/apimachinery/pkg/util/runtime.logPanic({0x28f6f98, 0x3ef8500}, {0x21f1ea0, 0x3e5f270})
		/root/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:107 +0xbc
	k8s.io/apimachinery/pkg/util/runtime.handleCrash({0x28f6f98, 0x3ef8500}, {0x21f1ea0, 0x3e5f270}, {0x3ef8500, 0x0, 0x10000c0003345d0?})
		/root/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:82 +0x5e
	k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0003345d0?})
		/root/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:59 +0x108
	panic({0x21f1ea0?, 0x3e5f270?})
		/root/go/pkg/mod/golang.org/[email protected]/src/runtime/panic.go:770 +0x132
	volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/vgpu.NewGPUDevices({0xc000012d10, 0xa}, 0xc001010308)
		/root/volcano/volcano_metrics/volcano/pkg/scheduler/api/devices/nvidia/vgpu/device_info.go:99 +0xe6
	volcano.sh/volcano/pkg/scheduler/api.(*NodeInfo).setNodeOthersResource(0xc00094a0c0, 0xc001010308)
		/root/volcano/volcano_metrics/volcano/pkg/scheduler/api/node_info.go:341 +0xcc
	volcano.sh/volcano/pkg/scheduler/api.(*NodeInfo).setNode(0xc00094a0c0, 0xc001010308)
		/root/volcano/volcano_metrics/volcano/pkg/scheduler/api/node_info.go:355 +0x93
	volcano.sh/volcano/pkg/scheduler/api.(*NodeInfo).SetNode(0xc000e280c0, 0xc001010308)
		/root/volcano/volcano_metrics/volcano/pkg/scheduler/api/node_info.go:320 +0x51
	volcano.sh/volcano/pkg/scheduler/cache.(*SchedulerCache).AddOrUpdateNode(0xc000428dc8, 0xc001010308)
		/root/volcano/volcano_metrics/volcano/pkg/scheduler/cache/event_handlers.go:497 +0x10a
	volcano.sh/volcano/pkg/scheduler/cache.(*SchedulerCache).SyncNode(0xc000428dc8, {0xc000808410, 0xa})
		/root/volcano/volcano_metrics/volcano/pkg/scheduler/cache/event_handlers.go:618 +0x4cc
	volcano.sh/volcano/pkg/scheduler/cache.(*SchedulerCache).processSyncNode(0xc000428dc8)
		/root/volcano/volcano_metrics/volcano/pkg/scheduler/cache/cache.go:1218 +0x1dc
	volcano.sh/volcano/pkg/scheduler/cache.(*SchedulerCache).runNodeWorker(...)
		/root/volcano/volcano_metrics/volcano/pkg/scheduler/cache/cache.go:1200
	k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
		/root/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 +0x33
	k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000318ed0, {0x28d50c0, 0xc00100e000}, 0x1, 0xc000163c20)
		/root/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 +0xaf
	k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000318ed0, 0x0, 0x0, 0x1, 0xc000163c20)
		/root/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:204 +0x7f
	k8s.io/apimachinery/pkg/util/wait.Until(...)
		/root/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:161
	created by volcano.sh/volcano/pkg/scheduler/cache.(*SchedulerCache).Run in goroutine 1
		/root/volcano/volcano_metrics/volcano/pkg/scheduler/cache/cache.go:806 +0x8f

What version of Volcano are you using?

latest

Any other relevant information

No response

@archlitchi archlitchi added the kind/bug Categorizes issue or PR as related to a bug. label Dec 25, 2024
@archlitchi
Copy link
Contributor Author

Related PR: #3924

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

1 participant