Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeflow-profiles service doesn't start but charm is active #164

Open
orfeas-k opened this issue Apr 11, 2024 · 1 comment
Open

kubeflow-profiles service doesn't start but charm is active #164

orfeas-k opened this issue Apr 11, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@orfeas-k
Copy link
Contributor

How I observed this behavior

  • Deployed CKF from latest/edge in a large EC2 instance (m5.4xlarge)
  • CKF dashboard takes me profile creation page
  • Clicking Finish (to create the profile) doesn't move forward, and if I click it again it outputs an error:
     profiles.kubeflow.org "orfeas" already exists
    

Environment

Microk8s 1.26, juju 3.1, CKF latest/edge

Debug

Charm logs

Looking at charm logs, we see that it failed to replan, but after that, the reconcile completed successfully and the charm went to active.

2024-04-04T10:55:44.911Z [container-agent] 2024-04-04 10:55:44 ERROR juju-log Uncaught exception while in charm code:
2024-04-04T10:55:44.912Z [container-agent] Traceback (most recent call last):
2024-04-04T10:55:44.912Z [container-agent]   File "./src/charm.py", line 248, in _update_profiles_layer
2024-04-04T10:55:44.912Z [container-agent]     self.profiles_container.replan()
2024-04-04T10:55:44.912Z [container-agent]   File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/ops/model.py", line 1984, in replan
2024-04-04T10:55:44.912Z [container-agent]     self._pebble.replan_services()
2024-04-04T10:55:44.912Z [container-agent]   File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/ops/pebble.py", line 1686, in replan_services
2024-04-04T10:55:44.912Z [container-agent]     return self._services_action('replan', [], timeout, delay)
2024-04-04T10:55:44.912Z [container-agent]   File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/ops/pebble.py", line 1767, in _services_action
2024-04-04T10:55:44.912Z [container-agent]     raise ChangeError(change.err, change)
2024-04-04T10:55:44.912Z [container-agent] ops.pebble.ChangeError: cannot perform the following tasks:
2024-04-04T10:55:44.912Z [container-agent] - Start service "kubeflow-profiles" (cannot start service: exited quickly with code 1)
2024-04-04T10:55:44.912Z [container-agent] ----- Logs from task 0 -----
2024-04-04T10:55:44.912Z [container-agent] 2024-04-04T10:55:44Z INFO Most recent service output:
2024-04-04T10:55:44.912Z [container-agent]     1.7122281448805752e+09	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": ":8080"}
2024-04-04T10:55:44.912Z [container-agent]     1.7122281448810372e+09	ERROR	setup	unable to create controller	{"controller": "Profile", "error": "Failed to start file watcher: too many open files", "errorVerbose": "too many open files\nFailed to start file watcher\ngithub.com/kubeflow/kubeflow/components/profile-controller/controllers.(*ProfileReconciler).SetupWithManager\n\t/workspace/controllers/profile_controller.go:373\nmain.main\n\t/workspace/main.go:107\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:255\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581"}
2024-04-04T10:55:44.912Z [container-agent]     main.main
2024-04-04T10:55:44.912Z [container-agent]     	/workspace/main.go:108
2024-04-04T10:55:44.912Z [container-agent]     runtime.main
2024-04-04T10:55:44.912Z [container-agent]     	/usr/local/go/src/runtime/proc.go:255
2024-04-04T10:55:44.912Z [container-agent] 2024-04-04T10:55:44Z ERROR cannot start service: exited quickly with code 1

Reconcile though completed successfully afterwards.

kubeflow-profiles logs

Looking at kubeflow-profiles container logs, we see

2024-04-04T10:55:44.881Z [kubeflow-profiles] 1.7122281448810372e+09	ERROR	setup	unable to create controller	{"controller": "Profile", "error": "Failed to start file watcher: too many open files", "errorVerbose": "too many open files\nFailed to start file watcher\ngithub.com/kubeflow/kubeflow/components/profile-controller/controllers.(*ProfileReconciler).SetupWithManager\n\t/workspace/controllers/profile_controller.go:373\nmain.main\n\t/workspace/main.go:107\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:255\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581"}
kfam logs

kfam works fine

Issue

The issue described above hides two issues:

@orfeas-k orfeas-k added the bug Something isn't working label Apr 11, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5569.

This message was autogenerated

@orfeas-k orfeas-k changed the title kubeflow-profiles workload doesn't start but charm is active kubeflow-profiles service doesn't start but charm is active Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant