You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Deployed CKF from latest/edge in a large EC2 instance (m5.4xlarge)
CKF dashboard takes me profile creation page
Clicking Finish (to create the profile) doesn't move forward, and if I click it again it outputs an error:
profiles.kubeflow.org "orfeas" already exists
Environment
Microk8s 1.26, juju 3.1, CKF latest/edge
Debug
Charm logs
Looking at charm logs, we see that it failed to replan, but after that, the reconcile completed successfully and the charm went to active.
2024-04-04T10:55:44.911Z [container-agent] 2024-04-04 10:55:44 ERROR juju-log Uncaught exception whilein charm code:
2024-04-04T10:55:44.912Z [container-agent] Traceback (most recent call last):
2024-04-04T10:55:44.912Z [container-agent] File "./src/charm.py", line 248, in _update_profiles_layer
2024-04-04T10:55:44.912Z [container-agent] self.profiles_container.replan()
2024-04-04T10:55:44.912Z [container-agent] File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/ops/model.py", line 1984, in replan
2024-04-04T10:55:44.912Z [container-agent] self._pebble.replan_services()
2024-04-04T10:55:44.912Z [container-agent] File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/ops/pebble.py", line 1686, in replan_services
2024-04-04T10:55:44.912Z [container-agent] return self._services_action('replan', [], timeout, delay)
2024-04-04T10:55:44.912Z [container-agent] File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/ops/pebble.py", line 1767, in _services_action
2024-04-04T10:55:44.912Z [container-agent] raise ChangeError(change.err, change)
2024-04-04T10:55:44.912Z [container-agent] ops.pebble.ChangeError: cannot perform the following tasks:
2024-04-04T10:55:44.912Z [container-agent] - Start service "kubeflow-profiles" (cannot start service: exited quickly with code 1)
2024-04-04T10:55:44.912Z [container-agent] ----- Logs from task 0 -----
2024-04-04T10:55:44.912Z [container-agent] 2024-04-04T10:55:44Z INFO Most recent service output:
2024-04-04T10:55:44.912Z [container-agent] 1.7122281448805752e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"}
2024-04-04T10:55:44.912Z [container-agent] 1.7122281448810372e+09 ERROR setup unable to create controller {"controller": "Profile", "error": "Failed to start file watcher: too many open files", "errorVerbose": "too many open files\nFailed to start file watcher\ngithub.com/kubeflow/kubeflow/components/profile-controller/controllers.(*ProfileReconciler).SetupWithManager\n\t/workspace/controllers/profile_controller.go:373\nmain.main\n\t/workspace/main.go:107\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:255\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581"}
2024-04-04T10:55:44.912Z [container-agent] main.main
2024-04-04T10:55:44.912Z [container-agent] /workspace/main.go:108
2024-04-04T10:55:44.912Z [container-agent] runtime.main
2024-04-04T10:55:44.912Z [container-agent] /usr/local/go/src/runtime/proc.go:255
2024-04-04T10:55:44.912Z [container-agent] 2024-04-04T10:55:44Z ERROR cannot start service: exited quickly with code 1
Reconcile though completed successfully afterwards.
kubeflow-profiles logs
Looking at kubeflow-profiles container logs, we see
2024-04-04T10:55:44.881Z [kubeflow-profiles] 1.7122281448810372e+09 ERROR setup unable to create controller {"controller": "Profile", "error": "Failed to start file watcher: too many open files", "errorVerbose": "too many open files\nFailed to start file watcher\ngithub.com/kubeflow/kubeflow/components/profile-controller/controllers.(*ProfileReconciler).SetupWithManager\n\t/workspace/controllers/profile_controller.go:373\nmain.main\n\t/workspace/main.go:107\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:255\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581"}
orfeas-k
changed the title
kubeflow-profiles workload doesn't start but charm is active
kubeflow-profiles service doesn't start but charm is active
Apr 11, 2024
How I observed this behavior
latest/edge
in a large EC2 instance (m5.4xlarge
)Finish
(to create the profile) doesn't move forward, and if I click it again it outputs an error:Environment
Microk8s 1.26, juju 3.1, CKF latest/edge
Debug
Charm logs
Looking at charm logs, we see that it failed to
replan
, but after that, the reconcile completed successfully and the charm went toactive
.Reconcile though completed successfully afterwards.
kubeflow-profiles logs
Looking at
kubeflow-profiles
container logs, we seekfam logs
kfam works fine
Issue
The issue described above hides two issues:
The text was updated successfully, but these errors were encountered: