Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8S点对点Runp模式运行测试作业时,任务一直pending #371

Open
PlanetAndMars opened this issue Jul 9, 2024 · 13 comments
Open

Comments

@PlanetAndMars
Copy link

PlanetAndMars commented Jul 9, 2024

Issue Type

Others

Search for existing issues similar to yours

No

Kuscia Version

latest

Link to Relevant Documentation

No response

Question Details

通过下面deployment.yaml拉起pod,使用的镜像是latest

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kuscia-autonomy-alice
  namespace: autonomy-alice
spec:
  replicas: 2
  selector:
    matchLabels:
      app: kuscia-autonomy-alice
  template:
    metadata:
      labels:
        app: kuscia-autonomy-alice
    spec:
      containers:
        - command:
            - tini
            - --
            - kuscia
            - start
            - -c
            - etc/conf/kuscia.yaml
          env:
            - name: REGISTRY_ENDPOINT
              value: secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow
            - name: NAMESPACE
              value: alice
            - name: TZ
              value: Asia/Shanghai
          image: secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia-secretflow:latest
          imagePullPolicy: Always
          name: alice
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /home/kuscia/var/tmp
              name: kuscia-var-tmp
            - mountPath: /home/kuscia/etc/conf/kuscia.yaml
              name: kuscia-config
              subPath: kuscia.yaml
          workingDir: /home/kuscia
      automountServiceAccountToken: false
      volumes:
        - emptyDir: {}
          name: kuscia-var-tmp
        - configMap:
            defaultMode: 420
            name: kuscia-autonomy-alice-cm
          name: kuscia-config

进入Pod后通过下面AppImage.yaml,创建appimage,使用的镜像是1.7.0b0:

apiVersion: kuscia.secretflow/v1alpha1
kind: AppImage
metadata:
  name: secretflow-image
spec:
  configTemplates:
    task-config.conf: |
      {
        "task_id": "{{.TASK_ID}}",
        "task_input_config": "{{.TASK_INPUT_CONFIG}}",
        "task_cluster_def": "{{.TASK_CLUSTER_DEFINE}}",
        "allocated_ports": "{{.ALLOCATED_PORTS}}"
      }
  deployTemplates:
  - name: secretflow
    replicas: 1
    spec:
      containers:
      - args:
        - -c
        - python -m secretflow.kuscia.entry ./kuscia/task-config.conf
        command:
        - sh
        configVolumeMounts:
        - mountPath: /root/kuscia/task-config.conf
          subPath: task-config.conf
        name: secretflow
        ports:
        - name: spu
          port: 20000
          protocol: GRPC
          scope: Cluster
        - name: fed
          port: 20001
          protocol: GRPC
          scope: Cluster
        - name: global
          port: 20002
          protocol: GRPC
          scope: Domain
        - name: node-manager
          port: 20003
          protocol: GRPC
          scope: Local
        - name: object-manager
          port: 20004
          protocol: GRPC
          scope: Local
        - name: client-server
          port: 20005
          protocol: GRPC
          scope: Local
        workingDir: /root
      restartPolicy: Never
  image:
    id: abc
    name: secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8
    sign: abc
    tag: 1.7.0b0

执行脚本
scripts/user/create_example_job.sh
运行测试任务,任务一直pending

查询任务详情:

apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaTask
metadata:
  creationTimestamp: "2024-07-09T06:16:37Z"
  generation: 1
  labels:
    kuscia.secretflow/controller: kuscia-job
    kuscia.secretflow/interconn-protocol-type: kuscia
    kuscia.secretflow/job-id: secretflow-task-20240709141636
    kuscia.secretflow/self-cluster-as-initiator: "true"
    kuscia.secretflow/task-alias: single-psi
  name: secretflow-task-20240709141636-single-psi
  ownerReferences:
  - apiVersion: kuscia.secretflow/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: KusciaJob
    name: secretflow-task-20240709141636
    uid: aaa2b5b8-c4c4-4020-88c7-22223ec8df4f
  resourceVersion: "5547"
  uid: 6078ddb9-96bb-48f4-bee9-c4ec0316ce49
spec:
  initiator: alice
  parties:
  - appImageRef: secretflow-image
    domainID: alice
    template:
      spec: {}
  - appImageRef: secretflow-image
    domainID: bob
    template:
      spec: {}
  scheduleConfig: {}
  taskInputConfig: '{"sf_datasource_config":{"alice":{"id":"default-data-source"},"bob":{"id":"default-data-source"}},"sf_cluster_desc":{"parties":["alice","bob"],"devices":[{"name":"spu","type":"spu","parties":["alice","bob"],"config":"{\"runtime_config\":{\"protocol\":\"REF2K\",\"field\":\"FM64\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"},{"name":"heu","type":"heu","parties":["alice","bob"],"config":"{\"mode\":
    \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"}],"ray_fed_config":{"cross_silo_comm_backend":"brpc_link"}},"sf_node_eval_param":{"domain":"preprocessing","name":"psi","version":"0.0.1","attr_paths":["input/receiver_input/key","input/sender_input/key","protocol","precheck_input","bucket_size","curve_type"],"attrs":[{"ss":["id1"]},{"ss":["id2"]},{"s":"ECDH_PSI_2PC"},{"b":true},{"i64":"1048576"},{"s":"CURVE_FOURQ"}]},"sf_input_ids":["alice-table","bob-table"],"sf_output_ids":["psi-output"],"sf_output_uris":["psi-output.csv"]}'
status:
  allocatedPorts:
  - domainID: alice
    namedPort:
      secretflow-task-20240709141636-single-psi-0/client-server: 24276
      secretflow-task-20240709141636-single-psi-0/fed: 24272
      secretflow-task-20240709141636-single-psi-0/global: 24273
      secretflow-task-20240709141636-single-psi-0/node-manager: 24274
      secretflow-task-20240709141636-single-psi-0/object-manager: 24275
      secretflow-task-20240709141636-single-psi-0/spu: 24277
  - domainID: bob
    namedPort:
      secretflow-task-20240709141636-single-psi-0/client-server: 31964
      secretflow-task-20240709141636-single-psi-0/fed: 31966
      secretflow-task-20240709141636-single-psi-0/global: 31967
      secretflow-task-20240709141636-single-psi-0/node-manager: 31968
      secretflow-task-20240709141636-single-psi-0/object-manager: 31963
      secretflow-task-20240709141636-single-psi-0/spu: 31965
  conditions:
  - lastTransitionTime: "2024-07-09T06:16:37Z"
    status: "True"
    type: ResourceCreated
  lastReconcileTime: "2024-07-09T06:26:41Z"
  phase: Pending
  podStatuses:
    alice/secretflow-task-20240709141636-single-psi-0:
      createTime: "2024-07-09T06:16:37Z"
      message: 'container[secretflow] waiting state reason: "ImageInspectError", message:
        "Failed to inspect image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\":
        failed to get image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\"
        manifest, detail-> image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\"
        not exist in local repository"'
      namespace: alice
      nodeName: kuscia-autonomy-alice-66cfbb85b-65kdf
      podName: secretflow-task-20240709141636-single-psi-0
      podPhase: Pending
      reason: ImageInspectError
      startTime: "2024-07-09T06:16:40Z"
    bob/secretflow-task-20240709141636-single-psi-0:
      createTime: "2024-07-09T06:16:38Z"
      message: 'container[secretflow] waiting state reason: "ImageInspectError", message:
        "Failed to inspect image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\":
        failed to get image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\"
        manifest, detail-> image \"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0\"
        not exist in local repository"'
      namespace: bob
      podName: secretflow-task-20240709141636-single-psi-0
      podPhase: Pending
      reason: ImageInspectError
      startTime: "2024-07-09T06:16:40Z"
  serviceStatuses:
    alice/secretflow-task-20240709141636-single-psi-0-fed:
      createTime: "2024-07-09T06:16:38Z"
      namespace: alice
      portName: fed
      portNumber: 24272
      readyTime: "2024-07-09T06:16:41Z"
      scope: Cluster
      serviceName: secretflow-task-20240709141636-single-psi-0-fed
    alice/secretflow-task-20240709141636-single-psi-0-global:
      createTime: "2024-07-09T06:16:38Z"
      namespace: alice
      portName: global
      portNumber: 24273
      readyTime: "2024-07-09T06:16:41Z"
      scope: Domain
      serviceName: secretflow-task-20240709141636-single-psi-0-global
    alice/secretflow-task-20240709141636-single-psi-0-spu:
      createTime: "2024-07-09T06:16:37Z"
      namespace: alice
      portName: spu
      portNumber: 24277
      readyTime: "2024-07-09T06:16:41Z"
      scope: Cluster
      serviceName: secretflow-task-20240709141636-single-psi-0-spu
    bob/secretflow-task-20240709141636-single-psi-0-fed:
      createTime: "2024-07-09T06:16:38Z"
      namespace: bob
      portName: fed
      portNumber: 31966
      scope: Cluster
      serviceName: secretflow-task-20240709141636-single-psi-0-fed
    bob/secretflow-task-20240709141636-single-psi-0-global:
      createTime: "2024-07-09T06:16:38Z"
      namespace: bob
      portName: global
      portNumber: 31967
      scope: Domain
      serviceName: secretflow-task-20240709141636-single-psi-0-global
    bob/secretflow-task-20240709141636-single-psi-0-spu:
      createTime: "2024-07-09T06:16:38Z"
      namespace: bob
      portName: spu
      portNumber: 31965
      scope: Cluster
      serviceName: secretflow-task-20240709141636-single-psi-0-spu
  startTime: "2024-07-09T06:16:37Z"

报错为镜像找不到,将镜像换成 1.6.0b0也是同样的报错。
但是在宿主机上1.7.0b0镜像存在。
image

image

求解答。

@aokaokd
Copy link

aokaokd commented Jul 9, 2024

用新版本再重试下看看呢

@PlanetAndMars
Copy link
Author

用新版本再重试下看看呢

我是用的就是latest版本呀,不知道您说的新版本什么意思呢

@aokaokd
Copy link

aokaokd commented Jul 9, 2024

好的,看上去是kuscia里去拉secretflow镜像出现的问题,我们这边确认一下

@aokaokd
Copy link

aokaokd commented Jul 9, 2024

刚才阿里云镜像仓库有抖动。这个错误是因为没有从远程仓库(secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow)拉取到本地导致。现在应该是恢复了。现在应该可以了

@PlanetAndMars
Copy link
Author

PlanetAndMars commented Jul 9, 2024

刚才阿里云镜像仓库有抖动。这个错误是因为没有从远程仓库(secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow)拉取到本地导致。现在应该是恢复了。现在应该可以了

刚试了下,还是一样的报错呀。
image

如果是用0.8.0b0 和 1.6.0b0就没问题
用latest和1.7.0b0,1.6.0b0都不行

@aokaokd
Copy link

aokaokd commented Jul 10, 2024

您部署完以后手动到pod 执行下k8s apply试下呢

@PlanetAndMars
Copy link
Author

您部署完以后手动到pod 执行下k8s apply试下呢

kubectl apply -f AppImage.yaml 么?执行了这个的

@magic-hya
Copy link
Contributor

我也遇到这个问题,我设置的私有仓库,仍然是从本地拉取镜像,同样的错误,怎么解决的

@zimu-yuxi
Copy link

1.私有仓库,这个配置暂不可用
2.k8s runp不会去拉远端的镜像的。确认下你的depolyment里用的是kuscia-secretflow镜像。

@magic-hya
Copy link
Contributor

是runp,如果不能拉远端的镜像,如何能够引入appimage的镜像

@zimu-yuxi
Copy link

是runp,如果不能拉远端的镜像,如何能够引入appimage的镜像

kuscia-secretflow这个镜像在构建的时候,已经将sf镜像打进去了。所以确认下depolyment里用的是kuscia-secretflow镜像

@magic-hya
Copy link
Contributor

image
这个镜像就是kuscia-secretflow,但是后面不是要指定个appimage镜像执行任务,现在是这个执行任务的时候回去拉取appimage里面的镜像

Copy link

Stale issue message. Please comment to remove stale tag. Otherwise this issue will be closed soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants