Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练流内组件拉取oss数据时kuscia节点报全局数据源不匹配 #461

Closed
libluecat opened this issue Nov 26, 2024 · 34 comments
Closed

Comments

@libluecat
Copy link

Issue Type

Feature

Search for existing issues similar to yours

Yes

Kuscia Version

kuscia 0.10.0b0

Link to Relevant Documentation

No response

Question Details

录入oss数据源,上传相应的文件到oss,将文件信息注册到alice节点中,当在训练流中使用该数据表时报全局数据源id不匹配问题,详细报错如下:
Seek to current after exception; nested exception is org.apache.kafka.common.errors.SerializationException: Error deserializing key/value for partition ioc_event_info-0 at offset 11997. If needed, please seek past the record to continue consumption.2024-11-26 15:28:16,167|alice|INFO|secretflow|entry.py:start_ray:59| ray_conf: RayConfig(ray_node_ip_address='ygih-ktidpmea-node-3-0-global.alice.svc', ray_node_manager_port=29109, ray_object_manager_port=29110, ray_client_server_port=29111, ray_worker_ports=[], ray_gcs_port=29114)\n2024-11-26 15:28:16,176|alice|INFO|secretflow|entry.py:start_ray:67| Trying to start ray head node at ygih-ktidpmea-node-3-0-global.alice.svc, start command: ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=ygih-ktidpmea-node-3-0-global.alice.svc --port=29114 --node-manager-port=29109 --object-manager-port=29110 --ray-client-server-port=29111\n2024-11-26 15:28:18,957|alice|INFO|secretflow|entry.py:start_ray:80| 2024-11-26 15:28:16,807\tINFO usage_lib.py:423 -- Usage stats collection is disabled.\n2024-11-26 15:28:16,807\tINFO scripts.py:744 -- Local node IP: ygih-ktidpmea-node-3-0-global.alice.svc\n2024-11-26 15:28:18,817\tSUCC scripts.py:781 -- --------------------\n2024-11-26 15:28:18,817\tSUCC scripts.py:782 -- Ray runtime started.\n2024-11-26 15:28:18,817\tSUCC scripts.py:783 -- --------------------\n2024-11-26 15:28:18,817\tINFO scripts.py:785 -- Next steps\n2024-11-26 15:28:18,817\tINFO scripts.py:788 -- To add another node to this Ray cluster, run\n2024-11-26 15:28:18,817\tINFO scripts.py:791 -- ray start --address='ygih-ktidpmea-node-3-0-global.alice.svc:29114'\n2024-11-26 15:28:18,817\tINFO scripts.py:800 -- To connect to this Ray cluster:\n2024-11-26 15:28:18,817\tINFO scripts.py:802 -- import ray\n2024-11-26 15:28:18,817\tINFO scripts.py:803 -- ray.init(_node_ip_address='ygih-ktidpmea-node-3-0-global.alice.svc')\n2024-11-26 15:28:18,818\tINFO scripts.py:834 -- To terminate the Ray runtime, run\n2024-11-26 15:28:18,818\tINFO scripts.py:835 -- ray stop\n2024-11-26 15:28:18,818\tINFO scripts.py:838 -- To view the status of the cluster, use\n2024-11-26 15:28:18,818\tINFO scripts.py:839 -- ray status\n\n2024-11-26 15:28:18,958|alice|INFO|secretflow|entry.py:start_ray:81| Succeeded to start ray head node at ygih-ktidpmea-node-3-0-global.alice.svc.\n2024-11-26 15:28:18,959|alice|INFO|secretflow|entry.py:main:510| datasource.access_directly True\nsf_node_eval_param {\n \"domain\": \"data_prep\",\n \"name\": \"psi\",\n \"version\": \"0.0.5\",\n \"attrPaths\": [\n \"input/receiver_input/key\",\n \"input/sender_input/key\",\n \"protocol\",\n \"sort_result\",\n \"allow_duplicate_keys\",\n \"allow_duplicate_keys/no/skip_duplicates_check\",\n \"fill_value_int\",\n \"ecdh_curve\"\n ],\n \"attrs\": [\n {\n \"ss\": [\n \"id1\"\n ]\n },\n {\n \"ss\": [\n \"id2\"\n ]\n },\n {\n \"s\": \"PROTOCOL_KKRT\"\n },\n {\n \"b\": true\n },\n {\n \"s\": \"no\"\n },\n {\n \"isNa\": true\n },\n {\n \"isNa\": true\n },\n {\n \"s\": \"CURVE_FOURQ\"\n }\n ],\n \"inputs\": [\n {\n \"type\": \"sf.table.individual\",\n \"meta\": {\n \"@type\": \"type.googleapis.com/secretflow.spec.v1.IndividualTable\",\n \"lineCount\": \"-1\"\n },\n \"dataRefs\": [\n {\n \"uri\": \"csv/alice.csv\",\n \"party\": \"alice\",\n \"format\": \"csv\"\n }\n ]\n },\n {\n \"type\": \"sf.table.individual\",\n \"meta\": {\n \"@type\": \"type.googleapis.com/secretflow.spec.v1.IndividualTable\",\n \"lineCount\": \"-1\"\n },\n \"dataRefs\": [\n {\n \"uri\": \"csv/bob.csv\",\n \"party\": \"bob\",\n \"format\": \"csv\"\n }\n ]\n }\n ],\n \"checkpointUri\": \"ckygih-ktidpmea-node-3-output-0\"\n} \nTraceback (most recent call last):\n File \"/usr/local/lib/python3.10/runpy.py\", line 196, in _run_module_as_main\n return _run_code(code, main_globals, None,\n File \"/usr/local/lib/python3.10/runpy.py\", line 86, in _run_code\n exec(code, run_globals)\n File \"/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py\", line 547, in \u003cmodule\u003e\n main()\n File \"/usr/local/lib/python3.10/site-packages/click/core.py\", line 1157, in __call__\n return self.main(*args, **kwargs)\n File \"/usr/local/lib/python3.10/site-packages/click/core.py\", line 1078, in main\n rv = self.invoke(ctx)\n File \"/usr/local/lib/python3.10/site-packages/click/core.py\", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File \"/usr/local/lib/python3.10/site-packages/click/core.py\", line 783, in invoke\n return __callback(*args, **kwargs)\n File \"/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py\", line 514, in main\n sf_node_eval_param = preprocess_sf_node_eval_param(\n File \"/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py\", line 290, in preprocess_sf_node_eval_param\n domaindata_id_to_dist_data(\n File \"/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py\", line 149, in domaindata_id_to_dist_data\n raise RuntimeError(\nRuntimeError: datasource_id of domain_data [isrdvqid] is oss-306740c5753cc10a6b951ddca3a7f788, which doesn't match global datasource_id default-data-source
@wangzul
Copy link
Contributor

wangzul commented Nov 26, 2024

是通过allinoen-secrerpad 注册的数据源,还是单独部署kuscia 。
如果allinoen 的话可以说明一下版本。

同时辛苦将日志文件重新发送一下,详情信息中日志格式有点问题可读性较差。

@libluecat
Copy link
Author

是通过allinoen-secrerpad 注册的数据源,还是单独部署kuscia 。 如果allinoen 的话可以说明一下版本。

同时辛苦将日志文件重新发送一下,详情信息中日志格式有点问题可读性较差。

好的,我发送的一下日志文件
由secretpad注册的数据源,secretpad使用的是0.9.0b0版本
alice.log

@wangzul
Copy link
Contributor

wangzul commented Nov 26, 2024

可以参照这个文档获取一下双方节点的pod日志文件。
https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.10.0b0/reference/troubleshoot/runjobfailed#id5

@libluecat
Copy link
Author

可以参照这个文档获取一下双方节点的pod日志文件。 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.10.0b0/reference/troubleshoot/runjobfailed#id5

查看alice节点下的pod的详细信息,与日志文件中的报错一致:
[root@root-kuscia-master-localhost-localdomain kuscia]# kubectl get kt ufwr-ohpnayoy-node-34 -n cross-domain -o yaml
apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaTask
metadata:
annotations:
kuscia.secretflow/job-id: ufwr
kuscia.secretflow/task-alias: ufwr-ohpnayoy-node-34
creationTimestamp: "2024-11-27T01:18:17Z"
generation: 1
labels:
kuscia.secretflow/controller: kuscia-job
kuscia.secretflow/job-uid: 1f023b45-7171-4279-b6ff-5f233086704c
name: ufwr-ohpnayoy-node-34
namespace: cross-domain
ownerReferences:

apiVersion: kuscia.secretflow/v1alpha1
blockOwnerDeletion: true
controller: true
kind: KusciaJob
name: ufwr
uid: 1f023b45-7171-4279-b6ff-5f233086704c
resourceVersion: "309495"
uid: 2241c42e-758c-432f-8b71-a8db67f40fd6
spec:
initiator: alice
parties:
appImageRef: secretflow-image
domainID: alice
template:
spec: {}
scheduleConfig: {}
taskInputConfig: |-
{
"sf_datasource_config": {
"alice": {
"id": "default-data-source"
}
},
"sf_cluster_desc": {
"parties": ["alice"],
"devices": [{
"name": "spu",
"type": "spu",
"parties": ["alice"],
"config": "{"runtime_config":{"protocol":"SEMI2K","field":"FM128"},"link_desc":{"connect_retry_times":60,"connect_retry_interval_ms":1000,"brpc_channel_protocol":"http","brpc_channel_connection_type":"pooled","recv_timeout_ms":1200000,"http_timeout_ms":1200000}}"
}, {
"name": "heu",
"type": "heu",
"parties": ["alice"],
"config": "{"mode": "PHEU", "schema": "paillier", "key_size": 2048}"
}],
"ray_fed_config": {
"cross_silo_comm_backend": "brpc_link"
}
},
"sf_node_eval_param": {
"domain": "data_filter",
"name": "sample",
"version": "0.0.1",
"attr_paths": ["sample_algorithm", "sample_algorithm/random/frac", "sample_algorithm/random/random_state", "sample_algorithm/random/replacement"],
"attrs": [{
"is_na": false,
"s": "random"
}, {
"f": 0.8,
"is_na": false
}, {
"i64": 1024.0,
"is_na": false
}, {
"is_na": true
}],
"inputs": [{
"type": "sf.table.individual",
"meta": {
"@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
"line_count": "-1"
},
"data_refs": [{
"uri": "csv/alice.csv",
"party": "alice",
"format": "csv"
}]
}],
"checkpoint_uri": "ckufwr-ohpnayoy-node-34-output-0"
},
"sf_output_uris": ["ufwr-ohpnayoy-node-34-output-0", "ufwr-ohpnayoy-node-34-output-1"],
"sf_input_ids": ["cywrlufq"],
"sf_output_ids": ["ufwr-ohpnayoy-node-34-output-0", "ufwr-ohpnayoy-node-34-output-1"]
}
status:
allocatedPorts:
domainID: alice
namedPort:
ufwr-ohpnayoy-node-34-0/client-server: 21054
ufwr-ohpnayoy-node-34-0/fed: 21056
ufwr-ohpnayoy-node-34-0/global: 21057
ufwr-ohpnayoy-node-34-0/node-manager: 21058
ufwr-ohpnayoy-node-34-0/object-manager: 21059
ufwr-ohpnayoy-node-34-0/spu: 21055
completionTime: "2024-11-27T01:18:26Z"
conditions:
lastTransitionTime: "2024-11-27T01:18:17Z"
status: "True"
type: ResourceCreated
lastTransitionTime: "2024-11-27T01:18:20Z"
status: "True"
type: Running
lastTransitionTime: "2024-11-27T01:18:26Z"
status: "False"
type: Success
lastReconcileTime: "2024-11-27T01:18:26Z"
message: The remaining no-failed party task counts 0 are less than the threshold
1 that meets the conditions for task success. pending party[], running party[],
successful party[], failed party[alice]
partyTaskStatus:
domainID: alice
phase: Failed
phase: Failed
podStatuses:
alice/ufwr-ohpnayoy-node-34-0:
createTime: "2024-11-27T01:18:17Z"
namespace: alice
nodeName: root-kuscia-lite-alice-localhost-localdomain
podName: ufwr-ohpnayoy-node-34-0
podPhase: Failed
readyTime: "2024-11-27T01:18:20Z"
reason: Error
startTime: "2024-11-27T01:18:20Z"
terminationLog: 'container[secretflow] terminated state reason "Error", message:
"WARNING:root:Since the GPL-licensed package unidecode is not installed,
using Python''s unicodedata package which yields worse results.\n2024-11-27
01:18:22,927|alice|INFO|secretflow|entry.py:start_ray:59| ray_conf: RayConfig(ray_node_ip_address=''ufwr-ohpnayoy-node-34-0-global.alice.svc'',
ray_node_manager_port=21058, ray_object_manager_port=21059, ray_client_server_port=21054,
ray_worker_ports=[], ray_gcs_port=21057)\n2024-11-27 01:18:22,927|alice|INFO|secretflow|entry.py:start_ray:67|
Trying to start ray head node at ufwr-ohpnayoy-node-34-0-global.alice.svc,
start command: ray start --head --include-dashboard=false --disable-usage-stats
--num-cpus=32 --node-ip-address=ufwr-ohpnayoy-node-34-0-global.alice.svc --port=21057
--node-manager-port=21058 --object-manager-port=21059 --ray-client-server-port=21054\n2024-11-27
01:18:25,524|alice|INFO|secretflow|entry.py:start_ray:80| 2024-11-27 01:18:23,477\tINFO
usage_lib.py:423 -- Usage stats collection is disabled.\n2024-11-27 01:18:23,477\tINFO
scripts.py:744 -- Local node IP: ufwr-ohpnayoy-node-34-0-global.alice.svc\n2024-11-27
01:18:25,378\tSUCC scripts.py:781 -- --------------------\n2024-11-27 01:18:25,380\tSUCC
scripts.py:782 -- Ray runtime started.\n2024-11-27 01:18:25,380\tSUCC scripts.py:783
-- --------------------\n2024-11-27 01:18:25,380\tINFO scripts.py:785 -- Next
steps\n2024-11-27 01:18:25,380\tINFO scripts.py:788 -- To add another node
to this Ray cluster, run\n2024-11-27 01:18:25,380\tINFO scripts.py:791 -- ray
start --address=''ufwr-ohpnayoy-node-34-0-global.alice.svc:21057''\n2024-11-27
01:18:25,380\tINFO scripts.py:800 -- To connect to this Ray cluster:\n2024-11-27
01:18:25,380\tINFO scripts.py:802 -- import ray\n2024-11-27 01:18:25,380\tINFO
scripts.py:803 -- ray.init(_node_ip_address=''ufwr-ohpnayoy-node-34-0-global.alice.svc'')\n2024-11-27
01:18:25,380\tINFO scripts.py:834 -- To terminate the Ray runtime, run\n2024-11-27
01:18:25,380\tINFO scripts.py:835 -- ray stop\n2024-11-27 01:18:25,380\tINFO
scripts.py:838 -- To view the status of the cluster, use\n2024-11-27 01:18:25,381\tINFO
scripts.py:839 -- ray status\n\n2024-11-27 01:18:25,524|alice|INFO|secretflow|entry.py:start_ray:81|
Succeeded to start ray head node at ufwr-ohpnayoy-node-34-0-global.alice.svc.\n2024-11-27
01:18:25,524|alice|INFO|secretflow|entry.py:main:510| datasource.access_directly
True\nsf_node_eval_param {\n "domain": "data_filter",\n "name": "sample",\n "version":
"0.0.1",\n "attrPaths": [\n "sample_algorithm",\n "sample_algorithm/random/frac",\n "sample_algorithm/random/random_state",\n "sample_algorithm/random/replacement"\n ],\n "attrs":
[\n {\n "s": "random"\n },\n {\n "f": 0.8\n },\n {\n "i64":
"1024"\n },\n {\n "isNa": true\n }\n ],\n "inputs":
[\n {\n "type": "sf.table.individual",\n "meta": {\n "@type":
"type.googleapis.com/secretflow.spec.v1.IndividualTable",\n "lineCount":
"-1"\n },\n "dataRefs": [\n {\n "uri": "csv/alice.csv",\n "party":
"alice",\n "format": "csv"\n }\n ]\n }\n ],\n "checkpointUri":
"ckufwr-ohpnayoy-node-34-output-0"\n} \nTraceback (most recent call last):\n File
"/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main\n return
_run_code(code, main_globals, None,\n File "/usr/local/lib/python3.10/runpy.py",
line 86, in _run_code\n exec(code, run_globals)\n File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py",
line 547, in \n main()\n File "/usr/local/lib/python3.10/site-packages/click/core.py",
line 1157, in call\n return self.main(*args, **kwargs)\n File "/usr/local/lib/python3.10/site-packages/click/core.py",
line 1078, in main\n rv = self.invoke(ctx)\n File "/usr/local/lib/python3.10/site-packages/click/core.py",
line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File
"/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke\n return
__callback(*args, **kwargs)\n File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py",
line 514, in main\n sf_node_eval_param = preprocess_sf_node_eval_param(\n File
"/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line
290, in preprocess_sf_node_eval_param\n domaindata_id_to_dist_data(\n File
"/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line
149, in domaindata_id_to_dist_data\n raise RuntimeError(\nRuntimeError:
datasource_id of domain_data [cywrlufq] is oss-9e32b5fcc9a93d46bc72f444761d42b6,
which doesn''t match global datasource_id default-data-source\n"'
serviceStatuses:
alice/ufwr-ohpnayoy-node-34-0-fed:
createTime: "2024-11-27T01:18:17Z"
namespace: alice
portName: fed
portNumber: 21056
readyTime: "2024-11-27T01:18:20Z"
scope: Cluster
serviceName: ufwr-ohpnayoy-node-34-0-fed
alice/ufwr-ohpnayoy-node-34-0-global:
createTime: "2024-11-27T01:18:17Z"
namespace: alice
portName: global
portNumber: 21057
readyTime: "2024-11-27T01:18:20Z"
scope: Domain
serviceName: ufwr-ohpnayoy-node-34-0-global
alice/ufwr-ohpnayoy-node-34-0-spu:
createTime: "2024-11-27T01:18:17Z"
namespace: alice
portName: spu
portNumber: 21055
readyTime: "2024-11-27T01:18:20Z"
scope: Cluster
serviceName: ufwr-ohpnayoy-node-34-0-spu
startTime: "2024-11-27T01:18:17Z"

@wangzul
Copy link
Contributor

wangzul commented Nov 27, 2024

另一方节点日志也是相同错误吗?
你这样尝试一下在Secretpad任务执行界面,右上角有个设置按钮,然后查看一下你的默认数据源配置是否是你创建的OSS数据源。如果部署的话选择使用oss数据源后重新运行训练流。

@wangzul
Copy link
Contributor

wangzul commented Nov 27, 2024

这个问题解决了吗?

@libluecat
Copy link
Author

问题解决了,是因为执行任务的时候未选择默认数据源。

@libluecat
Copy link
Author

但是还有一个问题,在联合圈人训练流中,在隐私求交组件执行成功后,全表统计组件执行错误,alice节点未报错,bob节点报错了,使用的数据源是oss数据源(使用节点内已经存在的alice.csv和bob.csv执行联合圈人可以执行成功),报错信息显示bob.csv列名不匹配,经过核对bob.csv输入是正确的:
message: "... Ignore 32995 characters at the beginning ...\nfunction_manager.py",
line 726, in actor_method_executor\n return method(__ray_actor, *args,
**kwargs)\n File "/usr/local/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py",
line 467, in _resume_span\n return method(self, *_args, **_kwargs)\n
\ File "/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py",
line 76, in wrapper\n return method(*args, **kwargs)\n File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/agent.py",
line 63, in append_data\n source = source(**kwargs)\n File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/io.py",
line 64, in read_csv_wrapper\n df = _read_csv(filepath, read_backend,
**kwargs)\n File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/io.py",
line 43, in _read_csv\n return read_pandas_csv(_filepath, *_args, **_kwargs)\n
\ File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/pandas/util.py",
line 38, in read_pandas_csv\n df = pd.read_csv(filepath, *args, **kwargs)\n
\ File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py",
line 211, in wrapper\n return func(*args, **kwargs)\n File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py",
line 331, in wrapper\n return func(*args, **kwargs)\n File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py",
line 950, in read_csv\n return _read(filepath_or_buffer, kwds)\n File
"/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py",
line 605, in _read\n parser = TextFileReader(filepath_or_buffer, **kwds)\n
\ File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py",
line 1442, in init\n self._engine = self._make_engine(f, self.engine)\n
\ File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py",
line 1753, in _make_engine\n return mapping[engine](f, **self.options)\n
\ File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py",
line 135, in init\n self._validate_usecols_names(usecols, self.orig_names)\n
\ File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py",
line 917, in _validate_usecols_names\n raise ValueError(\nValueError:
Usecols do not match columns, columns expected but not found: ['id2', 'poutcome_success',
'contact_cellular', 'month_jul', 'month_mar', 'contact_telephone', 'month_feb',
'month_oct', 'month_sep', 'month_jun', 'poutcome_unknown', 'month_jan',
'month_may', 'month_apr', 'month_aug', 'month_nov', 'poutcome_other', 'contact_unknown',
'y', 'month_dec', 'poutcome_failure']\n\nDuring handling of the above exception,
another exception occurred:\n\nTraceback (most recent call last):\n File
"python/ray/_raylet.pyx", line 2206, in ray._raylet.task_execution_handler\n
\ File "python/ray/_raylet.pyx", line 2102, in ray._raylet.execute_task_with_cancellation_handler\n
\ File "python/ray/_raylet.pyx", line 1756, in ray._raylet.execute_task\n
\ File "python/ray/_raylet.pyx", line 1757, in ray._raylet.execute_task\n
\ File "python/ray/_raylet.pyx", line 1995, in ray._raylet.execute_task\n
\ File "python/ray/_raylet.pyx", line 1055, in ray._raylet.store_task_errors\n
\ File "/usr/local/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py",
line 467, in _resume_span\n return method(self, *_args, **_kwargs)\n
\ File "/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py",
line 76, in wrapper\n return method(*args, **kwargs)\nTypeError: PartitionAgent.len()
missing 1 required positional argument: 'idx'\nAn unexpected internal error
occurred while the worker was executing a task.\n\e[33m(raylet)\e[0m A worker
died or was killed while executing a task by an unexpected system error.
To troubleshoot the problem, check the logs for the dead worker. RayTask
ID: ffffffffffffffffe3e0adf5b9f760ef784a412901000000 Worker ID: b934ae8491d71f2f464479014f7f1bee9e814a399ecd79b7d395e7d9
Node ID: 5f40fea3cd4c6f60615226f3002bc4deb4a646670173085c955c360d Worker
IP address: hxck-ogsjogsg-node-4-0-global.bob.svc Worker port: 10037 Worker
PID: 181051 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits
unexpectedly. Worker exits with an exit code None. The worker may have exceeded
K8s pod memory limits. Traceback (most recent call last):\n File "python/ray/_raylet.pyx",
line 1807, in ray._raylet.execute_task\n File "python/ray/_raylet.pyx",
line 1908, in ray._raylet.execute_task\n File "python/ray/_raylet.pyx",
line 1813, in ray._raylet.execute_task\n File "python/ray/_raylet.pyx",
line 1754, in ray._raylet.execute_task.function_executor\n File "/usr/local/lib/python3.10/site-packages/ray/_private/function_manager.py",
line 726, in actor_method_executor\n return method(__ray_actor, *args,
**kwargs)\n File "/usr/local/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py",
line 467, in _resume_span\n return method(self, *_args, **_kwargs)\n
\ File "/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py",
line 76, in wrapper\n return method(*args, **kwargs)\n File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/agent.py",
line 63, in append_data\n source = source(**kwargs)\n File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/io.py",
line 64, in read_csv_wrapper\n df = _read_csv(filepath, read_backend,
**kwargs)\n File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/io.py",
line 43, in _read_csv\n return read_pandas_csv(_filepath, *_args, **_kwargs)\n
\ File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/pandas/util.py",
line 38, in read_pandas_csv\n df = pd.read_csv(filepath, *args, **kwargs)\n
\ File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py",
line 211, in wrapper\n return func(*args, **kwargs)\n File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py",
line 331, in wrapper\n return func(*args, **kwargs)\n File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py",
line 950, in read_csv\n return _read(filepath_or_buffer, kwds)\n File
"/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py",
line 605, in _read\n parser = TextFileReader(filepath_or_buffer, **kwds)\n
\ File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py",
line 1442, in init\n self._engine = self._make_engine(f, self.engine)\n
\ File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py",
line 1753, in _make_engine\n return mapping[engine](f, **self.options)\n
\ File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py",
line 135, in init\n self._validate_usecols_names(usecols, self.orig_names)\n
\ File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py",
line 917, in _validate_usecols_names\n raise ValueError(\nValueError:
Usecols do not match columns, columns expected but not found: ['id2', 'poutcome_success',
'contact_cellular', 'month_jul', 'month_mar', 'contact_telephone', 'month_feb',
'month_oct', 'month_sep', 'month_jun', 'poutcome_unknown', 'month_jan',
'month_may', 'month_apr', 'month_aug', 'month_nov', 'poutcome_other', 'contact_unknown',
'y', 'month_dec', 'poutcome_failure']\n\nDuring handling of the above exception,
another exception occurred:\n\nTraceback (most recent call last):\n File
"python/ray/_raylet.pyx", line 2206, in ray._raylet.task_execution_handler\n
\ File "python/ray/_raylet.pyx", line 2102, in ray._raylet.execute_task_with_cancellation_handler\n
\ File "python/ray/_raylet.pyx", line 1756, in ray._raylet.execute_task\n
\ File "python/ray/_raylet.pyx", line 1757, in ray._raylet.execute_task\n
\ File "python/ray/_raylet.pyx", line 1995, in ray._raylet.execute_task\n
\ File "python/ray/_raylet.pyx", line 1055, in ray._raylet.store_task_errors\n
\ File "/usr/local/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py",
line 467, in _resume_span\n return method(self, *_args, **_kwargs)\n
\ File "/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py",
line 76, in wrapper\n return method(*args, **kwargs)\nTypeError: PartitionAgent.len()
missing 1 required positional argument: 'idx'\nAn unexpected internal error
occurred while the worker was executing a task.\n"

@wangzul
Copy link
Contributor

wangzul commented Nov 28, 2024

日志提示bob的列
['id2', 'poutcome_success',
'contact_cellular', 'month_jul', 'month_mar', 'contact_telephone', 'month_feb',
'month_oct', 'month_sep', 'month_jun', 'poutcome_unknown', 'month_jan',
'month_may', 'month_apr', 'month_aug', 'month_nov', 'poutcome_other', 'contact_unknown',
'y', 'month_dec', 'poutcome_failure']

获取一下/home/kuscia/var/stdout/pods目录下根据任务id(pad右上角j‘记录和结果’)获取一下node-4的日志 看一下。

@libluecat
Copy link
Author

把上传到oss的表下载下来重新注册到secretpad中,再次运行了联合圈人模板,报的还是列名不匹配问题(查看隐私求交的输出表,对应列是有的),以下是node-4中的部分日志:
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 2206, in ray._raylet.task_execution_handler
File "python/ray/_raylet.pyx", line 2102, in ray._raylet.execute_task_with_cancellation_handler
File "python/ray/_raylet.pyx", line 1756, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1757, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1995, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1055, in ray._raylet.store_task_errors
File "/usr/local/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
return method(self, *_args, **_kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py", line 76, in wrapper
return method(*args, **kwargs)
TypeError: PartitionAgent.len() missing 1 required positional argument: 'idx'
An unexpected internal error occurred while the worker was executing a task.
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffbd7da54608022a4fc35f582c01000000 Worker ID: d19bc8bc8a314dbe97e15b45a625a18d7b2cad6c62003921f131ec07 Node ID: 8e5ba021b8a903d756b07b3075399cb668510129bd2afd3c198df1bb Worker IP address: umfz-dmtjvuud-node-4-0-global.bob.svc Worker port: 10037 Worker PID: 213437 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code None. The worker may have exceeded K8s pod memory limits. Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 1807, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1908, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1813, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1754, in ray._raylet.execute_task.function_executor
File "/usr/local/lib/python3.10/site-packages/ray/_private/function_manager.py", line 726, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
return method(self, *_args, **_kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py", line 76, in wrapper
return method(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/agent.py", line 63, in append_data
source = source(**kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/io.py", line 64, in read_csv_wrapper
df = _read_csv(filepath, read_backend, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/io.py", line 43, in _read_csv
return read_pandas_csv(_filepath, *_args, **_kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/pandas/util.py", line 38, in read_pandas_csv
df = pd.read_csv(filepath, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 605, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1442, in init
self._engine = self._make_engine(f, self.engine)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1753, in _make_engine
return mapping[engine](f, **self.options)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 135, in init
self._validate_usecols_names(usecols, self.orig_names)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 917, in _validate_usecols_names
raise ValueError(
ValueError: Usecols do not match columns, columns expected but not found: ['contact_unknown', 'id2', 'month_jun', 'month_aug', 'month_dec', 'contact_telephone', 'month_apr', 'contact_cellular', 'month_jul', 'y', 'poutcome_other', 'month_may', 'poutcome_unknown', 'month_nov', 'month_sep', 'month_mar', 'poutcome_failure', 'month_oct', 'month_jan', 'month_feb', 'poutcome_success']

@wangzul
Copy link
Contributor

wangzul commented Nov 28, 2024

把上传到oss的表下载下来重新注册到secretpad中,再次运行了联合圈人模板,报的还是列名不匹配问题(查看隐私求交的输出表,对应列是有的),以下是node-4中的部分日志: During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "python/ray/_raylet.pyx", line 2206, in ray._raylet.task_execution_handler File "python/ray/_raylet.pyx", line 2102, in ray._raylet.execute_task_with_cancellation_handler File "python/ray/_raylet.pyx", line 1756, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 1757, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 1995, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 1055, in ray._raylet.store_task_errors File "/usr/local/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span return method(self, *_args, **_kwargs) File "/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py", line 76, in wrapper return method(*args, **kwargs) TypeError: PartitionAgent.len() missing 1 required positional argument: 'idx' An unexpected internal error occurred while the worker was executing a task. (raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffbd7da54608022a4fc35f582c01000000 Worker ID: d19bc8bc8a314dbe97e15b45a625a18d7b2cad6c62003921f131ec07 Node ID: 8e5ba021b8a903d756b07b3075399cb668510129bd2afd3c198df1bb Worker IP address: umfz-dmtjvuud-node-4-0-global.bob.svc Worker port: 10037 Worker PID: 213437 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code None. The worker may have exceeded K8s pod memory limits. Traceback (most recent call last): File "python/ray/_raylet.pyx", line 1807, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 1908, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 1813, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 1754, in ray._raylet.execute_task.function_executor File "/usr/local/lib/python3.10/site-packages/ray/_private/function_manager.py", line 726, in actor_method_executor return method(__ray_actor, *args, **kwargs) File "/usr/local/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span return method(self, *_args, **_kwargs) File "/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py", line 76, in wrapper return method(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/agent.py", line 63, in append_data source = source(**kwargs) File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/io.py", line 64, in read_csv_wrapper df = _read_csv(filepath, read_backend, **kwargs) File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/io.py", line 43, in _read_csv return read_pandas_csv(_filepath, *_args, **_kwargs) File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/pandas/util.py", line 38, in read_pandas_csv df = pd.read_csv(filepath, *args, **kwargs) File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv return _read(filepath_or_buffer, kwds) File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 605, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1442, in init self._engine = self._make_engine(f, self.engine) File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1753, in _make_engine return mapping[engine](f, **self.options) File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 135, in init self._validate_usecols_names(usecols, self.orig_names) File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 917, in _validate_usecols_names raise ValueError( ValueError: Usecols do not match columns, columns expected but not found: ['contact_unknown', 'id2', 'month_jun', 'month_aug', 'month_dec', 'contact_telephone', 'month_apr', 'contact_cellular', 'month_jul', 'y', 'poutcome_other', 'month_may', 'poutcome_unknown', 'month_nov', 'month_sep', 'month_mar', 'poutcome_failure', 'month_oct', 'month_jan', 'month_feb', 'poutcome_success']

需要提供一个完整的日志。

@libluecat
Copy link
Author

sh-5.2# cat bob_umfz-dmtjvuud-node-4-0_b3129c51-9b78-41ea-94ff-69215f2de14a/secretflow/0.logWARNING:root:Since the GPL-licensed package unidecode is not installed, using Python's unicodedata package which yields worse results.
2024-11-28 17:56:26,182|bob|INFO|secretflow|entry.py:start_ray:59| ray_conf: RayConfig(ray_node_ip_address='umfz-dmtjvuud-node-4-0-global.bob.svc', ray_node_manager_port=25934, ray_object_manager_port=25935, ray_client_server_port=25930, ray_worker_ports=[], ray_gcs_port=25933)
2024-11-28 17:56:26,190|bob|INFO|secretflow|entry.py:start_ray:67| Trying to start ray head node at umfz-dmtjvuud-node-4-0-global.bob.svc, start command: ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=umfz-dmtjvuud-node-4-0-global.bob.svc --port=25933 --node-manager-port=25934 --object-manager-port=25935 --ray-client-server-port=25930
2024-11-28 17:56:29,989|bob|INFO|secretflow|entry.py:start_ray:80| 2024-11-28 17:56:26,818 INFO usage_lib.py:423 -- Usage stats collection is disabled.
2024-11-28 17:56:26,818 INFO scripts.py:744 -- Local node IP: umfz-dmtjvuud-node-4-0-global.bob.svc
2024-11-28 17:56:29,844 SUCC scripts.py:781 -- --------------------
2024-11-28 17:56:29,844 SUCC scripts.py:782 -- Ray runtime started.
2024-11-28 17:56:29,844 SUCC scripts.py:783 -- --------------------
2024-11-28 17:56:29,844 INFO scripts.py:785 -- Next steps
2024-11-28 17:56:29,845 INFO scripts.py:788 -- To add another node to this Ray cluster, run
2024-11-28 17:56:29,845 INFO scripts.py:791 -- ray start --address='umfz-dmtjvuud-node-4-0-global.bob.svc:25933'
2024-11-28 17:56:29,845 INFO scripts.py:800 -- To connect to this Ray cluster:
2024-11-28 17:56:29,845 INFO scripts.py:802 -- import ray
2024-11-28 17:56:29,845 INFO scripts.py:803 -- ray.init(_node_ip_address='umfz-dmtjvuud-node-4-0-global.bob.svc')
2024-11-28 17:56:29,845 INFO scripts.py:834 -- To terminate the Ray runtime, run
2024-11-28 17:56:29,845 INFO scripts.py:835 -- ray stop
2024-11-28 17:56:29,845 INFO scripts.py:838 -- To view the status of the cluster, use
2024-11-28 17:56:29,845 INFO scripts.py:839 -- ray status

2024-11-28 17:56:29,990|bob|INFO|secretflow|entry.py:start_ray:81| Succeeded to start ray head node at umfz-dmtjvuud-node-4-0-global.bob.svc.
2024-11-28 17:56:29,994|bob|INFO|secretflow|entry.py:main:510| datasource.access_directly True
sf_node_eval_param {
"domain": "stats",
"name": "table_statistics",
"version": "0.0.2",
"attrPaths": [
"input/input_data/features"
],
"attrs": [
{
"ss": [
"id1",
"id2",
"age",
"education",
"default",
"balance",
"housing",
"loan",
"day",
"duration",
"campaign",
"pdays",
"previous",
"job_blue-collar",
"job_entrepreneur",
"job_housemaid",
"job_management",
"job_retired",
"job_self-employed",
"job_services",
"job_student",
"job_technician",
"job_unemployed",
"marital_divorced",
"marital_married",
"marital_single",
"contact_cellular",
"contact_telephone",
"contact_unknown",
"month_apr",
"month_aug",
"month_dec",
"month_feb",
"month_jan",
"month_jul",
"month_jun",
"month_mar",
"month_may",
"month_nov",
"month_oct",
"month_sep",
"poutcome_failure",
"poutcome_other",
"poutcome_success",
"poutcome_unknown",
"y"
]
}
],
"checkpointUri": "ckumfz-dmtjvuud-node-4-output-0"
}
2024-11-28 17:56:30,011|bob|INFO|secretflow|entry.py:domaindata_id_to_dist_data:160| domaindata_id umfz-dmtjvuud-node-3-output-0 to
...........
name: "umfz-dmtjvuud-node-3-output-0"
type: "sf.table.vertical_table"
system_info {
}
meta {
type_url: "type.googleapis.com/secretflow.spec.v1.VerticalTable"
value: "\n\335\003\n\003id1\022\003age\022\teducation\022\007default\022\007balance\022\007housing\022\004loan\022\003day\022\010duration\022\010campaign\022\005pdays\022\010previous\022\017job_blue-collar\022\020job_entrepreneur\022\rjob_housemaid\022\016job_management\022\013job_retired\022\021job_self-employed\022\014job_services\022\013job_student\022\016job_technician\022\016job_unemployed\022\020marital_divorced\022\017marital_married\022\016marital_single"\003int*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float\n\227\003\n\003id2\022\020contact_cellular\022\021contact_telephone\022\017contact_unknown\022\tmonth_apr\022\tmonth_aug\022\tmonth_dec\022\tmonth_feb\022\tmonth_jan\022\tmonth_jul\022\tmonth_jun\022\tmonth_mar\022\tmonth_may\022\tmonth_nov\022\tmonth_oct\022\tmonth_sep\022\020poutcome_failure\022\016poutcome_other\022\020poutcome_success\022\020poutcome_unknown\022\001y"\003int*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\003int\020\244M"
}
data_refs {
uri: "umfz-dmtjvuud-node-3-output-0"
party: "alice"
format: "csv"
}
data_refs {
uri: "umfz-dmtjvuud-node-3-output-0"
party: "bob"
format: "csv"
}

....
2024-11-28 17:56:30,012|bob|WARNING|secretflow|entry.py:comp_eval:169|

Secretflow 1.7.0b0
Build time (Jun 25 2024, 11:25:31) with commit id: d08547cb86d07d5515e8b997236fad81972cdef7

2024-11-28 17:56:30,012|bob|WARNING|secretflow|entry.py:comp_eval:170|

param

domain: "stats"
name: "table_statistics"
version: "0.0.2"
attr_paths: "input/input_data/features"
attrs {
ss: "id1"
ss: "id2"
ss: "age"
ss: "education"
ss: "default"
ss: "balance"
ss: "housing"
ss: "loan"
ss: "day"
ss: "duration"
ss: "campaign"
ss: "pdays"
ss: "previous"
ss: "job_blue-collar"
ss: "job_entrepreneur"
ss: "job_housemaid"
ss: "job_management"
ss: "job_retired"
ss: "job_self-employed"
ss: "job_services"
ss: "job_student"
ss: "job_technician"
ss: "job_unemployed"
ss: "marital_divorced"
ss: "marital_married"
ss: "marital_single"
ss: "contact_cellular"
ss: "contact_telephone"
ss: "contact_unknown"
ss: "month_apr"
ss: "month_aug"
ss: "month_dec"
ss: "month_feb"
ss: "month_jan"
ss: "month_jul"
ss: "month_jun"
ss: "month_mar"
ss: "month_may"
ss: "month_nov"
ss: "month_oct"
ss: "month_sep"
ss: "poutcome_failure"
ss: "poutcome_other"
ss: "poutcome_success"
ss: "poutcome_unknown"
ss: "y"
}
inputs {
name: "umfz-dmtjvuud-node-3-output-0"
type: "sf.table.vertical_table"
system_info {
}
meta {
type_url: "type.googleapis.com/secretflow.spec.v1.VerticalTable"
value: "\n\335\003\n\003id1\022\003age\022\teducation\022\007default\022\007balance\022\007housing\022\004loan\022\003day\022\010duration\022\010campaign\022\005pdays\022\010previous\022\017job_blue-collar\022\020job_entrepreneur\022\rjob_housemaid\022\016job_management\022\013job_retired\022\021job_self-employed\022\014job_services\022\013job_student\022\016job_technician\022\016job_unemployed\022\020marital_divorced\022\017marital_married\022\016marital_single"\003int*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float\n\227\003\n\003id2\022\020contact_cellular\022\021contact_telephone\022\017contact_unknown\022\tmonth_apr\022\tmonth_aug\022\tmonth_dec\022\tmonth_feb\022\tmonth_jan\022\tmonth_jul\022\tmonth_jun\022\tmonth_mar\022\tmonth_may\022\tmonth_nov\022\tmonth_oct\022\tmonth_sep\022\020poutcome_failure\022\016poutcome_other\022\020poutcome_success\022\020poutcome_unknown\022\001y"\003int*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\003int\020\244M"
}
data_refs {
uri: "umfz-dmtjvuud-node-3-output-0"
party: "alice"
format: "csv"
}
data_refs {
uri: "umfz-dmtjvuud-node-3-output-0"
party: "bob"
format: "csv"
}
}
output_uris: "umfz-dmtjvuud-node-4-output-0"
checkpoint_uri: "ckumfz-dmtjvuud-node-4-output-0"

--

2024-11-28 17:56:30,012|bob|WARNING|secretflow|entry.py:comp_eval:171|

storage_config

type: "s3"
s3 {
endpoint: "http://oss-cn-beijing.aliyuncs.com"
bucket: "secretpad"
access_key_id: "LTAI5t6Xbqcjwr67j4nhNFMg"
access_key_secret: "vbcYo1NwnIS8YtZyNa3YkEydlu0vWo"
virtual_host: true
}

--

2024-11-28 17:56:30,013|bob|WARNING|secretflow|entry.py:comp_eval:172|

cluster_config

desc {
parties: "bob"
parties: "alice"
devices {
name: "spu"
type: "spu"
parties: "bob"
parties: "alice"
config: "{"runtime_config":{"protocol":"SEMI2K","field":"FM128"},"link_desc":{"connect_retry_times":60,"connect_retry_interval_ms":1000,"brpc_channel_protocol":"http","brpc_channel_connection_type":"pooled","recv_timeout_ms":1200000,"http_timeout_ms":1200000}}"
}
devices {
name: "heu"
type: "heu"
parties: "bob"
parties: "alice"
config: "{"mode": "PHEU", "schema": "paillier", "key_size": 2048}"
}
ray_fed_config {
cross_silo_comm_backend: "brpc_link"
}
}
public_config {
ray_fed_config {
parties: "bob"
parties: "alice"
addresses: "0.0.0.0:25932"
addresses: "umfz-dmtjvuud-node-4-0-fed.alice.svc:80"
}
spu_configs {
name: "spu"
parties: "bob"
parties: "alice"
addresses: "0.0.0.0:25931"
addresses: "http://umfz-dmtjvuud-node-4-0-spu.alice.svc:80"
}
}
private_config {
self_party: "bob"
ray_head_addr: "umfz-dmtjvuud-node-4-0-global.bob.svc:25933"
}

--

2024-11-28 17:56:30,014|bob|WARNING|secretflow|driver.py:init:442| When connecting to an existing cluster, num_cpus must not be provided. Num_cpus is neglected at this moment.
2024-11-28 17:56:30,015 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: umfz-dmtjvuud-node-4-0-global.bob.svc:25933...
2024-11-28 17:56:30,028|bob|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140700038920224 on /tmp/ray/session_2024-11-28_17-56-26_818722_210987/node_ip_address.json.lock
2024-11-28 17:56:30,029|bob|DEBUG|secretflow|_api.py:acquire:334| Lock 140700038920224 acquired on /tmp/ray/session_2024-11-28_17-56-26_818722_210987/node_ip_address.json.lock
2024-11-28 17:56:30,029|bob|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140700038920224 on /tmp/ray/session_2024-11-28_17-56-26_818722_210987/node_ip_address.json.lock
2024-11-28 17:56:30,029|bob|DEBUG|secretflow|_api.py:release:367| Lock 140700038920224 released on /tmp/ray/session_2024-11-28_17-56-26_818722_210987/node_ip_address.json.lock
2024-11-28 17:56:30,034|bob|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140700038920320 on /tmp/ray/session_2024-11-28_17-56-26_818722_210987/ports_by_node.json.lock
2024-11-28 17:56:30,034|bob|DEBUG|secretflow|_api.py:acquire:334| Lock 140700038920320 acquired on /tmp/ray/session_2024-11-28_17-56-26_818722_210987/ports_by_node.json.lock
2024-11-28 17:56:30,035|bob|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140700038920320 on /tmp/ray/session_2024-11-28_17-56-26_818722_210987/ports_by_node.json.lock
2024-11-28 17:56:30,035|bob|DEBUG|secretflow|_api.py:release:367| Lock 140700038920320 released on /tmp/ray/session_2024-11-28_17-56-26_818722_210987/ports_by_node.json.lock
2024-11-28 17:56:30,035|bob|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140700038920080 on /tmp/ray/session_2024-11-28_17-56-26_818722_210987/ports_by_node.json.lock
2024-11-28 17:56:30,035|bob|DEBUG|secretflow|_api.py:acquire:334| Lock 140700038920080 acquired on /tmp/ray/session_2024-11-28_17-56-26_818722_210987/ports_by_node.json.lock
2024-11-28 17:56:30,036|bob|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140700038920080 on /tmp/ray/session_2024-11-28_17-56-26_818722_210987/ports_by_node.json.lock
2024-11-28 17:56:30,036|bob|DEBUG|secretflow|_api.py:release:367| Lock 140700038920080 released on /tmp/ray/session_2024-11-28_17-56-26_818722_210987/ports_by_node.json.lock
2024-11-28 17:56:30,036|bob|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140700038920320 on /tmp/ray/session_2024-11-28_17-56-26_818722_210987/ports_by_node.json.lock
2024-11-28 17:56:30,036|bob|DEBUG|secretflow|_api.py:acquire:334| Lock 140700038920320 acquired on /tmp/ray/session_2024-11-28_17-56-26_818722_210987/ports_by_node.json.lock
2024-11-28 17:56:30,037|bob|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140700038920320 on /tmp/ray/session_2024-11-28_17-56-26_818722_210987/ports_by_node.json.lock
2024-11-28 17:56:30,037|bob|DEBUG|secretflow|_api.py:release:367| Lock 140700038920320 released on /tmp/ray/session_2024-11-28_17-56-26_818722_210987/ports_by_node.json.lock
2024-11-28 17:56:30,037|bob|DEBUG|secretflow|_api.py:acquire:331| Attempting to acquire lock 140700038920080 on /tmp/ray/session_2024-11-28_17-56-26_818722_210987/ports_by_node.json.lock
2024-11-28 17:56:30,037|bob|DEBUG|secretflow|_api.py:acquire:334| Lock 140700038920080 acquired on /tmp/ray/session_2024-11-28_17-56-26_818722_210987/ports_by_node.json.lock
2024-11-28 17:56:30,038|bob|DEBUG|secretflow|_api.py:release:364| Attempting to release lock 140700038920080 on /tmp/ray/session_2024-11-28_17-56-26_818722_210987/ports_by_node.json.lock
2024-11-28 17:56:30,038|bob|DEBUG|secretflow|_api.py:release:367| Lock 140700038920080 released on /tmp/ray/session_2024-11-28_17-56-26_818722_210987/ports_by_node.json.lock
2024-11-28 17:56:30,038 INFO worker.py:1724 -- Connected to Ray cluster.
2024-11-28 17:56:30.999 INFO api.py:233 [bob] -- [Anonymous_job] Started rayfed with {'CLUSTER_ADDRESSES': {'bob': '0.0.0.0:25932', 'alice':'http://umfz-dmtjvuud-node-4-0-fed.alice.svc:80'}, 'CURRENT_PARTY_NAME': 'bob', 'TLS_CONFIG': {}}
(raylet) [2024-11-28 17:56:31,493 I 213025 213025] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
(SenderReceiverProxyActor pid=213025) 2024-11-28 17:56:32.063 INFO link.py:38 [bob] -- [Anonymous_job] brpc options: {'proxy_max_restarts': 3, 'timeout_in_ms': 300000, 'recv_timeout_ms': 604800000, 'connect_retry_times': 3600, 'connect_retry_interval_ms': 1000, 'brpc_channel_protocol': 'http', 'brpc_channel_connection_type': 'pooled', 'exit_on_sending_failure': True}
(SenderReceiverProxyActor pid=213025) I1128 17:56:32.089287 213025 external/com_github_brpc_brpc/src/brpc/server.cpp:1181] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=25932.
(SenderReceiverProxyActor pid=213025) W1128 17:56:32.089336 213025 external/com_github_brpc_brpc/src/brpc/server.cpp:1187] Builtin services are disabled according to ServerOptions.has_builtin_services
(SenderReceiverProxyActor pid=213025) I1128 17:56:32.723458 213247 external/com_github_brpc_brpc/src/brpc/span.cpp:506] Opened ./rpc_data/rpcz/20241128.175632.213025/id.db and ./rpc_data/rpcz/20241128.175632.213025/time.db
2024-11-28 17:56:33.142 INFO barriers.py:465 [bob] -- [Anonymous_job] Succeeded to create receiver proxy actor.
2024-11-28 17:56:33.143 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 0 attemp, up to 3600 attemps.
(_run pid=211429) WARNING:root:Since the GPL-licensed package unidecode is not installed, using Python's unicodedata package which yields worse results.
2024-11-28 17:56:35.647 INFO data_utils.py:391 [bob] -- [Anonymous_job] try load VDataFrame, file uri {'alice': DistdataInfo(uri='umfz-dmtjvuud-node-3-output-0', format='csv'), 'bob': DistdataInfo(uri='umfz-dmtjvuud-node-3-output-0', format='csv')}, file meta {PYURuntime(alice): {'LastModified': datetime.datetime(2024, 11, 28, 9, 56, 17, tzinfo=tzutc()), 'size': 4717320, 'ETag': '"FCA27E85FF5F4CCD75F114CFAFAE479B"'}, PYURuntime(bob): {'LastModified': datetime.datetime(2024, 11, 28, 9, 56, 17, tzinfo=tzutc()), 'size': 4717320, 'ETag': '"FCA27E85FF5F4CCD75F114CFAFAE479B"'}}
2024-11-28 17:56:35.647 INFO proxy.py:180 [bob] -- [Anonymous_job] Create proxy actor <class 'secretflow.data.core.agent.PartitionAgent'> with party alice.
2024-11-28 17:56:35.648 INFO proxy.py:180 [bob] -- [Anonymous_job] Create proxy actor <class 'secretflow.data.core.agent.PartitionAgent'> with party bob.
(raylet) [2024-11-28 17:56:36,242 I 213437 213437] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
2024-11-28 17:56:39.045 ERROR component.py:1130 [bob] -- [Anonymous_job] eval on domain: "stats"
name: "table_statistics"
version: "0.0.2"
attr_paths: "input/input_data/features"
attrs {
ss: "id1"
ss: "id2"
ss: "age"
ss: "education"
ss: "default"
ss: "balance"
ss: "housing"
ss: "loan"
ss: "day"
ss: "duration"
ss: "campaign"
ss: "pdays"
ss: "previous"
ss: "job_blue-collar"
ss: "job_entrepreneur"
ss: "job_housemaid"
ss: "job_management"
ss: "job_retired"
ss: "job_self-employed"
ss: "job_services"
ss: "job_student"
ss: "job_technician"
ss: "job_unemployed"
ss: "marital_divorced"
ss: "marital_married"
ss: "marital_single"
ss: "contact_cellular"
ss: "contact_telephone"
ss: "contact_unknown"
ss: "month_apr"
ss: "month_aug"
ss: "month_dec"
ss: "month_feb"
ss: "month_jan"
ss: "month_jul"
ss: "month_jun"
ss: "month_mar"
ss: "month_may"
ss: "month_nov"
ss: "month_oct"
ss: "month_sep"
ss: "poutcome_failure"
ss: "poutcome_other"
ss: "poutcome_success"
ss: "poutcome_unknown"
ss: "y"
}
inputs {
name: "umfz-dmtjvuud-node-3-output-0"
type: "sf.table.vertical_table"
system_info {
}
meta {
type_url: "type.googleapis.com/secretflow.spec.v1.VerticalTable"
value: "\n\335\003\n\003id1\022\003age\022\teducation\022\007default\022\007balance\022\007housing\022\004loan\022\003day\022\010duration\022\010campaign\022\005pdays\022\010previous\022\017job_blue-collar\022\020job_entrepreneur\022\rjob_housemaid\022\016job_management\022\013job_retired\022\021job_self-employed\022\014job_services\022\013job_student\022\016job_technician\022\016job_unemployed\022\020marital_divorced\022\017marital_married\022\016marital_single"\003int*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float\n\227\003\n\003id2\022\020contact_cellular\022\021contact_telephone\022\017contact_unknown\022\tmonth_apr\022\tmonth_aug\022\tmonth_dec\022\tmonth_feb\022\tmonth_jan\022\tmonth_jul\022\tmonth_jun\022\tmonth_mar\022\tmonth_may\022\tmonth_nov\022\tmonth_oct\022\tmonth_sep\022\020poutcome_failure\022\016poutcome_other\022\020poutcome_success\022\020poutcome_unknown\022\001y"\003int*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\005float*\003int\020\244M"
}
data_refs {
uri: "umfz-dmtjvuud-node-3-output-0"
party: "alice"
format: "csv"
}
data_refs {
uri: "umfz-dmtjvuud-node-3-output-0"
party: "bob"
format: "csv"
}
}
output_uris: "umfz-dmtjvuud-node-4-output-0"
checkpoint_uri: "ckumfz-dmtjvuud-node-4-output-0"
failed, error <The actor died unexpectedly before finishing this task.
class_name: PartitionAgent
actor_id: bd7da54608022a4fc35f582c01000000
pid: 213437
namespace: 70ff39a7-1fe7-4de6-a349-8c12bcc41a90
ip: umfz-dmtjvuud-node-4-0-global.bob.svc
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code None. The worker may have exceeded K8s pod memory limits. Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 1807, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1908, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1813, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1754, in ray._raylet.execute_task.function_executor
File "/usr/local/lib/python3.10/site-packages/ray/_private/function_manager.py", line 726, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
return method(self, *_args, **_kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py", line 76, in wrapper
return method(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/agent.py", line 63, in append_data
source = source(**kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/io.py", line 64, in read_csv_wrapper
df = _read_csv(filepath, read_backend, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/io.py", line 43, in _read_csv
return read_pandas_csv(_filepath, *_args, **_kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/pandas/util.py", line 38, in read_pandas_csv
df = pd.read_csv(filepath, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 605, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1442, in init
self._engine = self._make_engine(f, self.engine)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1753, in _make_engine
return mapping[engine](f, **self.options)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 135, in init
self._validate_usecols_names(usecols, self.orig_names)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 917, in _validate_usecols_names
raise ValueError(
ValueError: Usecols do not match columns, columns expected but not found: ['contact_unknown', 'id2', 'month_jun', 'month_aug', 'month_dec', 'contact_telephone', 'month_apr', 'contact_cellular', 'month_jul', 'y', 'poutcome_other', 'month_may', 'poutcome_unknown', 'month_nov', 'month_sep', 'month_mar', 'poutcome_failure', 'month_oct', 'month_jan', 'month_feb', 'poutcome_success']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 2206, in ray._raylet.task_execution_handler
File "python/ray/_raylet.pyx", line 2102, in ray._raylet.execute_task_with_cancellation_handler
File "python/ray/_raylet.pyx", line 1756, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1757, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1995, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1055, in ray._raylet.store_task_errors
File "/usr/local/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
return method(self, *_args, **_kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py", line 76, in wrapper
return method(*args, **kwargs)
TypeError: PartitionAgent.len() missing 1 required positional argument: 'idx'
An unexpected internal error occurred while the worker was executing a task.>
2024-11-28 17:56:39.046 INFO api.py:342 [bob] -- [Anonymous_job] Shutdowning rayfed intendedly...
2024-11-28 17:56:39.046 INFO api.py:356 [bob] -- [Anonymous_job] No wait for data sending.
2024-11-28 17:56:39.049 INFO message_queue.py:72 [bob] -- [Anonymous_job] Notify message polling thread[DataSendingQueueThread] to exit.
2024-11-28 17:56:39.049 INFO message_queue.py:72 [bob] -- [Anonymous_job] Notify message polling thread[ErrorSendingQueueThread] to exit.
2024-11-28 17:56:39.049 INFO api.py:384 [bob] -- [Anonymous_job] Shutdowned rayfed.
(PartitionAgent pid=213437) WARNING:root:Since the GPL-licensed package unidecode is not installed, using Python's unicodedata package which yields worse results.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 547, in
main()
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 527, in main
res = comp_eval(sf_node_eval_param, storage_config, sf_cluster_config)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/entry.py", line 176, in comp_eval
res = comp.eval(
File "/usr/local/lib/python3.10/site-packages/secretflow/component/component.py", line 1132, in eval
raise e from None
File "/usr/local/lib/python3.10/site-packages/secretflow/component/component.py", line 1127, in eval
ret = self.__eval_callback(ctx=ctx, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/component/stats/table_statistics.py", line 135, in table_statistics_eval_fn
input_df = load_table(
File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 394, in load_table
vdf = read_csv(filepaths, dtypes=dtypes, nrows=nrows, converters=converters)
File "/usr/local/lib/python3.10/site-packages/secretflow/data/vertical/io.py", line 191, in read_csv
parties_length[device.party] = len(part)
File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/partition.py", line 152, in len
return reveal(self.part_agent.len(self.agent_idx))
File "/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py", line 162, in reveal
all_object = sfd.get(all_object_refs)
File "/usr/local/lib/python3.10/site-packages/secretflow/distributed/primitive.py", line 156, in get
return fed.get(object_refs)
File "/usr/local/lib/python3.10/site-packages/fed/api.py", line 621, in get
values = ray.get(ray_refs)
File "/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ray/_private/worker.py", line 2626, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: PartitionAgent
actor_id: bd7da54608022a4fc35f582c01000000
pid: 213437
namespace: 70ff39a7-1fe7-4de6-a349-8c12bcc41a90
ip: umfz-dmtjvuud-node-4-0-global.bob.svc
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code None. The worker may have exceeded K8s pod memory limits. Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 1807, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1908, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1813, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1754, in ray._raylet.execute_task.function_executor
File "/usr/local/lib/python3.10/site-packages/ray/_private/function_manager.py", line 726, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
return method(self, *_args, **_kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py", line 76, in wrapper
return method(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/agent.py", line 63, in append_data
source = source(**kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/io.py", line 64, in read_csv_wrapper
df = _read_csv(filepath, read_backend, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/io.py", line 43, in _read_csv
return read_pandas_csv(_filepath, *_args, **_kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/pandas/util.py", line 38, in read_pandas_csv
df = pd.read_csv(filepath, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 605, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1442, in init
self._engine = self._make_engine(f, self.engine)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1753, in _make_engine
return mapping[engine](f, **self.options)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 135, in init
self._validate_usecols_names(usecols, self.orig_names)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 917, in _validate_usecols_names
raise ValueError(
ValueError: Usecols do not match columns, columns expected but not found: ['contact_unknown', 'id2', 'month_jun', 'month_aug', 'month_dec', 'contact_telephone', 'month_apr', 'contact_cellular', 'month_jul', 'y', 'poutcome_other', 'month_may', 'poutcome_unknown', 'month_nov', 'month_sep', 'month_mar', 'poutcome_failure', 'month_oct', 'month_jan', 'month_feb', 'poutcome_success']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 2206, in ray._raylet.task_execution_handler
File "python/ray/_raylet.pyx", line 2102, in ray._raylet.execute_task_with_cancellation_handler
File "python/ray/_raylet.pyx", line 1756, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1757, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1995, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1055, in ray._raylet.store_task_errors
File "/usr/local/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
return method(self, *_args, **_kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py", line 76, in wrapper
return method(*args, **kwargs)
TypeError: PartitionAgent.len() missing 1 required positional argument: 'idx'
An unexpected internal error occurred while the worker was executing a task.
(raylet) Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 1807, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1908, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1813, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1754, in ray._raylet.execute_task.function_executor
File "/usr/local/lib/python3.10/site-packages/ray/_private/function_manager.py", line 726, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
return method(self, *_args, **_kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py", line 76, in wrapper
return method(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/agent.py", line 63, in append_data
source = source(**kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/io.py", line 64, in read_csv_wrapper
df = _read_csv(filepath, read_backend, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/io.py", line 43, in _read_csv
return read_pandas_csv(_filepath, *_args, **_kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/pandas/util.py", line 38, in read_pandas_csv
df = pd.read_csv(filepath, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 605, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1442, in init
self._engine = self._make_engine(f, self.engine)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1753, in _make_engine
return mapping[engine](f, **self.options)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 135, in init
self._validate_usecols_names(usecols, self.orig_names)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 917, in _validate_usecols_names
raise ValueError(
ValueError: Usecols do not match columns, columns expected but not found: ['contact_unknown', 'id2', 'month_jun', 'month_aug', 'month_dec', 'contact_telephone', 'month_apr', 'contact_cellular', 'month_jul', 'y', 'poutcome_other', 'month_may', 'poutcome_unknown', 'month_nov', 'month_sep', 'month_mar', 'poutcome_failure', 'month_oct', 'month_jan', 'month_feb', 'poutcome_success']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 2206, in ray._raylet.task_execution_handler
File "python/ray/_raylet.pyx", line 2102, in ray._raylet.execute_task_with_cancellation_handler
File "python/ray/_raylet.pyx", line 1756, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1757, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1995, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1055, in ray._raylet.store_task_errors
File "/usr/local/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
return method(self, *_args, **_kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py", line 76, in wrapper
return method(*args, **kwargs)
TypeError: PartitionAgent.len() missing 1 required positional argument: 'idx'
An unexpected internal error occurred while the worker was executing a task.
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffbd7da54608022a4fc35f582c01000000 Worker ID: d19bc8bc8a314dbe97e15b45a625a18d7b2cad6c62003921f131ec07 Node ID: 8e5ba021b8a903d756b07b3075399cb668510129bd2afd3c198df1bb Worker IP address: umfz-dmtjvuud-node-4-0-global.bob.svc Worker port: 10037 Worker PID: 213437 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code None. The worker may have exceeded K8s pod memory limits. Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 1807, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1908, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1813, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1754, in ray._raylet.execute_task.function_executor
File "/usr/local/lib/python3.10/site-packages/ray/_private/function_manager.py", line 726, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
return method(self, *_args, **_kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py", line 76, in wrapper
return method(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/agent.py", line 63, in append_data
source = source(**kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/io.py", line 64, in read_csv_wrapper
df = _read_csv(filepath, read_backend, **kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/io.py", line 43, in _read_csv
return read_pandas_csv(_filepath, *_args, **_kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/pandas/util.py", line 38, in read_pandas_csv
df = pd.read_csv(filepath, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 605, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1442, in init
self._engine = self._make_engine(f, self.engine)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1753, in _make_engine
return mapping[engine](f, **self.options)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 135, in init
self._validate_usecols_names(usecols, self.orig_names)
File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 917, in _validate_usecols_names
raise ValueError(
ValueError: Usecols do not match columns, columns expected but not found: ['contact_unknown', 'id2', 'month_jun', 'month_aug', 'month_dec', 'contact_telephone', 'month_apr', 'contact_cellular', 'month_jul', 'y', 'poutcome_other', 'month_may', 'poutcome_unknown', 'month_nov', 'month_sep', 'month_mar', 'poutcome_failure', 'month_oct', 'month_jan', 'month_feb', 'poutcome_success']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 2206, in ray._raylet.task_execution_handler
File "python/ray/_raylet.pyx", line 2102, in ray._raylet.execute_task_with_cancellation_handler
File "python/ray/_raylet.pyx", line 1756, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1757, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1995, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1055, in ray._raylet.store_task_errors
File "/usr/local/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
return method(self, *_args, **_kwargs)
File "/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py", line 76, in wrapper
return method(*args, **kwargs)
TypeError: PartitionAgent.len() missing 1 required positional argument: 'idx'
An unexpected internal error occurred while the worker was executing a task.

@wangzul
Copy link
Contributor

wangzul commented Nov 28, 2024

secretpad 隐私求交 的参数配置截个图,我看一下。

@libluecat
Copy link
Author

1732793470818

@wangzul
Copy link
Contributor

wangzul commented Nov 29, 2024

我需要确定一下你的引擎版本

  1. kuscia 版本=0.10
  2. SecretPad版本=0.9版本
    按照隐语的标准secretflow应该为1.8.0b0
    我想确认一下你的Secretflow版本【在kuscia容器[center/autonomy]中执行该命令查看kubectl get app image secret flow-image -oyaml】

同时确认一下训练流状态,使用的模版还是自己拖拉拽组合的?可以提供一下训练流连线配置

@libluecat
Copy link
Author

部署的时候是按照kuscia和secretflow的版本对应部署的,是1.8.0b0

在容器中执行命令得到了下面的日志:
sh-5.2# kubectl get app image secret flow-image -oyaml
E1129 18:02:32.857267 236674 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dialtcp 127.0.0.1:8080: connect: connection refused
E1129 18:02:32.858767 236674 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dialtcp 127.0.0.1:8080: connect: connection refused
E1129 18:02:32.859674 236674 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dialtcp 127.0.0.1:8080: connect: connection refused
E1129 18:02:32.862079 236674 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dialtcp 127.0.0.1:8080: connect: connection refused
E1129 18:02:32.862979 236674 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dialtcp 127.0.0.1:8080: connect: connection refused
The connection to the server localhost:8080 was refused - did you specify the right host or port?

@wangzul
Copy link
Contributor

wangzul commented Nov 29, 2024

中心化模式在master节点执行,P2P在autonomy节点执行

@libluecat
Copy link
Author

训练流使用的联合圈人模板,具体配置如下:
image
image
image
image
image
image

@libluecat
Copy link
Author

今天也重新跑了一次,选择的不是隐私求交中发送方的表中的字段,是可以执行成功的

@wangzul
Copy link
Contributor

wangzul commented Nov 29, 2024

今天也重新跑了一次,选择的不是隐私求交中发送方的表中的字段,是可以执行成功的

因为你提供的日志中输出secretflow 版本为1.7.0b0所以我需要确定一下,我需要确定是日志输出错误还是你使用sf版本是错误的。

@wangzul
Copy link
Contributor

wangzul commented Nov 29, 2024

移除掉两方选项只选择,output隐私求交的数据集尝试一下,记得组件配置保存。

@libluecat
Copy link
Author

2. kubectl get app image secret flow-image -oyaml

确实是1.7.0b0版本,在master节点执行kubectl get app image secret flow-image -oyaml,报sh-5.2# kubectl get app image secret flow-image -oyaml
error: the server doesn't have a resource type "app"

现在重新打镜像将sf切换到了1.8.0b0,执行联合圈人训练流也失败了
image

日志如下:
sh-5.2# kubectl get kt sivj-xcpvzkqj-node-4 -n cross-domain -o yaml
apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaTask
metadata:
annotations:
kuscia.secretflow/job-id: sivj
kuscia.secretflow/self-cluster-as-participant: "true"
kuscia.secretflow/task-alias: sivj-xcpvzkqj-node-4
creationTimestamp: "2024-12-02T08:40:31Z"
generation: 1
labels:
kuscia.secretflow/controller: kuscia-job
kuscia.secretflow/job-uid: f21b2311-0e3e-426c-8d27-58d8f4eef9f1
name: sivj-xcpvzkqj-node-4
namespace: cross-domain
ownerReferences:

  • apiVersion: kuscia.secretflow/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: KusciaJob
    name: sivj
    uid: f21b2311-0e3e-426c-8d27-58d8f4eef9f1
    resourceVersion: "4058"
    uid: d7127327-3044-45ce-b003-92709a3edc2b
    spec:
    initiator: bob
    parties:
  • appImageRef: secretflow-image
    domainID: bob
    template:
    spec: {}
  • appImageRef: secretflow-image
    domainID: alice
    template:
    spec: {}
    scheduleConfig: {}
    taskInputConfig: |-
    {
    "sf_datasource_config": {
    "bob": {
    "id": "oss-63da092c528337b49753439025f46a97"
    },
    "alice": {
    "id": "oss-501411ada08312dc6dec836eca05aea8"
    }
    },
    "sf_cluster_desc": {
    "parties": ["bob", "alice"],
    "devices": [{
    "name": "spu",
    "type": "spu",
    "parties": ["bob", "alice"],
    "config": "{"runtime_config":{"protocol":"SEMI2K","field":"FM128"},"link_desc":{"connect_retry_times":60,"connect_retry_interval_ms":1000,"brpc_channel_protocol":"http","brpc_channel_connection_type":"pooled","recv_timeout_ms":1200000,"http_timeout_ms":1200000}}"
    }, {
    "name": "heu",
    "type": "heu",
    "parties": ["bob", "alice"],
    "config": "{"mode": "PHEU", "schema": "paillier", "key_size": 2048}"
    }],
    "ray_fed_config": {
    "cross_silo_comm_backend": "brpc_link"
    }
    },
    "sf_node_eval_param": {
    "domain": "stats",
    "name": "table_statistics",
    "version": "0.0.2",
    "attr_paths": ["input/input_data/features"],
    "attrs": [{
    "is_na": false,
    "ss": ["id1", "id2", "y", "age", "education", "default", "balance", "housing", "loan", "day", "duration", "campaign", "pdays", "previous", "job_blue-collar", "job_entrepreneur", "job_housemaid", "job_management", "job_retired", "job_self-employed", "job_services", "job_student", "job_technician", "job_unemployed", "marital_divorced", "marital_married", "marital_single", "contact_cellular", "contact_telephone", "contact_unknown", "month_apr", "month_aug", "month_dec", "month_feb", "month_jan", "month_jul", "month_jun", "month_mar", "month_may", "month_nov", "month_oct", "month_sep", "poutcome_failure", "poutcome_other", "poutcome_success", "poutcome_unknown"]
    }],
    "checkpoint_uri": "cksivj-xcpvzkqj-node-4-output-0"
    },
    "sf_output_uris": ["sivj-xcpvzkqj-node-4-output-0"],
    "sf_input_ids": ["sivj-xcpvzkqj-node-3-output-0"],
    "sf_output_ids": ["sivj-xcpvzkqj-node-4-output-0"]
    }
    status:
    allocatedPorts:
  • domainID: bob
    namedPort:
    sivj-xcpvzkqj-node-4-0/client-server: 29827
    sivj-xcpvzkqj-node-4-0/fed: 29823
    sivj-xcpvzkqj-node-4-0/global: 29824
    sivj-xcpvzkqj-node-4-0/node-manager: 29825
    sivj-xcpvzkqj-node-4-0/object-manager: 29826
    sivj-xcpvzkqj-node-4-0/spu: 29828
  • domainID: alice
    namedPort:
    sivj-xcpvzkqj-node-4-0/client-server: 20555
    sivj-xcpvzkqj-node-4-0/fed: 20557
    sivj-xcpvzkqj-node-4-0/global: 20558
    sivj-xcpvzkqj-node-4-0/node-manager: 20553
    sivj-xcpvzkqj-node-4-0/object-manager: 20554
    sivj-xcpvzkqj-node-4-0/spu: 20556
    completionTime: "2024-12-02T08:40:52Z"
    conditions:
  • lastTransitionTime: "2024-12-02T08:40:31Z"
    status: "True"
    type: ResourceCreated
  • lastTransitionTime: "2024-12-02T08:40:34Z"
    status: "True"
    type: Running
  • lastTransitionTime: "2024-12-02T08:40:52Z"
    status: "False"
    type: Success
    lastReconcileTime: "2024-12-02T08:40:52Z"
    message: The remaining no-failed party task counts 1 are less than the threshold
    2 that meets the conditions for task success. pending party[], running party[alice],
    successful party[], failed party[bob]
    partyTaskStatus:
  • domainID: bob
    phase: Failed
  • domainID: alice
    phase: Failed
    phase: Failed
    podStatuses:
    alice/sivj-xcpvzkqj-node-4-0:
    createTime: "2024-12-02T08:40:31Z"
    namespace: alice
    nodeName: kuscia-lite-alice-84cb955d8-b9mgh
    podName: sivj-xcpvzkqj-node-4-0
    podPhase: Failed
    readyTime: "2024-12-02T08:40:34Z"
    startTime: "2024-12-02T08:40:33Z"
    bob/sivj-xcpvzkqj-node-4-0:
    createTime: "2024-12-02T08:40:31Z"
    namespace: bob
    nodeName: kuscia-lite-bob-98957db7c-l47bf
    podName: sivj-xcpvzkqj-node-4-0
    podPhase: Failed
    readyTime: "2024-12-02T08:40:34Z"
    reason: Error
    startTime: "2024-12-02T08:40:33Z"
    terminationLog: 'container[secretflow] terminated state reason "Error", message:
    "... Ignore 46351 characters at the beginning ...\n/function_manager.py",
    line 726, in actor_method_executor\n return method(__ray_actor, *args,
    **kwargs)\n File "/usr/local/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py",
    line 467, in _resume_span\n return method(self, *_args, **_kwargs)\n File
    "/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py", line
    77, in wrapper\n return method(*args, **kwargs)\n File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/agent.py",
    line 63, in append_data\n source = source(**kwargs)\n File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/io.py",
    line 64, in read_csv_wrapper\n df = _read_csv(filepath, read_backend, **kwargs)\n File
    "/usr/local/lib/python3.10/site-packages/secretflow/data/core/io.py", line
    43, in _read_csv\n return read_pandas_csv(_filepath, *_args, **_kwargs)\n File
    "/usr/local/lib/python3.10/site-packages/secretflow/data/core/pandas/util.py",
    line 38, in read_pandas_csv\n df = pd.read_csv(filepath, *args, **kwargs)\n File
    "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py", line
    211, in wrapper\n return func(*args, **kwargs)\n File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py",
    line 331, in wrapper\n return func(*args, **kwargs)\n File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py",
    line 950, in read_csv\n return _read(filepath_or_buffer, kwds)\n File
    "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py",
    line 605, in _read\n parser = TextFileReader(filepath_or_buffer, **kwds)\n File
    "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py",
    line 1442, in init\n self._engine = self._make_engine(f, self.engine)\n File
    "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py",
    line 1753, in _make_engine\n return mapping[engine](f, **self.options)\n File
    "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py",
    line 135, in init\n self._validate_usecols_names(usecols, self.orig_names)\n File
    "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py",
    line 917, in _validate_usecols_names\n raise ValueError(\nValueError: Usecols
    do not match columns, columns expected but not found: [''month_feb'', ''month_may'',
    ''contact_cellular'', ''month_jan'', ''poutcome_failure'', ''contact_unknown'',
    ''poutcome_success'', ''month_aug'', ''contact_telephone'', ''month_dec'',
    ''poutcome_other'', ''month_apr'', ''month_oct'', ''month_jun'', ''month_sep'',
    ''month_nov'', ''y'', ''month_jul'', ''month_mar'', ''poutcome_unknown'',
    ''id2'']\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback
    (most recent call last):\n File "python/ray/_raylet.pyx", line 2206, in
    ray._raylet.task_execution_handler\n File "python/ray/_raylet.pyx", line
    2102, in ray._raylet.execute_task_with_cancellation_handler\n File "python/ray/_raylet.pyx",
    line 1756, in ray._raylet.execute_task\n File "python/ray/_raylet.pyx",
    line 1757, in ray._raylet.execute_task\n File "python/ray/_raylet.pyx",
    line 1995, in ray._raylet.execute_task\n File "python/ray/_raylet.pyx",
    line 1055, in ray._raylet.store_task_errors\n File "/usr/local/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py",
    line 467, in _resume_span\n return method(self, *_args, **_kwargs)\n File
    "/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py", line
    77, in wrapper\n return method(*args, **kwargs)\nTypeError: PartitionAgent.len()
    missing 1 required positional argument: ''idx''\nAn unexpected internal error
    occurred while the worker was executing a task.\n\x1b[33m(raylet)\x1b[0m A
    worker died or was killed while executing a task by an unexpected system error.
    To troubleshoot the problem, check the logs for the dead worker. RayTask ID:
    ffffffffffffffff766c1a1f5805e69e3255e56601000000 Worker ID: 457ba897b4d9df1241d8c4067fe9d3898b5dcad09f42303787d886ee
    Node ID: df8607116fc85993540c7376a9a79f233b80a3c01c6f1fc461f10c39 Worker IP
    address: sivj-xcpvzkqj-node-4-0-global.bob.svc Worker port: 10037 Worker PID:
    11878 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly.
    Worker exits with an exit code None. The worker may have exceeded K8s pod
    memory limits. Traceback (most recent call last):\n File "python/ray/_raylet.pyx",
    line 1807, in ray._raylet.execute_task\n File "python/ray/_raylet.pyx",
    line 1908, in ray._raylet.execute_task\n File "python/ray/_raylet.pyx",
    line 1813, in ray._raylet.execute_task\n File "python/ray/_raylet.pyx",
    line 1754, in ray._raylet.execute_task.function_executor\n File "/usr/local/lib/python3.10/site-packages/ray/_private/function_manager.py",
    line 726, in actor_method_executor\n return method(__ray_actor, *args,
    **kwargs)\n File "/usr/local/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py",
    line 467, in _resume_span\n return method(self, *_args, **_kwargs)\n File
    "/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py", line
    77, in wrapper\n return method(*args, **kwargs)\n File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/agent.py",
    line 63, in append_data\n source = source(**kwargs)\n File "/usr/local/lib/python3.10/site-packages/secretflow/data/core/io.py",
    line 64, in read_csv_wrapper\n df = _read_csv(filepath, read_backend, **kwargs)\n File
    "/usr/local/lib/python3.10/site-packages/secretflow/data/core/io.py", line
    43, in _read_csv\n return read_pandas_csv(_filepath, *_args, **_kwargs)\n File
    "/usr/local/lib/python3.10/site-packages/secretflow/data/core/pandas/util.py",
    line 38, in read_pandas_csv\n df = pd.read_csv(filepath, *args, **kwargs)\n File
    "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py", line
    211, in wrapper\n return func(*args, **kwargs)\n File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py",
    line 331, in wrapper\n return func(*args, **kwargs)\n File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py",
    line 950, in read_csv\n return _read(filepath_or_buffer, kwds)\n File
    "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py",
    line 605, in _read\n parser = TextFileReader(filepath_or_buffer, **kwds)\n File
    "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py",
    line 1442, in init\n self._engine = self._make_engine(f, self.engine)\n File
    "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py",
    line 1753, in _make_engine\n return mapping[engine](f, **self.options)\n File
    "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py",
    line 135, in init\n self._validate_usecols_names(usecols, self.orig_names)\n File
    "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py",
    line 917, in _validate_usecols_names\n raise ValueError(\nValueError: Usecols
    do not match columns, columns expected but not found: [''month_feb'', ''month_may'',
    ''contact_cellular'', ''month_jan'', ''poutcome_failure'', ''contact_unknown'',
    ''poutcome_success'', ''month_aug'', ''contact_telephone'', ''month_dec'',
    ''poutcome_other'', ''month_apr'', ''month_oct'', ''month_jun'', ''month_sep'',
    ''month_nov'', ''y'', ''month_jul'', ''month_mar'', ''poutcome_unknown'',
    ''id2'']\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback
    (most recent call last):\n File "python/ray/_raylet.pyx", line 2206, in
    ray._raylet.task_execution_handler\n File "python/ray/_raylet.pyx", line
    2102, in ray._raylet.execute_task_with_cancellation_handler\n File "python/ray/_raylet.pyx",
    line 1756, in ray._raylet.execute_task\n File "python/ray/_raylet.pyx",
    line 1757, in ray._raylet.execute_task\n File "python/ray/_raylet.pyx",
    line 1995, in ray._raylet.execute_task\n File "python/ray/_raylet.pyx",
    line 1055, in ray._raylet.store_task_errors\n File "/usr/local/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py",
    line 467, in _resume_span\n return method(self, *_args, **_kwargs)\n File
    "/usr/local/lib/python3.10/site-packages/secretflow/device/proxy.py", line
    77, in wrapper\n return method(*args, **kwargs)\nTypeError: PartitionAgent.len()
    missing 1 required positional argument: ''idx''\nAn unexpected internal error
    occurred while the worker was executing a task.\n"'
    serviceStatuses:
    alice/sivj-xcpvzkqj-node-4-0-fed:
    createTime: "2024-12-02T08:40:31Z"
    namespace: alice
    portName: fed
    portNumber: 20557
    readyTime: "2024-12-02T08:40:34Z"
    scope: Cluster
    serviceName: sivj-xcpvzkqj-node-4-0-fed
    alice/sivj-xcpvzkqj-node-4-0-global:
    createTime: "2024-12-02T08:40:31Z"
    namespace: alice
    portName: global
    portNumber: 20558
    readyTime: "2024-12-02T08:40:34Z"
    scope: Domain
    serviceName: sivj-xcpvzkqj-node-4-0-global
    alice/sivj-xcpvzkqj-node-4-0-spu:
    createTime: "2024-12-02T08:40:31Z"
    namespace: alice
    portName: spu
    portNumber: 20556
    readyTime: "2024-12-02T08:40:34Z"
    scope: Cluster
    serviceName: sivj-xcpvzkqj-node-4-0-spu
    bob/sivj-xcpvzkqj-node-4-0-fed:
    createTime: "2024-12-02T08:40:31Z"
    namespace: bob
    portName: fed
    portNumber: 29823
    readyTime: "2024-12-02T08:40:34Z"
    scope: Cluster
    serviceName: sivj-xcpvzkqj-node-4-0-fed
    bob/sivj-xcpvzkqj-node-4-0-global:
    createTime: "2024-12-02T08:40:31Z"
    namespace: bob
    portName: global
    portNumber: 29824
    readyTime: "2024-12-02T08:40:34Z"
    scope: Domain
    serviceName: sivj-xcpvzkqj-node-4-0-global
    bob/sivj-xcpvzkqj-node-4-0-spu:
    createTime: "2024-12-02T08:40:31Z"
    namespace: bob
    portName: spu
    portNumber: 29828
    readyTime: "2024-12-02T08:40:34Z"
    scope: Cluster
    serviceName: sivj-xcpvzkqj-node-4-0-spu
    startTime: "2024-12-02T08:40:31Z"

@libluecat
Copy link
Author

移除掉两方选项只选择,output隐私求交的数据集尝试一下,记得组件配置保存。

试了一下,报和以上相同的日志。

@wangzul
Copy link
Contributor

wangzul commented Dec 2, 2024

移除掉两方选项只选择,output隐私求交的数据集尝试一下,记得组件配置保存。

你这样操作一次试试。

  1. 创建一个新训练流,执行至隐私求交。
  2. 全表统计,只选择求交数据,不要选择样本表❌掉,然后执行看看。

@wangzul
Copy link
Contributor

wangzul commented Dec 3, 2024

移除掉两方选项只选择,output隐私求交的数据集尝试一下,记得组件配置保存。

你这样操作一次试试。

  1. 创建一个新训练流,执行至隐私求交。
  2. 全表统计,只选择求交数据,不要选择样本表❌掉,然后执行看看。

部署依赖于allinone的脚本自行改造的吗?还是什么方式。

@libluecat
Copy link
Author

移除掉两方选项只选择,output隐私求交的数据集尝试一下,记得组件配置保存。

你这样操作一次试试。

  1. 创建一个新训练流,执行至隐私求交。
  2. 全表统计,只选择求交数据,不要选择样本表❌掉,然后执行看看。

部署依赖于allinone的脚本自行改造的吗?还是什么方式。

不是依赖于allinone的脚本部署的,kuscia通过https://github.com/secretflow/kuscia/blob/release/0.10.x/build/dockerfile/kuscia-secretflow.Dockerfile#L15本地打包后上传到环境中部署的,secretpad也是上面这种方式

@libluecat
Copy link
Author

从oss中下载下来了隐私求交后的结果集,发现结果集中没有bob.csv中的列
1733275784742
1733275794308

@wangzul
Copy link
Contributor

wangzul commented Dec 4, 2024

kuscia runp 在k8s环境部署的吗? secretpad 部署命令方便提供一下吗?

@libluecat
Copy link
Author

kuscia runp 在k8s环境部署的吗? secretpad 部署命令方便提供一下吗?

对,runp k8s部署

secretpad部署命令:
ARG BASE_IMAGE=secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretpad-base-lite:0.3
FROM ${BASE_IMAGE}

ARG TARGETPLATFORM

ENV TZ=Asia/Shanghai
ENV LANG=C.UTF-8
WORKDIR /app

RUN mkdir -p /var/log/secretpad && mkdir -p /app/db && mkdir -p /app/config/certs && yum install -y sqlite

COPY config /app/config
COPY scripts /app/scripts
COPY demo/data /app/data
COPY target/secretpad.jar secretpad.jar
ENV JAVA_OPTS="-server -Xmx3100m -Xms3100m -XX:+UseZGC" SPRING_PROFILES_ACTIVE="dev"
EXPOSE 80
EXPOSE 8080
EXPOSE 9001
ENTRYPOINT java ${JAVA_OPTS} -Dsun.net.http.allowRestrictedHeaders=true -Dspring.profiles.active=${SPRING_PROFILES_ACTIVE} -Dproject.name=secretpad -jar /app/secretpad.jar --spring.config.location=/app/config/application-dev.yaml

@wangzul
Copy link
Contributor

wangzul commented Dec 4, 2024

从oss中下载下来了隐私求交后的结果集,发现结果集中没有bob.csv中的列 1733275784742 1733275794308

双方oss数据源是同一个地址吗?bucket相同吗?

@libluecat
Copy link
Author

libluecat commented Dec 4, 2024

同一个地址,bucket也相同
image
image
image
image

@wangzul
Copy link
Contributor

wangzul commented Dec 4, 2024

那就对了,因为双方的输出结果文件名称一致,后者覆盖掉前者导致全表统计时获取不到数据。

@libluecat
Copy link
Author

感谢!为两个节点分了一下桶,执行成功了

Copy link

github-actions bot commented Jan 3, 2025

Stale issue message. Please comment to remove stale tag. Otherwise this issue will be closed soon.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants