-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail to build the docker with rootless user #1844
Comments
Are you behind some proxy? |
I didn't use any proxy by myself, but I'm in the company's internal network, but I think It quite quite common, right? Could you help give some suggestions for this situation? Thanks. |
Are you able to do
on your shell? |
I'm not next to the machine and will try it later...... |
cm) tomcat@tomcat-Dove-Product:~$ wget https://www.dropbox.com/s/92n2fyej3lzy3s3/caffe_ilsvrc12.tar.gz --2024-09-11 08:35:34-- https://www.dropbox.com/s/92n2fyej3lzy3s3/caffe_ilsvrc12.tar.gz Resolving www.dropbox.com (www.dropbox.com)... 31.13.94.37, 2a03:2880:f11f:83:face:b00c:0:25de Connecting to www.dropbox.com (www.dropbox.com)|31.13.94.37|:443... failed: Connection timed out. Connecting to www.dropbox.com (www.dropbox.com)|2a03:2880:f11f:83:face:b00c:0:25de|:443... failed: Network is unreachable. (cm) tomcat@tomcat-Lenovo-Product:~$ |
@arjunsuresh |
@Bob123Yang yes, we can find a way. But since this is not the only download in the workflow it'll be good to know what is happening. Is dropbox URLs blocked in your network? All other URLs are expected to work? |
Yeah, it looks like that dropbox URLs is blocked here and the others seems good. So how can I do to bypass this problem? I really don't want to be stopped by a download... |
That's great. We have now added backup URL support in CM. Can you please do |
@arjunsuresh (cm) tomcat@tomcat-Dove-Product:~$ cm pull repoAlias: mlcommons@cm4mlops Local path: /home/tomcat/CM/repos/mlcommons@cm4mlops git pull remote: Enumerating objects: 161, done.
CM alias for this repository: mlcommons@cm4mlops Reindexing all CM artifacts. Can take some time ... (cm) tomcat@tomcat-Dove-Product:~$ cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev
|
Sorry, it should be said that building docker will stop at cloning the git at 14/14 for a long time over 12 hrs so that I have to stop the command by press "Ctrl + c". |
Sorry, I'm unable to see this part in the shared output. Can you please share the number of cores and the RAM of the system? Nvidia 4.0 code needs pytorch build from src and it typically takes around 2 hours on a 24 cores 64G system. If this is a problem, the best option is to use Nvidia 4.1 code which we are currently working on. Hope to make this available within a week. |
tomcat@tomcat-Dove-Product:~$ lscpu | grep "socket\|Socket" Core(s) per socket: 56 Socket(s): 2 tomcat@tomcat-Dove-Product:~$ free -h total used free shared buff/cache available Mem: 125Gi 4.8Gi 119Gi 41Mi 1.4Gi 119Gi Swap: 49Gi 0B 49Gi tomcat@tomcat-Dove-Product:~$ |
Total 112 physical cores and 64G*2 memory. |
@arjunsuresh Please refer to the running log as below (try it again today) that stopped at 21% of downloading resnet50_v1.onnx within docker building and last 21336.8s without any downloading progress. (cm) tomcat@tomcat-Dove-Product:~$ cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev |
I believe it could be a network issue - best to restart the command if download hangs like this. zenodo download is slow but it works 99% of the time as we have this resnet50 download in most of our github actions. Ideally this download should be over within a couple of minutes. |
Yes, after several times of try run, resnet50 or other downloading passed but still stopped at the "Cloning into 'repo' ..." as before. (refer to the 1st picture) I tried the command "git clone https://github.com/GATEOverflow/inference_results_v4.0.git --depth 5 repo" out of the docker and downloading is normal at first and last to about 20% of downloading progress and then the error prompted. (refer to the 2nd log) The 2nd log: tomcat@tomcat-Dove-Product:~/bobtry$ git clone https://github.com/GATEOverflow/inference_results_v4.0.git --depth 5 repo Cloning into 'repo'... remote: Enumerating objects: 71874, done. remote: Counting objects: 100% (71874/71874), done. remote: Compressing objects: 100% (33638/33638), done. error: RPC failed; curl 92 HTTP/2 stream 0 was not closed cleanly: CANCEL (err 8) error: 1705 bytes of body are still expected fetch-pack: unexpected disconnect while reading sideband packet fatal: early EOF fatal: fetch-pack: invalid index-pack output tomcat@tomcat-Dove-Product:~/bobtry$ |
I think we should fix the download issue before proceeding with MLPerf runs as there are many more downloads needed. Since the clone is failing from github - may be best is to contact your system admin? |
@Bob123Yang while testing across multiple systems we have pin pointed this error to the case where available network bandwidth is very low. One such case we have seen is while using rclone download, which chokes the network bandwidth affecting git clone of large repositories for any system on the same network. |
Run the below cm commands for several times and always failed at the same place:
The text was updated successfully, but these errors were encountered: