[BUG] why the training time of warm-up dataset is longer than that of local hostpath in the same k8s/fluid environment and training job #4231

liumiaomiaoIntel · 2024-07-24T04:57:15Z

What is your environment(Kubernetes version, Fluid version, etc.)
kubernetes version: v1.28
Fluid version: 1.0.1-14eda3b
kubernetes cluster: 1 control plane + 1 node +1 remote storage server with a s3 bucket
cache runtime: alluxio, 20G MEM, replica 1
training job: pretrained resnet50 model with10G raw data of small files(.jpg).

Describe the bug
test scenarios:

10G raw data is stored in the s3 buckte of the remote storage server, and use Fluid to abstract this s3 bucket into pv in k8s cluster. this pv is mounted in the training job pod as training raw data directory. before training started, this dataset has been preloaded into fluid cache system.
without using fluid, 10G raw data is stored in the k8s node's local directory, and use hostpath to mount this directory into the training job pod as training raw data.

Results of the above 2 scenarios show:
training time of scenario 1: 155.51s (training time of no-preloaded dataset is 950.9s )
training time of scenario 2: 141.7s

What you expect to happen:
As far as I understand, in scenario 2, raw data is stored in a local directory but not in memory; in scenario 1, raw data has already cached in Fluid cache system(also in the k8s node's memory). So, training time of scenario 2 should be smaller than that of scenario 1 due to data access speed of memory is much higher than tant of loacl directory.
But the training time of scenario 1 is always about 10s longer than scenario 2.

How to reproduce it

Additional Information

cheyang · 2024-07-29T10:10:23Z

@liumiaomiaoIntel Thank you for opening the issue. I think we need more details. For example, if you use remote cache through network but with local path even in page cache. The test result may be different. Please reach me by my email: [email protected].

liumiaomiaoIntel added the bug Something isn't working label Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] why the training time of warm-up dataset is longer than that of local hostpath in the same k8s/fluid environment and training job #4231

[BUG] why the training time of warm-up dataset is longer than that of local hostpath in the same k8s/fluid environment and training job #4231

liumiaomiaoIntel commented Jul 24, 2024

cheyang commented Jul 29, 2024

[BUG] why the training time of warm-up dataset is longer than that of local hostpath in the same k8s/fluid environment and training job #4231

[BUG] why the training time of warm-up dataset is longer than that of local hostpath in the same k8s/fluid environment and training job #4231

Comments

liumiaomiaoIntel commented Jul 24, 2024

cheyang commented Jul 29, 2024