Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] why the training time of warm-up dataset is longer than that of local hostpath in the same k8s/fluid environment and training job #4231

Open
liumiaomiaoIntel opened this issue Jul 24, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@liumiaomiaoIntel
Copy link

What is your environment(Kubernetes version, Fluid version, etc.)
kubernetes version: v1.28
Fluid version: 1.0.1-14eda3b
kubernetes cluster: 1 control plane + 1 node +1 remote storage server with a s3 bucket
cache runtime: alluxio, 20G MEM, replica 1
training job: pretrained resnet50 model with10G raw data of small files(.jpg).

Describe the bug
test scenarios:

  1. 10G raw data is stored in the s3 buckte of the remote storage server, and use Fluid to abstract this s3 bucket into pv in k8s cluster. this pv is mounted in the training job pod as training raw data directory. before training started, this dataset has been preloaded into fluid cache system.
  2. without using fluid, 10G raw data is stored in the k8s node's local directory, and use hostpath to mount this directory into the training job pod as training raw data.

Results of the above 2 scenarios show:
training time of scenario 1: 155.51s (training time of no-preloaded dataset is 950.9s )
training time of scenario 2: 141.7s

What you expect to happen:
As far as I understand, in scenario 2, raw data is stored in a local directory but not in memory; in scenario 1, raw data has already cached in Fluid cache system(also in the k8s node's memory). So, training time of scenario 2 should be smaller than that of scenario 1 due to data access speed of memory is much higher than tant of loacl directory.
But the training time of scenario 1 is always about 10s longer than scenario 2.

How to reproduce it

Additional Information

@liumiaomiaoIntel liumiaomiaoIntel added the bug Something isn't working label Jul 24, 2024
@cheyang
Copy link
Collaborator

cheyang commented Jul 29, 2024

@liumiaomiaoIntel Thank you for opening the issue. I think we need more details. For example, if you use remote cache through network but with local path even in page cache. The test result may be different. Please reach me by my email: [email protected].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants