-
Notifications
You must be signed in to change notification settings - Fork 628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline to iterate over a single NumPy file #5782
Comments
Hi @ziw-liu, Thank you for reaching out.
I agree, that opening many files is inefficient from the file system point of view as well as GDS which works best for large files, not the small ones where the cost of the GDS initialization may outweigh the benefits. Can you tell us more about your use case? Do you see an IO bottleneck when using a plain IO without GDS? Are you saturating the storage IO or the CPU is busy enough to prevent this? |
Hi @JanuszL and thanks for the quick answer!
I was trying that, and because my files were larger than VRAM (<1% chunks of a 10^1 TB dataset), it would OOM before getting to the slicing step.
I was just starting to explore DALI. I used to have a I/O bottleneck when reading with python code and the data is on NFS (VAST) and had to pre-cache on the compute nodes (DGX H100/H200), which poses a size limit. Reading this article, I thought GDS would be a good way to avoid this step, but NFS suffers a lot from metadata overhead of opening many files. If I use an external source for DALI, I won't be able to use DALI's thread pool and have to use multiprocessing, which should then be similar to running them in a multi-worker PyTorch dataloader? |
Hi @ziw-liu, Thank you for providing the details of your use case.
I would first confirm that you are the GDS is the solution. You can check if CPU utilization is height and this is the limiting factor or the storage just cannot feed the data faster no matter what. |
Thanks, for now I can still afford to pre-cache. Another major reason to try DALI is that we are doing some computation-heavy augmentations and that creates some compute contention on the CPU.
I was also looking at kvikio. I guess if I use GDS via that I would use an external source with the CuPy interface to feed that into a DALI pipeline? |
I think that should work. Please give it a go and let us know how that works for you. |
Describe the question.
I'm trying to use GPU direct storage (GDS) via DALI's numpy reader for a dataset of many (10^4) 3D volumes (each volume is one training sample). However, the API seems to require that one file only contains one sample, so each sample will have to be in a different file, leading to tens of thousands of files. Opening this many files each training epoch could have significant overhead for certain file systems. Is there a way to use larger files instead (for example stacking volumes into chunks) and iterate over a dimension? #4140 suggests using an external source for this, but that would not support GDS.
Check for duplicates
The text was updated successfully, but these errors were encountered: