Python Dataset Reader #2414

fiedorowicz1 · 2024-01-08T19:59:09Z

Implements a new python data reader that is cleaner, more general, and supports DistConv. It also supports parallel pre-fetching of multiple mini-batches.

tbennun

overall looks good, some comments

src/data_ingestion/readers/data_reader_python_v2.cpp

tbennun · 2024-01-22T18:30:07Z

ci_test/unit_tests/test_unit_datareader_python_dataset.py

@@ -0,0 +1,137 @@
+import os


Why not use the pytest version?
Would be nice to extend the test infrastructure (maybe at a later PR?)

Can you clarify what you mean here?

Certainly. What I meant is that our pytest version of the tests (i.e., @test_util.lbann_test) might have a nicer syntax here. However, it is designed to use the existing Python data reader and may be extended as such to use the dataset reader (for example, with a flag @test_util.lbann_test(dataset=MyObject()) or similar).

I agree we should change the test framework to use the new python dataset reader, but I think that should be a separate PR.

ci_test/unit_tests/test_unit_datareader_python_dataset.py

ci_test/unit_tests/test_unit_datareader_python_dataset_distconv.py

include/lbann/data_ingestion/readers/data_reader_python_dataset.hpp

src/data_ingestion/readers/data_reader_python_dataset.cpp

applications/physics/cosmology/cosmoflow/cosmoflow_dataset.py

tbennun

Looks great!

ndryden

Nice! :)

tbennun

just noticed this

python/lbann/util/data.py

tbennun · 2024-03-25T17:25:50Z

src/data_ingestion/readers/data_reader_python_dataset.cpp

+  send_rank = recv_rank = send_rank_count = recv_rank_count = 0;
+  uint64_t send_rank_max_count =
+    local_distconv_mb_size + (distconv_extra_samples > 0);
+  uint64_t recv_rank_max_count = local_mb_size + (extra_samples > 0);


@fiedorowicz1 this looks like an Alltoall collective. Do you think using it would be more effective instead of the loop?

…clean up

Co-authored-by: Tal Ben-Nun <[email protected]>

fiedorowicz1 requested review from ndryden, bvanessen and tbennun January 8, 2024 19:59

fiedorowicz1 marked this pull request as draft January 8, 2024 23:10

fiedorowicz1 force-pushed the distconv-python-data-reader branch from eee429e to 40c516b Compare January 17, 2024 18:48

fiedorowicz1 marked this pull request as ready for review January 19, 2024 01:12

tbennun requested changes Jan 22, 2024

View reviewed changes

fiedorowicz1 force-pushed the distconv-python-data-reader branch from 40c516b to 861931c Compare February 27, 2024 06:32

bvanessen reviewed Feb 29, 2024

View reviewed changes

applications/physics/cosmology/cosmoflow/cosmoflow_dataset.py Show resolved Hide resolved

fiedorowicz1 changed the title ~~Python Data Reader 2.0~~ Python Dataset Reader Mar 7, 2024

fiedorowicz1 requested review from tbennun and bvanessen March 7, 2024 05:43

tbennun approved these changes Mar 8, 2024

View reviewed changes

ndryden approved these changes Mar 12, 2024

View reviewed changes

tbennun requested changes Mar 21, 2024

View reviewed changes

python/lbann/util/data.py Outdated Show resolved Hide resolved

tbennun self-requested a review March 21, 2024 00:29

tbennun approved these changes Mar 21, 2024

View reviewed changes

tbennun reviewed Mar 25, 2024

View reviewed changes

bvanessen approved these changes Apr 3, 2024

View reviewed changes

fiedorowicz1 added 11 commits April 3, 2024 17:50

Add skeleton for new python data reader

746650a

Implement basic functionality

64a76e6

Fix initialization for distconv

7eb92a6

Add support for labels

9a7a150

Add python library supporting classes

8fc35e7

clang format

e6465c5

Raise exception if rank/io parts not set

c20d4b4

Rename to python dataset

88c52bd

Add optional module dir argument to add to path

6027f19

Add unit tests

05d1395

Simplify naming

faee18a

fiedorowicz1 and others added 14 commits April 3, 2024 17:51

Add cosmoflow example and reader helper

4702069

Update release notes

9e4cba8

Save dataset pickle in work dir

44eda6d

Overhaul new data reader to support prefetching multiple samples/batches

a2f9a15

Fix worker index calculation

ac2d403

clang-format

5cf06ab

Clarify proto comments

753a52f

Throw error if file fails to open

944b90f

Add docstrings and type hints

de0e5fa

Update CosmoFlow example and enable parallel IO

0af89f8

Add basic sample size checking, remove label reconstruction, general …

fa668c9

…clean up

Switch to multiprocessing pool

6c33696

Implement response shuffling for distconv

63288ac

fix typo

75aac4e

Co-authored-by: Tal Ben-Nun <[email protected]>

fiedorowicz1 force-pushed the distconv-python-data-reader branch from e0cf5f2 to 75aac4e Compare April 4, 2024 00:51

fiedorowicz1 merged commit 1db91a2 into LLNL:develop Apr 4, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python Dataset Reader #2414

Python Dataset Reader #2414

fiedorowicz1 commented Jan 8, 2024 •

edited

Loading

tbennun left a comment

tbennun Jan 22, 2024

fiedorowicz1 Jan 24, 2024

tbennun Jan 24, 2024

fiedorowicz1 Mar 7, 2024

tbennun left a comment

ndryden left a comment

tbennun left a comment

tbennun Mar 25, 2024

Python Dataset Reader #2414

Python Dataset Reader #2414

Conversation

fiedorowicz1 commented Jan 8, 2024 • edited Loading

tbennun left a comment

Choose a reason for hiding this comment

tbennun Jan 22, 2024

Choose a reason for hiding this comment

fiedorowicz1 Jan 24, 2024

Choose a reason for hiding this comment

tbennun Jan 24, 2024

Choose a reason for hiding this comment

fiedorowicz1 Mar 7, 2024

Choose a reason for hiding this comment

tbennun left a comment

Choose a reason for hiding this comment

ndryden left a comment

Choose a reason for hiding this comment

tbennun left a comment

Choose a reason for hiding this comment

tbennun Mar 25, 2024

Choose a reason for hiding this comment

fiedorowicz1 commented Jan 8, 2024 •

edited

Loading