-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallel merge index #590
base: main
Are you sure you want to change the base?
parallel merge index #590
Conversation
with open(partition_index, 'r') as f: | ||
obj = json.load(f) | ||
for shard in obj['shards']: | ||
for key in ('raw_data', 'zip_data', 'raw_meta', 'zip_meta'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we really ought to make this a Shard method, which is subject to inheritance and so on
this code won't work for parquet shards :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any specific suggestion how to deal with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this work for json/xsv or just for mds index files? Could you test that as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do json/xsv index files have the same file format? @knighton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left comments!
shard = obj['shards'][i] | ||
for key in ('raw_data', 'zip_data', 'raw_meta', 'zip_meta'): | ||
if shard.get(key): | ||
basename = shard[key]['basename'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait why are you just taking the basename of the child file here? and to be clear, why the basename of the parent as well, what if the dir to merge is >1 hops deep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if anyone takes basename
in the format literally, that would be a mistake lol, those are actually always relative paths, was originally named wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@XiaohanZhangCMU was this resolved?
@knighton I updated the description to include some of the profiling results. PTA~ |
@XiaohanZhangCMU is this ready for another round of reviewing? would be good to get it in |
49f5fae
to
18a2f97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments!
|
||
Args: | ||
index_file_urls (List[Union[str, Tuple[str,str]]]): index.json from all the partitions. | ||
index_file_urls (List[Union[str, Tuple[str,str]]]): index.json from all the streams. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function had typed arguments before, could you add that back instead of using generic *args, **kwargs
? Would help clarify what the function is expecting, since otherwise, people have to go looking for the docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an entry function and it branches to either merge_index from list or merge_index from a root folder. Are you recommending to not remove this entry function, and just keep the two actual implementations? Can you elaborate a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I mean that the signature of the merge_index
function isn't typed -- it just has *args, **kwargs
, but the docstring has types for the arguments, for example, index_file_urls
has type (List[Union[str, Tuple[str,str]]])
. I'm suggesting we add types to the merge_index
function so that users will easily be able to see what the merge_index
function takes in (for example, IDEs all support this, would be better for our docs, etc). Otherwise they have to reference the docstring, which doesn't match up with the function signature. lmk if i should elaborate more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping this entry function is fine
with open(partition_index, 'r') as f: | ||
obj = json.load(f) | ||
for shard in obj['shards']: | ||
for key in ('raw_data', 'zip_data', 'raw_meta', 'zip_meta'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this work for json/xsv or just for mds index files? Could you test that as well?
56a2e1b
to
1a0c458
Compare
@XiaohanZhangCMU what remaining changes do we need here? |
Description of changes:
Make merge_index utility run in parallel with multiprocessing. Note the normal use case for merge index happens after mds shards are written to a number of partition folders (remote or local), and one wants to merge the index files of those folders into one merged index file. Depending on the number of cores available at the driver node, downloading and merging use all the available cores.
profiling with 40 cores. Explanation of the profile table:
Downloading
Merging
Issue #, if available:
Merge Checklist:
Put an
x
without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
pre-commit
on my change. (check out thepre-commit
section of prerequisites)