-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for shape touching in Dask #966
Conversation
src/uproot/_dask.py
Outdated
def build_form_key_to_key(form: Form) -> dict[str, str | None]: | ||
form_key_to_path: dict[str, str | None] = {} | ||
|
||
def impl(form, column_path): | ||
# Store columnar path | ||
form_key_to_path[form.form_key] = column_path[-1] if column_path else None | ||
|
||
if form.is_union: | ||
for _i, entry in enumerate(form.contents): | ||
impl(entry, column_path) | ||
elif form.is_indexed: | ||
impl(form.content, column_path) | ||
elif form.is_list: | ||
impl(form.content, column_path) | ||
elif form.is_option: | ||
impl(form.content, column_path) | ||
elif form.is_record: | ||
for field in form.fields: | ||
impl(form.content(field), (*column_path, field)) | ||
elif form.is_unknown or form.is_numpy: | ||
pass | ||
else: | ||
actual_form = self.rendered_form | ||
raise AssertionError(form) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This simple recursion builds a mapping form form_key
to the key
(depth-0 field) in the TTree
src/uproot/_dask.py
Outdated
buffer_key: Final[str] = "{form_key}-{attribute}" | ||
|
||
def parse_buffer_key(self, buffer_key: str) -> tuple[str, str]: | ||
form_key, attribute = buffer_key.rsplit("-", maxsplit=1) | ||
return form_key, attribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we decide upon a trivial form_key ←→ buffer_key
encoding.
src/uproot/_dask.py
Outdated
def keys_for_buffer_keys(self, buffer_keys: set[str]) -> set[str]: | ||
keys: set[str] = set() | ||
for buffer_key in buffer_keys: | ||
# Identify form key | ||
form_key, attribute = buffer_key.rsplit("-", maxsplit=1) | ||
# Identify key from form_key | ||
keys.add(self._form_key_to_key[form_key]) | ||
return keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per the new dask-awkward buffer projection, we implement a function that maps buffer keys (of our own choosing!) to the TTree keys.
src/uproot/_dask.py
Outdated
|
||
def load_buffers( | ||
self, tree: HasBranches, keys: set[str], start: int, stop: int, options: Any | ||
) -> Mapping[str, AwkArray]: | ||
# First, let's read the arrays as a tuple (to associate with each key) | ||
arrays = tree.arrays( | ||
keys, | ||
entry_start=start, | ||
entry_stop=stop, | ||
ak_add_doc=self.interp_options["ak_add_doc"], | ||
ak_add_doc=options["ak_add_doc"], | ||
how=tuple, | ||
) | ||
|
||
if self.original_form is not None: | ||
awkward = uproot.extras.awkward() | ||
dask_awkward = uproot.extras.dask_awkward() | ||
awkward = uproot.extras.awkward() | ||
|
||
# The subform generated by awkward.to_buffers() has different form keys | ||
# from those used to perform buffer projection. However, the subform | ||
# structure should be identical to the projection optimisation | ||
# subform, as they're derived from `branch.interpretation.awkward_form` | ||
# Therefore, we can correlate the subform keys using `expected_from_buffers` | ||
container = {} | ||
for key, array in zip(keys, arrays): | ||
# First, convert the sub-array into buffers | ||
ttree_subform, length, ttree_container = awkward.to_buffers(array) | ||
|
||
# Load the associated projection subform | ||
projection_subform = self._form.content(key) | ||
|
||
# Correlate each TTree form key with the projection form key | ||
for (src, src_dtype), (dst, dst_dtype) in zip( | ||
ttree_subform.expected_from_buffers().items(), | ||
projection_subform.expected_from_buffers(self.buffer_key).items(), | ||
): | ||
assert src_dtype == dst_dtype # Sanity check! | ||
container[dst] = ttree_container[src] | ||
|
||
return container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per the dask-awkward buffer projection interface, this function populates a dictionary of buffer keys (that match our remapped form, and hence, our buffer-key scheme) with the corresponding buffers loaded from the TTree.
The caller will fill in the blanks for non-read buffers.
src/uproot/_dask.py
Outdated
class TrivialFormMapping(ImplementsFormMapping): | ||
def __call__(self, form: Form) -> tuple[Form, TrivialFormMappingInfo]: | ||
new_form = form_with_unique_keys(form, "<root>") | ||
return new_form, TrivialFormMappingInfo(new_form) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ImplementsFormMapping
interface just provides us with a way to couple some state with a remapped form, hence it is very thin.
src/uproot/_dask.py
Outdated
@property | ||
def meta(self) -> AwkArray: | ||
awkward = uproot.extras.awkward() | ||
high_level_form, form_info = self.form_mapping(self.base_form) | ||
return awkward.typetracer.typetracer_from_form( | ||
high_level_form, | ||
highlevel=True, | ||
behavior=form_info.behavior, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meta
is separate to the column projection, but enables the from_map
function to derive the meta from a single source of truth.
Hey team (@jpivarski and @ioanaif), would you mind casting your eyes over this PR and see whether anything stands out. Particular with respect to the abstractions being built here. I chose to avoid branching, and instead have form remapping be the base case, and treat non-remapping as a trivial remapper. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you've completely refactored the uproot.dask
implementation. That's good—it needed it—but there isn't an easy way to review this from looking at the code. We'll have to do it by extensive tests.
Getting #968 and #969 into main and then merging them into this PR will be a first step, so that we have a good baseline to start with. Then we can get all of the required pieces into the same environment.
I take it there isn't an order in which we can apply these changes in successive releases of awkward, dask-awkward, uproot, and coffea to not break things while releasing one new version of each at a time? Does it need to be all at once? If so, then we should plan the new version numbers simultaneously, so that we can make the new version of each package require (at minimum) the new version of each of the other packages, so that either the packaging system or the runtime-ImportError message specifies the versions needed.
Here's a first attempt at that:
- awkward (root dependency): 2.4.5
- dask-awkward (depends on awkward): 2023.10.0
- uproot (runtime dependency on dask-awkward): 5.1.0 (a new minor version because it will be easier to keep track of than a large number in the third digit, like 5.0.13)
- coffea (depends on awkward, uproot, and dask-awkward): 2023.10.0.rc0
When everything tests well in a common environment, we should release the new, mutually compatible versions of all four packages in a short window of time, and all of their minimum requirements on the other packages in the group should be these numbers at the time of release.
During the window, it will be possible for someone to get a bad combination because the currently latest versions of the dependent packages don't put an upper cap on their requirements for the others. I think the best way to deal with that is just to make that window short (pull off the band-aid quickly).
We can
That should minimise the pain, I hope, as |
This reverts commit 4a69863.
3860ffb
to
f63ca09
Compare
…o agoose77/wip-shape-tracking
…o agoose77/wip-shape-tracking
@jpivarski and @ioanaif I've bumped the awkward version to 2.4.5+, which is strict but necessary to ensure that uproot.dask works. My view is awkward 2 users should not have any upper caps at this stage, do you feel OK with this change? If so, let's merge! |
@jpivarski could we make another uproot pre-release at some point? |
5.1.0rc2 is another pre-release: https://pypi.org/project/uproot/#history |
This PR adds support for the
shape_touched
tracking performed by Awkward typetracers in order to avoid reading all data buffers.High level overview of this PR:
_UprootRead
et al. implement the new buffer projection interfacefrom_buffers