-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support shape_touched
from Dask
#900
feat: support shape_touched
from Dask
#900
Conversation
for more information, see https://pre-commit.ci
…uched' into agoose77/feat-support-shape-touched
@lgray this PR implements the protocols that uproot expects for form remapping in Can I hand this over to you? |
…uched' into agoose77/feat-support-shape-touched
Yes I can handle the parquet reader stuff - no big deal there. I'll give this a try today. |
for more information, see https://pre-commit.ci
…uched' into agoose77/feat-support-shape-touched
@agoose77 we're just missing a release of dask-histogram and then I can mess with pins here. I'll let you handle the pre-release vs. full release state of things. Once we get everything to a full release we can merge this one. |
Co-authored-by: Angus Hollands <[email protected]>
For the two new protocols defined we should update dask-awkward since that's where the present protocol |
Could you clarify what you mean here @lgray ? :) |
I've made a PR to remove it, as it's a better fit for uproot! :) |
src/coffea/nanoevents/factory.py
Outdated
buffer_key(form_key=form_key, attribute=attribute, form=None) | ||
] | ||
|
||
return TranslateBufferKeys() | ||
|
||
def create_column_mapping_and_key(self, tree, start, stop, interp_options): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgray can we remove this from the base definition, or does someone out there likely use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interface is completely private and rather obscure to boot, so I am fine to remove it.
@@ -837,9 +837,9 @@ def test_corrected_jets_factory(optimization_enabled): | |||
**{name: evaluator[name] for name in jec_stack_names[5:6]} | |||
) | |||
|
|||
print(dak.necessary_columns(jets.eta)) | |||
print(dak.report_necessary_columns(jets.eta)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgray note that report_necessary_columns
reports whatever the IO layer thinks a "column" is. It doesn't necessarily mean "dotted field" right now, rather I envisage it talking about the concept of a "key" or "column" that can be read from the IO source.
@douglasdavis this is something I didn't fully think about when we made necessary_columns
shim report_necessary_columns
.
The difference will be that for remapped forms, the report_necessary_columns
(and by proxy, necessary_columns
) will report the underlying ROOT file keys, not the remapped keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It therefore might mean that we should remove that alias for necessary_columns
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually prefer the new behavior since users will probably expect to get a list of columns in the file they're reading! But seeing the remapped keys is nice for debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make it so that we get a version or two of dask-awkward before dak.necessary_columns
is gone and give migration instructions? I know a few end-users that use that function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgray I'm not 100% clear if we're on the same page.
report_necessary_columns
is a new function, so I'm comfortable with its semantics deviating from ak.necessary_columns
.
It would be possible to restore the old behavior, but it will require a reasonable amount of work on the uproot side (I am guessing).
ak.necessary_columns
is currently an alias for this new function, but for uproot this wouldn't be a non-breaking change; if previously it would report ROOT_KEY.record.foo.bar
, now it would report ROOT_KEY
. Is this OK? I don't see it being that useful if it breaks existing users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I suppose for my most typical use case I never see keys with ROOT_KEY.record.foo.bar
formatting, it's really only ever ROOT_KEY
and any structure is up in the form being remapped to. So they appear the same to me.
@nsmith- I think this one is ready for review now. If you could give it a bit of time before the end of the week it would be appreciated. |
src/coffea/nanoevents/factory.py
Outdated
def keys_for_buffer_keys(self, buffer_keys): | ||
base_columns = set() | ||
for buffer_key in buffer_keys: | ||
form_key, attribute = self.parse_buffer_key(buffer_key) | ||
operands = urllib.parse.unquote(form_key).split(",") | ||
|
||
it_operands = iter(operands) | ||
next(it_operands) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function behaves like extract_form_key_base_columns
, but supports operands e.g. attribute lookup.
@@ -147,8 +175,17 @@ def create_column_mapping_and_key(self, tree, start, stop, interp_options): | |||
use_ak_forth=True, | |||
) | |||
mapping.preload_column_source(partition_key[0], partition_key[1], tree) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've deliberately tried not to touch this logic at all. Hopefully, by presenting the correct interface to uproot, internal refactoring can happen here in future without any side effects.
suspicious new failure... |
I'd like to report that I've been using this all day on LPC running stuff with the newest awkward, uproot, dask-awkward and dask-histogram and have not ran into any errors. |
Todo: bump the uproot and dask-awkward versions |
nb: safer to repin in coffea for users because of numba's sliding window, very easy to get a mismatch
@nsmith- a gentle reminder for a review, please! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to do some cleanup, but it can happen later I think.
@agoose77 does the protocol you defined in the initial comment have some explanatory docstrings available somewhere?
@@ -1734,11 +1734,11 @@ def _work_function( | |||
metrics = {} | |||
if isinstance(file, uproot.ReadOnlyDirectory): | |||
metrics["bytesread"] = file.file.source.num_requested_bytes | |||
# metrics["data_and_shape_buffers"] = set(materialized) | |||
# metrics["shape_only_buffers"] = set(materialized) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these placeholders for something in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just to get tests to run - executor is going to be significantly changed in a subsequent PR.
src/coffea/nanoevents/factory.py
Outdated
def parse_buffer_key(self, buffer_key): | ||
prefix, attribute, form_key = buffer_key.rsplit("/", maxsplit=2) | ||
if attribute == "offsets": | ||
return (form_key[: -len("%2C%21offsets")], attribute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to also be in the previous code as well, but some of these string splitting and urllib name mangling things were put into https://github.com/CoffeaTeam/coffea/blob/master/src/coffea/nanoevents/util.py and could be used here
@@ -68,7 +67,7 @@ def _key_formatter(prefix, form_key, form, attribute): | |||
return prefix + f"/{attribute}/{form_key}" | |||
|
|||
|
|||
class _map_schema_base(ImplementsFormTransformation): | |||
class _map_schema_base: # ImplementsFormMapping, ImplementsFormMappingInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to inherit the protocol(s)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They weren't exported yet! I think a new release includes them. Let me check.
This PR at present represents a hack of the existing uproot remapping loader to support dask-awkward's new buffer projection interface.
@lgray the Coffea
form_mapping
object that is passed touproot.dask
needs to satisfy theImplementsFormMapping
interface:The idea is that the form mapping generates both the remapped form and any auxiliary state required for projection. The return tuple's second item,
ImplementsFormMappingInfo
has the following protocol:This PR hacks the
_map_schema_uproot
object to provide this interface, but it could be cleaner; one could remove any unused code here.