Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support shape_touched from Dask #900

Merged
merged 33 commits into from
Oct 17, 2023

Conversation

agoose77
Copy link
Contributor

@agoose77 agoose77 commented Oct 3, 2023

This PR at present represents a hack of the existing uproot remapping loader to support dask-awkward's new buffer projection interface.

@lgray the Coffea form_mapping object that is passed to uproot.dask needs to satisfy the ImplementsFormMapping interface:

class ImplementsFormMapping(Protocol):
    def __call__(self, form: Form) -> tuple[Form, ImplementsFormMappingInfo]:
        ...

The idea is that the form mapping generates both the remapped form and any auxiliary state required for projection. The return tuple's second item, ImplementsFormMappingInfo has the following protocol:

class ImplementsFormMappingInfo(Protocol):
    @property
    def behavior(self) -> dict | None:
        ...

    buffer_key: str | Callable

    def parse_buffer_key(self, buffer_key: str) -> tuple[str, str]:
        ...

    def keys_for_buffer_keys(self, buffer_keys: frozenset[str]) -> frozenset[str]:
        ...

    def load_buffers(
        self,
        tree: HasBranches,
        keys: frozenset[str],
        start: int,
        stop: int,
        options: Any,
    ) -> Mapping[str, AwkArray]:
        ...

This PR hacks the _map_schema_uproot object to provide this interface, but it could be cleaner; one could remove any unused code here.

@agoose77
Copy link
Contributor Author

agoose77 commented Oct 4, 2023

@lgray this PR implements the protocols that uproot expects for form remapping in factory.py for the Uproot source. I'm not yet au-fait with how the parquet equivalent is handled here, but you'd probably need to replicate the logic in uproot._dask.UprootReadMixin for the parquet IO function.

Can I hand this over to you?

@lgray
Copy link
Collaborator

lgray commented Oct 4, 2023

Yes I can handle the parquet reader stuff - no big deal there.

I'll give this a try today.

@lgray
Copy link
Collaborator

lgray commented Oct 7, 2023

@agoose77 we're just missing a release of dask-histogram and then I can mess with pins here. I'll let you handle the pre-release vs. full release state of things.

Once we get everything to a full release we can merge this one.

pyproject.toml Outdated Show resolved Hide resolved
@lgray lgray marked this pull request as ready for review October 7, 2023 15:41
@lgray
Copy link
Collaborator

lgray commented Oct 7, 2023

For the two new protocols defined we should update dask-awkward since that's where the present protocol ImplementsFormTransformation is kept and I have no problem with changing that.

@agoose77
Copy link
Contributor Author

agoose77 commented Oct 7, 2023

For the two new protocols defined we should update dask-awkward since that's where the present protocol ImplementsFormTransformation is kept and I have no problem with changing that.

Could you clarify what you mean here @lgray ? :)

@lgray
Copy link
Collaborator

lgray commented Oct 8, 2023

@agoose77

This:
https://github.com/dask-contrib/dask-awkward/blob/main/src/dask_awkward/lib/io/io.py#L45-L61

@agoose77
Copy link
Contributor Author

agoose77 commented Oct 8, 2023

I've made a PR to remove it, as it's a better fit for uproot! :)

buffer_key(form_key=form_key, attribute=attribute, form=None)
]

return TranslateBufferKeys()

def create_column_mapping_and_key(self, tree, start, stop, interp_options):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgray can we remove this from the base definition, or does someone out there likely use it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interface is completely private and rather obscure to boot, so I am fine to remove it.

tests/test_jetmet_tools.py Outdated Show resolved Hide resolved
tests/test_jetmet_tools.py Outdated Show resolved Hide resolved
@@ -837,9 +837,9 @@ def test_corrected_jets_factory(optimization_enabled):
**{name: evaluator[name] for name in jec_stack_names[5:6]}
)

print(dak.necessary_columns(jets.eta))
print(dak.report_necessary_columns(jets.eta))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgray note that report_necessary_columns reports whatever the IO layer thinks a "column" is. It doesn't necessarily mean "dotted field" right now, rather I envisage it talking about the concept of a "key" or "column" that can be read from the IO source.

@douglasdavis this is something I didn't fully think about when we made necessary_columns shim report_necessary_columns.

The difference will be that for remapped forms, the report_necessary_columns (and by proxy, necessary_columns) will report the underlying ROOT file keys, not the remapped keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It therefore might mean that we should remove that alias for necessary_columns.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually prefer the new behavior since users will probably expect to get a list of columns in the file they're reading! But seeing the remapped keys is nice for debugging.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it so that we get a version or two of dask-awkward before dak.necessary_columns is gone and give migration instructions? I know a few end-users that use that function.

Copy link
Contributor Author

@agoose77 agoose77 Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgray I'm not 100% clear if we're on the same page.

report_necessary_columns is a new function, so I'm comfortable with its semantics deviating from ak.necessary_columns.

It would be possible to restore the old behavior, but it will require a reasonable amount of work on the uproot side (I am guessing).

ak.necessary_columns is currently an alias for this new function, but for uproot this wouldn't be a non-breaking change; if previously it would report ROOT_KEY.record.foo.bar, now it would report ROOT_KEY. Is this OK? I don't see it being that useful if it breaks existing users.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I suppose for my most typical use case I never see keys with ROOT_KEY.record.foo.bar formatting, it's really only ever ROOT_KEY and any structure is up in the form being remapped to. So they appear the same to me.

@lgray
Copy link
Collaborator

lgray commented Oct 11, 2023

@nsmith- I think this one is ready for review now. If you could give it a bit of time before the end of the week it would be appreciated.

Comment on lines 79 to 86
def keys_for_buffer_keys(self, buffer_keys):
base_columns = set()
for buffer_key in buffer_keys:
form_key, attribute = self.parse_buffer_key(buffer_key)
operands = urllib.parse.unquote(form_key).split(",")

it_operands = iter(operands)
next(it_operands)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function behaves like extract_form_key_base_columns, but supports operands e.g. attribute lookup.

@@ -147,8 +175,17 @@ def create_column_mapping_and_key(self, tree, start, stop, interp_options):
use_ak_forth=True,
)
mapping.preload_column_source(partition_key[0], partition_key[1], tree)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've deliberately tried not to touch this logic at all. Hopefully, by presenting the correct interface to uproot, internal refactoring can happen here in future without any side effects.

@lgray
Copy link
Collaborator

lgray commented Oct 12, 2023

suspicious new failure...

@ikrommyd
Copy link
Collaborator

I'd like to report that I've been using this all day on LPC running stuff with the newest awkward, uproot, dask-awkward and dask-histogram and have not ran into any errors.

@agoose77
Copy link
Contributor Author

Todo: bump the uproot and dask-awkward versions

@lgray
Copy link
Collaborator

lgray commented Oct 16, 2023

@nsmith- a gentle reminder for a review, please!

Copy link
Member

@nsmith- nsmith- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to do some cleanup, but it can happen later I think.
@agoose77 does the protocol you defined in the initial comment have some explanatory docstrings available somewhere?

@@ -1734,11 +1734,11 @@ def _work_function(
metrics = {}
if isinstance(file, uproot.ReadOnlyDirectory):
metrics["bytesread"] = file.file.source.num_requested_bytes
# metrics["data_and_shape_buffers"] = set(materialized)
# metrics["shape_only_buffers"] = set(materialized)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these placeholders for something in the future?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to get tests to run - executor is going to be significantly changed in a subsequent PR.

def parse_buffer_key(self, buffer_key):
prefix, attribute, form_key = buffer_key.rsplit("/", maxsplit=2)
if attribute == "offsets":
return (form_key[: -len("%2C%21offsets")], attribute)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to also be in the previous code as well, but some of these string splitting and urllib name mangling things were put into https://github.com/CoffeaTeam/coffea/blob/master/src/coffea/nanoevents/util.py and could be used here

@@ -68,7 +67,7 @@ def _key_formatter(prefix, form_key, form, attribute):
return prefix + f"/{attribute}/{form_key}"


class _map_schema_base(ImplementsFormTransformation):
class _map_schema_base: # ImplementsFormMapping, ImplementsFormMappingInfo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to inherit the protocol(s)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They weren't exported yet! I think a new release includes them. Let me check.

@lgray lgray enabled auto-merge October 17, 2023 15:29
@lgray lgray merged commit 8662c9f into scikit-hep:master Oct 17, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants