Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: added hasattr check to NanoAODEvents __repr__ #909

Merged
merged 20 commits into from
Oct 18, 2023

Conversation

pviscone
Copy link
Contributor

Hi everyone,
I don't know how common it is, but sometimes I use skimmed NanoAOD files that contain only the Events tree.

In this case, the repr function of NanoAODEvents raises an error because the attribute in the repr string does not exist. I just added a check using hasattr.

@NJManganelli
Copy link
Collaborator

Hi, the fields referenced here should be members of the NanoAOD Events tree. The NanoAODSchema does not, as I recall, connect anything with the auxiliary Runs or LuminosityBlock trees (where available). When you say you make a skimmed tree, do you mean that the Events tree itself has been stripped of the run, luminosityBlock, or event field? If not, would you have a small (few events) file to test against?

@pviscone
Copy link
Contributor Author

Yes, I'm sorry, I got confused.
You are right, I'm talking about the Events tree that is stripped of the run, luminosityBlock, or event field.

@NJManganelli
Copy link
Collaborator

On the one hand, I'd strongly discourage anyone from dropping these fields. The savings are very small, but the costs of having to regenerate ntuples when one needs to do event overlap checks later on, or analysis cross-checks (against partner frameworks for example, or against older versions of the analysis if things have changed), or someone requests an event display, can mean weeks or months of delay. The run field can be needed to distinguish problematic data collection sub-eras, such as the beamspot problems in 2017 ("Early" and "Late" Run C), or to do MET x-y corrections.

On the other hand, this can be a nuisance or worse, and sometimes one is stuck with a cut-down format. I'm leaning towards being opinionated enough to warn people when these fields are missing (ala missing cross-refs between collections, but with enough info to ensure it's being done with cognizance of the considerable drawbacks alluded to above), but not getting in the way of analysis being done. @lgray @nsmith- @andrzejnovak thoughts?

@lgray
Copy link
Collaborator

lgray commented Oct 15, 2023

Yeah that's certainly an odd choice to leave out of a tree, especially when cross-validation between analysis groups is so common (and as @NJManganelli noted it just doesn't save much disk space).

In any case it exists so we must deal with it.

I think the opinionated warning from above makes sense. I would even go so far as to raise an error if this happens since these columns are so commonly useful, and have an option in the schema to demote the error to a warning (with also the suggested patch in this PR).

@pviscone @NJManganelli @andrzejnovak @nsmith- make sense to you?

@andrzejnovak
Copy link
Member

Agreed, a warning that forces one to "opt-in" sounds like a reasonable compromise. I don't think it's a good idea to "won't fix" just because users should not typically remove it. One might easily find themselves in a situation where they need to use a skimmed nano someone else made and it doesn't make sense to disallow it on principle.

@NJManganelli
Copy link
Collaborator

@pviscone would you be willing to implement these features? We can offer guidance.

Some elements would include:

  1. a variable like error_missing_event_ids = True below https://github.com/CoffeaTeam/coffea/blob/master/src/coffea/nanoevents/schemas/nanoaod.py#L43 .

  2. perhaps right below that, a list of the event_ids : ["run", "luminosityBlock", "event"]

  3. the docstrings should be updated to inform users of the above variable, it's usage, and to note the event_ids purpose, like the other elements in the form

  4. inside https://github.com/CoffeaTeam/coffea/blob/master/src/coffea/nanoevents/schemas/nanoaod.py#L193

if <any of the event_ids are missing>:
    if error_missing_event_ids:
        <raise an error>
    else:
        <raise just a warning>

Ideally, the printed message should specify exactly which field(s) are missing from the event_ids (both error and warning). Message should be opinionated about removing the event_ids, and can note some of the quite common usecases we've brought up in the conversation here.

@pviscone
Copy link
Contributor Author

Thank You for the hints.

I implemented the features, but maybe You want to improve the error message, I have just reported what was discussed in this PR

@@ -206,6 +212,24 @@ def _build_collections(self, field_names, input_contents):
branch_forms["n" + name]
)

# Check the presence of the event_ids
missing_event_ids = [
event_id for event_id in self.event_ids if not hasattr(self, event_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
event_id for event_id in self.event_ids if not hasattr(self, event_id)
event_id for event_id in self.event_ids if event_id not in branch_forms

f"event_ids {missing_event_ids} is/are missing. "
"These fields can be needed to distinguish problematic "
"data collection sub-eras or to do an analysis cross-check.\n"
"To prevent this error turn error_missing_event_ids to False"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these 3 lines can read as follows (@andrzejnovak @lgray @nsmith- crosscheck me on verbosity):
"The event ID fields {self.event_ids} are necessary to perform sub-run identification (e.g. for corrections and sub-dividing data during different detector conditions), to cross-validate MC and Data (i.e. matching events for comparison), and to generate event displays. It's advised to never drop these branches from the dataformat.

This error can be demoted to a warning by setting the class level variable error_missing_event_ids to False."

hasattr(self, "run")
and hasattr(self, "luminosityBlock")
and hasattr(self, "event")
):
Copy link
Collaborator

@NJManganelli NJManganelli Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option may be to use getattr(self, field, "??") such that if 1 or 2 of the fields are there, they'll still print. (Or something similar)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the attributes are missing I don't know if it is better just to print an empty string or again to warn the user that the attributes are missing with something like

"<event (missing run):(missing luminosityBlock):(missing Event)>"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repeated warnings tend to be annoying to users, for what it's worth.

@lgray
Copy link
Collaborator

lgray commented Oct 17, 2023

@pviscone This is shaping up well. We'll merge it after #900 so we can run tests, thanks!

My one review comment is that we include an example file and a corresponding test which triggers this error or warning depending on setting. Given this is related to the TBranch metadata alone it can be a one-event skim or something trivial like that.

@pviscone
Copy link
Contributor Author

@lgray I committed all the comments. Here I have attached a NanoAOD file that contains just 2 events stripped from the luminosityBlock field. (The file is zipped because GitHub does not allow to upload directly root files)

missing_luminosityBlock.zip

@lgray
Copy link
Collaborator

lgray commented Oct 17, 2023

oh just add the .root file to the repo via a git commit in the tests/samples directory and make a corresponding test.

def test_missing_eventIds_warning():
path = os.path.abspath("tests/samples/missing_luminosityBlock.root")
with pytest.raises(RuntimeWarning, match="Missing event_ids : ['luminosityBlock']"):
NanoAODSchema.error_missing_event_ids = False
Copy link
Contributor Author

@pviscone pviscone Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a safer way to turn the error_missing_events_ids flag to False?
In this way, I think that it will disable it globally

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the intended pattern, it's fine!

@lgray
Copy link
Collaborator

lgray commented Oct 18, 2023

ok this will go in the next pre-release of coffea :-)

@lgray lgray merged commit 2cf1196 into scikit-hep:master Oct 18, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants