-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(wip): edm4hep development #822
base: master
Are you sure you want to change the base?
Conversation
@jbrewster7 don't worry about the failures on arm, they're intermittent (probably depending on the host machine the CI job lands on) |
coffea/nanoevents/schemas/edm4hep.py
Outdated
- Extended quantities of physic objects are stored in the format | ||
<Object>_<variable>, such as "Jets_jecFactor". Such variables will be | ||
merged into the collection <Object>, so the branch "Jets_jetFactor" will be | ||
access to in the array format as "Jets.jecFactor". An exception to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In "podio speak" these are VectorMembers
and they are used in several of the datatypes of edm4hep.
Note that there are also OneToOneRelations
and OneToManyRelations
which are currently still stored using the convenction <collection-name>#<index>
. These are the branches you would need to use to resolve e.g. mother/daughter relations for MCParticles. The <index>
is currently somewhat "magic" but this is going to change with AIDASoft/podio#405, where the new schema of the branch names will be _<collection-name>_<name>
for both VectorMembers
and the relations. Let me know if you want to wait for that or whether you want some information about the magic.
I am not sure how general you want to make this in the end, but we could potentially help with some code generation on the podio side if you think that would be in any way useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmadlener Thanks for the info. Sorry, the documentation is for another dataformat type - I'll clean it up next week. In any case I can always find the constituents of a given physics object and wrap them together. It was pretty clear what to do in edm4hep!
So far as code generation - once we work it out here it could be nice to move it properly to podio
(though that's not a python package but rather a generator for one)... Could envision some CI to put something on PyPI. Though that may not be well advised since what's being generated here is the schema as opposed to the implementation (which in our case is handled by more general functions + an embedded domain specific language that does know how to handle cross references if you give it the source/mapping/target). It may be best that it stays here, it only needs to be done once anyway since it gets distributed with coffea? We can work it out in time.
I think in the mean time if you could help us with the "magic" cross references so we know what we're supposed to be referencing that would be really useful! We can add in the better-named xref branches when they're there and files with it are available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on the code generation. Let's see how this works out and how frequent changes would need to be.
So the "magic" is effectively that the indices get generated sequentially in the order the relations appear in the yaml definition file, starting with OneToManyRelations
and then continuing with OneToOneRelations
. So as an example the edm4hep::ReconstructedParticle
has
OneToOneRelations:
- edm4hep::Vertex startVertex //start vertex associated to this particle
- edm4hep::ParticleID particleIDUsed //particle Id used for the kinematics of this particle
OneToManyRelations:
- edm4hep::Cluster clusters //clusters that have been used for this particle.
- edm4hep::Track tracks //tracks that have been used for this particle.
- edm4hep::ReconstructedParticle particles //reconstructed particles that have been combined to this particle.
- edm4hep::ParticleID particleIDs //particle Ids (not sorted by their likelihood)
and on file the branches will be (assuming the collection name is "reco"
)
reco#0
for the related Clustersreco#1
for the related Tracks- ...
reco#4
for the (single) startVertexreco#5
for the (single) particleIDUsed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed that's a little bit hacky to keep working since there's no clear way to keep track of the schema over time.
The next version sounds much more tractable in the long run!
@jbrewster7 how are things going w.r.t. cross references? |
@lgray Hi! The cross references seem to all be working well and I'm trying to set them up in a way that makes them easy enough to change when they go from magic numbers to names |
Hi @jbrewster7, if you want to make this transparent, it should be enough to use some information that is available from the name_to_id = {n: i for (n,i) in zip(names, collectionIDs)} should effectively give you a map of names to IDs. This will also hold after we have switched to something more robust than we currently have. The |
@tmadlener I'll poke around with that table - I remember uproot complaining about it. If it won't read I'll make a PR to uproot to get it happening. |
@jbrewster7 Hooray! That's awesome! |
@tmadlener ah, that was something else, the idTable is easily accessible: >>> x["podio_metadata"].keys()
['events___idTable', 'events___idTable/m_collectionIDs', 'events___idTable/m_names', 'events___CollectionTypeInfo', 'events___CollectionTypeInfo/events___CollectionTypeInfo._3', 'events___CollectionTypeInfo/events___CollectionTypeInfo._2', 'events___CollectionTypeInfo/events___CollectionTypeInfo._1', 'events___CollectionTypeInfo/events___CollectionTypeInfo._0', 'PodioBuildVersion', 'PodioBuildVersion/major', 'PodioBuildVersion/minor', 'PodioBuildVersion/patch', 'EDMDefinitions', 'EDMDefinitions/EDMDefinitions._1', 'EDMDefinitions/EDMDefinitions._0']
>>> x["podio_metadata"]["events___idTable"].array()[0].show()
{m_collectionIDs: [1, 2, 3, 4, 5, 6, 7, 8, ..., 40, 41, 42, 43, 44, 45, 46, 47],
m_names: ['AllCaloHitContributionsCombined', ..., 'RecoMCTruthLink']} A bit annoying we can't just do this with the awkward form alone though. I'll need to add in a hook to nanoevents to let the schema get organizational data that's available in the file! @nsmith- thoughts? |
@jeyserma FYI |
coffea/nanoevents/methods/vector.py
Outdated
@@ -746,6 +748,251 @@ def nearest( | |||
return out | |||
|
|||
|
|||
@awkward.mixin_class(behavior) | |||
class LorentzVectorM(ThreeVector): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should double-check, but I think all that's needed is to subclass LorentzVector and implement two overrides:
@property
def t(self):
return numpy.sqrt(self["mass"]*self["mass"] + self.rho2)
@property
def mass(self):
return self["mass"]
basically the magic is that getitem
always gets what's actually in the awkward array and all getattr
is for derived or almost-derived. That way all the other methods should pick the right override
coffea/nanoevents/factory.py
Outdated
behavior = {} | ||
behavior.update(base.behavior) | ||
behavior.update(vector.behavior) | ||
elif schemaclass is EDM4HEPSchema: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is following the existing pattern, so it's a question for @lgray: why not use the class property EDM4HEPSchema.behavior
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs cleanup in a separate PR to harmonize the interface with the eager one. You're right but marking it for later.
On the issue of needing additional metadata to properly reconstruct the NanoEvents schema (i.e. awkward form) from the list of branches, the only issue I see is that, in the case of ROOT files, there is no standard way of storing it. So each file type will need its own. Perhaps what can be done is to add a new hook in the schema class interface: a classmethod that takes in the TFile and does what it needs to find the rest of the metadata. But this of course is only for ROOT, since parquet can mostly capture awkward data types and there is a standard spot (iirc) to put the mixin behavior names. So |
@jpivarski given @nsmith-'s comment (which is where I had more or less landed as well) - is there a reasonable way to get additional serialized metadata that's within the (first) root file (even if it is globbed) as a tree or class, when calling I suppose I have the file string, but it might be globbed, which I can't use in That should be sufficiently quick. |
Indeed it's pretty easy: import uproot
x = uproot.dask({"~/Downloads/rv02-02.sv02-02.mILD_l5_o1_v02.E250-SetA.I402004.Pe2e2h.eR.pL.n000.d_dstm_15090_*.slcio.edm4hep.root": "podio_metadata/events___idTable"})
|
|
Hey, @jbrewster7, just to check: how are things coming along with the cross references. I think whatever you have now is fine and then I'd be happy to take that and mutate it to use this stuff I was talking about with @nsmith- and @jpivarski. However, I need to see what you've got first to act on that! There's no need to hone it to perfection, just good enough, going further is really more a collective process. |
So yeah @nsmith- it looks like I can instead just take the Should be enough, and there's only one additional file open on the user side for a small amount of data to get the necessary bits of metadata. |
…o the procedure. Added the class LorentzVectorM to vector.py so that vectors can be definedwith mass instead of energy.
for more information, see https://pre-commit.ci
…particles and mc particles. This involved the creation of new classes in nanoevents/methods/edm4hep.py. This linking still isn't running error-free. There is a TypeError when the matched_gen function is called from the edm4hep methods.
Pushing changes made to add particle linking.
for more information, see https://pre-commit.ci
…ting still needs to be updated for proper construction of 4-vectors.
@jbrewster7 could you fix up the flake8 errors? (click red x next to |
@lgray Yes, committing the changes now! |
for more information, see https://pre-commit.ci
@lgray apologies for the slight delay. New files (and one with 40 events) are on cernbox: https://cernbox.cern.ch/s/eYmuXTRimfgmdZg The 40 event one is produced from one using source /cvmfs/sw-nightlies.hsf.org/key4hep/setup.sh and the following lines of python from podio.root_io import Reader, Writer
reader = Reader("rv02-02.sv02-02.mILD_l5_o1_v02.E250-SetA.I402004.Pe2e2h.eR.pL.n000.d_dstm_15090_0.edm4hep.root")
events = reader.get("events")
writer = Writer("output.edm4hep.root")
for i in range(40):
writer.write_frame(events[i], "events") (There will be a few warnings about some missing schema evolution which can be safely ignored) |
Awesome, thanks! |
@tmadlener FYI https://pypi.org/project/podio/ is not taken. It would be nice if it was pip installable. |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
TODO: update to interface changes in dask-awkward, build tests. |
hi @lgray, @jbrewster7 - @prayagyadav was going to investigate a edm4hep coffea schema and stumbled on this PR. It looks like an open item for someone to pick up at this point, correct? |
@davidlange6 this PR is still usable but I know edm4hep as developed since we last touched it. I was working on finishing this up and then they changed everything. It remains a good starting point! |
A schema for edm4hep files in nanoevents.
It's very similar to Delphes / Treemaker FWIW.
@jbrewster7 we should keep a list here of the things we need to do unless we think we've found a good feature set!