-
At present I'm loading arrays using pandas so that I can drop duplicates that occur in certain rows. e.g events.arrays(GROUP, library="pd").drop_duplicates(subset=['trackID','eventID','flagParticle']) Is there a way to do this using expressions/cuts instead? Pandas is pretty memory hungry so it'd be very useful if I could avoid it. Thank you in advance. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Dropping duplicates is the sort of thing that Pandas does better than NumPy or Awkward Array. (The Probably the best way to go about it would be to leverage Pandas's functionality—use its strengths—while minimizing the memory it holds. Nobody says you need to read all of your data in one command, rather than two or three, and you can pull different columns in each file-read command. To minimize Pandas memory, only read the three columns you're trying to find the duplicates of into Pandas. By default, the DataFrame will be made with an index of increasing integers, starting at zero, and after you eliminate duplicates, the index will be missing the integers that correspond to all but one instance of each THREE_VARIABLES = ["trackID", "eventID", "flagParticle"]
triple = events.arrays(THREE_VARIABLES, library="pd").drop_duplicates(subset=three_variables)
what_you_want = events.arrays(GROUP, library="ak")[np.asarray(triple.index)] But if your data are purely numeric, without any variable-length lists, use (I say it's "schematically" like the above because you'll probably want to remove the |
Beta Was this translation helpful? Give feedback.
Dropping duplicates is the sort of thing that Pandas does better than NumPy or Awkward Array. (The
expressions
/cuts
interface just applies NumPy or Awkward Array cuts—it doesn't do anything special internally.) There's a variety of ways you can cobble together a "drop duplicates" operation in NumPy (see this StackOverflow question), and ak.run_lengths is helpful for doing that in Awkward Array if you're looking for duplicates within each variable-length list, but I assume that"trackID"
,"eventID"
,"flagParticle"
are not within nested lists.Probably the best way to go about it would be to leverage Pandas's functionality—use its strengths—while minimizing the memory it holds. Nobody says …