Dropping duplicates without using Pandas #1353

blairium · 2024-12-23T03:34:50Z

blairium
Dec 23, 2024

At present I'm loading arrays using pandas so that I can drop duplicates that occur in certain rows. e.g

events.arrays(GROUP, library="pd").drop_duplicates(subset=['trackID','eventID','flagParticle'])

Is there a way to do this using expressions/cuts instead? Pandas is pretty memory hungry so it'd be very useful if I could avoid it.

Thank you in advance.

Answered by jpivarski

Dec 23, 2024

Dropping duplicates is the sort of thing that Pandas does better than NumPy or Awkward Array. (The expressions/cuts interface just applies NumPy or Awkward Array cuts—it doesn't do anything special internally.) There's a variety of ways you can cobble together a "drop duplicates" operation in NumPy (see this StackOverflow question), and ak.run_lengths is helpful for doing that in Awkward Array if you're looking for duplicates within each variable-length list, but I assume that "trackID", "eventID", "flagParticle" are not within nested lists.

Probably the best way to go about it would be to leverage Pandas's functionality—use its strengths—while minimizing the memory it holds. Nobody says …

View full answer

jpivarski · 2024-12-23T16:24:31Z

jpivarski
Dec 23, 2024
Maintainer

Dropping duplicates is the sort of thing that Pandas does better than NumPy or Awkward Array. (The expressions/cuts interface just applies NumPy or Awkward Array cuts—it doesn't do anything special internally.) There's a variety of ways you can cobble together a "drop duplicates" operation in NumPy (see this StackOverflow question), and ak.run_lengths is helpful for doing that in Awkward Array if you're looking for duplicates within each variable-length list, but I assume that "trackID", "eventID", "flagParticle" are not within nested lists.

Probably the best way to go about it would be to leverage Pandas's functionality—use its strengths—while minimizing the memory it holds. Nobody says you need to read all of your data in one command, rather than two or three, and you can pull different columns in each file-read command. To minimize Pandas memory, only read the three columns you're trying to find the duplicates of into Pandas. By default, the DataFrame will be made with an index of increasing integers, starting at zero, and after you eliminate duplicates, the index will be missing the integers that correspond to all but one instance of each "trackID", "eventID", "flagParticle" triple. That integer index can be used as a slice for the other variables in a NumPy or Awkward Array. Schematically, like this:

THREE_VARIABLES = ["trackID", "eventID", "flagParticle"]
triple = events.arrays(THREE_VARIABLES, library="pd").drop_duplicates(subset=three_variables)

what_you_want = events.arrays(GROUP, library="ak")[np.asarray(triple.index)]

But if your data are purely numeric, without any variable-length lists, use library="np".

(I say it's "schematically" like the above because you'll probably want to remove the THREE_VARIABLES from your GROUP, and maybe make other minor adjustments before it works. But this is a strategy that you can use.)

1 reply

blairium Dec 24, 2024
Author

Thank you for your incredibly detailed answer. This was very helpful and reduced my memory use by several orders of magnitude.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dropping duplicates without using Pandas #1353

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Dropping duplicates without using Pandas #1353

blairium Dec 23, 2024

Replies: 1 comment · 1 reply

jpivarski Dec 23, 2024 Maintainer

blairium Dec 24, 2024 Author

blairium
Dec 23, 2024

Replies: 1 comment 1 reply

jpivarski
Dec 23, 2024
Maintainer

blairium Dec 24, 2024
Author