PercentileCluster #616

RasmusOrsoe · 2023-10-10T10:47:31Z

This PR contains two related changes:

A slight refactor of the way in which NodeDefinition and GraphDefinition interact.
Introduction of a new NodeDefinition called PercentileCluster that allows us to summarize pulses into clusters using percentiles.

About the refactor:

Following this change, NodeDefinition now also have to define the column names of the output they produce. This is done by implementing the _define_output_feature_names function, which receives as input the column names as a list of strings and it must return a list of strings representing the column names of the output of the NodeDefinition. When NodeDefinition is used together with GraphDefinition, which is the intended usage, then the input feature names that GraphDefinition is instantiated with, will be passed to NodeDefinition using the new set_output_feature_names(self, input_feature_names: List[str]) -> None: public method of NodeDefinition. It calls the _define_output_feature_names method and sets it as member variable of the NodeDefinition. This means that we don't have to pass the input feature names to multiple sub-modules of GraphDefinition. If one wants to use NodeDefinition outside of GraphDefinition, one must pass the input column names as argument, and I've added checks to the code and (hopefully) informative error messages to this end.

About PercentileCluster
This new ´NodeDefinition´ is used to identify duplicate rows in the raw input pulses and summarize the duplicates using percentiles. One use-case here is to summarize pulses to DOM-level. I.e. defining the cluster to be on unique values of XYZ which would mean that pulse information such as time and charge would be summarized with percentiles. However, the code is much more generic than this, allowing for different definitions of "what a cluster is", e.g. strings.

I spent quite some time on making this PercentileCluster fast and in my experiments with different methodologies I found that the following procedure was the fastest:

Given an event with n pulses, unique combinations of the column-values specified by cluster_on is found. (cluster_on=['x', 'y', 'z'] would mean each cluster is a DOM). The maximum number of duplicates (again, if each cluster is a DOM, a duplicate corresponds to a same-DOM pulse) is found. Suppose the number of clusters is c and the maximum number of duplicates found for a single cluster is d. Then, for each variable that must be summarized with percentiles (e.g. time, charge) an array of shape [c, d] is constructed, where sequences shorter than d is padded with np.nan. The corresponding percentiles for this variable, for each cluster, can then be calculated using a single numpy call np.nanpercentile(array, percentiles = percentiles).

The final result outputted by PercentileCluster will be [c, len(cluster_on) + n_features_for_summary*len(percentiles) + 1] dimensional, where +1 correspond to the log10 of number of duplicates found for each cluster, which can be turned on/off using the add_counts argument to PercentileClusters. A summary of execution time per. event vs. event length is shown below.

The values shown above is averages across 5 repetitions for each event. As you can see, even for rather large events the execution time stays well under 1s.

Here's an example of syntax:

node_definition = PercentileClusters(cluster_on = ['sensor_pos_x','sensor_pos_y', 'sensor_pos_z'],
                                     percentiles = [0, 10, 50, 90, 100])
graph_definition = GraphDefinition(detector = Prometheus(), 
                                   node_definition=node_definition)

This configuration will cluster the pulses on xyz, and calculate the 0th, 10th, 50th, 90th and 100th percentile of each variable not specified in cluster_on but available as an input feature. In the case of the Prometheus test set, this would be just the arrival time t.

I have also added a unit test that checks that the percentiles outputted by this new definition correspond to what one would expect from a more traditional method.

Aske-Rosted · 2023-10-18T03:19:00Z

src/graphnet/models/graphs/graph_definition.py

It makes me a little uneasy that in the forward call the function variable node_feature_names might be different from the class instantiated self._node_feature_names after the _node_definition call on line 147. While I do believe this is as intended it might be quite confusing upon revisiting the code later, maybe consider a renaming.

update branch

RasmusOrsoe · 2023-10-18T11:47:12Z

@Aske-Rosted I've gone through the code and renamed the variables. Now, the input to GraphDefinition is referred to as input_features and the output of NodeDefinition is referred to as node_features.

Aske-Rosted

Looks good to me, I do not have any other comments although I am curious as to the benefits of using numpy lexsort over other methods.

PercentileCluster

RasmusOrsoe added 11 commits October 9, 2023 14:32

copy-paste of code

84f73f5

copy-paste

776e300

add comment

f8577a4

introduce set function, refactor

27d0b3a

copy-paste utils

d41af7d

add import statement

d7e9b82

fix output of construct_nodes

a601033

type hint

4c7e121

nb_output property

57571f2

add unit test of node definition

a1f6b7e

code-climate

1e1ffc8

RasmusOrsoe requested review from AMHermansen and Aske-Rosted October 10, 2023 13:10

Aske-Rosted reviewed Oct 18, 2023

View reviewed changes

RasmusOrsoe and others added 11 commits October 18, 2023 11:26

rename variables

2f0f21a

rename

d07115c

rename

b67ba08

update pretrained configs

1049765

rename arg in KNNGraph

1d19026

arg update in dataset

170c2b3

update args

fd27997

update configs

7c921c3

update args in i3modules

aaa8dc6

update args dataset config test

928c221

Merge pull request #19 from RasmusOrsoe/main

d368159

update branch

Aske-Rosted approved these changes Oct 19, 2023

View reviewed changes

RasmusOrsoe merged commit 8b9c353 into graphnet-team:main Oct 19, 2023
12 checks passed

RasmusOrsoe added a commit to RasmusOrsoe/graphnet that referenced this pull request Oct 25, 2023

Merge pull request graphnet-team#616 from RasmusOrsoe/percentile_doms_v2

c9d8fdb

PercentileCluster

RasmusOrsoe mentioned this pull request Oct 25, 2023

Geometry Tables #623

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PercentileCluster #616

PercentileCluster #616

RasmusOrsoe commented Oct 10, 2023

Aske-Rosted Oct 18, 2023

RasmusOrsoe commented Oct 18, 2023

Aske-Rosted left a comment

PercentileCluster #616

PercentileCluster #616

Conversation

RasmusOrsoe commented Oct 10, 2023

Aske-Rosted Oct 18, 2023

Choose a reason for hiding this comment

RasmusOrsoe commented Oct 18, 2023

Aske-Rosted left a comment

Choose a reason for hiding this comment