Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PercentileCluster #616

Merged
merged 22 commits into from
Oct 19, 2023
Merged

Conversation

RasmusOrsoe
Copy link
Collaborator

This PR contains two related changes:

  1. A slight refactor of the way in which NodeDefinition and GraphDefinition interact.
  2. Introduction of a new NodeDefinition called PercentileCluster that allows us to summarize pulses into clusters using percentiles.

About the refactor:

Following this change, NodeDefinition now also have to define the column names of the output they produce. This is done by implementing the _define_output_feature_names function, which receives as input the column names as a list of strings and it must return a list of strings representing the column names of the output of the NodeDefinition. When NodeDefinition is used together with GraphDefinition, which is the intended usage, then the input feature names that GraphDefinition is instantiated with, will be passed to NodeDefinition using the new set_output_feature_names(self, input_feature_names: List[str]) -> None: public method of NodeDefinition. It calls the _define_output_feature_names method and sets it as member variable of the NodeDefinition. This means that we don't have to pass the input feature names to multiple sub-modules of GraphDefinition. If one wants to use NodeDefinition outside of GraphDefinition, one must pass the input column names as argument, and I've added checks to the code and (hopefully) informative error messages to this end.

About PercentileCluster
This new ´NodeDefinition´ is used to identify duplicate rows in the raw input pulses and summarize the duplicates using percentiles. One use-case here is to summarize pulses to DOM-level. I.e. defining the cluster to be on unique values of XYZ which would mean that pulse information such as time and charge would be summarized with percentiles. However, the code is much more generic than this, allowing for different definitions of "what a cluster is", e.g. strings.

I spent quite some time on making this PercentileCluster fast and in my experiments with different methodologies I found that the following procedure was the fastest:

Given an event with n pulses, unique combinations of the column-values specified by cluster_on is found. (cluster_on=['x', 'y', 'z'] would mean each cluster is a DOM). The maximum number of duplicates (again, if each cluster is a DOM, a duplicate corresponds to a same-DOM pulse) is found. Suppose the number of clusters is c and the maximum number of duplicates found for a single cluster is d. Then, for each variable that must be summarized with percentiles (e.g. time, charge) an array of shape [c, d] is constructed, where sequences shorter than d is padded with np.nan. The corresponding percentiles for this variable, for each cluster, can then be calculated using a single numpy call np.nanpercentile(array, percentiles = percentiles).

The final result outputted by PercentileCluster will be [c, len(cluster_on) + n_features_for_summary*len(percentiles) + 1] dimensional, where +1 correspond to the log10 of number of duplicates found for each cluster, which can be turned on/off using the add_counts argument to PercentileClusters. A summary of execution time per. event vs. event length is shown below.

image

The values shown above is averages across 5 repetitions for each event. As you can see, even for rather large events the execution time stays well under 1s.

Here's an example of syntax:

node_definition = PercentileClusters(cluster_on = ['sensor_pos_x','sensor_pos_y', 'sensor_pos_z'],
                                     percentiles = [0, 10, 50, 90, 100])
graph_definition = GraphDefinition(detector = Prometheus(), 
                                   node_definition=node_definition)

This configuration will cluster the pulses on xyz, and calculate the 0th, 10th, 50th, 90th and 100th percentile of each variable not specified in cluster_on but available as an input feature. In the case of the Prometheus test set, this would be just the arrival time t.

I have also added a unit test that checks that the percentiles outputted by this new definition correspond to what one would expect from a more traditional method.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes me a little uneasy that in the forward call the function variable node_feature_names might be different from the class instantiated self._node_feature_names after the _node_definition call on line 147. While I do believe this is as intended it might be quite confusing upon revisiting the code later, maybe consider a renaming.

@RasmusOrsoe
Copy link
Collaborator Author

@Aske-Rosted I've gone through the code and renamed the variables. Now, the input to GraphDefinition is referred to as input_features and the output of NodeDefinition is referred to as node_features.

Copy link
Collaborator

@Aske-Rosted Aske-Rosted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I do not have any other comments although I am curious as to the benefits of using numpy lexsort over other methods.

@RasmusOrsoe RasmusOrsoe merged commit 8b9c353 into graphnet-team:main Oct 19, 2023
12 checks passed
RasmusOrsoe added a commit to RasmusOrsoe/graphnet that referenced this pull request Oct 25, 2023
@RasmusOrsoe RasmusOrsoe mentioned this pull request Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants