-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PercentileCluster #616
PercentileCluster #616
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes me a little uneasy that in the forward call the function variable node_feature_names
might be different from the class instantiated self._node_feature_names
after the _node_definition
call on line 147. While I do believe this is as intended it might be quite confusing upon revisiting the code later, maybe consider a renaming.
update branch
@Aske-Rosted I've gone through the code and renamed the variables. Now, the input to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I do not have any other comments although I am curious as to the benefits of using numpy lexsort over other methods.
PercentileCluster
This PR contains two related changes:
NodeDefinition
andGraphDefinition
interact.NodeDefinition
calledPercentileCluster
that allows us to summarize pulses into clusters using percentiles.About the refactor:
Following this change,
NodeDefinition
now also have to define the column names of the output they produce. This is done by implementing the_define_output_feature_names
function, which receives as input the column names as a list of strings and it must return a list of strings representing the column names of the output of theNodeDefinition
. WhenNodeDefinition
is used together withGraphDefinition
, which is the intended usage, then the input feature names thatGraphDefinition
is instantiated with, will be passed toNodeDefinition
using the newset_output_feature_names(self, input_feature_names: List[str]) -> None:
public method ofNodeDefinition
. It calls the_define_output_feature_names
method and sets it as member variable of theNodeDefinition
. This means that we don't have to pass the input feature names to multiple sub-modules ofGraphDefinition
. If one wants to useNodeDefinition
outside ofGraphDefinition
, one must pass the input column names as argument, and I've added checks to the code and (hopefully) informative error messages to this end.About
PercentileCluster
This new ´NodeDefinition´ is used to identify duplicate rows in the raw input pulses and summarize the duplicates using percentiles. One use-case here is to summarize pulses to DOM-level. I.e. defining the cluster to be on unique values of XYZ which would mean that pulse information such as time and charge would be summarized with percentiles. However, the code is much more generic than this, allowing for different definitions of "what a cluster is", e.g. strings.
I spent quite some time on making this
PercentileCluster
fast and in my experiments with different methodologies I found that the following procedure was the fastest:Given an event with
n
pulses, unique combinations of the column-values specified bycluster_on
is found. (cluster_on=['x', 'y', 'z']
would mean each cluster is a DOM). The maximum number of duplicates (again, if each cluster is a DOM, a duplicate corresponds to a same-DOM pulse) is found. Suppose the number of clusters isc
and the maximum number of duplicates found for a single cluster isd
. Then, for each variable that must be summarized with percentiles (e.g. time, charge) an array of shape[c, d]
is constructed, where sequences shorter thand
is padded withnp.nan
. The corresponding percentiles for this variable, for each cluster, can then be calculated using a single numpy callnp.nanpercentile(array, percentiles = percentiles)
.The final result outputted by
PercentileCluster
will be[c, len(cluster_on) + n_features_for_summary*len(percentiles) + 1]
dimensional, where+1
correspond to the log10 of number of duplicates found for each cluster, which can be turned on/off using theadd_counts
argument toPercentileClusters
. A summary of execution time per. event vs. event length is shown below.The values shown above is averages across 5 repetitions for each event. As you can see, even for rather large events the execution time stays well under 1s.
Here's an example of syntax:
This configuration will cluster the pulses on xyz, and calculate the 0th, 10th, 50th, 90th and 100th percentile of each variable not specified in
cluster_on
but available as an input feature. In the case of thePrometheus
test set, this would be just the arrival timet
.I have also added a unit test that checks that the percentiles outputted by this new definition correspond to what one would expect from a more traditional method.