-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clustering utilities #770
Clustering utilities #770
Conversation
…t into clustering_utilities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very elegant solution @Aske-Rosted! I really like the reusability of this method, and I think the code is quite readable - well done!
Also great to see the speed benchmarks. It looks like the new clustering logic is quite a bit faster than the current version. This is great!
I added a few minor comments
) | ||
array = cluster_class.add_counts() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the resulting node definition will always have counts added as a node feature - but in the arguments to PercentileCluster we allow the user to specify if they want this. The new implementation should respect this.
Would recommend a simple change such as:
if self._add_counts:
array = cluster_class.add_counts()
src/graphnet/models/graphs/utils.py
Outdated
@@ -172,6 +179,251 @@ def cluster_summarize_with_percentiles( | |||
return array | |||
|
|||
|
|||
class cluster_and_pad: | |||
"""Cluster and pad the data for further summarization.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets expand the doc string here to make it clear that the class stores a clustered tensor that is manipulated when the public methods are called. Here's a suggestion:
"""
Cluster the input data according to a set of specified columns and
compute aggregate statistics of cluster attributes.
The method will cluster the input data according to unique combinations of values in
the specified `cluster_columns` and store the clustered tensor. The clustered tensor can
be manipulated further by calling public methods that add statistical quantities to the
tensor of the values in each cluster. The order in which the public methods are called
determines the order of manipulations.
Example:
# Build cluster of single event on first three columns
cluster_class = cluster_and_pad(x = single_event_as_array, cluster_columns = [0,1,2])
# Summarize columns 3 and 4 from original array on each cluster with 10th, 50th and 90th percentiles
clustered_array_with_percentile_summary = cluster_class.percentile_summarize(summarization_indices =[3,4], percentiles = [10, 50, 90])
# Add the std of the 5th column to the array above
clustered_array_with_percentile_summary_and_std = cluster_class.add_std(column = 5)
"""
""" | ||
if location is None: | ||
self.clustered_x = np.column_stack([self.clustered_x, column]) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The location parameter could potentially overwrite information, right? E.g. if column 6 already contains some aggregate statistics and the user specifies location=6
. Maybe a warning/error here would be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The location feature is added using insert which does not remove information but instead expands the vector. The only possible confusion that could come from this is that the final ordering of the features might not be as the user expects as since every feature after the insert will have their location incremented by 1.
src/graphnet/models/graphs/utils.py
Outdated
"This function is deprecated and will be removed,", | ||
"use the class cluster_and_pad with add_percentile_summary", | ||
"instead for the same functionality", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it's good to warn the user that this function is deprecated, this will print for every call (lots of printing). I am not sure what a good solution is here. Maybe it can just be removed immediately since the nodes no longer use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! I missed that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the warning but kept the TODO and function for now. If we all agree to remove the function then I am not against it but just prefer to air on the side of caution.
Idea for improving Handling of Summarization Features in the Clustering nodesI’ve been thinking about how we can make the clustering (especially the upcoming new class ClusterFeature(Model):
"""Base class for summarizing feature series per cluster."""
@abstractmethod
def summarisation_feature_label(self):
"""Returns the label for the summarization feature."""
@abstractmethod
def add_feature(self):
"""Calculates the summarization feature.""" For every summarization feature we could then create a new subclass of this. E.g. for Totals of input features, Time of first pulse, Cumulative value of input feature after specific times, .... Then, the new features = [
Percentiles(percentiles=[10, 30, 50, 70]),
Totals(),
CumulativeAfterT(times=[10, 50, 40]),
TimeOfPercentiles(percentiles=[10, 30, 50, 70]),
Std(),
TimeOfFirstPulse(),
AddCounts(),
] Why This Might Be Useful:
Potential Concerns:
Let me know what you think! I think this could make things more modular and easier to extend, but I’d love to hear your thoughts. I am happy to help out in any way. I am aware that this would not add much functionality and is more a style choice I guess. I'm already very happy with this PR and I can build on this as well to implement some more summarization features once its through. |
Thanks for the input. After having thought about the suggestions for a while I believe that the implementation with the cluster and pad class and then functions in order to add extra information is the way to go, my main arguments for this is the following
Hope this is somewhat agreeable to you, and again thanks for your input. 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice @Aske-Rosted! Thank you for the great discussion @sevmag @pweigel
This PR add utility functions which will enable DOM-level clustering as described in #752. This PR does not include a NodesASDOM's definition but enables the creation of one. The new utilities completely contain the functionality of
cluster_summarize_with_percentiles
and as such a warning of this function being deprecated and a "TODO" of the removal of said function has been added so it can be removed some time in the future.The
PercentileClusters
node defintion has been changed to use the new utility class/function, and this node definition is tested in thegraphnet/tests/models/test_node_definition.py
test_percentile_cluster()
test.I plan to soon follow up with a new node definition but thought it would be better to not grow the PR too large.
I created some test to ensure the performance of the new utilities and gathered clustering times as seen below
With the tensor being some pseudo pulse data. The plots below show that the new utilities are about a factor 2 faster than the old implementation.