Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering utilities #770

Merged
merged 19 commits into from
Dec 11, 2024
Merged

Conversation

Aske-Rosted
Copy link
Collaborator

This PR add utility functions which will enable DOM-level clustering as described in #752. This PR does not include a NodesASDOM's definition but enables the creation of one. The new utilities completely contain the functionality of cluster_summarize_with_percentiles and as such a warning of this function being deprecated and a "TODO" of the removal of said function has been added so it can be removed some time in the future.

The PercentileClusters node defintion has been changed to use the new utility class/function, and this node definition is tested in the graphnet/tests/models/test_node_definition.py test_percentile_cluster() test.

I plan to soon follow up with a new node definition but thought it would be better to not grow the PR too large.

I created some test to ensure the performance of the new utilities and gathered clustering times as seen below

    ts = time()
    test = cluster_summarize_with_percentiles(tensors, [3,4], [0,1,2], percentiles, add_counts=True)
    old_times.append(time()-ts)

    ts = time()
    cluster_class = cluster_and_pad(tensors,[0,1,2])
    cluster_class.percentile_summarize([3,4],percentiles)
    cluster_class.add_counts(cluster_class.clustered_x.shape[1])
    new_times.append(time()-ts)

    ts = time()
    cluster_class_charge = cluster_and_pad(tensors,[0,1,2])
    cluster_class_charge.add_charge_threshold_summary([3,4],percentiles,4)
    cluster_class_charge.add_counts(cluster_class_charge.clustered_x.shape[1])
    charge_cluster_times.append(time()-ts)

With the tensor being some pseudo pulse data. The plots below show that the new utilities are about a factor 2 faster than the old implementation.
cluster_time_ratio
cluster_time_scatter
cluster_time_hist

Copy link
Collaborator

@RasmusOrsoe RasmusOrsoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very elegant solution @Aske-Rosted! I really like the reusability of this method, and I think the code is quite readable - well done!

Also great to see the speed benchmarks. It looks like the new clustering logic is quite a bit faster than the current version. This is great!

I added a few minor comments

)
array = cluster_class.add_counts()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the resulting node definition will always have counts added as a node feature - but in the arguments to PercentileCluster we allow the user to specify if they want this. The new implementation should respect this.

Would recommend a simple change such as:

if self._add_counts:
    array = cluster_class.add_counts()

@@ -172,6 +179,251 @@ def cluster_summarize_with_percentiles(
return array


class cluster_and_pad:
"""Cluster and pad the data for further summarization."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets expand the doc string here to make it clear that the class stores a clustered tensor that is manipulated when the public methods are called. Here's a suggestion:

"""
Cluster the input data according to a set of specified columns and 
compute aggregate statistics of cluster attributes.

The method will cluster the input data according to unique combinations of values in 
the specified `cluster_columns` and store the clustered tensor. The clustered tensor can
be manipulated further by calling public methods that add statistical quantities to the 
tensor of the values in each cluster. The order in which the public methods are called
determines the order of manipulations. 

Example:
# Build cluster of single event on first three columns
cluster_class = cluster_and_pad(x = single_event_as_array, cluster_columns = [0,1,2])

# Summarize columns 3 and 4 from original array on each cluster with 10th, 50th and 90th percentiles
clustered_array_with_percentile_summary = cluster_class.percentile_summarize(summarization_indices =[3,4], percentiles = [10, 50, 90])
# Add the std of the 5th column to the array above
clustered_array_with_percentile_summary_and_std = cluster_class.add_std(column = 5)
"""

"""
if location is None:
self.clustered_x = np.column_stack([self.clustered_x, column])
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The location parameter could potentially overwrite information, right? E.g. if column 6 already contains some aggregate statistics and the user specifies location=6. Maybe a warning/error here would be useful?

Copy link
Collaborator Author

@Aske-Rosted Aske-Rosted Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The location feature is added using insert which does not remove information but instead expands the vector. The only possible confusion that could come from this is that the final ordering of the features might not be as the user expects as since every feature after the insert will have their location incremented by 1.

"This function is deprecated and will be removed,",
"use the class cluster_and_pad with add_percentile_summary",
"instead for the same functionality",
)
Copy link
Collaborator

@pweigel pweigel Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's good to warn the user that this function is deprecated, this will print for every call (lots of printing). I am not sure what a good solution is here. Maybe it can just be removed immediately since the nodes no longer use it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! I missed that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the warning but kept the TODO and function for now. If we all agree to remove the function then I am not against it but just prefer to air on the side of caution.

@sevmag
Copy link

sevmag commented Nov 25, 2024

Idea for improving Handling of Summarization Features in the Clustering nodes

I’ve been thinking about how we can make the clustering (especially the upcoming new NodeDefiniton class lets call it Cluster) more flexible and easier to maintain, and I have a suggestion. Instead of having a function within the cluster_and_pad class to create new cluster features, we could create a new class called ClusterFeature that handles summarizing each feature. The class might look something like this:

class ClusterFeature(Model):
    """Base class for summarizing feature series per cluster."""

    @abstractmethod
    def summarisation_feature_label(self):
        """Returns the label for the summarization feature."""

    @abstractmethod
    def add_feature(self):
        """Calculates the summarization feature."""

For every summarization feature we could then create a new subclass of this. E.g. for Totals of input features, Time of first pulse, Cumulative value of input feature after specific times, ....

Then, the new NodeDefinition (which we’ll call Cluster) can just take a list of these ClusterFeature instances. For example:

features = [
    Percentiles(percentiles=[10, 30, 50, 70]),
    Totals(),
    CumulativeAfterT(times=[10, 50, 40]),
    TimeOfPercentiles(percentiles=[10, 30, 50, 70]),
    Std(),
    TimeOfFirstPulse(),
    AddCounts(),
]

Why This Might Be Useful:

  1. Cleaner Constructor: Instead of passing a ton of individual arguments to __init__(), we’d just pass a list of ClusterFeature instances. This will make the Cluster class constructor much cleaner and easier to work with.

  2. Modular and Flexible: Each summarization feature could be its own class, which makes it super easy to add new features down the line without cluttering the Cluster class.

  3. Easier Subsetting: You could easily calculate features on just a subset of the input data (like just the charge), instead of always calculating for everything at once.

  4. Consistency: By looping through the ClusterFeature list, we ensure that the feature labels and data always line up correctly, so there’s less risk of mistakes with mismatched ordering in the _get_indices_and_feature_names and _construct_nodes.

Potential Concerns:

  • Speed: There might be a performance hit because of the extra abstraction but I don't have a feeling for that
  • Configuration: We’ll want to make sure this change doesn’t mess with any of the config file stuff or other initialization logic. (I am not too familiar of how the yaml files are being used)

Let me know what you think! I think this could make things more modular and easier to extend, but I’d love to hear your thoughts. I am happy to help out in any way.

I am aware that this would not add much functionality and is more a style choice I guess. I'm already very happy with this PR and I can build on this as well to implement some more summarization features once its through.

@Aske-Rosted
Copy link
Collaborator Author

Idea for improving Handling of Summarization Features in the Clustering nodes

I’ve been thinking about how we can make the clustering (especially the upcoming new NodeDefiniton class lets call it Cluster) more flexible and easier to maintain, and I have a suggestion. Instead of having a function within the cluster_and_pad class to create new cluster features, we could create a new class called ClusterFeature that handles summarizing each feature. The class might look something like this:

class ClusterFeature(Model):
    """Base class for summarizing feature series per cluster."""

    @abstractmethod
    def summarisation_feature_label(self):
        """Returns the label for the summarization feature."""

    @abstractmethod
    def add_feature(self):
        """Calculates the summarization feature."""

For every summarization feature we could then create a new subclass of this. E.g. for Totals of input features, Time of first pulse, Cumulative value of input feature after specific times, ....

Then, the new NodeDefinition (which we’ll call Cluster) can just take a list of these ClusterFeature instances. For example:

features = [
    Percentiles(percentiles=[10, 30, 50, 70]),
    Totals(),
    CumulativeAfterT(times=[10, 50, 40]),
    TimeOfPercentiles(percentiles=[10, 30, 50, 70]),
    Std(),
    TimeOfFirstPulse(),
    AddCounts(),
]

Why This Might Be Useful:

1. **Cleaner Constructor**: Instead of passing a ton of individual arguments to `__init__()`, we’d just pass a list of `ClusterFeature` instances. This will make the `Cluster` class constructor much cleaner and easier to work with.

2. **Modular and Flexible**: Each summarization feature could be its own class, which makes it super easy to add new features down the line without cluttering the `Cluster` class.

3. **Easier Subsetting**: You could easily calculate features on just a subset of the input data (like just the charge), instead of always calculating for everything at once.

4. **Consistency**: By looping through the `ClusterFeature` list, we ensure that the feature labels and data always line up correctly, so there’s less risk of mistakes with mismatched ordering in the `_get_indices_and_feature_names` and `_construct_nodes`.

Potential Concerns:

* **Speed**: There might be a performance hit because of the extra abstraction but I don't have a feeling for that

* **Configuration**: We’ll want to make sure this change doesn’t mess with any of the config file stuff or other initialization logic. (I am not too familiar of how the yaml files are being used)

Let me know what you think! I think this could make things more modular and easier to extend, but I’d love to hear your thoughts. I am happy to help out in any way.

I am aware that this would not add much functionality and is more a style choice I guess. I'm already very happy with this PR and I can build on this as well to implement some more summarization features once its through.

Thanks for the input.

After having thought about the suggestions for a while I believe that the implementation with the cluster and pad class and then functions in order to add extra information is the way to go, my main arguments for this is the following

  1. Feeding instances of classes to the node definition would mean that we would have to add some translation from/to the config files in order to accommodate this.
  2. I do not see a good way to share the clustering and padding information between the now separate classes, recreating this clustering and padding would basically linearly increase the over head of each of the added summarization features, which is part of what the cluster_and_pad class implementation tries to solve.
  3. While this way definitely makes it easier for people make their own clustering classes which they can use in their private repos, I would worry that this might lead people to not uploading their contributions to the clustering features as these classes can be completely separate from the rest of the library.
  4. for your point regarding consistency I believe that an automatic calculation of the final output features can also be implemented in the current solution, and I will try to add this.

Hope this is somewhat agreeable to you, and again thanks for your input. 😄

@RasmusOrsoe RasmusOrsoe self-requested a review December 10, 2024 08:24
Copy link
Collaborator

@RasmusOrsoe RasmusOrsoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice @Aske-Rosted! Thank you for the great discussion @sevmag @pweigel

@Aske-Rosted Aske-Rosted merged commit 92b150a into graphnet-team:main Dec 11, 2024
13 of 14 checks passed
@Aske-Rosted Aske-Rosted deleted the clustering_utilities branch December 11, 2024 01:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants