Skip to content

Shrecki/treeCoreset

Repository files navigation

Contributors Forks Stargazers Issues MIT License LinkedIn

Build Status Line Coverage Badge Function Coverage Badge


Logo

treeCoreset

An implementation of the stream kmeans++ algorithm described in *modern C++ (Ackermann, Marcel R., et al. "Streamkm++ a clustering algorithm for data streams." Journal of Experimental Algorithmics (JEA) 17 (2012): 2-1.)*. The main program launches a server, to which points can be sent from other processes arbitrarily. Representatives and centroids can be queried from the server via simple commands. Communication relies on the ØMQ library for minimum overhead. The server estimates overall footprint on RAM for user input, and if not satisfiable exits and suggests modifications to input parameters to satisfy the specified RAM constraints. Note that the constraint does not consider virtual memory and as such is much stricter than what might be in effect available on your machine.
Explore the docs »

View Demo · Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact
  8. Acknowledgments

About The Project

Product Name Screen Shot

(back to top)

Built With

(back to top)

Getting Started

This is an example of how you may give instructions on setting up your project locally. To get a local copy up and running follow these simple example steps.

Prerequisites

Eigen, Boost and ØMQ are required for this project to work. Eigen is already included as a submodule.

Installation

  1. Install prerequisite libraries:
    sudo apt-get install -y build-essential g++ python-dev autotools-dev libicu-dev libbz2-dev lcov libcppunit-dev software-properties-common
  2. Clone the repo
    git clone https://github.com/Shrecki/treeCoreset.git
  3. Install the submodules:
git submodule init
git submodule update
  1. If your cmake version is outdated, install a newer version, e.g:
wget https://github.com/Kitware/CMake/releases/download/v3.22.5/cmake-3.22.5.tar.gz
tar -zxvf cmake-3.22.5.tar.gz
cd cmake-3.22.5
sudo ./bootstrap
sudo make
sudo make install
cmake --version
  1. Install ØMQ, either from source file or through apt:
sudo apt-get install -y libzmq3-dev
  1. Install Boost:
sudo apt-get install -y libboost-all-dev
wget -O boost_1_79_0.tar.gz https://boostorg.jfrog.io/artifactory/main/release/1.79.0/source/boost_1_79_0.tar.gz
tar xzvf boost_1_79_0.tar.gz
cd boost_1_79_0/
./bootstrap.sh --prefix=/usr/
./b2
sudo ./b2 install
  1. Finally, make the project:
    mkdir cmake-build-debug-coverage
    cd cmake-build-debug-coverage
    cmake ../ -DCODE_COVERAGE=ON
    cmake --build . --target unit_test --config Release -- -j

Note that in CMakeList, march=native is used. If you're on an unsupported platform, you can remove this flag or - better yet - replace it with the flag corresponding to your CPU architecture to benefit from Eigen's maximum performance.

(back to top)

Usage

Direct usage method

There are two main ways to use this program. The first is to directly initialize the coreset object and send it points. This is the recommended way, as it is faster. There are three main steps:

  • Initialize the coreset object, by specifying a number of buckets and of representatives. The number of buckets is given as ceil(log2(n/n_representatives)), where n is the number of points you expect to pass in total to the algorithm (ie: number of samples in your original dataset). The number of representatives should be at least 200*k, k the number of centroids to find later, for decent performance.
  • Send the points to the coreset object
  • Recover the representatives

Doing these three steps can be done as follows:

# Makes the client_coreset.so shared library visible to Python by adding it to the path
import sys
sys.path.insert(1, 'treeCoreset/cmake-build-debug-coverage')

# Import the shared library
import client_coreset

k=3 # For the sake of the example, we let k, the number of clusters, be 3
n_reps = 200*k
n_buckets = int(np.ceil(np.log2(n/ n_reps)))


# Initialize the coreset object
cluster_rep = client_coreset.ClusteredPoints(n_buckets, n_reps)

Assuming your data is a numpy array, you can pass it directly with only a single function call to the cluster object. You should also specify how many "points" have been passed, so that the function can figure out which axis is features and which axis is samples:

cluster_rep.insertVectors(X.astype(np.double, order='C').T, X.shape[0])

Lastly, representatives can be recovered as simply as:

reps = cluster_rep.getRepresentatives()

A complete Python example

To make clear the purpose of this library, let's walk through a minimal example. Assume the following code and data:

import numpy as np
import matplotlib.pyplot as plt

n_per_cluster = 10000
X1 = np.random.multivariate_normal([0, 0],[[1, 0], [0, 10]], n_per_cluster)
X2 = np.random.multivariate_normal([10, 0],[[1, 0], [0, 10]], n_per_cluster)
X3 = np.random.multivariate_normal([2, 50],[[1, 0], [0, 10]], n_per_cluster)
X4 = np.random.multivariate_normal([10, 30],[[1, 0], [0, 10]], n_per_cluster)


plt.scatter(X1[:,0], X1[:,1], label='X1')
plt.scatter(X2[:,0], X2[:,1], label='X2')
plt.scatter(X3[:,0], X3[:,1], label= 'X3')
plt.scatter(X4[:,0], X4[:,1], label= 'X4')

plt.legend()

X = np.concatenate((X1, X2, X3, X4))

We've created in particular four clusters here, in effect. We will therefore perform clustering with k=4. Let's now extract representatives:

k = 4
n_reps = 200*k
n_buckets = int(np.ceil(np.log2(n_per_cluster*3 / n_reps)))

# Initialize and insert all points in the clusterer
cluster_rep = client_coreset.ClusteredPoints(n_buckets, n_reps)
cluster_rep.insertVectors(X.astype(np.double, order='C'), X.shape[0])

# Recover the representatives
reps = cluster_rep.getRepresentatives()

# Let's visualize the representatives
plt.scatter(X1[:,0], X1[:,1])
plt.scatter(X2[:,0], X2[:,1])
plt.scatter(X3[:,0], X3[:,1])
plt.scatter(X4[:,0], X4[:,1])

plt.scatter(reps[:,0], reps[:, 1], label='Extracted representatives')
plt.legend()

Now, we will apply KMeans and compare the results when computing the centroids on the entire dataset and on the representatives only:

def perform_clustering_and_plot(data,reps, use_reps, title, k):
    kmeans = KMeans(n_clusters=k)
    if use_reps:
        kmeans.fit(reps)
        assignements = kmeans.predict(data)
    else:
        kmeans.fit(data)
        assignements = kmeans.labels_
    clusters = [data[assignements == k,:] for k in np.unique(assignements)]

    for c in clusters:
        plt.scatter(c[:,0], c[:, 1])

    plt.title(title)
    return assignements

_ = perform_clustering_and_plot(X, reps, False, 'Ground truth clustering using all points', 4)
_ = perform_clustering_and_plot(X, reps, True, 'Clustering using {} representatives ({} % of points)'.format(n_reps, 100*n_reps/(n_per_cluster*4)), 4)

One can see that the extracted clusters were indeed well matched. We can finally state the advantage of this method: we are applying kmeans to the representatives only, which reduces the number of points to consider. Notice that the number of representatives is independent of the number of samples in the original data! It is conditioned by the number of centroids to choose.

Remote server method

A MATLAB binding is also available to pass data between the coreset object and MATLAB. In such a case, the MATLAB program acts as a client and the coreset object as a server.

The server expecting the stream can be started by running the main program:

./treeCoreset

For the complete set of instructions, type

./treeCoreset --help

Once the server is launched, any other process can send points to the server, as if part of a stream, request representatives, centroids or even stop the server. For convenience, functions in MATLAB are already provided for the client-side.

Should you wish to implement a communication in an unsupported language, here are relevant details to consider. For any transmission to happen between client and server, the client must fulfill the following conditions:

  1. Be the only client connected to the server (any new connection will be refused by the server while a client is still attached to the socket).
  2. Connect to the ipc socket specified by the user, using ØMQ's API. By default, this socket is ipc:///tmp/sock-0. This connection can only be of type ZMQ_PAIR.
  3. Send an appropriate request to the server

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

License

Distributed under the GNU License. See LICENSE.txt for more information.

(back to top)

Contact

Fabrice Guibert - @twitter_handle - [email protected]

Project Link: https://github.com/Shrecki/treeCoreset

(back to top)

Acknowledgments

(back to top)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published