- Problem statement
- Prepared outputs
- Usage
2.0. Prerequisites
2.1. Python Script
2.1. Jupyter Notebook - Analysis
3.1. General Approach
3.2. Point Coverage Efficacy Evaluation
3.3. Computational Performance
3.4. Conclusion
Given a set A of points in a 2D space (input.csv - provided), select an N
number of random Representative Points from this set such that they best "cover" the shape of the point set (as opposed to its density).
Output should be the source code and 5 separate output.csv
files where N
= 10, 25, 63, 159, 380
👁️ See the outputs folder.
🥇 hierarchical_clustering contains results generated using the hierarchical clustering approach. It has the best rep-rep/average-distance ratio, and perform more consistently across different N values. This is the preferred set of outputs.
🥈 graph_disconnect contains results generated using the graph-disconnect approach. It has the best rep-rep/max-distance ratio on small N values only.
See Analysis section for more details.
Create a virtual environment & install the required packages
This program requires Python3.10
or above.
Please install the necessary packages in requirements.txt
.
pip install -r requirements.txt
jupyter==1.0.0
matplotlib==3.6.2
matplotlib-inline==0.1.6
networkx==2.8.8
numpy==1.23.4
scikit-learn==1.1.3
The python program computes the required points for one N value.
usage: point_coverage.py [-h] [-v] -n NUM_REPS -a ALGORITHM [-r RADIUS] [-b BRANCHING_FACTOR] -o OUTPUT input_file
positional arguments:
input_file
options:
-h, --help show this help message and exit
-v, --verbose When set, print analytics.
-n NUM_REPS, --num-reps NUM_REPS
Number of representative points to pick from the input data. Positive integer.
-a ALGORITHM, --algorithm ALGORITHM
One of 'clustering', 'graph'.
-r RADIUS, --radius RADIUS
If using 'graph' algorithm, select the max radius between connected points.
-b BRANCHING_FACTOR, --branching-factor BRANCHING_FACTOR
If using 'graph' algorithm, select the branching factor for each point in the graph.
-o OUTPUT, --output OUTPUT
Path to the output csv file. (e.g. output.csv)
➡️ Example with clustering algorithm:
python3.10 ./point_coverage.py -v -n 10 -a clustering -o out_c.csv input.csv
➡️ Example with graph algorithm:
python3.10 ./point_coverage.py -v -n 10 -a graph -r 6000 -b 180 -o out_g.csv input.csv
We can use the provided run.sh
to produce output files for all N values specified in the handout:
bash ./run.sh <algorithm> <input-file>
The script will also time the execution for each N value round.
Output files will be generated in the current directory, named output_<algorithm>_%d.csv
.
➡️ Example with clustering algorithm:
./run.sh clustering input.csv
Computing for N = 10
Took 0.39 seconds.
Computing for N = 25
Took 0.41 seconds.
Computing for N = 63
Took 0.39 seconds.
Computing for N = 159
Took 0.39 seconds.
Computing for N = 380
Took 0.39 seconds.
➡️ Example with graph algorithm:
./run.sh graph input.csv
Computing for N = 10
Took 49.95 seconds.
Computing for N = 25
Took 91.06 seconds.
Computing for N = 63
Took 96.84 seconds.
Computing for N = 159
Took 99.31 seconds.
Computing for N = 380
Took 98.65 seconds.
The Jupyter Notebook contains step-by-step computation and analysis, as well as scatter plots to visualize results. Useful if you are interested in the details. The same required packages apply.
You can view and edit Jupyter Notebooks on your own computer using Visual Studio Code. For online editing, feel free to check out Google Colaboratory.
The following three approaches to sampling N representatives from a collection of input points in order to attain optimal geometric coverage are evaluated:
- k-Means Clustering
- Hierarchical Clustering
- Graph-Disconnect
Out of the three, only "Hierarchical Clustering" and "Graph-Disconnect" are deemed acceptable, and only "Hierarchical Clustering" performs consistently across many N values.
We will look at the three approaches in detail below.
If you would like to follow along with the analysis, please feel free to check out the Jupyter Notebook, as it provides step-by-step computation. You can even tweak parameters if you download the notebook to your local computer.
Initially, the given problem appeared to me as a clustering problem. For this type of approach, we group the input points into N clusters, then find the geometric centroids for each cluster, and pick the closest point to the centroid as a representative.
I have evaluated various clustering algorithms provided by the Scikit-Learn
library. We must exclude the algorithms that do not allow specifying the number of clusters in the output, because they are not suitable for the given problem.
This algorithm is perhaps the most well-known unsupervised clustering algorithm, so naturally, it became my first candidate.
It starts by picking N random points as centroids, tries to absorb adjacent points, and recomputes the centroids. This process repeats until the centroids no longer move significantly across iterations.
The KMeans function provided by Scikit-Learn
returns cluster centroids at the end of the algorithm, which are ready to be converted to input data points.
This class of clustering algorithms either start with one cluster containing all input points, and divide them repeatedly until forming N clusters, or they start with each point in their own cluster, and merge clusters repeatedly until forming N final clusters. When the bottom-up merging approach is used, it's called agglomerative clustering.
I found most success with agglomerative clustering using maximum/complete linkage. Linkage is the criterion by which the algorithm decides which clusters are most "linked" to each other, in order to merge them. Maximum linkage refers to taking the maximum distance between all points in two clusters, and clusters who are closest by this maximum distance are merged together.
The AgglomerativeClustering function provided by Scikit-Learn
does not return a collection of cluster centroids. It instead returns a collection of labels for each input data point describing which of the N clusters the data point belongs to. As a result, I grouped the input data by their cluster label, and computed a centroid for each cluster, so that they are ready to be converted to input data points.
I have also evaluated Spectral clustering and BIRCH for the purpose of the given problem, but they have produced underwhelming results, therefore I will not discuss them in more detail.
Rather than using existing clustering algorithms, another approach to this problem can be described in terms of graph splitting, which I will refer to as "Graph-Disconnect".
Consider the following:
- Create a graph containing all input data points as vertices
- Connect all vertices to each other, with the edge weight being their Euclidean distance.
- Repeatedly remove the largest edge from the graph, until there are N connected components in the graph left
- Compute the mean centroids for all vertices in each connected component
In order to build and operate on such a graph, I used the NetworkX library.
The initial attempt following the steps described above suffered from major performance issues. The number of edges are exponential to the number of input data points, and the process of removing edges until N components are formed was taking in excess of 10 minutes, with no results.
To optimize performance, I first tried to use the geometric_edges
function provided by the NetworkX library. It connects vertices that are within a certain radius of each other, which reduces the number of edges in the final graph. However, the resulting graph still contains way too many edges, since the input data contains dense regions with many points in a close distance from each other.
To further optimize performance, I modified the source code of the geometric_edges
function to also include a branching factor constraint. The branching factor is the number of edges that can be incident from a vertex (a.k.a. the size of the neighborhood). This allows the graph to remain reasonably sized even in dense regions of the input data.
The final optimization allowed the Graph-Disconnect algorithm to successfully complete within a few minutes. For the given input data, I found that a radius constraint of 6000
and branching factor constraint of 180
produces optimal results. However, these constraints will be different depending on the input data, and there are no trivial ways to compute them automatically.
The centroids returned from clustering are not part of the input dataset. We must find input data points that are closest to these centroids as points that represent their region ("reps"), in order to form the final output.
Nearest neighbors algorithms are well suited for this task, as it tries to find k input points closest to a given set of reference points. The Skikit-Learn
library provides such a function. For our purpose, we find the 1 nearest neighbor from the input data to each of the computed centroids, which we return as representatives.
In the following plots:
- dark gray points are input data
- blue points are floating-point centroids
- red points are input points chosen as representatives, which are closest to the blue centroids
For the Graph-Disconnect algorithm, the radius is chosen to be 6000
, and branching factor is chosen to be 180
.
I will discuss 4 ways to evaluate the point coverage for each of the three approaches described above.
- Take all points in the input data, compute their distance to every selected rep point, and take the total sum.
- For optimal point coverage, unselected points should be as close to rep points as possible. The total of such distances is therefore a general measure of how close are points to reps throughout the dataset.
- Small values in this metric are desireable.
- However, this metric favors dense input point distributions, since for dense inputs, more points will be closer to reps compared to sparse inputs.
- Take all points not selected as a rep, find their closest rep, compute their distance. Take all these distances to the closest rep, and find the max, min, and average.
- For optimal point coverage, unselected points should be as close to their rep as possible. This improves upon the naive total distance measure, by focusing on reps that are supposed to cover points in their region.
- Small values in this metric are desireable.
- Take all points chosen as reps, find the minimum distance between any 2 reps.
- For optimal point coverage, the selected reps should be as far from each other as possible, so that each rep can cover as many unselected points as possible.
- Large values in this metric are desireable.
- The ratio of previous two measures ("3.2.2.3. closest distance between reps" and "3.2.2.2. max|average distance from a point to a rep")
- The greater this ratio, the farthest apart are reps to each other, and closer the unselected points to a rep.
- Large values in this ratio are desireable.
- The min distance between reps to max distance from point to rep (
rr/max
) is a summary measure of point coverage in the worst case, and favors outlier input points. - The min distance between reps to average distance from point to rep (
rr/avg
) is a summary measure of point coverage in the average case, and should be considered the optimal measure.
Algorithm | k-Means | Hierarchical | Graph-Disconnect | ||||||
---|---|---|---|---|---|---|---|---|---|
Total Distance |
29272141.3949 | 34850046.2278 | 37936438.9315 | ||||||
Dist-Rep | Min | Max | Avg | Min | Max | Avg | Min | Max | Avg |
7.2111 | 3246.6056 | 525.7065 | 43.8406 | 1976.1905 | 639.4779 | 59.4811 | 2120.6284 | 895.2549 | |
Rep-Rep | 997.0245 | 1411.0978 | 1880.5130 | ||||||
Ratio | rr/max | rr/avg | Worst | rr/max | rr/avg | Best Average | rr/max | rr/avg | Best Max |
0.3071 | 1.8965 | 0.7140 | 2.2066 | 0.8868 | 2.1005 |
As we can see, the k-Means approach does not live up to standard both visually and quantitatively, as it is too biased toward point density, and produces suboptimal rep-rep/dist-rep ratios.
We will exclude k-Means from now on and focus on the other two methods.
Algorithm | Hierarchical | Graph-Disconnect | ||
---|---|---|---|---|
N | rr/avg | rr/max | rr/avg | rr/max |
10 | 2.2066 | 0.7140 | 2.1005 | 0.8868 |
25 | 1.5561 | 0.6281 | 0.2790 | 0.1034 |
63 | 1.1474 | 0.3402 | 0.1688 | 0.0426 |
159 | 0.9536 | 0.3582 | 0.2131 | 0.0779 |
380 | 1.1536 | 0.4710 | 0.1439 | 0.0456 |
We can see that, for both Hierarchical Clustering and Graph-Disconnect, their efficacy ratios seem to decrease with larger N values.
For N=10, Graph-Disconnect has superior worst-case coverage, while Hierarchical Clustering has superior average-case coverage.
For N>10, Graph-Disconnect suffers from a sharp drop in efficacy ratios, while Hierarchical Clustering continues to perform consistently well, and beats Graph-Disconnect on all measures.
It is worth nothing that, in my trials, the coverage efficacy of Graph-Disconnect is highly influenced by its performance constraint parameters (radius and branching factor). Increasing these parameters tend to improve coverage ratios, but will cause exponential performance degradation. For large N, I have not found performance parameters that allow Graph-Disconnect to outperform Hierarchical Clustering in coverage while completing in a practical amount of time.
In summary, Hierarchical Clustering produces better point coverage for scenarios.
The following are recorded execution times for both algorithms on the same computer, measured in seconds.
For the Graph-Disconnect algorithm, the radius is chosen to be 6000
, and branching factor is chosen to be 180
. These parameters have a significant impact of the execution time of the algorithm.
N | Hierarchical | Graph-Disconnect |
---|---|---|
10 | 0.41 | 49.89 |
25 | 0.44 | 86.59 |
63 | 0.49 | 92.02 |
159 | 0.60 | 96.90 |
380 | 0.82 | 96.15 |
As seen here, Hierarchical Clustering is far superior to Graph-Disconnect when it comes to computational performance.
Across different approaches to selecting representative points for optimal point coverage given a collection of input points, Hierarchical Clustering is overwhelmingly successful, both in terms of coverage metrics, and in computational performance.
The only case to favor Graph-Disconnect is for very small N values, and if long execution times are acceptable.