-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to achieve sample weighting? #3
Comments
You can simply multiply sample weights into your distance matrix here, such that X = X * sample_weight * sample_weight[:, np.newaxis] to multiply sample weights on both rows and columns of the distance matrix. I am leaving this open here because we probably should add this to the documentation. Feel free to make a pull request for the Python wrapper in https://github.com/kno10/python-kmedoids/ if you would like this to be in the API. |
@kno10 Thanks for your suggestion! I tried it out, but it does not quite work as expected. Considering k-means objective function, it could be expressed as where If I'm not mistaken, k-medoids objective function works similarly where I understand that in order to keep and Below is an example. The red dots are centroids or medoids. Considering weights in k-means improves the final metric for my specific task (statistically over a dataset), but considering weights as such in k-medoids significantly worsen the performance (by favoring outliers).
|
Ok, I see. Then what would be the desired weighting scheme? You could encode |
@kno10 Thanks for the suggestion. This indeed works, but only after small tweaks to the source code. When the dissimilarity matrix is symmetric, everything works fine; but when it turns assymetric, the order of diff --git a/src/fasterpam.rs b/src/fasterpam.rs
index 3749a06..1ea4f97 100644
--- a/src/fasterpam.rs
+++ b/src/fasterpam.rs
@@ -232,7 +232,7 @@ where
// Improvement from the journal version:
let mut acc = L::zero();
for (o, reco) in data.iter().enumerate() {
- let djo = mat.get(j, o);
+ let djo = mat.get(o, j);
// New medoid is closest:
if djo < reco.near.d {
acc += L::from(djo) - L::from(reco.near.d);
@@ -316,7 +316,7 @@ where
reco.near = DistancePair::new(b as u32, N::zero());
return L::zero();
}
- let djo = mat.get(j, o);
+ let djo = mat.get(o, j);
// Nearest medoid is gone:
if reco.near.i == b as u32 {
if djo < reco.seco.d { Test code use rand::prelude::*;
use ndarray::prelude::*;
use ndarray_rand::RandomExt;
use kmedoids::{fasterpam, random_initialization};
fn euclidean_distances(a: &ArrayView2<f64>) -> Array2<f64> {
let n = a.nrows();
Array2::from_shape_fn((n, n), |(i, j)| {
let x = a.index_axis(Axis(0), i);
let y = a.index_axis(Axis(0), j);
distance(&x, &y)
})
}
fn distance(x: &ArrayView1<f64>, y: &ArrayView1<f64>) -> f64 {
let v = x - y;
v.dot(&v).sqrt()
}
fn main() {
const N: usize = 64;
const D: usize = 2;
const K: usize = 3;
let mut rng = rand::rngs::StdRng::seed_from_u64(0);
// Test data generation
let (sample, weight) = {
let uniform = rand::distributions::Uniform::new(0., 1.);
let sample = Array2::<f64>::random_using((N, D), uniform, &mut rng);
let mut weight = Array1::<f64>::random_using(N, uniform, &mut rng);
weight /= weight.sum();
weight *= N as f64;
(sample, weight)
};
// K-medoids
let diss = euclidean_distances(&sample.view());
let diss = diss * weight.slice(s![.., NewAxis]); // Apply weights (sample, medoid)
let mut medoids = random_initialization(N, K, &mut rng);
let (loss, labels, _, _): (f64, _, _, _) = fasterpam(&diss, &mut medoids, 100);
dbg!(loss);
// Check loss, before the fix, loss will be inconsistent with loss2 and loss3
let loss2: f64 = itertools::izip!(sample.outer_iter(), labels.iter(), weight.iter())
.map(|(x, &l, &w)| {
let medoid = medoids[l];
let y = sample.index_axis(Axis(0), medoid);
distance(&x, &y) * w
})
.sum();
let loss3: f64 = labels.iter().enumerate()
.map(|(i, &l)| {
let medoid: usize = medoids[l];
diss[[i, medoid]]
})
.sum();
assert_eq!(loss, loss2);
assert_eq!(loss, loss3);
} |
TODO: to merge this, all the variants need to be checked to use consistent access to the matrix. When doing this, we can also allow asymmetric matrices, although this has more potential for misuse (by passing a data matrix instead of a distance matrix, very few data matrixes will be square). |
I made some of the changes in 6a82530 but have not added any tests for asymmetric use yet. Some methods will be able to support this, others will not (e.g., Silhouette itself is defined based on pairwise distances of samples, and will not support asymmetric use, but the medoid Silhouette can). |
Thanks for this awesome library. It's fast, powerful, and has already helped me a lot.
The sklearn KMeans algorithm has a optional
sample_weight
option, specifying weights for each observation. I'm wondering if this functionality is available in your library as well, e.g., by passing some sort of weighted dissimilarities? Or fork this crate and extend an algorithm?P.S.: Currently, I'm not involved in clustering research; I'm simply using it for my purposes. Hence, I'm uncertain if weighting is even well-defined for k-medoids algorithms. So, apologize for any disturbance.
The text was updated successfully, but these errors were encountered: