-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New LMNN version #309
base: master
Are you sure you want to change the base?
New LMNN version #309
Conversation
Thanks for the PR @johny-c ! 👍 The failures seem to happen because Putting the line That said, I'm wondering if we shouldn't allow the singleton classes to be taken also into account in LMNN ? We can't form similar pairs from those samples (since there's just 1), but we could maybe use them in dissimilar pairs, no ? (If we don't take them into account nothing will prevent those singletons to be inside a cluster of points from a different class) PS: the CI tests are failing because we need to update the repo to match the new scikit-learn API (see #311), we'll fix that soon |
Hi @wdevazelhes, thanks for spotting the issue. I think the first solution would be best, namely initializing the components with all samples and then removing the singleton class samples. I think this can only negatively affect the algorithm if the initialization uses class labels; namely until now only LDA. Regarding taking into account singleton classes in LMNN as negative samples: I guess this could be done, but I think it would make the algorithm and code more complex and I'm uncertain about any potential gains. Arguably, if one intends to use an algorithm that works with triplets, they would make sure that each class has at least 2 samples. I would even argue that, throwing an exception if this is not satisfied, would make sense. In the PR in |
@johny-c, thanks for updating the PR, I should have a bit more time by the middle of next week to fix #311, so that we could see the checks here go green, About taking into account singleton classes as negative examples, I'm bothered for instance by use cases like person re-identification which is a use case for which many users are using |
@wdevazelhes, that's a good point. In practice, this can be done by considering |
@johny-c good question, I guess indeed it would be problematic if LMNN is too long to run, But I think that dropping only singleton classes would modify the training data in a biased way (i.e. the modified training data would not really be a true iid "subsample" of the original data anymore), so maybe instead we would rather want to train on a random subsample from the original data ? However this is maybe not ideal because in this case if there were a lot of classes with just two elements they may become singleton classes and therefore we may lose many points for which we can form target neighbors... In the end, in order to keep the code simple, and keep the data as much as possible as is, I'm thinking maybe we could drop a warning if there are a lot of singleton classes, sth like But I'm not really sure... @bellet, @perimosocordiae , @terrytangyuan , @nvauquie , what do you think about this ? |
Yes, I think a warning is fine in this case to keep the code less complex. We should probably not modify the original data and leave this up to the user to clean/decide what strategy to use. |
Hello! When trying your PR, I get the following error. Ran from Google Collab: Code: from sklearn.datasets import make_friedman1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
!pip install git+https://github.com/johny-c/metric-learn.git
def friedman_np_to_df(X,y):
return pd.DataFrame(X,columns=['x0','x1', 'x2', 'x3', 'x4']), pd.Series(y)
# Make training set. We don't care about Y so call it NA.
X_train, NA = make_friedman1(n_samples=1000, n_features=5, random_state = 1)
X_train, NA = friedman_np_to_df(X_train,NA)
#categorize training set based off of x0
domain_list = []
for i in range(len(X_train)):
if X_train.iloc[i]['x0'] < 0.6 :
domain_list.append(1)
else:
domain_list.append(0)
X_train['domain'] = domain_list
# Set training set to where domain == 1 (x0 < 0.6), but also add ~60 out-of-domain samples (X_train['domain'] == 1 )
out_of_domain = X_train[X_train['domain'] == 0][:60]
X_train = X_train[X_train['domain']==1]
X_train = pd.concat([out_of_domain, X_train])
y_train = X_train.copy()
X_train = X_train.drop(columns = ['domain'])
y_train = y_train['domain']
# Make testing set with a different random_state
X_test, NA2 = make_friedman1(n_samples=1000, n_features=5, random_state = 3)
X_test, NA2 = friedman_np_to_df(X_test,NA2)
#categorize testing set based off of x0
domain_list = []
for i in range(len(X_test)):
if X_test.iloc[i]['x0'] < 0.6:
domain_list.append(1)
else:
domain_list.append(0)
X_test['domain'] = domain_list
y_test = X_test['domain'].copy()
X_test = X_test.drop(columns = ['domain'])
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from metric_learn import LMNN
lmnn_knn = Pipeline(steps=[('lmnn', LMNN()), ('knn', KNeighborsClassifier())])
parameters = {'lmnn__init': ['pca', 'lda', 'identity', 'random'],
'lmnn__k':[2,3],
'knn__n_neighbors':[2,3],
'knn__weights': ['uniform','distance'],
'knn__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
'knn__leaf_size': [28,30,32],
'knn__metric': [ 'manhattan', 'mahalanobis', 'minkowski']}
grid_lmnn_knn = GridSearchCV(lmnn_knn, parameters,cv = 2, n_jobs=-1, verbose=True, scoring='f1')
grid_lmnn_knn.fit(np.array(X_train),np.array(y_train))
grid_lmnn_knn.score(np.array(X_test), np.array(y_test)) Output:
|
Hi @angelotc , thanks for trying this out. One issue that I can see is that the number of neighbors parameter is |
Did not realize that the parameters were renamed in this PR! Now I am getting the following, though init should be a parameter, no?: New param grid: parameters = {'lmnn__init': ['pca', 'lda', 'identity', 'random'],
'lmnn__n_neighbors':[2,3],
'knn__n_neighbors':[2,3],
'knn__weights': ['uniform','distance'],
'knn__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
'knn__leaf_size': [28,30,32],
'knn__metric': [ 'manhattan', 'mahalanobis', 'minkowski']} Output:
|
Yes, it should. I'm not sure what's going wrong here. Have you tried isolating the problem by removing all other parts, like |
Tried that! !pip install git+https://github.com/johny-c/metric-learn.git
from metric_learn import LMNN
inits = ['pca', 'lda', 'identity', 'random']
for init_param in inits:
lmnn = LMNN(init=init_param) Same error.
Also not sure if worth noting, but before that I was getting errors from the SDML import from init.py. So I just commented out the SDML import. |
@johny-c #313 was merged, so if you merge with master the tests should go green now, and actually I'm thinking we could maybe merge the PR as is, in the end (if everyone agrees), (and deal with the singleton classes and the warning message mentioned above later) since actually the current LMNN also doesn't support singleton classes (see comment #313 (comment):
@angelotc thanks for investigating, I tried to run your snippet just above, and realized the pip install actually didn't install this PR but the master from @johny-c which is an old |
@wdevazelhes , that's great! I need to make a few changes before pushing - the |
@wdevazelhes Wow thank you so much! Got around .93 AUC on my synthetic dataset example. Someone in my group also created a Mahalanobis distance based classifier that is similar performance. Wanted to compare this to it, and it looks promising. Hopefully this will work well on our materials datasets. |
You make a good point here in which there is a minimum amount of samples needed in each class. In my binary classification example, I initially tried to tried training with 600 samples of class 1, and testing on 600 samples of class 1 and 400 samples of class 0. This doesn't work due to the nature of the algorithm, and thus I received an error. So I tried training on 600 samples of class 0, and ran additional simulations where I constantly add more samples of the other class, and these were my results: So for x = 0.01, this means that if there were 600 samples of class 0 in the training set, 1% of the samples from the other class were also added to the training set(6 samples of class=1 in this case were added to the training set ). |
If you want to review again, it's ready from my side. |
@angelotc I'm glad it worked quite well, and indeed, I guess negative pairs are very important to learn the decision frontier, interesting curve ! It's nice to see you can have a good auc with that few samples from the other class, although I'm curious why you don't have the same balance of classes between your train and test set ? (your test set seems quite balanced with 600 ones and 400 zeros, contrary to the trainset, is that due to experimental constraints for instance ?) |
Thanks a lot @johny-c ! I'll try to do a review this week |
Hello, my academic research group is interested in the field of uncertainty quantification in materials sciences, but the materials field has small sample sizes; the dataset that I am working with only has around ~400 in-domain samples and <10 out-of-domain samples. My predecessors have actually treated this as a regression problem, and that has been published here. We sort of treat it as an unsupervised learning problem, but my PI told me that it would be fun to think of it as a supervised classification problem (which I am exploring with metric-learn). We actually had more success with just using the raw Mahalanobis distance score of each testing set sample to the training set's (in-domain samples) centroid, but still this was a fun journey. |
Coincidentally, this paper basically models what we are attempting to explore with the exception that it is in the materials field (not dialog systems though I do work with dialog systems). Pretty funny that it was written by Huawei people as well seeing that is where you work at. |
@angelotc I see, thanks for the reference, this Out of Domain Detection use case of metric-learning sounds interesting indeed! |
Hello, I am trying this out on another dataset now for superconductors cuprate classification. Running into an issue where ind_a and ind_b arrays are getting corrupted. I believe this is happening in line(s) 525-526 in the lmnn.py file. Here is my output from the error. I can share my notebook and dataset file via email since I don't know if the prof wants me to share either publically.
The array looks fine when being outputted from _find_impostors, but not from _find_impostors_blockwise, leading me to think this is being corrupted from ln 525/526:
|
Hi @angelotc , thanks for catching this. I think I know what's wrong. I changed something in the last commit, but it's easy to fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @johny-c ! sorry for the delay, here is a review, overall it looks good to me, I didn't manage to review all details of the computations, but I think if tests are sufficient, i.e. they ensure in a large enough set of test cases that the result is the same as a naive but easier to verify version of LMNN, it should be enough
Also I'm thinking it could be good to run the new LMNN with this benchmark code https://github.com/wdevazelhes/metric-bench, to see how much we gain in performances/accuracy on standard datasets, to advertise it in the release notes etc when doing the new release (since LMNN is one of the most used algorithms from metric-learn so it could encourage people to give it a try with the new version)
I'll try to do it if I have time but it may be faster if you try on your side, I guess the benchmark code should be easy to reuse
pull_loss_grad_m = _sum_weighted_outer_products(X, tn_graph) | ||
|
||
# Compute the pull loss weight such that the push loss weight becomes 1 | ||
pull_loss_weight = (1 - self.push_loss_weight) / self.push_loss_weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the idea, but why not keep the cost function of the form reg * push_loss + (1 - reg) * pull loss
, with reg
between 0 and 1 as given by the user ? I agree it would be the same up to a scaling factor, but I think the scaling factor can have a small impact on the training algorithm no ? (I guess it would be related to the magnitude of gradient steps ?)
If so I'd think it's better that the scaling of the problem doesn't depend on the push_loss_weight
(which is an argument of LMNN), so that if people change this parameter, there is not an additional scaling effect that adds to the effect of the regularization itself
Also, it will be better for users when debugging/[monitoring the training] for different push_loss_weight
values if the printed cost function is of the same order (because if say self.push_loss_weight=1e-5
, then pull_loss_weight=99999
, which will give a much higher cost function than if say push_loss_weight=0.5)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benefit is that you only need to multiply the pull loss by the rescaled weight once and the push loss never because its rescaled weight is one. I don't think the scaling has any negative effect in practice. If a user sets the push_loss_weight=1e-5
, then they really intend for the pull loss to have such a higher weight relatively speaking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check but I think this is a bit complicated to change in the current code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it is fine like this, but I think it could be good to at least advertise it in the docstring of the push_loss_weight
parameter, that such a rescaling will happen ?
Another possibility that may be good: I looked at our User Guide for LMNN, and in fact, as in the Wikipedia page for LMNN, the weight on the pull loss is 1, and the weight on the push loss is set by the user (whereas in the original paper, they use weights that sum to 1)
So this means there's a bit of freedom for how people parameterize LMNN, so I'm thinking about this solution:
you could, instead of the push_loss_term
, allow the user to set the pull_loss_term
weight(i.e. yet another formulation than Wikipedia or the original paper but they're all fine). Then in the docstring we would just say pull_loss_term
is a coefficient in front of the pull loss term, (so not a relative weight), (which implicitly means the coefficient in front of the push loss term is 1). Doing that you would just have to replace the line of code above (i.e. pull_loss_weight = (1 - self.push_loss_weight) / self.push_loss_weight
) by (pull_loss_weight=self.pull_loss_weight
). The last thing would be just to update the User Guide by putting the coefficient c
in front of the pull loss instead of the push loss.
I think that would preserve as much as possible your code while being very transparent and understandable (in the sense that the value of the cost function and gradient is exactly the one advertised in the User Guide, and there's no hidden rescaling)
What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be more confusing. The paper you cite is the journal paper, which was published a few years after the conference paper. The original conference paper (https://proceedings.neurips.cc/paper/2005/file/a7f592cef8b130a6967a90617db5681b-Paper.pdf) contains a formula for the cost function (Equation 2) with a push loss weight (c
), and the formula in the User Guide is taken from there. So, I think keeping the push_loss_weight
as a user parameter would be the most consistent option.
|
||
|
||
@pytest.mark.parametrize('X, y, loss', [(np.array([[0], [1], [2], [3]]), | ||
[1, 1, 0, 0], 3.0), | ||
[1, 1, 0, 0], 6.0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea why the loss value is not the same here ? Maybe some difference come from the scaling of the problem due to the new way the push_loss_weight
term in the cost function is computed ? (cf my comment above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. The pull loss and the push loss are equally weighted by 1 in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's see what we decide for the way to express the cost function and the push/pull loss weights (see comment #309 (comment)), and if it changes we can recompute that cost function by hand and put it in the test
test/metric_learn_test.py
Outdated
np.testing.assert_almost_equal(rel_diff, 0., decimal=5) | ||
|
||
|
||
def test_loss_func(capsys): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove this test ? I think we should keep it, even if we remove some checks that cannot be checked anymore (like the checks of the gradients at each iteration, which we cannot do because the optimization now uses LBFGSB), at least it's a good check for the cost function at the first iteration for instance, no ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I removed it because the checks that can remain are covered by other tests, such as the test_toy_ex_lmnn()
. Sure, if you think there's still something that's not covered already, I'll add it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think test_toy_ex_lmnn
is good because we can compute the result by hand, but it's maybe too light and simple, I think it would be good to have a more "real" example, with many random points (even more than 10 like in the original example, maybe like 40)
For that I think it would be quite simple to adapt the previous example, just checking the first iteration of loss and gradient: the whole last part of the original test with callbacks etc (not applicable here because of lbfgs) could be removed, let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, sounds good.
Yes, the algorithm performs well with .98 AUC with distinguishing cuprates from non-cuprates on our superconductors dataset. However, my research group is more interested in unsupervised methods; e.g. if we train the model on in-domain samples, how well does it perform on detecting the risk of out-of-domain samples? Out-of-domain samples can be of different classes. Through my research, Isolation Forest does not work well since it's highly correlated with the euclidian distance from the centroid. So we have started visiting the raw mahalanobis distance to the centroid of the training set, which works well for diffusion dataset, but not for our superconductors datasets. I do see that there is the Covariance method from the metric-learn module which I will explore in the next few weeks. |
@angelotc Sorry for the very late reply, indeed, the Covariance method can be interesting, since it returns a raw Mahalanobis metric computed from the training samples, so if it works well as a metric to the centroid in some cases, maybe it would even work well as a general metric (like for computing distances of test samples to the closest sample from the tranining set, not necessarily the centroid, to see if this test sample is an outlier or not) Also, Covariance is currently the only truly unsupervised metric learning algorithm in |
@johny-c thanks a lot for the commits |
Hi @wdevazelhes , I added back the test with more samples and removing the callbacks part. Regarding the push loss weight, see my reply above. I guess I can more clearly state what happens in the docstring. Other than that, is it good to go? |
Dear @johny-c, thanks a lot for the great work on this. Sorry I haven't had time yet to take a close look at this but I will try to do this soon. I don't remember if you provided somewhere some empirical comparison to the current implementation, to show for instance that it is faster in most cases while achieving similar accuracy? Of course it is quite clear that your implementation is more clear and easier to maintain :-) |
Also if I'm not mistaken this implementation solves a slightly different optimization problem (the hinge loss is squared), can you confirm @johny-c? If so, we should probably add a note about this in the documentation. |
Hi @bellet , thanks! Yes, the objective here uses squared distances as in the paper (https://proceedings.neurips.cc/paper/2005/file/a7f592cef8b130a6967a90617db5681b-Paper.pdf). I haven't done any comparisons to the previous version. Do you have any specific data sets in mind? I can do this, but I'm not sure when I will have free time. |
Hi @johny-c, I was referring to the use of the squared hinge loss (that is, a max squared instead of a max in the constraint/loss), which is needed to make the loss differentiable. |
@johny-c for a quick benchmark, I think a few standard classification datasets (like the ones in sklearn) would be sufficient. Just to show that the solutions found by this implementation lead to roughly the same accuracy in KNN |
Hi all,
this is a PR to replace LMNN. I have submitted this version also to scikit-learn some time ago (see scikit-learn/scikit-learn#8602) and @bellet suggested to submit here too. I removed some of the bells and whistles submitted to
scikit-learn
to make the code more readable and more easily maintainable, while following the conventions ofmetric-learn
.There are currently 9 tests failing, all on the same test function, namely the components initialization via
PCA
orLDA
. I'm not sure what is the appropriate way to handle these failures inmetric-learn
. I'm guessing other algorithms have faced the same failures, so I'm starting this PR as is and await for comments.