Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New LMNN version #309

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

New LMNN version #309

wants to merge 11 commits into from

Conversation

johny-c
Copy link

@johny-c johny-c commented Mar 20, 2021

Hi all,

this is a PR to replace LMNN. I have submitted this version also to scikit-learn some time ago (see scikit-learn/scikit-learn#8602) and @bellet suggested to submit here too. I removed some of the bells and whistles submitted to scikit-learn to make the code more readable and more easily maintainable, while following the conventions of metric-learn.

There are currently 9 tests failing, all on the same test function, namely the components initialization via PCA or LDA. I'm not sure what is the appropriate way to handle these failures in metric-learn. I'm guessing other algorithms have faced the same failures, so I'm starting this PR as is and await for comments.

FAILED test/test_mahalanobis_mixin.py::test_auto_init_transformation[LMNN-3-5-3-7] - ValueError: n_components cannot be larger than min(n_features, n_classes - 1).
FAILED test/test_mahalanobis_mixin.py::test_auto_init_transformation[LMNN-3-5-5-7] - ValueError: n_components cannot be larger than min(n_features, n_classes - 1).
FAILED test/test_mahalanobis_mixin.py::test_auto_init_transformation[LMNN-3-5-7-7] - ValueError: n_components cannot be larger than min(n_features, n_classes - 1).
FAILED test/test_mahalanobis_mixin.py::test_auto_init_transformation[LMNN-3-5-11-7] - ValueError: n_components cannot be larger than min(n_features, n_classes - 1).
FAILED test/test_mahalanobis_mixin.py::test_auto_init_transformation[LMNN-5-5-7-7] - ValueError: n_components=5 must be between 0 and min(n_samples, n_features)=4 with svd_solver='full'
FAILED test/test_mahalanobis_mixin.py::test_auto_init_transformation[LMNN-5-5-11-7] - ValueError: n_components=5 must be between 0 and min(n_samples, n_features)=4 with svd_solver='full'
FAILED test/test_mahalanobis_mixin.py::test_auto_init_transformation[LMNN-5-7-5-11] - ValueError: n_components cannot be larger than min(n_features, n_classes - 1).
FAILED test/test_mahalanobis_mixin.py::test_auto_init_transformation[LMNN-5-7-7-11] - ValueError: n_components cannot be larger than min(n_features, n_classes - 1).
FAILED test/test_mahalanobis_mixin.py::test_auto_init_transformation[LMNN-5-7-11-11] - ValueError: n_components cannot be larger than min(n_features, n_classes - 1).

@wdevazelhes
Copy link
Member

wdevazelhes commented Mar 22, 2021

Thanks for the PR @johny-c ! 👍

The failures seem to happen because test_auto_init_transformation creates toy examples with very few samples, and it happens that they contain singleton classes, which are removed by the method _validate_params from this PR, so then they don't have the right format (e.g. number of classes) for the initializations like LDA

Putting the line X, y, classes = self._validate_params(X, y) just after self.components_ = _initialize_components... instead of after X, y = self._prepare_inputs(X.... seems to solve the pb. That is, initializing the components_ with the original data, and pruning it only for the processing done for LMNN itself. However this would change a bit the behaviour of the initialization (which, after this modif, would take into account the singleton classes). But I think it makes sense, if we are able to fit such an init with singleton classes, to do so, no ?

That said, I'm wondering if we shouldn't allow the singleton classes to be taken also into account in LMNN ? We can't form similar pairs from those samples (since there's just 1), but we could maybe use them in dissimilar pairs, no ? (If we don't take them into account nothing will prevent those singletons to be inside a cluster of points from a different class)

PS: the CI tests are failing because we need to update the repo to match the new scikit-learn API (see #311), we'll fix that soon

@johny-c
Copy link
Author

johny-c commented Mar 23, 2021

Hi @wdevazelhes,

thanks for spotting the issue. I think the first solution would be best, namely initializing the components with all samples and then removing the singleton class samples. I think this can only negatively affect the algorithm if the initialization uses class labels; namely until now only LDA.

Regarding taking into account singleton classes in LMNN as negative samples: I guess this could be done, but I think it would make the algorithm and code more complex and I'm uncertain about any potential gains. Arguably, if one intends to use an algorithm that works with triplets, they would make sure that each class has at least 2 samples. I would even argue that, throwing an exception if this is not satisfied, would make sense. In the PR in scikit-learn, the solution suggested and implemented was to just remove the singleton samples from the beginning.

@wdevazelhes
Copy link
Member

wdevazelhes commented Mar 26, 2021

@johny-c, thanks for updating the PR, I should have a bit more time by the middle of next week to fix #311, so that we could see the checks here go green,

About taking into account singleton classes as negative examples, I'm bothered for instance by use cases like person re-identification which is a use case for which many users are using metric-learn (https://github.com/scikit-learn-contrib/metric-learn/network/dependents?package_id=UGFja2FnZS01MjI5NjIxOA%3D%3D). In those cases, I think it would still be useful to take singleton classes into account, because there can be a lot of them: for instance the "labeled faces in the wild" dataset contains 5749 people in total but only 1680 people with two or more images, (see http://vis-www.cs.umass.edu/lfw/#information), therefore, in this case dropping singleton cases would mean droping 4069 samples... I fear that in this case dropping the singleton classes could cause a significative drop in performance...

@johny-c
Copy link
Author

johny-c commented Mar 28, 2021

@wdevazelhes, that's a good point. In practice, this can be done by considering classes to be the non-singleton classes when iterating over them in find_impostors and considering impostor candidates with y != class_id instead of y > class_id. The issue with this change will be in terms of efficiency. I had tried this version in an early version of the scikit-learn PR and it makes the code about half as fast (which makes sense). Maybe, this should be controlled by another user parameter (e.g. use_singletons)? I'm not sure.

@wdevazelhes
Copy link
Member

@johny-c good question, I guess indeed it would be problematic if LMNN is too long to run,

But I think that dropping only singleton classes would modify the training data in a biased way (i.e. the modified training data would not really be a true iid "subsample" of the original data anymore), so maybe instead we would rather want to train on a random subsample from the original data ?
If so we could let the LMNN as-is (the version that keeps singleton classes), and let the users subsample their data by themselves.

However this is maybe not ideal because in this case if there were a lot of classes with just two elements they may become singleton classes and therefore we may lose many points for which we can form target neighbors...

In the end, in order to keep the code simple, and keep the data as much as possible as is, I'm thinking maybe we could drop a warning if there are a lot of singleton classes, sth like The dataset contains many singleton classes (XX, or YY% of the dataset), which will only be used when computing impostors. Training may be longer because of them, you may want to drop some of them to make the training faster. ? Then it's up to the users to use their own strategy for dropping singleton classes ?

But I'm not really sure...

@bellet, @perimosocordiae , @terrytangyuan , @nvauquie , what do you think about this ?

@terrytangyuan
Copy link
Member

Yes, I think a warning is fine in this case to keep the code less complex. We should probably not modify the original data and leave this up to the user to clean/decide what strategy to use.

@angelotc
Copy link

angelotc commented Apr 5, 2021

Hello! When trying your PR, I get the following error. Ran from Google Collab:

Code:

from sklearn.datasets import make_friedman1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
!pip install git+https://github.com/johny-c/metric-learn.git

def friedman_np_to_df(X,y):
  return pd.DataFrame(X,columns=['x0','x1', 'x2', 'x3', 'x4']), pd.Series(y)

# Make training set. We don't care about Y so call it NA.
X_train, NA = make_friedman1(n_samples=1000, n_features=5, random_state = 1) 
X_train, NA = friedman_np_to_df(X_train,NA)

#categorize training set based off of x0
domain_list = []
for i in range(len(X_train)):
  if X_train.iloc[i]['x0'] < 0.6 :
    domain_list.append(1)
  else:
    domain_list.append(0)


X_train['domain'] = domain_list

# Set training set to where domain == 1 (x0 < 0.6), but also add ~60 out-of-domain samples (X_train['domain'] == 1 )

out_of_domain = X_train[X_train['domain'] == 0][:60]
X_train =  X_train[X_train['domain']==1]
X_train = pd.concat([out_of_domain, X_train])
y_train = X_train.copy()
X_train = X_train.drop(columns = ['domain'])
y_train = y_train['domain']


# Make testing set with a different random_state
X_test, NA2 = make_friedman1(n_samples=1000, n_features=5, random_state = 3)
X_test, NA2 = friedman_np_to_df(X_test,NA2)


#categorize testing set based off of x0
domain_list = []
for i in range(len(X_test)):
  if X_test.iloc[i]['x0'] < 0.6:
    domain_list.append(1)
  else:
    domain_list.append(0)
X_test['domain'] = domain_list

y_test = X_test['domain'].copy()
X_test = X_test.drop(columns = ['domain'])


from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from metric_learn import LMNN
lmnn_knn = Pipeline(steps=[('lmnn', LMNN()), ('knn', KNeighborsClassifier())])
parameters = {'lmnn__init': ['pca', 'lda', 'identity', 'random'],
              'lmnn__k':[2,3],
              'knn__n_neighbors':[2,3],
              'knn__weights': ['uniform','distance'],
              'knn__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
              'knn__leaf_size': [28,30,32],
              'knn__metric': [ 'manhattan', 'mahalanobis', 'minkowski']}
grid_lmnn_knn = GridSearchCV(lmnn_knn, parameters,cv = 2, n_jobs=-1, verbose=True, scoring='f1')
grid_lmnn_knn.fit(np.array(X_train),np.array(y_train))
grid_lmnn_knn.score(np.array(X_test), np.array(y_test))

Output:

Fitting 2 folds for each of 1152 candidates, totalling 2304 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:  4.3min
[Parallel(n_jobs=-1)]: Done 446 tasks      | elapsed:  9.9min
[Parallel(n_jobs=-1)]: Done 796 tasks      | elapsed: 17.6min
[Parallel(n_jobs=-1)]: Done 1246 tasks      | elapsed: 27.5min
[Parallel(n_jobs=-1)]: Done 1796 tasks      | elapsed: 39.5min
[Parallel(n_jobs=-1)]: Done 2304 out of 2304 | elapsed: 50.6min finished
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-0d2c55d1bedc> in <module>()
     63               'knn__metric': [ 'manhattan', 'mahalanobis', 'minkowski']}
     64 grid_lmnn_knn = GridSearchCV(lmnn_knn, parameters,cv = 2, n_jobs=-1, verbose=True, scoring='f1')
---> 65 grid_lmnn_knn.fit(np.array(X_train),np.array(y_train))
     66 grid_lmnn_knn.score(np.array(X_test), np.array(y_test))
     67 

6 frames
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
    734             # of the params are estimators as well.
    735             self.best_estimator_ = clone(clone(base_estimator).set_params(
--> 736                 **self.best_params_))
    737             refit_start_time = time.time()
    738             if y is not None:

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
     69     new_object_params = estimator.get_params(deep=False)
     70     for name, param in new_object_params.items():
---> 71         new_object_params[name] = clone(param, safe=False)
     72     new_object = klass(**new_object_params)
     73     params_set = new_object.get_params(deep=False)

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
     57     # XXX: not handling dictionaries
     58     if estimator_type in (list, tuple, set, frozenset):
---> 59         return estimator_type([clone(e, safe=safe) for e in estimator])
     60     elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
     61         if not safe:

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in <listcomp>(.0)
     57     # XXX: not handling dictionaries
     58     if estimator_type in (list, tuple, set, frozenset):
---> 59         return estimator_type([clone(e, safe=safe) for e in estimator])
     60     elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
     61         if not safe:

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
     57     # XXX: not handling dictionaries
     58     if estimator_type in (list, tuple, set, frozenset):
---> 59         return estimator_type([clone(e, safe=safe) for e in estimator])
     60     elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
     61         if not safe:

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in <listcomp>(.0)
     57     # XXX: not handling dictionaries
     58     if estimator_type in (list, tuple, set, frozenset):
---> 59         return estimator_type([clone(e, safe=safe) for e in estimator])
     60     elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
     61         if not safe:

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
     70     for name, param in new_object_params.items():
     71         new_object_params[name] = clone(param, safe=False)
---> 72     new_object = klass(**new_object_params)
     73     params_set = new_object.get_params(deep=False)
     74 

TypeError: __init__() got an unexpected keyword argument 'init'

@johny-c
Copy link
Author

johny-c commented Apr 5, 2021

Hi @angelotc , thanks for trying this out. One issue that I can see is that the number of neighbors parameter is n_neighbors instead of k to be consistent with scikit-learn naming (lmnn__k --> lmnn__n_neighbors).

@angelotc
Copy link

angelotc commented Apr 5, 2021

Did not realize that the parameters were renamed in this PR! Now I am getting the following, though init should be a parameter, no?:

New param grid:

parameters = {'lmnn__init': ['pca', 'lda', 'identity', 'random'],
              'lmnn__n_neighbors':[2,3],
              'knn__n_neighbors':[2,3],
              'knn__weights': ['uniform','distance'],
              'knn__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
              'knn__leaf_size': [28,30,32],
              'knn__metric': [ 'manhattan', 'mahalanobis', 'minkowski']}

Output:


Collecting git+https://github.com/johny-c/metric-learn.git
  Cloning https://github.com/johny-c/metric-learn.git to /tmp/pip-req-build-fitppg1y
  Running command git clone -q https://github.com/johny-c/metric-learn.git /tmp/pip-req-build-fitppg1y
Requirement already satisfied (use --upgrade to upgrade): metric-learn==0.3.0 from git+https://github.com/johny-c/metric-learn.git in /usr/local/lib/python3.7/dist-packages
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (1.19.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (1.4.1)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (0.22.2.post1)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (1.15.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->metric-learn==0.3.0) (1.0.1)
Building wheels for collected packages: metric-learn
  Building wheel for metric-learn (setup.py) ... done
  Created wheel for metric-learn: filename=metric_learn-0.3.0-py2.py3-none-any.whl size=20418 sha256=5f4304f1854d9baf2addf29205be74f8606b616a37bb4654e9785382741e3806
  Stored in directory: /tmp/pip-ephem-wheel-cache-l_uxezae/wheels/37/bb/37/d1f7cdc04867e7f3238ae031563cf2d4d54fa038dff40b9cd7
Successfully built metric-learn
Fitting 2 folds for each of 1152 candidates, totalling 2304 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:  4.6min
[Parallel(n_jobs=-1)]: Done 446 tasks      | elapsed: 10.6min
[Parallel(n_jobs=-1)]: Done 796 tasks      | elapsed: 18.8min
[Parallel(n_jobs=-1)]: Done 1246 tasks      | elapsed: 29.4min
[Parallel(n_jobs=-1)]: Done 1796 tasks      | elapsed: 42.3min
[Parallel(n_jobs=-1)]: Done 2304 out of 2304 | elapsed: 54.2min finished
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-6e7010734268> in <module>()
     65               'knn__metric': [ 'manhattan', 'mahalanobis', 'minkowski']}
     66 grid_lmnn_knn = GridSearchCV(lmnn_knn, parameters,cv = 2, n_jobs=-1, verbose=True, scoring='f1')
---> 67 grid_lmnn_knn.fit(np.array(X_train),np.array(y_train))
     68 grid_lmnn_knn.score(np.array(X_test), np.array(y_test))
     69 

6 frames
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
    734             # of the params are estimators as well.
    735             self.best_estimator_ = clone(clone(base_estimator).set_params(
--> 736                 **self.best_params_))
    737             refit_start_time = time.time()
    738             if y is not None:

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
     69     new_object_params = estimator.get_params(deep=False)
     70     for name, param in new_object_params.items():
---> 71         new_object_params[name] = clone(param, safe=False)
     72     new_object = klass(**new_object_params)
     73     params_set = new_object.get_params(deep=False)

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
     57     # XXX: not handling dictionaries
     58     if estimator_type in (list, tuple, set, frozenset):
---> 59         return estimator_type([clone(e, safe=safe) for e in estimator])
     60     elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
     61         if not safe:

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in <listcomp>(.0)
     57     # XXX: not handling dictionaries
     58     if estimator_type in (list, tuple, set, frozenset):
---> 59         return estimator_type([clone(e, safe=safe) for e in estimator])
     60     elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
     61         if not safe:

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
     57     # XXX: not handling dictionaries
     58     if estimator_type in (list, tuple, set, frozenset):
---> 59         return estimator_type([clone(e, safe=safe) for e in estimator])
     60     elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
     61         if not safe:

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in <listcomp>(.0)
     57     # XXX: not handling dictionaries
     58     if estimator_type in (list, tuple, set, frozenset):
---> 59         return estimator_type([clone(e, safe=safe) for e in estimator])
     60     elif not hasattr(estimator, 'get_params') or isinstance(estimator, type):
     61         if not safe:

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in clone(estimator, safe)
     70     for name, param in new_object_params.items():
     71         new_object_params[name] = clone(param, safe=False)
---> 72     new_object = klass(**new_object_params)
     73     params_set = new_object.get_params(deep=False)
     74 

TypeError: __init__() got an unexpected keyword argument 'init'

@johny-c
Copy link
Author

johny-c commented Apr 6, 2021

Yes, it should. I'm not sure what's going wrong here. Have you tried isolating the problem by removing all other parts, like GridCV and Pipeline?

@angelotc
Copy link

angelotc commented Apr 6, 2021

Tried that!

!pip install git+https://github.com/johny-c/metric-learn.git
from metric_learn import LMNN
inits = ['pca', 'lda', 'identity', 'random']
for init_param in inits:
  lmnn = LMNN(init=init_param)

Same error.

Collecting git+https://github.com/johny-c/metric-learn.git
  Cloning https://github.com/johny-c/metric-learn.git to /tmp/pip-req-build-g4w95cib
  Running command git clone -q https://github.com/johny-c/metric-learn.git /tmp/pip-req-build-g4w95cib
Requirement already satisfied (use --upgrade to upgrade): metric-learn==0.3.0 from git+https://github.com/johny-c/metric-learn.git in /usr/local/lib/python3.7/dist-packages
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (1.19.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (1.4.1)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (0.22.2.post1)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from metric-learn==0.3.0) (1.15.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->metric-learn==0.3.0) (1.0.1)
Building wheels for collected packages: metric-learn
  Building wheel for metric-learn (setup.py) ... done
  Created wheel for metric-learn: filename=metric_learn-0.3.0-py2.py3-none-any.whl size=20418 sha256=3a82cbd80fba81b5798d874527b3586c34e09488a2d11d28cafe649cb036c416
  Stored in directory: /tmp/pip-ephem-wheel-cache-bgggttl6/wheels/37/bb/37/d1f7cdc04867e7f3238ae031563cf2d4d54fa038dff40b9cd7
Successfully built metric-learn
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-b9501fab5074> in <module>()
      4 inits = ['pca', 'lda', 'identity', 'random']
      5 for init_param in inits:
----> 6   lmnn = LMNN(init=init_param)
      7 
      8 

TypeError: __init__() got an unexpected keyword argument 'init'

Also not sure if worth noting, but before that I was getting errors from the SDML import from init.py. So I just commented out the SDML import.

@wdevazelhes
Copy link
Member

@johny-c #313 was merged, so if you merge with master the tests should go green now, and actually I'm thinking we could maybe merge the PR as is, in the end (if everyone agrees), (and deal with the singleton classes and the warning message mentioned above later) since actually the current LMNN also doesn't support singleton classes (see comment #313 (comment):

the current LMNN is even more strict than the newer PR (it needs at least n_neighbors + 1 elements in each class, not just 2), so the new PR would actually allow to deal with more cases, and we could deal with singleton classes in another PR))

@angelotc thanks for investigating, I tried to run your snippet just above, and realized the pip install actually didn't install this PR but the master from @johny-c which is an old metric-learn version (so caused some bugs). Replacing your first line by !pip install git+https://github.com/johny-c/metric-learn.git@lmnn, it worked, maybe you can try it again on your full example ?

@johny-c
Copy link
Author

johny-c commented Apr 16, 2021

@wdevazelhes , that's great! I need to make a few changes before pushing - the max_impostors param is crucial for memory and should be included. I'll push within the next 2-3 days.

@angelotc
Copy link

@wdevazelhes Wow thank you so much! Got around .93 AUC on my synthetic dataset example. Someone in my group also created a Mahalanobis distance based classifier that is similar performance. Wanted to compare this to it, and it looks promising. Hopefully this will work well on our materials datasets.

@angelotc
Copy link

angelotc commented Apr 16, 2021

the current LMNN is even more strict than the newer PR (it needs at least n_neighbors + 1 elements in each class, not just 2), so the new PR would actually allow to deal with more cases, and we could deal with singleton classes in another PR))

You make a good point here in which there is a minimum amount of samples needed in each class. In my binary classification example, I initially tried to tried training with 600 samples of class 1, and testing on 600 samples of class 1 and 400 samples of class 0. This doesn't work due to the nature of the algorithm, and thus I received an error. So I tried training on 600 samples of class 0, and ran additional simulations where I constantly add more samples of the other class, and these were my results:

image

So for x = 0.01, this means that if there were 600 samples of class 0 in the training set, 1% of the samples from the other class were also added to the training set(6 samples of class=1 in this case were added to the training set ).

@johny-c
Copy link
Author

johny-c commented Apr 17, 2021

If you want to review again, it's ready from my side.

@wdevazelhes
Copy link
Member

@angelotc I'm glad it worked quite well, and indeed, I guess negative pairs are very important to learn the decision frontier, interesting curve ! It's nice to see you can have a good auc with that few samples from the other class, although I'm curious why you don't have the same balance of classes between your train and test set ? (your test set seems quite balanced with 600 ones and 400 zeros, contrary to the trainset, is that due to experimental constraints for instance ?)

@wdevazelhes
Copy link
Member

If you want to review again, it's ready from my side.

Thanks a lot @johny-c ! I'll try to do a review this week

@angelotc
Copy link

angelotc commented Apr 21, 2021

@angelotc I'm glad it worked quite well, and indeed, I guess negative pairs are very important to learn the decision frontier, interesting curve ! It's nice to see you can have a good auc with that few samples from the other class, although I'm curious why you don't have the same balance of classes between your train and test set ? (your test set seems quite balanced with 600 ones and 400 zeros, contrary to the trainset, is that due to experimental constraints for instance ?)

Hello, my academic research group is interested in the field of uncertainty quantification in materials sciences, but the materials field has small sample sizes; the dataset that I am working with only has around ~400 in-domain samples and <10 out-of-domain samples. My predecessors have actually treated this as a regression problem, and that has been published here. We sort of treat it as an unsupervised learning problem, but my PI told me that it would be fun to think of it as a supervised classification problem (which I am exploring with metric-learn). We actually had more success with just using the raw Mahalanobis distance score of each testing set sample to the training set's (in-domain samples) centroid, but still this was a fun journey.

image

@angelotc
Copy link

Coincidentally, this paper basically models what we are attempting to explore with the exception that it is in the materials field (not dialog systems though I do work with dialog systems). Pretty funny that it was written by Huawei people as well seeing that is where you work at.

@wdevazelhes
Copy link
Member

@angelotc I see, thanks for the reference, this Out of Domain Detection use case of metric-learning sounds interesting indeed!
Btw we just opened the github discussions section of metric-learn, https://github.com/scikit-learn-contrib/metric-learn/discussions, don't hesitate to use it if you have any remark to discuss regarding metric-learn ;)

@angelotc
Copy link

Hello, I am trying this out on another dataset now for superconductors cuprate classification. Running into an issue where ind_a and ind_b arrays are getting corrupted. I believe this is happening in line(s) 525-526 in the lmnn.py file.

Here is my output from the error. I can share my notebook and dataset file via email since I don't know if the prof wants me to share either publically.

Fitting 2 folds for each of 288 candidates, totalling 576 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  68 tasks      | elapsed:    4.0s
[Parallel(n_jobs=-1)]: Done 368 tasks      | elapsed:   16.9s
[Parallel(n_jobs=-1)]: Done 576 out of 576 | elapsed:   25.8s finished
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-21-6c1595135885> in <module>()
     16 
     17 grid_lmnn_knn = GridSearchCV(lmnn_knn, parameters,cv = 2, n_jobs=-1, verbose=True, scoring='roc_auc')
---> 18 grid_lmnn_knn = grid_lmnn_knn.fit(np.array(X_train), np.array(y_train))
     19 roc_auc_score = grid_lmnn_knn.score(np.array(X_test), np.array(y_test))
     20 results.append(roc_auc_score)

14 frames
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
    737             refit_start_time = time.time()
    738             if y is not None:
--> 739                 self.best_estimator_.fit(X, y, **fit_params)
    740             else:
    741                 self.best_estimator_.fit(X, **fit_params)

/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    348             This estimator
    349         """
--> 350         Xt, fit_params = self._fit(X, y, **fit_params)
    351         with _print_elapsed_time('Pipeline',
    352                                  self._log_message(len(self.steps) - 1)):

/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params)
    313                 message_clsname='Pipeline',
    314                 message=self._log_message(step_idx),
--> 315                 **fit_params_steps[name])
    316             # Replace the transformer of the step with the fitted
    317             # transformer. This is necessary when loading the transformer

/usr/local/lib/python3.7/dist-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    350 
    351     def __call__(self, *args, **kwargs):
--> 352         return self.func(*args, **kwargs)
    353 
    354     def call_and_shelve(self, *args, **kwargs):

/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    726     with _print_elapsed_time(message_clsname, message):
    727         if hasattr(transformer, 'fit_transform'):
--> 728             res = transformer.fit_transform(X, y, **fit_params)
    729         else:
    730             res = transformer.fit(X, y, **fit_params).transform(X)

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    572         else:
    573             # fit method of arity 2 (supervised transformation)
--> 574             return self.fit(X, y, **fit_params).transform(X)
    575 
    576 

/usr/local/lib/python3.7/dist-packages/metric_learn/lmnn.py in fit(self, X, y)
    220     # Call the optimizer
    221     self.n_iter_ = 0
--> 222     opt_result = minimize(**optimizer_params)
    223 
    224     # Reshape the solution found by the optimizer

/usr/local/lib/python3.7/dist-packages/scipy/optimize/_minimize.py in minimize(fun, x0, args, method, jac, hess, hessp, bounds, constraints, tol, callback, options)
    608     elif meth == 'l-bfgs-b':
    609         return _minimize_lbfgsb(fun, x0, args, jac, bounds,
--> 610                                 callback=callback, **options)
    611     elif meth == 'tnc':
    612         return _minimize_tnc(fun, x0, args, jac, bounds, callback=callback,

/usr/local/lib/python3.7/dist-packages/scipy/optimize/lbfgsb.py in _minimize_lbfgsb(fun, x0, args, jac, bounds, disp, maxcor, ftol, gtol, eps, maxfun, maxiter, iprint, callback, maxls, **unknown_options)
    343             # until the completion of the current minimization iteration.
    344             # Overwrite f and g:
--> 345             f, g = func_and_grad(x)
    346         elif task_str.startswith(b'NEW_X'):
    347             # new iteration

/usr/local/lib/python3.7/dist-packages/scipy/optimize/lbfgsb.py in func_and_grad(x)
    293     else:
    294         def func_and_grad(x):
--> 295             f = fun(x, *args)
    296             g = jac(x, *args)
    297             return f, g

/usr/local/lib/python3.7/dist-packages/scipy/optimize/optimize.py in function_wrapper(*wrapper_args)
    325     def function_wrapper(*wrapper_args):
    326         ncalls[0] += 1
--> 327         return function(*(wrapper_args + args))
    328 
    329     return ncalls, function_wrapper

/usr/local/lib/python3.7/dist-packages/scipy/optimize/optimize.py in __call__(self, x, *args)
     63     def __call__(self, x, *args):
     64         self.x = numpy.asarray(x).copy()
---> 65         fg = self.fun(x, *args)
     66         self.jac = fg[1]
     67         return fg[0]

/usr/local/lib/python3.7/dist-packages/metric_learn/lmnn.py in _loss_grad_lbfgs(self, L, X, y, classes, target_neighbors, pull_loss_grad_m)
    340     # Find impostors
    341     impostors_graph = self._find_impostors(X_embedded, y, classes,
--> 342                                            dist_tn[:, -1])
    343 
    344     # Compute the push loss and its gradient

/usr/local/lib/python3.7/dist-packages/metric_learn/lmnn.py in _find_impostors(self, X_embedded, y, classes, margin_radii)
    428 
    429       ind_impostors = _find_impostors_blockwise(X_embedded, margin_radii,
--> 430                                                 ind_b, ind_a)
    431 
    432       n_impostors = len(ind_impostors)

/usr/local/lib/python3.7/dist-packages/metric_learn/lmnn.py in _find_impostors_blockwise(X, radii, ind_a, ind_b, return_distance, block_size)
    520   for chunk in gen_batches(n_samples_a, block_n_rows):
    521 
--> 522     dist = euclidean_distances(X[ind_a[chunk]], X_b, squared=True,
    523                                Y_norm_squared=X_b_norm_squared)
    524 

IndexError: index 947380 is out of bounds for axis 0 with size 5314

The array looks fine when being outputted from _find_impostors, but not from _find_impostors_blockwise, leading me to think this is being corrupted from ln 525/526:

> /usr/local/lib/python3.7/dist-packages/metric_learn/lmnn.py(522)_find_impostors_blockwise()
    520   for chunk in gen_batches(n_samples_a, block_n_rows):
    521 
--> 522     dist = euclidean_distances(X[ind_a[chunk]], X_b, squared=True,
    523                                Y_norm_squared=X_b_norm_squared)
    524 

ipdb> ind_a
array([   2669,    3386,    5540,    8190,   21013,   22068,   23031,
         23576,   27355,   28453,   28894,   32398,   34464,   34579,
         35427,   35972,   41765,   43127,   43962,   46115,   46596,
         47108,   47729,   48406,   59039,   61196,   64114,   77970,
         78425,   82708,   86698,   90830,   92560,   93573,   94290,
         99473,  100018,  109465,  109803,  110348,  112846,  115199,
        117224,  117679,  121861,  121941,  130463,  131008,  135818,
        136273,  145117,  151123,  151668,  157679,  158160,  161074,
        165218,  166808,  167263,  169382,  171404,  171975,  172669,
        173073,  173514,  179810,  184376,  190675,  191392,  195356,
        197899,  201930,  202385,  202536,  206062,  206517,  207223,
        207934,  208636,  209163,  209289,  209857,  210000,  210702,
        214834,  217141,  217818,  224280,  224685,  227186,  231318,
        233428,  233955,  234081,  234649,  234792,  235053,  235494,
        236187,  236668,  238213,  238924,  239626,  240040,  241327,
        243714,  247382,  247837,  250483,  251581,  251759,  254088,
        255400,  257336,  257741,  261945,  263951,  265051,  265768,
        268042,  268497,  268885,  269430,  270951,  271496,  279215,
        279760,  288326,  292458,  295364,  297857,  298127,  299540,
        305110,  305225,  310476,  312271,  312816,  314079,  315132,
        316166,  317250,  318609,  319971,  328420,  330600,  332367,
        336220,  338794,  340958,  343003,  344056,  345625,  346342,
        348496,  358511,  358738,  361476,  361934,  361995,  363221,
        363357,  365144,  365599,  368371,  369784,  377785,  382148,
        386268,  395103,  395797,  396201,  402938,  404022,  405599,
        406080,  406592,  409956,  411974,  417829,  418927,  419466,
        422992,  423447,  427632,  429654,  430880,  431384,  433786,
        435388,  435843,  442050,  445759,  455408,  458619,  459275,
        460688,  463089,  463634,  469479,  470577,  471018,  475248,
        481304,  491050,  491165,  493744,  494079,  494624,  497832,
        497974,  504409,  504954,  507859,  507970,  508541,  509086,
        512673,  513218,  519229,  519710,  525786,  526012,  528238,
        528353,  529519,  530932,  531625,  532106,  534556,  535011,
        535539,  536901,  538211,  538253,  538480,  538967,  545350,
        546952,  547407,  549482,  549861,  550406,  552774,  553179,
        553756,  556002,  561414,  563143,  564641,  565209,  565352,
        566054,  567236,  568773,  569341,  569484,  570186,  571333,
        572267,  574701,  574971,  575539,  576384,  578785,  579330,
        581169,  581737,  581880,  582582,  582680,  585301,  586714,
        588736,  589163,  590218,  592284,  595505,  597379,  597924,
        601829,  603242,  605835,  606919,  606933,  607374,  609854,
        610823,  611141,  614225,  614793,  614936,  615638,  616820,
        618231,  618357,  618925,  619068,  619329,  619770,  622529,
        623010,  626303,  626848,  628111,  638364,  638839,  639445,
        640201,  645195,  645912,  649347,  649915,  650760,  651883,
        652110,  660462,  665178,  674924,  675039,  675887,  676432,
        679684,  680317,  681034,  684963,  685897,  686357,  688601,
        689169,  689312,  690014,  690112,  693262,  694481,  695026,
        697770,  698225,  702487,  703540,  705003,  705697,  706101,
        711327,  711895,  712038,  712740,  714762,  718430,  718885,
        723026,  723597,  723723,  724291,  724434,  724695,  725136,
        725226,  726002,  726283,  727217,  727835,  730826,  731281,
        732516,  735993,  736119,  736687,  736830,  737091,  737532,
        738165,  738655,  738882,  739218,  743222,  743677,  748495,
        749212,  752687,  753168,  754693,  755410,  757564,  757679,
        760255,  762851,  762977,  763545,  763688,  763949,  764390,
        764725,  765270,  768067,  772146,  772601,  777957,  779235,
        780248,  780290,  781253,  781798,  783367,  783637,  785050,
        787499,  788554,  789517,  790062,  791662,  791723,  792949,
        793085,  797183,  799847,  800392,  802211,  802928,  809214,
        809329,  813974,  814449,  815055,  815324,  816375,  816920,
        820249,  821771,  822122,  824304,  824779,  826141,  826705,
        827250,  828819,  829832,  829874,  830885,  831940,  834590,
        834969,  835514,  837175,  838537,  840832,  841359,  842443,
        842457,  845439,  846801,  847683,  849096,  850492,  850534,
        851497,  852042,  854410,  857174,  861448,  864639,  864681,
        864908,  866494,  867310,  867648,  869248,  869703,  872349,
        872475,  873043,  873186,  873447,  873888,  875446,  875901,
        878355,  878900,  882535,  882805,  883373,  884218,  886252,
        890153,  897247,  897964,  898300,  900233,  900725,  903339,
        904437,  910062,  910288,  914060,  915841,  916078,  917609,
        918154,  920487,  921421,  922270,  932957,  937264,  937306,
        944765,  945002,  945818,  947380,  949875,  952779,  953049,
        954462,  959227,  959464,  960280,  961187,  961313,  961881,
        962024,  962285,  962726,  963253,  963379,  963947,  964090,
        964351,  964792,  966188,  966230,  968411,  970990,  973012,
        973689,  974406,  975199,  983386,  984939,  990111,  990237,
        990805,  990948,  991650,  993208,  993663,  996309,  997407,
        997585,  998995,  999406,  999861,  999929, 1005604, 1006059,
       1006745, 1007462, 1011426, 1011831, 1012785, 1017995, 1026144,
       1026259, 1030020, 1034474, 1039124, 1039245, 1043714, 1045001,
       1046762, 1046804])
ipdb> ind_b
array([ 80584,  80713,  80727,  80830,  80989,  81022,  81155,  81156,
        81386,  81417,  81542,  81544,  81589,  81654,  81655,  81717,
        81731,  81740,  81744,  81770,  81875,  81893,  81935,  81995,
        82005,  82009,  82034,  82073,  82122,  82237,  82256,  82260,
        82290,  82307,  82310,  82318,  82429,  82563, 261040, 261945,
       447398, 448136, 455408, 506794, 525096, 525106, 525786, 525858,
       526012, 526196, 526639, 526686, 698398, 698430, 698545, 698596,
       698608, 698667, 698687, 698771, 698851, 698884, 699040, 699103,
       699122, 699132, 699153, 699199, 699350, 699399, 699437, 699715,
       699737, 699976, 699981, 700117, 700150, 700313, 822122, 909129,
       909443, 909907, 910062, 910127, 910288, 910806, 911031, 966188])
ipdb> up
> /usr/local/lib/python3.7/dist-packages/metric_learn/lmnn.py(430)_find_impostors()
    428 
    429       ind_impostors = _find_impostors_blockwise(X_embedded, margin_radii,
--> 430                                                 ind_b, ind_a)
    431 
    432       n_impostors = len(ind_impostors)

ipdb> ind_a
array([   1,    3,    5, ..., 5298, 5306, 5313])
ipdb> ind_b
array([   0,    2,    4, ..., 5310, 5311, 5312])
ipdb> 

@johny-c
Copy link
Author

johny-c commented Apr 28, 2021

Hi @angelotc , thanks for catching this. I think I know what's wrong. I changed something in the last commit, but it's easy to fix.

Copy link
Member

@wdevazelhes wdevazelhes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @johny-c ! sorry for the delay, here is a review, overall it looks good to me, I didn't manage to review all details of the computations, but I think if tests are sufficient, i.e. they ensure in a large enough set of test cases that the result is the same as a naive but easier to verify version of LMNN, it should be enough

Also I'm thinking it could be good to run the new LMNN with this benchmark code https://github.com/wdevazelhes/metric-bench, to see how much we gain in performances/accuracy on standard datasets, to advertise it in the release notes etc when doing the new release (since LMNN is one of the most used algorithms from metric-learn so it could encourage people to give it a try with the new version)
I'll try to do it if I have time but it may be faster if you try on your side, I guess the benchmark code should be easy to reuse

test/metric_learn_test.py Outdated Show resolved Hide resolved
metric_learn/lmnn.py Outdated Show resolved Hide resolved
metric_learn/lmnn.py Show resolved Hide resolved
pull_loss_grad_m = _sum_weighted_outer_products(X, tn_graph)

# Compute the pull loss weight such that the push loss weight becomes 1
pull_loss_weight = (1 - self.push_loss_weight) / self.push_loss_weight
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the idea, but why not keep the cost function of the form reg * push_loss + (1 - reg) * pull loss , with reg between 0 and 1 as given by the user ? I agree it would be the same up to a scaling factor, but I think the scaling factor can have a small impact on the training algorithm no ? (I guess it would be related to the magnitude of gradient steps ?)

If so I'd think it's better that the scaling of the problem doesn't depend on the push_loss_weight (which is an argument of LMNN), so that if people change this parameter, there is not an additional scaling effect that adds to the effect of the regularization itself

Also, it will be better for users when debugging/[monitoring the training] for different push_loss_weight values if the printed cost function is of the same order (because if say self.push_loss_weight=1e-5, then pull_loss_weight=99999, which will give a much higher cost function than if say push_loss_weight=0.5)

Copy link
Author

@johny-c johny-c May 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit is that you only need to multiply the pull loss by the rescaled weight once and the push loss never because its rescaled weight is one. I don't think the scaling has any negative effect in practice. If a user sets the push_loss_weight=1e-5, then they really intend for the pull loss to have such a higher weight relatively speaking.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check but I think this is a bit complicated to change in the current code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is fine like this, but I think it could be good to at least advertise it in the docstring of the push_loss_weight parameter, that such a rescaling will happen ?

Another possibility that may be good: I looked at our User Guide for LMNN, and in fact, as in the Wikipedia page for LMNN, the weight on the pull loss is 1, and the weight on the push loss is set by the user (whereas in the original paper, they use weights that sum to 1)
So this means there's a bit of freedom for how people parameterize LMNN, so I'm thinking about this solution:
you could, instead of the push_loss_term, allow the user to set the pull_loss_term weight(i.e. yet another formulation than Wikipedia or the original paper but they're all fine). Then in the docstring we would just say pull_loss_term is a coefficient in front of the pull loss term, (so not a relative weight), (which implicitly means the coefficient in front of the push loss term is 1). Doing that you would just have to replace the line of code above (i.e. pull_loss_weight = (1 - self.push_loss_weight) / self.push_loss_weight) by (pull_loss_weight=self.pull_loss_weight). The last thing would be just to update the User Guide by putting the coefficient c in front of the pull loss instead of the push loss.

I think that would preserve as much as possible your code while being very transparent and understandable (in the sense that the value of the cost function and gradient is exactly the one advertised in the User Guide, and there's no hidden rescaling)

What do you think ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be more confusing. The paper you cite is the journal paper, which was published a few years after the conference paper. The original conference paper (https://proceedings.neurips.cc/paper/2005/file/a7f592cef8b130a6967a90617db5681b-Paper.pdf) contains a formula for the cost function (Equation 2) with a push loss weight (c), and the formula in the User Guide is taken from there. So, I think keeping the push_loss_weight as a user parameter would be the most consistent option.

test/metric_learn_test.py Show resolved Hide resolved
metric_learn/lmnn.py Show resolved Hide resolved


@pytest.mark.parametrize('X, y, loss', [(np.array([[0], [1], [2], [3]]),
[1, 1, 0, 0], 3.0),
[1, 1, 0, 0], 6.0),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea why the loss value is not the same here ? Maybe some difference come from the scaling of the problem due to the new way the push_loss_weight term in the cost function is computed ? (cf my comment above)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. The pull loss and the push loss are equally weighted by 1 in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's see what we decide for the way to express the cost function and the push/pull loss weights (see comment #309 (comment)), and if it changes we can recompute that cost function by hand and put it in the test

np.testing.assert_almost_equal(rel_diff, 0., decimal=5)


def test_loss_func(capsys):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove this test ? I think we should keep it, even if we remove some checks that cannot be checked anymore (like the checks of the gradients at each iteration, which we cannot do because the optimization now uses LBFGSB), at least it's a good check for the cost function at the first iteration for instance, no ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I removed it because the checks that can remain are covered by other tests, such as the test_toy_ex_lmnn(). Sure, if you think there's still something that's not covered already, I'll add it back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think test_toy_ex_lmnn is good because we can compute the result by hand, but it's maybe too light and simple, I think it would be good to have a more "real" example, with many random points (even more than 10 like in the original example, maybe like 40)
For that I think it would be quite simple to adapt the previous example, just checking the first iteration of loss and gradient: the whole last part of the original test with callbacks etc (not applicable here because of lbfgs) could be removed, let me know what you think

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds good.

@wdevazelhes
Copy link
Member

@angelotc, did you try the new commit from @johny-c on your example again ? It would be interesting to know if it works now

@angelotc
Copy link

@angelotc, did you try the new commit from @johny-c on your example again ? It would be interesting to know if it works now

Yes, the algorithm performs well with .98 AUC with distinguishing cuprates from non-cuprates on our superconductors dataset. However, my research group is more interested in unsupervised methods; e.g. if we train the model on in-domain samples, how well does it perform on detecting the risk of out-of-domain samples? Out-of-domain samples can be of different classes. Through my research, Isolation Forest does not work well since it's highly correlated with the euclidian distance from the centroid. So we have started visiting the raw mahalanobis distance to the centroid of the training set, which works well for diffusion dataset, but not for our superconductors datasets. I do see that there is the Covariance method from the metric-learn module which I will explore in the next few weeks.

@wdevazelhes
Copy link
Member

wdevazelhes commented Aug 3, 2021

Yes, the algorithm performs well with .98 AUC with distinguishing cuprates from non-cuprates on our superconductors dataset. However, my research group is more interested in unsupervised methods; e.g. if we train the model on in-domain samples, how well does it perform on detecting the risk of out-of-domain samples? Out-of-domain samples can be of different classes. Through my research, Isolation Forest does not work well since it's highly correlated with the euclidian distance from the centroid. So we have started visiting the raw mahalanobis distance to the centroid of the training set, which works well for diffusion dataset, but not for our superconductors datasets. I do see that there is the Covariance method from the metric-learn module which I will explore in the next few weeks.

@angelotc Sorry for the very late reply, indeed, the Covariance method can be interesting, since it returns a raw Mahalanobis metric computed from the training samples, so if it works well as a metric to the centroid in some cases, maybe it would even work well as a general metric (like for computing distances of test samples to the closest sample from the tranining set, not necessarily the centroid, to see if this test sample is an outlier or not)

Also, Covariance is currently the only truly unsupervised metric learning algorithm in metric-learn

@wdevazelhes
Copy link
Member

@johny-c thanks a lot for the commits
Sorry for my very late reply, everything looks good to me, there's just this tiny point about the weights in the cost function (see comment #309 (comment)), let me know what you think, as well as the extra test (#309 (comment)), and it should be good !

@johny-c
Copy link
Author

johny-c commented Aug 29, 2021

Hi @wdevazelhes , I added back the test with more samples and removing the callbacks part. Regarding the push loss weight, see my reply above. I guess I can more clearly state what happens in the docstring. Other than that, is it good to go?

@bellet
Copy link
Member

bellet commented Sep 27, 2021

Dear @johny-c, thanks a lot for the great work on this. Sorry I haven't had time yet to take a close look at this but I will try to do this soon.

I don't remember if you provided somewhere some empirical comparison to the current implementation, to show for instance that it is faster in most cases while achieving similar accuracy? Of course it is quite clear that your implementation is more clear and easier to maintain :-)

@bellet
Copy link
Member

bellet commented Sep 27, 2021

Also if I'm not mistaken this implementation solves a slightly different optimization problem (the hinge loss is squared), can you confirm @johny-c? If so, we should probably add a note about this in the documentation.

@johny-c
Copy link
Author

johny-c commented Oct 17, 2021

Hi @bellet , thanks! Yes, the objective here uses squared distances as in the paper (https://proceedings.neurips.cc/paper/2005/file/a7f592cef8b130a6967a90617db5681b-Paper.pdf). I haven't done any comparisons to the previous version. Do you have any specific data sets in mind? I can do this, but I'm not sure when I will have free time.

@bellet
Copy link
Member

bellet commented Oct 18, 2021

Hi @bellet , thanks! Yes, the objective here uses squared distances as in the paper (https://proceedings.neurips.cc/paper/2005/file/a7f592cef8b130a6967a90617db5681b-Paper.pdf). I haven't done any comparisons to the previous version. Do you have any specific data sets in mind? I can do this, but I'm not sure when I will have free time.

Hi @johny-c, I was referring to the use of the squared hinge loss (that is, a max squared instead of a max in the constraint/loss), which is needed to make the loss differentiable.

@bellet
Copy link
Member

bellet commented Oct 18, 2021

@johny-c for a quick benchmark, I think a few standard classification datasets (like the ones in sklearn) would be sufficient. Just to show that the solutions found by this implementation lead to roughly the same accuracy in KNN

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants