diff --git a/README.rst b/README.rst index b2f6e6d4..106851e4 100644 --- a/README.rst +++ b/README.rst @@ -17,6 +17,7 @@ metric-learn contains efficient Python implementations of several popular superv - Relative Components Analysis (RCA) - Metric Learning for Kernel Regression (MLKR) - Mahalanobis Metric for Clustering (MMC) +- Online Algorithm for Scalable Image Similarity (OASIS) **Dependencies** diff --git a/doc/introduction.rst b/doc/introduction.rst index e9ff0015..a2a6032b 100644 --- a/doc/introduction.rst +++ b/doc/introduction.rst @@ -4,17 +4,16 @@ What is Metric Learning? ======================== -Many approaches in machine learning require a measure of distance between data -points. Traditionally, practitioners would choose a standard distance metric +Many approaches in machine learning require a measure of distance (or similarity) +between data points. Traditionally, practitioners would choose a standard metric (Euclidean, City-Block, Cosine, etc.) using a priori knowledge of the domain. However, it is often difficult to design metrics that are well-suited to the particular data and task of interest. -Distance metric learning (or simply, metric learning) aims at -automatically constructing task-specific distance metrics from (weakly) -supervised data, in a machine learning manner. The learned distance metric can -then be used to perform various tasks (e.g., k-NN classification, clustering, -information retrieval). +Metric learning (or simply, metric learning) aims at automatically constructing +task-specific metrics from (weakly) supervised data, in a machine learning manner. +The learned metric can then be used to perform various tasks (e.g., +k-NN classification, clustering, information retrieval). Problem Setting =============== @@ -25,19 +24,19 @@ of supervision available about the training data: - :doc:`Supervised learning `: the algorithm has access to a set of data points, each of them belonging to a class (label) as in a standard classification problem. - Broadly speaking, the goal in this setting is to learn a distance metric + Broadly speaking, the goal in this setting is to learn a metric that puts points with the same label close together while pushing away points with different labels. - :doc:`Weakly supervised learning `: the algorithm has access to a set of data points with supervision only at the tuple level (typically pairs, triplets, or quadruplets of data points). A classic example of such weaker supervision is a set of - positive and negative pairs: in this case, the goal is to learn a distance + positive and negative pairs: in this case, the goal is to learn a metric that puts positive pairs close together and negative pairs far away. Based on the above (weakly) supervised data, the metric learning problem is generally formulated as an optimization problem where one seeks to find the -parameters of a distance function that optimize some objective function +parameters of a function that optimize some objective function measuring the agreement with the training data. .. _mahalanobis_distances: @@ -45,7 +44,7 @@ measuring the agreement with the training data. Mahalanobis Distances ===================== -In the metric-learn package, all algorithms currently implemented learn +In the metric-learn package, most algorithms currently implemented learn so-called Mahalanobis distances. Given a real-valued parameter matrix :math:`L` of shape ``(num_dims, n_features)`` where ``n_features`` is the number features describing the data, the Mahalanobis distance associated with @@ -79,6 +78,35 @@ necessarily the identity of indiscernibles. parameterizations are equivalent. In practice, an algorithm may thus solve the metric learning problem with respect to either :math:`M` or :math:`L`. +.. _bilinear_similarity: + +Bilinear Similarities +===================== + +Some algorithms in the package learn bilinear similarity functions. These +similarity functions are not pseudo-distances: they simply output real values +such that the larger the similarity value, the more similar the two examples. +Given a real-valued parameter matrix :math:`W` of shape +``(n_features, n_features)`` where ``n_features`` is the number features +describing the data, the bilinear similarity associated with :math:`W` is +defined as follows: + +.. math:: S_W(x, x') = x^\top W x' + +The matrix :math:`W` is not required to be positive semi-definite (PSD) or +even symmetric, so the distance properties (nonnegativity, identity of +indiscernibles, symmetry and triangle inequality) do not hold in general. + +This allows some algorithms to optimize :math:`S_W` in an online manner using a +simple and efficient procedure, and thus can be applied to problems with +millions of training instances and achieves state-of-the-art performance +on an image search task using :math:`k`-NN. + +The absence of PSD constraint can enable the design of more efficient +algorithms. It is also relevant in applications where the underlying notion +of similarity does not satisfy the triangle inequality, as known to be the +case for visual judgments. + .. _use_cases: Use-cases @@ -99,9 +127,9 @@ examples (for code illustrating some of these use-cases, see the elements of a database that are semantically closest to a query element. - Dimensionality reduction: metric learning may be seen as a way to reduce the data dimension in a (weakly) supervised setting. -- More generally, the learned transformation :math:`L` can be used to project - the data into a new embedding space before feeding it into another machine - learning algorithm. +- More generally with Mahalanobis distances, the learned transformation :math:`L` + can be used to project the data into a new embedding space before feeding it + into another machine learning algorithm. The API of metric-learn is compatible with `scikit-learn `_, the leading library for machine diff --git a/doc/metric_learn.rst b/doc/metric_learn.rst index 8f91d91c..1398b05a 100644 --- a/doc/metric_learn.rst +++ b/doc/metric_learn.rst @@ -34,6 +34,7 @@ Supervised Learning Algorithms metric_learn.SDML_Supervised metric_learn.RCA_Supervised metric_learn.SCML_Supervised + metric_learn.OASIS_Supervised Weakly Supervised Learning Algorithms ------------------------------------- @@ -47,6 +48,7 @@ Weakly Supervised Learning Algorithms metric_learn.MMC metric_learn.SDML metric_learn.SCML + metric_learn.OASIS Unsupervised Learning Algorithms -------------------------------- diff --git a/doc/supervised.rst b/doc/supervised.rst index e27b58ec..04a3233b 100644 --- a/doc/supervised.rst +++ b/doc/supervised.rst @@ -41,11 +41,13 @@ two numbers. Fit, transform, and so on ------------------------- -The goal of supervised metric-learning algorithms is to transform -points in a new space, in which the distance between two points from the -same class will be small, and the distance between two points from different -classes will be large. To do so, we fit the metric learner (example: -`NCA`). +The goal of supervised metric learning algorithms is to learn a (distance or +similarity) metric such that two points from the same class will be similar +(e.g., have small distance) and points from different classes will be dissimilar +(e.g., have large distance). + +To do so, we first need to fit the supervised metric learner on a labeled dataset, +as in the example below with ``NCA``. >>> from metric_learn import NCA >>> nca = NCA(random_state=42) @@ -53,58 +55,79 @@ classes will be large. To do so, we fit the metric learner (example: NCA(init='auto', max_iter=100, n_components=None, preprocessor=None, random_state=42, tol=None, verbose=False) - Now that the estimator is fitted, you can use it on new data for several purposes. -First, you can transform the data in the learned space, using `transform`: -Here we transform two points in the new embedding space. +We can now use the learned metric to **score** new pairs of points with ``pair_score`` +(the larger the score, the more similar the pair). For Mahalanobis learners, +it is equal to the opposite of the distance. ->>> X_new = np.array([[9.4, 4.1], [2.1, 4.4]]) ->>> nca.transform(X_new) -array([[ 5.91884732, 10.25406973], - [ 3.1545886 , 6.80350083]]) +>>> score = nca.pair_score([[[3.5, 3.6], [5.6, 2.4]], [[1.2, 4.2], [2.1, 6.4]], [[3.3, 7.8], [10.9, 0.1]]]) +>>> score +array([-0.49627072, -3.65287282, -6.06079877]) -Also, as explained before, our metric learners has learn a distance between -points. You can use this distance in two main ways: +This is useful because ``pair_score`` matches the **score** semantic of +scikit-learn's `Classification metrics +`_. -- You can either return the distance between pairs of points using the - `pair_distance` function: +For metric learners that learn a distance metric, there is also the ``pair_distance`` +method. >>> nca.pair_distance([[[3.5, 3.6], [5.6, 2.4]], [[1.2, 4.2], [2.1, 6.4]], [[3.3, 7.8], [10.9, 0.1]]]) array([0.49627072, 3.65287282, 6.06079877]) -- Or you can return a function that will return the distance (in the new - space) between two 1D arrays (the coordinates of the points in the original - space), similarly to distance functions in `scipy.spatial.distance`. +.. warning:: + + If you try to use ``pair_distance`` with a bilinear similarity learner, an error + will be thrown, as it does not learn a distance. + +You can also return a function that will return the metric learned. It can +compute the metric between two 1D arrays, similarly to distance functions in +`scipy.spatial.distance`. To do that, use the ``get_metric`` method. >>> metric_fun = nca.get_metric() >>> metric_fun([3.5, 3.6], [5.6, 2.4]) 0.4962707194621285 -- Alternatively, you can use `pair_score` to return the **score** between - pairs of points (the larger the score, the more similar the pair). - For Mahalanobis learners, it is equal to the opposite of the distance. +You can also call ``get_metric`` with bilinear similarity learners, and you will get +a function that will return the similarity between 1D arrays. ->>> score = nca.pair_score([[[3.5, 3.6], [5.6, 2.4]], [[1.2, 4.2], [2.1, 6.4]], [[3.3, 7.8], [10.9, 0.1]]]) ->>> score -array([-0.49627072, -3.65287282, -6.06079877]) +>>> similarity_fun = algorithm.get_metric() +>>> similarity_fun([3.5, 3.6], [5.6, 2.4]) +-0.04752 -This is useful because `pair_score` matches the **score** semantic of -scikit-learn's `Classification metrics -`_. +Finally, as explained in :ref:`mahalanobis_distances`, these are equivalent to the Euclidean +distance in a transformed space, and can thus be used to transform data points in +a new embedding space. You can use ``transform`` to do so. + +>>> X_new = np.array([[9.4, 4.1], [2.1, 4.4]]) +>>> nca.transform(X_new) +array([[ 5.91884732, 10.25406973], + [ 3.1545886 , 6.80350083]]) + +.. warning:: + + If you try to use ``transform`` with a bilinear similarity learner, an error will + be thrown, as you cannot transform the data using them. .. note:: If the metric learner that you use learns a :ref:`Mahalanobis distance - ` (like it is the case for all algorithms - currently in metric-learn), you can get the plain learned Mahalanobis - matrix using `get_mahalanobis_matrix`. + `, you can get the learned Mahalanobis + matrix :math:`M` using `get_mahalanobis_matrix`. >>> nca.get_mahalanobis_matrix() array([[0.43680409, 0.89169412], [0.89169412, 1.9542479 ]]) + If the metric learner that you use learns a :ref:`bilinear similarity + `, you can get the plain learned Bilinear + matrix :math:`W` using `get_bilinear_matrix`. + + >>> algorithm.get_bilinear_matrix() + array([[-0.72680409, -0.153213], + [1.45542269, 7.8135546 ]]) + Scikit-learn compatibility -------------------------- @@ -116,7 +139,7 @@ All supervised algorithms are scikit-learn estimators scikit-learn model selection routines (`sklearn.model_selection.cross_val_score`, `sklearn.model_selection.GridSearchCV`, etc). -You can also use some of the scoring functions from `sklearn.metrics`. +You can also use some scoring functions from `sklearn.metrics`. Algorithms ========== @@ -248,12 +271,12 @@ the sum of probability of being correctly classified: Local Fisher Discriminant Analysis (:py:class:`LFDA `) `LFDA` is a linear supervised dimensionality reduction method which effectively combines the ideas of `Linear Discriminant Analysis ` and Locality-Preserving Projection . It is -particularly useful when dealing with multi-modality, where one ore more classes +particularly useful when dealing with multi-modality, where one or more classes consist of separate clusters in input space. The core optimization problem of LFDA is solved as a generalized eigenvalue problem. -The algorithm define the Fisher local within-/between-class scatter matrix +The algorithm defines the Fisher local within-/between-class scatter matrix :math:`\mathbf{S}^{(w)}/ \mathbf{S}^{(b)}` in a pairwise fashion: .. math:: @@ -408,7 +431,7 @@ method will look at all the samples from a different class and sample randomly a pair among them. The method will try to build `num_constraints` positive pairs and `num_constraints` negative pairs, but sometimes it cannot find enough of one of those, so forcing `same_length=True` will return both times the -minimum of the two lenghts. +minimum of the two lengths. For using quadruplets learners (see :ref:`learning_on_quadruplets`) in a supervised way, positive and negative pairs are sampled as above and diff --git a/doc/weakly_supervised.rst b/doc/weakly_supervised.rst index 02ea4ef6..c376bada 100644 --- a/doc/weakly_supervised.rst +++ b/doc/weakly_supervised.rst @@ -79,11 +79,13 @@ the number of features of each point. >>> [-2.16, +0.11, -0.02]]]) # same as tuples[1, 0, :] >>> y = np.array([-1, 1, 1, -1]) -.. warning:: This way of specifying pairs is not recommended for a large number - of tuples, as it is redundant (see the comments in the example) and hence - takes a lot of memory. Indeed each feature vector of a point will be - replicated as many times as a point is involved in a tuple. The second way - to specify pairs is more efficient +.. warning:: + + This way of specifying pairs is not recommended for a large number + of tuples, as it is redundant (see the comments in the example) and hence + takes a lot of memory. Indeed, each feature vector of a point will be + replicated as many times as a point is involved in a tuple. The second way + to specify pairs is more efficient 2D array of indicators + preprocessor @@ -127,9 +129,12 @@ through the argument `preprocessor` (see below :ref:`fit_ws`) Fit, transform, and so on ------------------------- -The goal of weakly-supervised metric-learning algorithms is to transform -points in a new space, in which the tuple-wise constraints between points -are respected. +The goal of weakly supervised metric learning algorithms is to learn a (distance +or similarity) metric such that the tuple-wise constraints between points are +respected. + +To do so, we first need to fit the weakly supervised metric learner on a dataset +of tuples, as in the example below with ``MMC``. >>> from metric_learn import MMC >>> mmc = MMC(random_state=42) @@ -142,62 +147,82 @@ Or alternatively (using a preprocessor): >>> from metric_learn import MMC >>> mmc = MMC(preprocessor=X, random_state=42) ->>> mmc.fit(pairs_indice, y) - +>>> mmc.fit(pairs_indices, y) Now that the estimator is fitted, you can use it on new data for several purposes. -First, you can transform the data in the learned space, using `transform`: -Here we transform two points in the new embedding space. +We can now use the learned metric to **score** new pairs of points with ``pair_score`` +(the larger the score, the more similar the pair). For Mahalanobis learners, +it is equal to the opposite of the distance. ->>> X_new = np.array([[9.4, 4.1, 4.2], [2.1, 4.4, 2.3]]) ->>> mmc.transform(X_new) -array([[-3.24667162e+01, 4.62622348e-07, 3.88325421e-08], - [-3.61531114e+01, 4.86778289e-07, 2.12654397e-08]]) +>>> score = mmc.pair_score([[[3.5, 3.6], [5.6, 2.4]], [[1.2, 4.2], [2.1, 6.4]], [[3.3, 7.8], [10.9, 0.1]]]) +>>> score +array([-0.49627072, -3.65287282, -6.06079877]) -Also, as explained before, our metric learner has learned a distance between -points. You can use this distance in two main ways: +This is useful because ``pair_score`` matches the **score** semantic of +scikit-learn's `Classification metrics +`_. -- You can either return the distance between pairs of points using the - `pair_distance` function: +For metric learners that learn a distance metric, there is also the ``pair_distance`` +method. >>> mmc.pair_distance([[[3.5, 3.6, 5.2], [5.6, 2.4, 6.7]], ... [[1.2, 4.2, 7.7], [2.1, 6.4, 0.9]]]) array([7.27607365, 0.88853014]) -- Or you can return a function that will return the distance - (in the new space) between two 1D arrays (the coordinates of the points in - the original space), similarly to distance functions in - `scipy.spatial.distance`. To do that, use the `get_metric` method. +.. warning:: + + If you try to use ``pair_distance`` with a bilinear similarity learner, an error + will be thrown, as it does not learn a distance. + +You can also return a function that will return the metric learned. It can +compute the metric between two 1D arrays, similarly to distance functions in +`scipy.spatial.distance`. To do that, use the ``get_metric`` method. >>> metric_fun = mmc.get_metric() >>> metric_fun([3.5, 3.6, 5.2], [5.6, 2.4, 6.7]) 7.276073646278203 -- Alternatively, you can use `pair_score` to return the **score** between - pairs of points (the larger the score, the more similar the pair). - For Mahalanobis learners, it is equal to the opposite of the distance. +You can also call ``get_metric``` with bilinear similarity learners, and you will get +a function that will return the similarity between 1D arrays. ->>> score = mmc.pair_score([[[3.5, 3.6], [5.6, 2.4]], [[1.2, 4.2], [2.1, 6.4]], [[3.3, 7.8], [10.9, 0.1]]]) ->>> score -array([-0.49627072, -3.65287282, -6.06079877]) +>>> similarity_fun = algorithm.get_metric() +>>> similarity_fun([3.5, 3.6], [5.6, 2.4]) +-0.04752 - This is useful because `pair_score` matches the **score** semantic of - scikit-learn's `Classification metrics - `_. +Finally, as explained in :ref:`mahalanobis_distances`, these are equivalent to the Euclidean +distance in a transformed space, and can thus be used to transform data points in +a new embedding space. You can use ``transform`` to do so. + +>>> X_new = np.array([[9.4, 4.1, 4.2], [2.1, 4.4, 2.3]]) +>>> mmc.transform(X_new) +array([[-3.24667162e+01, 4.62622348e-07, 3.88325421e-08], + [-3.61531114e+01, 4.86778289e-07, 2.12654397e-08]]) + +.. warning:: + + If you try to use ``transform`` with a bilinear similarity learner, an error will + be thrown, as you cannot transform the data using them. .. note:: If the metric learner that you use learns a :ref:`Mahalanobis distance - ` (like it is the case for all algorithms - currently in metric-learn), you can get the plain Mahalanobis matrix using - `get_mahalanobis_matrix`. + `, you can get the plain learned Mahalanobis + matrix :math:`M` using `get_mahalanobis_matrix`. + + >>> mmc.get_mahalanobis_matrix() + array([[ 0.58603894, -5.69883982, -1.66614919], + [-5.69883982, 55.41743549, 16.20219519], + [-1.66614919, 16.20219519, 4.73697721]]) + + If the metric learner that you use learns a :ref:`bilinear similarity + `, you can get the learned Bilinear + matrix :math:`W` using `get_bilinear_matrix`. ->>> mmc.get_mahalanobis_matrix() -array([[ 0.58603894, -5.69883982, -1.66614919], - [-5.69883982, 55.41743549, 16.20219519], - [-1.66614919, 16.20219519, 4.73697721]]) + >>> algorithm.get_bilinear_matrix() + array([[-0.72680409, -0.153213], + [1.45542269, 7.8135546 ]]) .. _sklearn_compat_ws: @@ -451,7 +476,7 @@ Mahalanobis matrix :math:`\mathbf{M}`, and a log-determinant divergence between or :math:`\mathbf{\Omega}^{-1}`, where :math:`\mathbf{\Omega}` is the covariance matrix). -The formulated optimization on the semidefinite matrix :math:`\mathbf{M}` +The formulated optimization on the semi-definite matrix :math:`\mathbf{M}` is convex: .. math:: @@ -612,6 +637,15 @@ one should provide the algorithm with `n_samples` triplets of points. The semantic of each triplet is that the first point should be closer to the second point than to the third one. +If :math:`P` is the set of points, and :math:`p_i, p_i^{+}, p_i^{-} \in P` +are arbitrary points in :math:`P`, then a triplet of the form +:math:`(p_i, p_i^{+}, p_i^{-})` suggests that +:math:`S(p_i, p_i^{+}) > S(p_i, p_i^{-})` for a similarity function :math:`S`, +or equivalently :math:`d(p_i, p_i^{+}) < d(p_i, p_i^{-})` for a pseudo-distance +function :math:`d`. + +Some algorithms will learn :math:`S`, while others will learn :math:`d`. + Fitting ------- Here is an example for fitting on triplets (see :ref:`fit_ws` for more @@ -768,6 +802,86 @@ where :math:`[\cdot]_+` is the hinge loss. `Matlab implementation.`_. +.. _oasis: + +:py:class:`OASIS ` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Online Algorithm for Scalable Image Similarity +(:py:class:`OASIS `) + +`OASIS` learns a bilinear similarity from triplet constraints with an online +Passive-Agressive (PA) algorithm approach. The bilinear similarity +between :math:`p_1` and :math:`p_2` is defined as :math:`p_{1}^{T} W p_2` +where :math:`W` is the learned matrix by OASIS. This particular algorithm +is fast as it scales linearly with the number of samples. + +The aim is to find a parametric similarity function :math:`S` such that all +triplets of the form :math:`(p_i, p_{i}^{+}, p_{i}^{-})` obey +:math:`S_W (p_i, p_{i}^{+}) > S_W (p_i, p_{i}^{-}) + 1`. Which means there +must be a margin of :math:`1` when satisfiyng the triplet definition. + +Given the loss function: + +.. math:: + + l_W (p_i, p_{i}^{+}, p_{i}^{-}) = max\{0, 1 - S_W (p_i, p_{i}^{+}) + S_W (p_i, p_{i}^{-})\} + +The goal is to minimize a global loss :math:`L_W` that accumulates hinge +losses over all possible triplets: + +.. math:: + + L_W = \sum_{(p_i, p_{i}^{+}, p_{i}^{-}) \in P} l_W (p_i, p_{i}^{+}, p_{i}^{-}) + +In order to minimize this loss, a Passive-Aggressive algorithm is applied +iteratively over triplets to optimize :math:`W`. First :math:`W` is initialized to +some value :math:`W^0`. Then, at each training iteration :math:`i`, a random triplet +:math:`(p_i, p_{i}^{+}, p_{i}^{-})` is selected, and solve the following convex +problem with soft margin: + +.. math:: + + W^i = argmin \frac{1}{2} {\lVert W - W^{i-1} \rVert}_{Fro}^{2} + C\xi\\ + s.t. \quad l_W (p_i, p_{i}^{+}, p_{i}^{-}) \quad and \quad \xi \geq 0 + +where :math:`{\lVert \cdot \rVert}_{Fro}^{2}` is the Frobenius norm +(point-wise :math:`L_2` norm). Therefore, at each iteration :math:`i`, :math:`W^i` +is selected to optimize a trade-off between remaining close to the previous +parameters :math:`W^{i-1}` and minimizing the loss on the current triplet +:math:`l_W (p_i, p_{i}^{+}, p_{i}^{-})`. The aggressiveness parameter :math:`C` +controls this trade-off. + +As this algorithm learns a bilinear similarity, the learned matrix :math:`W` +is not guaranteed to be symmetric nor semi-positive definite (SPD). So it may +happen that for any pair of points :math:`(x,y) \in P` with :math:`x \ne y` that +:math:`S(x, y) \ne S(y,x)` and :math:`S(x,x) \ne 0`. Also notice that :math:`S(x, y) \in \mathbb{R}` +for all :math:`x, y \in P`. + +.. topic:: Example Code: + +:: + + from metric_learn import OASIS + + triplets = [[[1.2, 7.5], [1.3, 1.5], [6.2, 9.7]], + [[1.3, 4.5], [3.2, 4.6], [5.4, 5.4]], + [[3.2, 7.5], [3.3, 1.5], [8.2, 9.7]], + [[3.3, 4.5], [5.2, 4.6], [7.4, 5.4]]] + + oasis = OASIS() + oasis.fit(triplets) + +.. topic:: References: + + .. [1] Chechik, Gal and Sharma, Varun and Shalit, Uri and Bengio, Samy + `Large Scale Online Learning of Image Similarity Through Ranking. + `_. \ + , JMLR 2010. + + .. [2] Adapted from original \ + `Matlab implementation.`_. + .. _learning_on_quadruplets: Learning on quadruplets diff --git a/examples/oasis_example.py b/examples/oasis_example.py new file mode 100644 index 00000000..d255071c --- /dev/null +++ b/examples/oasis_example.py @@ -0,0 +1,198 @@ +""" +Bilinear similarity example +============= + +Bilinear similarity example using OASIS algorithm +""" + +from metric_learn import SCML, LMNN, NCA, OASIS, LFDA, MLKR, MMC +from sklearn.datasets import load_iris +from sklearn.utils import check_random_state +from sklearn.model_selection import cross_val_score, train_test_split +import numpy as np +from metric_learn.constraints import Constraints, wrap_pairs +import matplotlib.pyplot as plt +from sklearn.datasets import fetch_lfw_people +from time import time +from sklearn.decomposition import PCA +from sklearn.svm import SVC +from sklearn.model_selection import GridSearchCV +from sklearn.neighbors import KNeighborsClassifier +from sklearn.metrics import classification_report +from sklearn.metrics import confusion_matrix + +SEED = 33 +RNG = check_random_state(SEED) + + +lfw_people = fetch_lfw_people(min_faces_per_person=100, resize=0.5) + +# introspect the images arrays to find the shapes (for plotting) +n_samples, h, w = lfw_people.images.shape + +# for machine learning we use the 2 data directly (as relative pixel +# positions info is ignored by this model) +X = lfw_people.data +n_features = X.shape[1] + +# the label to predict is the id of the person +y = lfw_people.target +target_names = lfw_people.target_names +n_classes = target_names.shape[0] + +print("Total dataset size:") +print("n_samples: %d" % n_samples) +print("n_features: %d" % n_features) +print("n_classes: %d" % n_classes) + +# ############################################################################# +# Split into a training set and a test set using a stratified k fold + +# split into a training and testing set +X_train, X_test, y_train, y_test = train_test_split( + X, y, test_size=0.25, random_state=12 +) + +# ############################################################################# +# Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled +# dataset): unsupervised feature extraction / dimensionality reduction +n_components = 50 + +print( + "Extracting the top %d eigenfaces from %d faces" % (n_components, X_train.shape[0]) +) +t0 = time() +pca = PCA(n_components=n_components, svd_solver="randomized", whiten=True).fit(X_train) +print("done in %0.3fs" % (time() - t0)) + +eigenfaces = pca.components_.reshape((n_components, h, w)) + +print("Projecting the input data on the eigenfaces orthonormal basis") +t0 = time() +X_train_pca = pca.transform(X_train) +X_test_pca = pca.transform(X_test) +print("done in %0.3fs" % (time() - t0)) + +# ############################################################################# +# Make triplets +print("Building triplets from supervised dataset") +t0 = time() +constraints = Constraints(y_train) +k_geniuine = 3 +k_impostor = 4 +triplets = constraints.generate_knntriplets(X_train_pca, k_geniuine, k_impostor) +print("done in %0.3fs" % (time() - t0)) + +# ############################################################################# +# OASIS: Values to test for c, folds, and estimator +if False: + print("Training OASIS model") + t0 = time() + oasis = OASIS(random_state=33, preprocessor=X_train_pca, c=0.00162) + oasis.fit(triplets) + custom_metric = lambda a, b : - + 1.0 / oasis.get_metric()(a, b) + print(oasis.score(triplets)) + + constraints = Constraints(y_test) + k_geniuine = 10 + k_impostor = 10 + triplets_test = constraints.generate_knntriplets(X_test_pca, k_geniuine, k_impostor) + + print(oasis.score(triplets_test)) + # print(custom_metric(X_train_pca[0], X_train_pca[0])) + # print(oasis.get_metric()(X_train_pca[0], X_train_pca[0])) + # # print(oasis.get_bilinear_matrix().min()) + # print(custom_metric) + + # Tunning OASIS + + # Cs = np.logspace(-8, 1, 20) + # folds = 4 # Cross-validations folds + # clf = GridSearchCV(estimator=oasis, + # param_grid=dict(c=Cs), n_jobs=-1, cv=folds, + # verbose=True) + # clf.fit(triplets) + # print(f"Best c: {clf.best_estimator_.c}") + # print(f"Best score: {clf.best_score_}") + # custom_metric = clf.best_estimator_.get_metric() + # print("done in %0.3fs" % (time() - t0)) + +# ############################################################################# +if False: + print("Training SCML model") + scml = SCML(random_state=33, preprocessor=X_train_pca) + scml.fit(triplets) + custom_metric = scml.get_metric() + print("done in %0.3fs" % (time() - t0)) + print(scml.score(triplets)) + + constraints = Constraints(y_test) + k_geniuine = 3 + k_impostor = 4 + triplets_test = constraints.generate_knntriplets(X_test_pca, k_geniuine, k_impostor) + print(scml.score(triplets_test)) + +if True: + c = Constraints(y_train) + p = c.positive_negative_pairs(1000) + pairs, label = wrap_pairs(X_train_pca, p) + + mmc = MMC(random_state=22) + mmc.fit(pairs, label) + print(mmc.score(pairs, label)) + + c1 = Constraints(y_test) + p1 = c1.positive_negative_pairs(1000) + pairs1, label1 = wrap_pairs(X_train_pca, p1) + print(mmc.score(pairs1, label1)) + +# ############################################################################# +if False: + print("Training LMNN model") + lmnn = LMNN(random_state=33, preprocessor=X_train_pca) + lmnn.fit(X_train_pca, y_train) + custom_metric = lmnn.get_metric() + print("done in %0.3fs" % (time() - t0)) + +# ############################################################################# +if False: + print("Training NCA model") + nca = NCA(random_state=33, preprocessor=X_train_pca, max_iter=1000) + nca.fit(X_train_pca, y_train) + custom_metric = nca.get_metric() + print("done in %0.3fs" % (time() - t0)) + +# ############################################################################# +if False: + print("Training MLKR model") + mlkr = MLKR(preprocessor=X_train_pca, random_state=33) + mlkr.fit(X_train_pca, y_train) + custom_metric = mlkr.get_metric() + print("done in %0.3fs" % (time() - t0)) + +# ############################################################################# +if False: + print("Training LFDA model") + lfda = LFDA(preprocessor=X_train_pca) + lfda.fit(X_train_pca, y_train) + custom_metric = lfda.get_metric() + print("done in %0.3fs" % (time() - t0)) + +# ############################################################################# +# KNN Classifier +# print("Fitting Classifier") +# t0 = time() +# neigh = KNeighborsClassifier(n_neighbors=5, metric=custom_metric, algorithm='brute') +# neigh.fit(X_train_pca, y_train) +# print("done in %0.3fs" % (time() - t0)) + +# # ############################################################################# +# # Quantitative evaluation of the model quality on the test set + +# print("Predicting people's names on the test set") +# t0 = time() +# y_pred = neigh.predict(X_test_pca) +# print("done in %0.3fs" % (time() - t0)) + +# print(classification_report(y_test, y_pred, target_names=target_names)) +# #print(confusion_matrix(y_test, y_pred, labels=range(n_classes))) \ No newline at end of file diff --git a/examples/plot_diff_random_state_oasis.py b/examples/plot_diff_random_state_oasis.py new file mode 100644 index 00000000..e8146c9e --- /dev/null +++ b/examples/plot_diff_random_state_oasis.py @@ -0,0 +1,90 @@ +""" +Importance of random state +============= + +This example shows how important the random state is for +some algorithms such as the online algorith OASIS. The random +states has a direct impact in the order in wich the triplets +are seen by the algorithm. +""" + +from metric_learn.oasis import OASIS +from sklearn.datasets import load_iris +from sklearn.utils import check_random_state +from sklearn.model_selection import cross_val_score +import numpy as np +from metric_learn.constraints import Constraints +import matplotlib.pyplot as plt + +SEED = 33 +RNG = check_random_state(SEED) + +# Load Iris +X, y = load_iris(return_X_y=True) + +# Generate triplets +constraints = Constraints(y) +k_geniuine = 3 +k_impostor = 10 +triplets = constraints.generate_knntriplets(X, k_geniuine, k_impostor) +triplets = X[triplets] + +# Values to test for c, folds, and estimator +rs = np.arange(30) +folds = 6 # Cross-validations folds +c = 0.006203576 +oasis = OASIS(c=c, init="random") # M init random + + +def random_theory(plot=True, verbose=True, cv=5): + # Save the cross val results of each c + scores = list() + scores_std = list() + rs_l = list() + i = 0 + for r in rs: + oasis.random_state = check_random_state(r) # Change rs each time + this_scores = cross_val_score(oasis, triplets, n_jobs=-1, cv=cv) + scores.append(np.mean(this_scores)) + scores_std.append(np.std(this_scores)) + rs_l.append(r) + if verbose: + print(f"""Evaluating param # {i} | random_state={r} \ +|score: {np.mean(this_scores)}""") + i = i + 1 + + # Plot the cross_val_scores + if plot: + plt.figure() + plt.plot(rs, scores) + plt.plot(rs, np.array(scores) + np.array(scores_std), 'b--') + plt.plot(rs, np.array(scores) - np.array(scores_std), 'b--') + locs, labels = plt.yticks() + plt.yticks(locs, list(map(lambda x: "%g" % x, locs))) + plt.ylabel(f'OASIS score with c={c}') + plt.xlabel('Random State (For shuffling and M init)') + plt.ylim(0, 1.1) + plt.show() + + max_i = np.argmax(scores) + min_i = np.argmin(scores) + avg = np.average(scores) + avgstd = np.average(scores_std) + return scores[max_i], rs_l[max_i], scores[min_i], rs_l[min_i], avg, avgstd + + +maxs, maxrs, mins, minrs, avg, avgstd = random_theory(cv=folds, + plot=True, + verbose=True) + +msg = f""" +Max Score : {maxs} +Max Score Seed: {maxrs} +--------------- +Min Score : {mins} +Min Score Seed: {minrs} +-------------- +Average Score : {avg} +Average Std. : {avgstd} +""" +print(msg) diff --git a/examples/plot_grid_serach_oasis.py b/examples/plot_grid_serach_oasis.py new file mode 100644 index 00000000..80b5155f --- /dev/null +++ b/examples/plot_grid_serach_oasis.py @@ -0,0 +1,111 @@ +""" +Grid serach use case +============= + +Grid search for parameter C in OASIS algorithm +""" + +from metric_learn.oasis import OASIS +from sklearn.datasets import load_iris +from sklearn.utils import check_random_state +from sklearn.model_selection import cross_val_score +import numpy as np +from metric_learn.constraints import Constraints +import matplotlib.pyplot as plt +from sklearn.model_selection import GridSearchCV + +SEED = 33 +RNG = check_random_state(SEED) + +# Load Iris +X, y = load_iris(return_X_y=True) + +# Generate triplets +constraints = Constraints(y) +k_geniuine = 3 +k_impostor = 10 +triplets = constraints.generate_knntriplets(X, k_geniuine, k_impostor) +triplets = X[triplets] + +# Values to test for c, folds, and estimator +Cs = np.logspace(-8, 1, 20) +folds = 6 # Cross-validations folds +oasis = OASIS(random_state=RNG) + + +def find_best_and_plot(plot=True, verbose=True, cv=5): + """ + Performs a manual grid search of parameter c, then plots + the cross validation score for each value of c. + + Returns the best score, and the value of c for that score. + + plot: If True will plot a Score vs value of C chart + verbose: If True will tell in wich iteration it goes. + cv: Number of cross-validation folds. + """ + # Save the cross val results of each c + scores = list() + scores_std = list() + c_list = list() + i = 0 + for c in Cs: + if verbose: + print(f'Evaluating param # {i} | c={c}') + oasis.c = c # Change c each time + this_scores = cross_val_score(oasis, triplets, n_jobs=-1, cv=cv) + scores.append(np.mean(this_scores)) + scores_std.append(np.std(this_scores)) + c_list.append(c) + i = i + 1 + + # Plot the cross_val_scores + if plot: + plt.figure() + plt.semilogx(Cs, scores) + plt.semilogx(Cs, np.array(scores) + np.array(scores_std), 'b--') + plt.semilogx(Cs, np.array(scores) - np.array(scores_std), 'b--') + locs, labels = plt.yticks() + plt.yticks(locs, list(map(lambda x: "%g" % x, locs))) + plt.ylabel('OASIS score') + plt.xlabel('Parameter C') + plt.ylim(0, 1.1) + plt.show() + + return scores[np.argmax(scores)], c_list[np.argmax(scores)] + + +def grid_serach(cv=5, verbose=1): + """ + Performs grid serach using sklearn's GridSearchCV. + verbose: If True will tell in wich iteration it goes. + + Returns the best score, and the value of c for that score. + + cv: Number of cross-validation folds. + verbose: Controls the prints of GridSearchCV + """ + clf = GridSearchCV(estimator=oasis, + param_grid=dict(c=Cs), n_jobs=-1, cv=cv, + verbose=verbose) + clf.fit(triplets) + return clf.best_score_, clf.best_estimator_.c + + +# Both manual serach and GridSearchCV should output the same value +s1, c1 = find_best_and_plot(plot=True, verbose=True, cv=folds) +s2, c2 = grid_serach(cv=folds, verbose=1) + +results = f""" +Manual search +------------- +Best score: {s1} +Best c: {c1} + + +GridSearchCV +------------ +Best score: {s2} +Best c: {c2}""" + +print(results) diff --git a/metric_learn/__init__.py b/metric_learn/__init__.py index 92823fb1..6be45f2a 100644 --- a/metric_learn/__init__.py +++ b/metric_learn/__init__.py @@ -10,6 +10,7 @@ from .mlkr import MLKR from .mmc import MMC, MMC_Supervised from .scml import SCML, SCML_Supervised +from .oasis import OASIS, OASIS_Supervised from ._version import __version__ @@ -17,4 +18,5 @@ 'LMNN', 'LSML', 'LSML_Supervised', 'SDML', 'SDML_Supervised', 'NCA', 'LFDA', 'RCA', 'RCA_Supervised', 'MLKR', 'MMC', 'MMC_Supervised', 'SCML', - 'SCML_Supervised', '__version__'] + 'SCML_Supervised', 'OASIS', 'OASIS_Supervised', + '__version__'] diff --git a/metric_learn/_util.py b/metric_learn/_util.py index 868ececa..cad68f73 100644 --- a/metric_learn/_util.py +++ b/metric_learn/_util.py @@ -785,3 +785,153 @@ def _pseudo_inverse_from_eig(w, V, tol=None): w[~large] = 0 return np.dot(V * w, np.conjugate(V).T) + + +def _to_index_points(o_triplets): + """ + Takes the origial triplets, and returns a mapping of the triplets + to an X array that has all unique point values. + + Returns: (mapping_tr, X) + + X: Unique points across all triplets. + + mapping_tr: Output: indices_to_X, X = unique(triplets) + + Triplets-shaped values that represent the indices of X. + Its guaranteed that shape(triplets) = shape(o_triplets[:-1]). + + For instance the first element of mapping_tr could be [0, 43, 1]. + That means the first original triplet is [X[0], X[43], X[1]]. + + X[mapping] restore the original input + + For algorithms built to work with indices, but in order to be + compliant with the current handling of inputs it is converted + back to indices by the following fusnction. This should be improved + in the future. + """ + shape = o_triplets.shape # (n_triplets, 3, n_features) + X, mapping_tr = np.unique(np.vstack(o_triplets), return_inverse=True, + axis=0) + mapping_tr = mapping_tr.reshape(shape[:2]) # (n_triplets, 3) + return mapping_tr, X + + +def _get_random_indices(n_triplets, n_iter, shuffle=True, + random=False, random_state=None): + """ + Generates n_iter indices in (0, n_triplets). + + If not random: + + If n_iter = n_triplets, then the resulting array will include + all values in range(0, n_triplets). If shuffle=True, then this + array is shuffled. + + If n_iter > n_triplets, all values in range(0, n_triplets) + will be included at least ceil(n_iter / n_triplets) - 1 times. + The rest is filled with non-repeated values. If shuffle=True, + then the final array is shuffled, otherwise you get a sorted + array. + + If n_iter < n_triplets, then a random sampling takes place. + The final array does not contains duplicates. If shuffle=True + the resulting array is not sorted, but shuffled. + + If random: + + A random sampling is made in any case, generating n_iters values + that may include duplicates. The shuffle param has no effect. + """ + rng = check_random_state(random_state) + + if n_triplets == 0: + raise ValueError("n_triplets cannot be 0") + if n_iter == 0: + raise ValueError("n_iter cannot be 0") + + if random: + return rng.randint(low=0, high=n_triplets, size=n_iter) + else: + if n_iter < n_triplets: + sample = rng.choice(n_triplets, n_iter, replace=False) + return sample if shuffle else np.sort(sample) + else: + array = np.arange(n_triplets) # Unique triplets included + + if n_iter == n_triplets: + if shuffle: + rng.shuffle(array) + return array + + elif n_iter > n_triplets: + final = np.array([], dtype=int) # Base + for _ in range(int(np.ceil(n_iter / n_triplets))): + if shuffle: + rng.shuffle(array) + final = np.concatenate([final, np.copy(array)]) + final = final[:n_iter] # Get only whats necessary + if shuffle: # An additional shuffle at the end + rng.shuffle(final) + return final + + +def _initialize_similarity_bilinear(input, init='identity', + random_state=None, + strict_pd=False, + matrix_name='matrix'): + n_features = input.shape[-1] + if isinstance(init, np.ndarray): + # we copy the array, so that if we update the metric, we don't want to + # update the init + init = check_array(init, copy=True) + + # Assert that init.shape[1] = (n_features, n_features) + if init.shape != (n_features,) * 2: + raise ValueError('The input dimensionality {} of the given ' + 'similarity matrix `{}` must match the ' + 'dimensionality of the given inputs ({}).' + .format(init.shape, matrix_name, n_features)) + elif init not in ['identity', 'random_spd', 'random', 'covariance']: + raise ValueError( + f"`{matrix_name}` must be 'identity', 'random_spd', 'random', \ + covariance or a numpy array of shape (n_features, n_features).\ + Not `{init}`.") + + rng = check_random_state(random_state) + M = init + if isinstance(M, np.ndarray): + return M + elif init == "identity": + return np.identity(n_features) + elif init == "random": + return rng.rand(n_features, n_features) + elif init == "random_spd": + return make_spd_matrix(n_features, random_state=rng) + elif init == 'covariance': + if input.ndim == 3: + # if the input are tuples, we need to form an X by deduplication + X = np.unique(np.vstack(input), axis=0) + else: + X = input + # atleast2d is necessary to deal with scalar covariance matrices + M_inv = np.atleast_2d(np.cov(X, rowvar=False)) + w, V = eigh(M_inv, check_finite=False) + cov_is_definite = _check_sdp_from_eigen(w) + if strict_pd and not cov_is_definite: + raise LinAlgError("Unable to get a true inverse of the covariance " + "matrix since it is not definite. Try another " + "`{}`, or an algorithm that does not " + "require the `{}` to be strictly positive definite." + .format(*((matrix_name,) * 2))) + elif not cov_is_definite: + warnings.warn('The covariance matrix is not invertible: ' + 'using the pseudo-inverse instead.' + 'To make the covariance matrix invertible' + ' you can remove any linearly dependent features and/or ' + 'reduce the dimensionality of your input, ' + 'for instance using `sklearn.decomposition.PCA` as a ' + 'preprocessing step.') + M = _pseudo_inverse_from_eig(w, V) + return M diff --git a/metric_learn/base_metric.py b/metric_learn/base_metric.py index 9064c100..3f0fa3ae 100644 --- a/metric_learn/base_metric.py +++ b/metric_learn/base_metric.py @@ -242,6 +242,118 @@ def transform(self, X): """ +class BilinearMixin(BaseMetricLearner, metaclass=ABCMeta): + r"""Bilinear similarity learning algorithms. + + Algorithm that learns a Bilinear (pseudo) similarity :math:`s_M(x, x')`, + defined between two column vectors :math:`x` and :math:`x'` by: :math: + `s_M(x, x') = x M x'`, where :math:`M` is a learned matrix. This matrix + is not guaranteed to be symmetric nor positive semi-definite (PSD). Thus + it cannot be seen as learning a linear transformation of the original + space like Mahalanobis learning algorithms. + + Attributes + ---------- + components_ : `numpy.ndarray`, shape=(n_components, n_features) + The learned bilinear matrix ``M``. + """ + + def score_pairs(self, pairs): + dpr_msg = ("score_pairs will be deprecated in release 0.7.0. " + "Use pair_score to compute similarity scores, or " + "pair_distances to compute distances.") + warnings.warn(dpr_msg, category=FutureWarning) + return self.pair_score(pairs) + + def pair_distance(self, pairs): + """ + Returns an error, as bilinear similarity learners don't learn a + pseudo-distance nor a distance. In consecuence, the additive inverse + of the bilinear similarity cannot be used as distance by construction. + """ + msg = ("This learner doesn't learn a distance, thus ", + "this method is not implemented. Use pair_score instead") + raise Exception(msg) + + def pair_score(self, pairs): + r"""Returns the learned Bilinear similarity between pairs. + + This similarity is defined as: :math:`s_M(x, x') = x^T M x'` + where ``M`` is the learned Bilinear matrix, for every pair of points + ``x`` and ``x'``. + + Parameters + ---------- + pairs : array-like, shape=(n_pairs, 2, n_features) or (n_pairs, 2) + 3D Array of pairs to score, with each row corresponding to two points, + for 2D array of indices of pairs if the similarity learner uses a + preprocessor. + + Returns + ------- + scores : `numpy.ndarray` of shape=(n_pairs,) + The learned Bilinear similarity for every pair. + + See Also + -------- + get_metric : a method that returns a function to compute the similarity + between two points. The difference with `pair_score` is that it + works on two 1D arrays and cannot use a preprocessor. Besides, the + returned function is independent of the similarity learner and hence + is not modified if the similarity learner is. + + :ref:`Bilinear_similarity` : The section of the project documentation + that describes Bilinear similarity. + """ + check_is_fitted(self, ['preprocessor_']) + pairs = check_input(pairs, type_of_inputs='tuples', + preprocessor=self.preprocessor_, + estimator=self, tuple_size=2) + # Note: For bilinear order matters, dist(a,b) != dist(b,a) + # We always choose first pair first, then second pair + # (In contrast with Mahalanobis implementation) + return np.sum(np.dot(pairs[:, 0, :], self.components_) * pairs[:, 1, :], + axis=-1) + + def get_metric(self): + check_is_fitted(self, 'components_') + components = self.components_.copy() + + def similarity_fun(u, v): + """This function computes the similarity between u and v, according to the + previously learned similarity. + + Parameters + ---------- + u : array-like, shape=(n_features,) + The first point involved in the similarity computation. + + v : array-like, shape=(n_features,) + The second point involved in the similarity computation. + + Returns + ------- + similarity : float + The similarity between u and v according to the new similarity. + """ + u = validate_vector(u) + v = validate_vector(v) + return np.dot(np.dot(u.T, components), v) + + return similarity_fun + + def get_bilinear_matrix(self): + """Returns a copy of the Bilinear matrix learned by the similarity learner. + + Returns + ------- + M : `numpy.ndarray`, shape=(n_features, n_features) + The copy of the learned Bilinear matrix. + """ + check_is_fitted(self, 'components_') + return self.components_ + + class MahalanobisMixin(BaseMetricLearner, MetricTransformer, metaclass=ABCMeta): r"""Mahalanobis metric learning algorithms. diff --git a/metric_learn/oasis.py b/metric_learn/oasis.py new file mode 100644 index 00000000..0378aa54 --- /dev/null +++ b/metric_learn/oasis.py @@ -0,0 +1,339 @@ +""" +Online Algorithm for Scalable Image Similarity (OASIS) +""" + +from .base_metric import BilinearMixin, _TripletsClassifierMixin +import numpy as np +from sklearn.utils import check_random_state +from .constraints import Constraints +from ._util import _to_index_points, _get_random_indices, \ + _initialize_similarity_bilinear + + +class _BaseOASIS(BilinearMixin): + def __init__( + self, + preprocessor=None, + n_iter=None, + c=0.0001, + random_state=None, + shuffle=True, + random_sampling=False, + init="identity" + ): + super().__init__(preprocessor=preprocessor) + self.n_iter = n_iter # Max iterations + self.c = c # Trade-off param + self.random_state = random_state + self.shuffle = shuffle # Shuffle the trilplets + self.random_sampling = random_sampling + self.init = init + + def _fit(self, triplets): + """ + Fit OASIS model + + Parameters + ---------- + triplets : (n x 3 x d) array of samples + """ + # Currently prepare_inputs makes triplets contain points and not indices + triplets = self._prepare_inputs(triplets, type_of_inputs='tuples') + triplets, X = _to_index_points(triplets) # Work with indices + + n_triplets = triplets.shape[0] # (n_triplets, 3) + n_iter = n_triplets if self.n_iter is None else self.n_iter + + rng = check_random_state(self.random_state) + + M = _initialize_similarity_bilinear(X[triplets], + init=self.init, + strict_pd=False, + random_state=rng) + self.components_ = M + + self.indices_ = _get_random_indices(n_triplets, + n_iter, + shuffle=self.shuffle, + random=self.random_sampling, + random_state=rng) + i = 0 + while i < n_iter: + t = X[triplets[self.indices_[i]]] # t = Current triplet + delta = t[1] - t[2] + loss = 1 - np.dot(np.dot(t[0], self.components_), delta) + if loss > 0: + vi = np.outer(t[0], delta) # V_i matrix + fs = np.linalg.norm(vi, ord='fro') ** 2 # Frobenius norm ** 2 + tau_i = np.minimum(self.c, loss / fs) # Global GD or fit tuple + self.components_ = np.add(self.components_, tau_i * vi) # Update + i = i + 1 + + return self + + def partial_fit(self, new_triplets, n_iter, shuffle=True, + random_sampling=False): + """ + Reuse previous fit, and feed the algorithm with new triplets. + A new n_iter can be set for these new triplets. + + Parameters + ---------- + new_ triplets : (n x 3 x d) array of samples + + n_iter: int (default = n_triplets) + Number of iterations. When n_iter < n_triplets, a random sampling + takes place without repetition, but preserving the original order. + When n_iter = n_triplets, all triplets are included with the + original order. When n_iter > n_triplets, each triplet is included + at least floor(n_iter/n_triplets) times, while some may have one + more apparition at most. The order is preserved as well. + + shuffle: bool (default = True) + Whether the triplets should be shuffled after the sampling process. + If n_iter > n_triplets, then the suffle happends during the sampling + and at the end. + + random_sampling: bool (default = False) + If enabled, the algorithm will sample n_iter triplets from + the input. This sample can contain duplicates. It does not + matter if n_iter is lower, equal or greater than the number + of triplets. The sampling uses uniform distribution. + """ + self.n_iter = n_iter + self.shuffle = shuffle # Shuffle the trilplets + self.random_sampling = random_sampling + self.fit(new_triplets) + + +class OASIS(_BaseOASIS, _TripletsClassifierMixin): + """Online Algorithm for Scalable Image Similarity (OASIS) + + `OASIS` learns a bilinear similarity from triplet constraints with an online + Passive-Agressive (PA) algorithm approach. The bilinear similarity + between :math:`p_1` and :math:`p_2` is defined as :math:`p_{1}^{T} W p_2` + where :math:`W` is the learned matrix by OASIS. This particular algorithm + is fast as it scales linearly with the number of samples. + + Read more in the :ref:`User Guide `. + + .. warning:: + OASIS is still a bit experimental, don't hesitate to report if + something fails/doesn't work as expected. + + Parameters + ---------- + n_iter: int (default = n_triplets) + Number of iterations. When n_iter < n_triplets, a random sampling + takes place without repetition, but preserving the original order. + When n_iter = n_triplets, all triplets are included with the + original order. When n_iter > n_triplets, each triplet is included + at least floor(n_iter/n_triplets) times, while some may have one + more apparition at most. The order is preserved as well. + + shuffle: bool (default = True) + Whether the triplets should be shuffled after the sampling process. + If n_iter > n_triplets, then the suffle happends during the sampling + and at the end. + + random_sampling: bool (default = False) + If enabled, the algorithm will sample n_iter triplets from + the input. This sample can contain duplicates. It does not + matter if n_iter is lower, equal or greater than the number + of triplets. The sampling uses uniform distribution. + + c: float (default = 1e-4) + Passive-agressive param. Controls trade-off bewteen remaining + close to previous W_i-1 or minimizing loss of the current triplet. + + preprocessor : array-like, shape=(n_samples, n_features) or callable + The preprocessor to call to get triplets from indices. If array-like, + triplets will be formed like this: X[indices]. + + random_state : int or numpy.RandomState or None, optional (default=None) + A pseudo random number generator object or a seed for it if int. + + Attributes + ---------- + components_ : `numpy.ndarray`, shape=(n_features, n_features) + The matrix W learned for the bilinear similarity. + + indices : `numpy.ndarray`, shape=(n_iter) + The final order in which the triplets fed the algorithm. It's the list + of indices in respect to the original triplet list given as input. + + Examples + -------- + >>> from metric_learn import OASIS + >>> triplets = [[[1.2, 7.5], [1.3, 1.5], [6.2, 9.7]], + >>> [[1.3, 4.5], [3.2, 4.6], [5.4, 5.4]], + >>> [[3.2, 7.5], [3.3, 1.5], [8.2, 9.7]], + >>> [[3.3, 4.5], [5.2, 4.6], [7.4, 5.4]]] + >>> oasis = OASIS() + >>> oasis.fit(triplets) + + References + ---------- + .. [1] Chechik, Gal and Sharma, Varun and Shalit, Uri and Bengio, Samy + `Large Scale Online Learning of Image Similarity Through Ranking. + `_. \ + , JMLR 2010. + + .. [2] Adapted from original \ + `Matlab implementation.\ + `_. + + See Also + -------- + metric_learn.OASIS_Supervised : The supervised version of the algorithm. + + :ref:`supervised_version` : The section of the project documentation + that describes the supervised version of weakly supervised estimators. + """ + + def __init__(self, preprocessor=None, n_iter=None, c=0.0001, + random_state=None, shuffle=True, random_sampling=False, + init="identity"): + super().__init__(preprocessor=preprocessor, n_iter=n_iter, c=c, + random_state=random_state, shuffle=shuffle, + random_sampling=random_sampling, + init=init) + + def fit(self, triplets): + """Learn the OASIS model. + + Parameters + ---------- + triplets : array-like, shape=(n_constraints, 3, n_features) or \ + (n_constraints, 3) + 3D array-like of triplets of points or 2D array of triplets of + indicators. Triplets are assumed to be ordered such that: + d(triplets[i, 0],triplets[i, 1]) < d(triplets[i, 0], triplets[i, 2]). + + Returns + ------- + self : object + Returns the instance. + """ + return self._fit(triplets) + + +class OASIS_Supervised(_BaseOASIS): + """Online Algorithm for Scalable Image Similarity (OASIS) + + `OASIS_Supervised` creates triplets by taking `k_genuine` neighbours + of the same class and `k_impostor` neighbours from different classes for each + point and then runs the OASIS algorithm on these triplets. + + Read more in the :ref:`User Guide `. + + .. warning:: + OASIS is still a bit experimental, don't hesitate to report if + something fails/doesn't work as expected. + + Parameters + ---------- + n_iter: int (default = n_triplets) + Number of iterations. When n_iter < n_triplets, a random sampling + takes place without repetition, but preserving the original order. + When n_iter = n_triplets, all triplets are included with the + original order. When n_iter > n_triplets, each triplet is included + at least floor(n_iter/n_triplets) times, while some may have one + more apparition at most. The order is preserved as well. + + shuffle: bool (default = True) + Whether the triplets should be shuffled after the sampling process. + If n_iter > n_triplets, then the suffle happends during the sampling + and at the end. + + random_sampling: bool (default = False) + If enabled, the algorithm will sample n_iter triplets from + the input. This sample can contain duplicates. It does not + matter if n_iter is lower, equal or greater than the number + of triplets. The sampling uses uniform distribution. + + c: float (default = 1e-4) + Passive-agressive param. Controls trade-off bewteen remaining + close to previous W_i-1 or minimizing loss of the current triplet. + + preprocessor : array-like, shape=(n_samples, n_features) or callable + The preprocessor to call to get triplets from indices. If array-like, + triplets will be formed like this: X[indices]. + + random_state : int or numpy.RandomState or None, optional (default=None) + A pseudo random number generator object or a seed for it if int. + + Attributes + ---------- + components_ : `numpy.ndarray`, shape=(n_features, n_features) + The matrix W learned for the bilinear similarity. + + indices : `numpy.ndarray`, shape=(n_iter) + The final order in which the triplets fed the algorithm. It's the list + of indices in respect to the original triplet list given as input. + + Examples + -------- + >>> from metric_learn import OASIS_Supervised + >>> from sklearn.datasets import load_iris + >>> iris_data = load_iris() + >>> X = iris_data['data'] + >>> Y = iris_data['target'] + >>> oasis = OASIS_Supervised() + >>> oasis.fit(X, Y) + OASIS_Supervised(n_iter=4500, + random_state=RandomState(MT19937) at 0x7FE1B598FA40) + >>> oasis.pair_score([[X[0], X[1]]]) + array([-21.14242072]) + + References + ---------- + .. [1] Chechik, Gal and Sharma, Varun and Shalit, Uri and Bengio, Samy + `Large Scale Online Learning of Image Similarity Through Ranking. + `_. \ + , JMLR 2010. + + .. [2] Adapted from original \ + `Matlab implementation.\ + `_. + + See Also + -------- + metric_learn.OASIS : The weakly supervised version of this + algorithm. + """ + + def __init__(self, k_genuine=3, k_impostor=10, + preprocessor=None, n_iter=None, c=0.0001, + random_state=None, shuffle=True, random_sampling=False, + init="identity"): + self.k_genuine = k_genuine + self.k_impostor = k_impostor + super().__init__(preprocessor=preprocessor, n_iter=n_iter, c=c, + random_state=random_state, shuffle=shuffle, + random_sampling=random_sampling, + init=init) + + def fit(self, X, y): + """Create constraints from labels and learn the OASIS model. + + Parameters + ---------- + X : (n x d) matrix + Input data, where each row corresponds to a single instance. + + y : (n) array-like + Data labels. + + Returns + ------- + self : object + Returns the instance. + """ + X, y = self._prepare_inputs(X, y, ensure_min_samples=2) + constraints = Constraints(y) + triplets = constraints.generate_knntriplets(X, self.k_genuine, + self.k_impostor) + triplets = X[triplets] + + return self._fit(triplets) diff --git a/metric_learn/scml.py b/metric_learn/scml.py index b86c6fe1..c0afd285 100644 --- a/metric_learn/scml.py +++ b/metric_learn/scml.py @@ -14,6 +14,7 @@ from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.utils import check_array, check_random_state import warnings +from ._util import _to_index_points class _BaseSCML(MahalanobisMixin): @@ -65,7 +66,7 @@ def _fit(self, triplets, basis=None, n_basis=None): # compliant with the current handling of inputs it is converted # back to indices by the following function. This should be improved # in the future. - triplets, X = self._to_index_points(triplets) + triplets, X = _to_index_points(triplets) if basis is None: basis, n_basis = self._initialize_basis(triplets, X) @@ -187,12 +188,6 @@ def _components_from_basis_weights(self, basis, w): else: # if metric is full rank return components_from_metric(np.matmul(basis.T, w.T*basis)) - def _to_index_points(self, triplets): - shape = triplets.shape - X, triplets = np.unique(np.vstack(triplets), return_inverse=True, axis=0) - triplets = triplets.reshape(shape[:2]) - return triplets, X - def _initialize_basis(self, triplets, X): """ Checks if the basis array is well constructed or constructs it based on one of the available options. diff --git a/test/metric_learn_test.py b/test/metric_learn_test.py index fe1560c2..a96d4c79 100644 --- a/test/metric_learn_test.py +++ b/test/metric_learn_test.py @@ -1,3 +1,6 @@ +""" +Tests that are specific for each learner. +""" import unittest import re import pytest @@ -12,6 +15,8 @@ from sklearn.exceptions import ConvergenceWarning from sklearn.utils.validation import check_X_y from sklearn.preprocessing import StandardScaler +from sklearn.utils import check_random_state +from test.test_utils import build_triplets try: from inverse_covariance import quic assert(quic) @@ -22,20 +27,26 @@ from metric_learn import (LMNN, NCA, LFDA, Covariance, MLKR, MMC, SCML_Supervised, LSML_Supervised, ITML_Supervised, SDML_Supervised, RCA_Supervised, - MMC_Supervised, SDML, RCA, ITML, SCML) + MMC_Supervised, SDML, RCA, ITML, SCML, + OASIS, OASIS_Supervised) # Import this specially for testing. from metric_learn.constraints import wrap_pairs, Constraints from metric_learn.lmnn import _sum_outer_products -def class_separation(X, labels): - unique_labels, label_inds = np.unique(labels, return_inverse=True) - ratio = 0 - for li in range(len(unique_labels)): - Xc = X[label_inds == li] - Xnc = X[label_inds != li] - ratio += pairwise_distances(Xc).mean() / pairwise_distances(Xc, Xnc).mean() - return ratio / len(unique_labels) +SEED = 33 +RNG = check_random_state(SEED) + + +def class_separation(X, labels, callable_metric='euclidean'): + unique_labels, label_inds = np.unique(labels, return_inverse=True) + ratio = 0 + for li in range(len(unique_labels)): + Xc = X[label_inds == li] + Xnc = X[label_inds != li] + aux = pairwise_distances(Xc, metric=callable_metric).mean() + ratio += aux / pairwise_distances(Xc, Xnc, metric=callable_metric).mean() + return ratio / len(unique_labels) class MetricTestCase(unittest.TestCase): @@ -75,6 +86,167 @@ def test_singular_returns_pseudo_inverse(self): pseudo_inverse) +class TestOASIS(object): + def test_sanity_check(self): + """ + With M=I init. As the algorithm sees more triplets, + the score(triplet) should increse or maintain. + + A warning might show up regarding division by 0. See + test_divide_zero for further research. + """ + triplets = np.array([[[0, 1], [2, 1], [0, 0]], + [[2, 1], [0, 1], [2, 0]], + [[0, 0], [2, 0], [0, 1]], + [[2, 0], [0, 0], [2, 1]], + [[2, 1], [-1, -1], [33, 21]]]) + + # Baseline, no M = Identity + oasis = OASIS(n_iter=1, c=0.24, random_state=RNG, init='identity') + # See 1/5 triplets + oasis.fit(triplets[:1]) + a1 = oasis.score(triplets) + + msg = "divide by zero encountered in double_scalars" + with pytest.warns(RuntimeWarning) as raised_warning: + # See 2/5 triplets + oasis.partial_fit(triplets[1:2], n_iter=2) + a2 = oasis.score(triplets) + + # See 4/5 triplets + oasis.partial_fit(triplets[2:4], n_iter=3) + a3 = oasis.score(triplets) + + # See 5/5 triplets, one is seen again + oasis.partial_fit(triplets[4:5], n_iter=1) + a4 = oasis.score(triplets) + + assert a2 >= a1 + assert a3 >= a2 + assert a4 >= a3 + assert msg == raised_warning[0].message.args[0] + + def test_score_zero(self): + """ + The third triplet will give similarity 0, then the prediction + will be 0. But predict() must give results in {+1, -1}. This + tests forcing prediction 0 to be -1. + """ + triplets = np.array([[[0, 1], [2, 1], [0, 0]], + [[2, 1], [0, 1], [2, 0]], + [[0, 0], [2, 0], [0, 1]], + [[2, 0], [0, 0], [2, 1]]]) + + # Baseline, no M = Identity + with pytest.raises(ValueError): + oasis1 = OASIS(n_iter=0, c=0.24, random_state=RNG) + oasis1.fit(triplets) + predictions = oasis1.predict(triplets) + not_valid = [e for e in predictions if e not in [-1, 1]] + assert len(not_valid) == 0 + + def test_divide_zero(self): + """ + The thrid triplet willl force norm(V_i) to be zero, and + force a division by 0 when calculating tau = loss / norm(V_i). + No error should be experienced. A warning should show up. + """ + triplets = np.array([[[0, 1], [2, 1], [0, 0]], + [[2, 1], [0, 1], [2, 0]], + [[0, 0], [2, 0], [0, 1]], + [[2, 0], [0, 0], [2, 1]]]) + + # Baseline, no M = Identity + oasis1 = OASIS(n_iter=20, c=0.24, random_state=RNG) + msg = "divide by zero encountered in double_scalars" + with pytest.warns(RuntimeWarning) as raised_warning: + oasis1.fit(triplets) + assert msg == raised_warning[0].message.args[0] + + def test_iris_supervised(self): + """ + Test a real use case: Using class separation as evaluation metric, + and the Iris dataset, this tests verifies that points of the same + class are closer now, using the learnt bilinear similarity at OASIS. + + In contrast with Mahalanobis tests, we cant use transform(X) and + then use euclidean metric. Instead, we need to pass pairwise_distances + method from sklearn an explicit callable metric. Then we use + get_metric() for that purpose. + """ + + # Default bilinear similarity uses M = Identity + def bilinear_identity(u, v): + return - np.dot(np.dot(u.T, np.identity(np.shape(u)[0])), v) + + X, y = load_iris(return_X_y=True) + prev = class_separation(X, y, bilinear_identity) + + oasis = OASIS_Supervised(random_state=33, c=0.38) + oasis.fit(X, y) + now = class_separation(X, y, oasis.get_metric()) + assert now < prev # -0.0407866 vs 1.08 ! + + @pytest.mark.parametrize('init', ['random', 'random_spd', + 'covariance', 'identity']) + @pytest.mark.parametrize('random_state', [33, 69, 112]) + def test_random_state_in_suffling(self, init, random_state): + """ + Tests that many instances of OASIS, with the same random_state, + produce the same shuffling on the triplets given. + + Test that many instances of OASIS, with different random_state, + produce different shuffling on the trilpets given. + + The triplets are produced with the Iris dataset. + + Tested with all possible init. + """ + triplets, _, _, _ = build_triplets() + + # Test same random_state, then same shuffling + oasis_a = OASIS(random_state=random_state, init=init) + oasis_a.fit(triplets) + shuffle_a = oasis_a.indices_ + + oasis_b = OASIS(random_state=random_state, init=init) + oasis_b.fit(triplets) + shuffle_b = oasis_b.indices_ + + assert_array_equal(shuffle_a, shuffle_b) + + # Test different random states + last_suffle = shuffle_b + for i in range(3, 5): + oasis_a = OASIS(random_state=random_state+i, init=init) + oasis_a.fit(triplets) + shuffle_a = oasis_a.indices_ + + with pytest.raises(AssertionError): + assert_array_equal(last_suffle, shuffle_a) + + last_suffle = shuffle_a + + @pytest.mark.parametrize('init', ['random', 'random_spd', + 'covariance', 'identity']) + @pytest.mark.parametrize('random_state', [33, 69, 112]) + def test_general_results_random_state(self, init, random_state): + """ + With fixed triplets and random_state, two instances of OASIS + should produce the same output (matrix W) + """ + triplets, _, _, _ = build_triplets() + oasis_a = OASIS(random_state=random_state, init=init) + oasis_a.fit(triplets) + matrix_a = oasis_a.get_bilinear_matrix() + + oasis_b = OASIS(random_state=random_state, init=init) + oasis_b.fit(triplets) + matrix_b = oasis_b.get_bilinear_matrix() + + assert_array_equal(matrix_a, matrix_b) + + class TestSCML(object): @pytest.mark.parametrize('basis', ('lda', 'triplet_diffs')) def test_iris(self, basis): diff --git a/test/test_base_metric.py b/test/test_base_metric.py index baa585b9..d98c3c45 100644 --- a/test/test_base_metric.py +++ b/test/test_base_metric.py @@ -1,12 +1,19 @@ -from numpy.core.numeric import array_equal +""" +Tests general things from the API: String parsing, methods like get_metric, +and deprecation warnings. +""" import pytest import re import unittest import metric_learn import numpy as np +from numpy.testing import assert_array_equal +from itertools import product from sklearn import clone from test.test_utils import ids_metric_learners, metric_learners, remove_y from metric_learn.sklearn_shims import set_random_state, SKLEARN_AT_LEAST_0_22 +from metric_learn._util import make_context +from metric_learn.base_metric import MahalanobisMixin, BilinearMixin def remove_spaces(s): @@ -278,25 +285,71 @@ def test_n_components(estimator, build_dataset): @pytest.mark.parametrize('estimator, build_dataset', metric_learners, ids=ids_metric_learners) def test_score_pairs_warning(estimator, build_dataset): - """Tests that score_pairs returns a FutureWarning regarding deprecation. - Also that score_pairs and pair_distance have the same behaviour""" + """Tests that score_pairs returns a FutureWarning regarding + deprecation for all learners""" input_data, labels, _, X = build_dataset() model = clone(estimator) set_random_state(model) - - # We fit the metric learner on it and then we call score_pairs on some - # points model.fit(*remove_y(model, input_data, labels)) msg = ("score_pairs will be deprecated in release 0.7.0. " "Use pair_score to compute similarity scores, or " "pair_distances to compute distances.") with pytest.warns(FutureWarning) as raised_warning: - score = model.score_pairs([[X[0], X[1]], ]) - dist = model.pair_distance([[X[0], X[1]], ]) - assert array_equal(score, dist) + _ = model.score_pairs([[X[0], X[1]], ]) assert any([str(warning.message) == msg for warning in raised_warning]) +@pytest.mark.parametrize('estimator, build_dataset', metric_learners, + ids=ids_metric_learners) +def test_pair_score_dim(estimator, build_dataset): + """ + Scoring of 3D arrays should return 1D array (several tuples), + and scoring of 2D arrays (one tuple) should return an error (like + scikit-learn's error when scoring 1D arrays) + """ + input_data, labels, _, X = build_dataset() + model = clone(estimator) + set_random_state(model) + model.fit(*remove_y(estimator, input_data, labels)) + tuples = np.array(list(product(X, X))) + assert model.pair_score(tuples).shape == (tuples.shape[0],) + context = make_context(model) + msg = ("3D array of formed tuples expected{}. Found 2D array " + "instead:\ninput={}. Reshape your data and/or use a preprocessor.\n" + .format(context, tuples[1])) + with pytest.raises(ValueError) as raised_error: + model.pair_score(tuples[1]) + assert str(raised_error.value) == msg + + +@pytest.mark.parametrize('estimator, build_dataset', metric_learners, + ids=ids_metric_learners) +def test_deprecated_score_pairs_same_result(estimator, build_dataset): + """ + Test that `pari_distance` gives the same result as `score_pairs` for + Mahalanobis learnes, and the same for `pair_score` and `score_paris` + for Bilinear learners. It also checks that the deprecation warning of + `score_pairs` is being shown. + """ + input_data, labels, _, X = build_dataset() + model = clone(estimator) + set_random_state(model) + model.fit(*remove_y(model, input_data, labels)) + random_pairs = np.array(list(product(X, X))) + + msg = ("score_pairs will be deprecated in release 0.7.0. " + "Use pair_score to compute similarity scores, or " + "pair_distances to compute distances.") + with pytest.warns(FutureWarning) as raised_warnings: + s1 = model.score_pairs(random_pairs) + if isinstance(model, BilinearMixin): + s2 = model.pair_score(random_pairs) + elif isinstance(model, MahalanobisMixin): + s2 = model.pair_distance(random_pairs) + assert_array_equal(s1, s2) + assert any(str(w.message) == msg for w in raised_warnings) + + if __name__ == '__main__': unittest.main() diff --git a/test/test_bilinear_mixin.py b/test/test_bilinear_mixin.py new file mode 100644 index 00000000..db9b937f --- /dev/null +++ b/test/test_bilinear_mixin.py @@ -0,0 +1,256 @@ +""" +Tests all functionality for Bilinear learners. Correctness, use cases, +warnings, etc. +""" +from itertools import product +from scipy.linalg import eigh +import numpy as np +from numpy.testing import assert_array_almost_equal, assert_array_equal +from numpy.linalg import LinAlgError +import pytest +from metric_learn._util import (_initialize_similarity_bilinear, + _check_sdp_from_eigen) +from sklearn import clone +from sklearn.datasets import make_spd_matrix +from sklearn.utils import check_random_state +from metric_learn.sklearn_shims import set_random_state +from test.test_utils import metric_learners_b, ids_metric_learners_b, \ + remove_y, IdentityBilinearLearner, build_classification, build_triplets + +RNG = check_random_state(0) + + +@pytest.mark.parametrize('estimator, build_dataset', metric_learners_b, + ids=ids_metric_learners_b) +def test_same_similarity_with_two_methods(estimator, build_dataset): + """" + Tests that pair_score() and get_metric() give consistent results. + In both cases, the results must match for the same input. + Tests it for 'n_pairs' sampled from 'n' d-dimentional arrays. + """ + input_data, labels, _, X = build_dataset() + n_samples = 20 + X = X[:n_samples] + model = clone(estimator) + set_random_state(model) + model.fit(*remove_y(estimator, input_data, labels)) + random_pairs = np.array(list(product(X, X))) + + dist1 = model.pair_score(random_pairs) + dist2 = [model.get_metric()(p[0], p[1]) for p in random_pairs] + + assert_array_almost_equal(dist1, dist2) + + +@pytest.mark.parametrize('estimator, build_dataset', metric_learners_b, + ids=ids_metric_learners_b) +def test_check_correctness_similarity(estimator, build_dataset): + """ + Tests the correctness of the results made from socre_paris(), + get_metric() and get_bilinear_matrix. Results are compared with + the real bilinear similarity calculated in-place. + """ + input_data, labels, _, X = build_dataset() + n_samples = 20 + X = X[:n_samples] + model = clone(estimator) + set_random_state(model) + model.fit(*remove_y(estimator, input_data, labels)) + random_pairs = np.array(list(product(X, X))) + + dist1 = model.pair_score(random_pairs) + dist2 = [model.get_metric()(p[0], p[1]) for p in random_pairs] + dist3 = [np.dot(np.dot(p[0].T, model.get_bilinear_matrix()), p[1]) + for p in random_pairs] + desired = [np.dot(np.dot(p[0].T, model.components_), p[1]) + for p in random_pairs] + + assert_array_almost_equal(dist1, desired) # pair_score + assert_array_almost_equal(dist2, desired) # get_metric + assert_array_almost_equal(dist3, desired) # get_metric + + +# This is a `hardcoded` handmade tests, to make sure the computation +# made at BilinearMixin is correct. +def test_check_handmade_example(): + """ + Checks that pair_score() result is correct comparing it with a + handmade example. + """ + u = np.array([0, 1, 2]) + v = np.array([3, 4, 5]) + mixin = IdentityBilinearLearner() + mixin.fit([u, v], [0, 0]) # Identity fit + c = np.array([[2, 4, 6], [6, 4, 2], [1, 2, 3]]) + mixin.components_ = c # Force components_ + dists = mixin.pair_score([[u, v], [v, u]]) + assert_array_almost_equal(dists, [96, 120]) + + +# Note: This test needs to be `hardcoded` as the similarity martix must +# be symmetric. Running on all Bilinear learners will throw an error as +# the matrix can be non-symmetric. +def test_check_handmade_symmetric_example(): + """ + When the Bilinear matrix is the identity. The similarity + between two arrays must be equal: S(u,v) = S(v,u). Also + checks the random case: when the matrix is spd and symetric. + """ + input_data, labels, _, X = build_classification() + n_samples = 20 + X = X[:n_samples] + model = clone(IdentityBilinearLearner()) # Identity matrix + set_random_state(model) + model.fit(*remove_y(IdentityBilinearLearner(), input_data, labels)) + random_pairs = np.array(list(product(X, X))) + + pairs_reverse = [[p[1], p[0]] for p in random_pairs] + dist1 = model.pair_score(random_pairs) + dist2 = model.pair_score(pairs_reverse) + assert_array_almost_equal(dist1, dist2) + + # Random pairs for M = spd Matrix + spd_matrix = make_spd_matrix(X[0].shape[-1], random_state=RNG) + model.components_ = spd_matrix + dist1 = model.pair_score(random_pairs) + dist2 = model.pair_score(pairs_reverse) + assert_array_almost_equal(dist1, dist2) + + +@pytest.mark.parametrize('estimator, build_dataset', metric_learners_b, + ids=ids_metric_learners_b) +def test_pair_score_finite(estimator, build_dataset): + """ + Checks for 'n' pair_score() of 'd' dimentions, that all + similarities are finite numbers: not NaN, +inf or -inf. + Considers a random M for bilinear similarity. + """ + input_data, labels, _, X = build_dataset() + n_samples = 20 + X = X[:n_samples] + model = clone(estimator) + set_random_state(model) + model.fit(*remove_y(estimator, input_data, labels)) + random_pairs = np.array(list(product(X, X))) + dist1 = model.pair_score(random_pairs) + assert np.isfinite(dist1).all() + + +@pytest.mark.parametrize('estimator, build_dataset', metric_learners_b, + ids=ids_metric_learners_b) +def test_check_error_with_pair_distance(estimator, build_dataset): + """ + Check that calling `pair_distance` is not possible with a Bilinear learner. + An Exception must be shown instead. + """ + input_data, labels, _, X = build_dataset() + model = clone(estimator) + set_random_state(model) + model.fit(*remove_y(model, input_data, labels)) + random_pairs = np.array(list(product(X, X))) + + msg = ("This learner doesn't learn a distance, thus ", + "this method is not implemented. Use pair_score instead") + with pytest.raises(Exception) as e: + _ = model.pair_distance(random_pairs) + assert e.value.args[0] == msg + + +@pytest.mark.parametrize('init', ['random', 'random_spd', + 'covariance', 'identity']) +@pytest.mark.parametrize('random_state', [6, 42]) +def test_random_state_random_base_M(init, random_state): + """ + Tests that the function _initialize_similarity_bilinear + outputs the same matrix, given the same tuples and random_state + """ + triplets, _, _, _ = build_triplets() + matrix_a = _initialize_similarity_bilinear(triplets, init=init, + random_state=random_state) + matrix_b = _initialize_similarity_bilinear(triplets, init=init, + random_state=random_state) + + assert_array_equal(matrix_a, matrix_b) + + +@pytest.mark.parametrize('estimator, build_dataset', metric_learners_b, + ids=ids_metric_learners_b) +def test_bilinear_init(estimator, build_dataset): + """ + Test the general functionality of _initialize_similarity_bilinear + """ + input_data, labels, _, X = build_dataset() + model = clone(estimator) + set_random_state(model) + d = input_data.shape[-1] + + # Test that a custom matrix is accepted as init + my_M = RNG.rand(d, d) + M = _initialize_similarity_bilinear(X, init=my_M, + random_state=RNG) + assert_array_equal(my_M, M) + + # Test that an error is raised if the init is not allowed + msg = "`matrix` must be 'identity', 'random_spd', 'random', \ + covariance or a numpy array of shape (n_features, n_features).\ + Not `random_string`." + with pytest.raises(ValueError) as e: + M = _initialize_similarity_bilinear(X, init="random_string", + random_state=RNG) + assert str(e.value) == msg + + # Test identity init + expected = np.identity(d) + M = _initialize_similarity_bilinear(X, init="identity", + random_state=RNG) + assert_array_equal(M, expected) + + # Test random init + M = _initialize_similarity_bilinear(X, init="random", + random_state=RNG) + assert np.isfinite(M).all() # Check that all values are finite + + # Test random spd init + M = _initialize_similarity_bilinear(X, init="random_spd", + random_state=RNG) + w, V = eigh(M, check_finite=False) + assert _check_sdp_from_eigen(w) # Check strictly positive definite + assert np.isfinite(M).all() + + # Test that (X * Cov^-1).T * X == (X*L).T * (X*L) where Cov^-1 = L.T * L + C_m = np.linalg.inv(np.cov(X, rowvar=False)) + L = np.linalg.cholesky(C_m) + X1 = X[0, :].dot(C_m).T.dot(X[7, :]) # Take 2 points to test + X2 = X[0, :].dot(L).T.dot(X[7, :].dot(L)) + assert_array_almost_equal(X1, X2) + + # Test covariance warning when its not invertible: + # We create a feature that is a linear combination of the first two + # features: + input_data = np.concatenate([input_data, input_data[:, ..., :2].dot([[2], + [3]])], + axis=-1) + model.set_params(init='covariance') + msg = ('The covariance matrix is not invertible: ' + 'using the pseudo-inverse instead.' + 'To make the covariance matrix invertible' + ' you can remove any linearly dependent features and/or ' + 'reduce the dimensionality of your input, ' + 'for instance using `sklearn.decomposition.PCA` as a ' + 'preprocessing step.') + with pytest.warns(UserWarning) as raised_warning: + model.fit(*remove_y(model, input_data, labels)) + assert any([str(warning.message) == msg for warning in raised_warning]) + assert np.isfinite(M).all() + + # Test warning triggered by strict_pd=True + msg = ("Unable to get a true inverse of the covariance " + "matrix since it is not definite. Try another " + "`matrix`, or an algorithm that does not " + "require the `matrix` to be strictly positive definite.") + with pytest.raises(LinAlgError) as raised_err: + M = _initialize_similarity_bilinear(input_data, init="covariance", + strict_pd=True, + random_state=RNG) + assert str(raised_err.value) == msg + assert np.isfinite(M).all() diff --git a/test/test_components_metric_conversion.py b/test/test_components_metric_conversion.py index 5502ad90..368a2fa2 100644 --- a/test/test_components_metric_conversion.py +++ b/test/test_components_metric_conversion.py @@ -1,3 +1,7 @@ +""" +Tests for Mahalanobis learners, that the transormation matrix (L) squared +is equivalent to the Mahalanobis matrix, even in edge cases. +""" import unittest import numpy as np import pytest diff --git a/test/test_constraints.py b/test/test_constraints.py index def228d4..d9e567aa 100644 --- a/test/test_constraints.py +++ b/test/test_constraints.py @@ -1,3 +1,7 @@ +""" +Test Contrains generation for positive_negative_pairs and knn_triplets. +Also tests warnings. +""" import pytest import numpy as np from sklearn.utils import shuffle diff --git a/test/test_fit_transform.py b/test/test_fit_transform.py index d4d4bfe0..63ca421c 100644 --- a/test/test_fit_transform.py +++ b/test/test_fit_transform.py @@ -1,3 +1,7 @@ +""" +For each lerner that has `fit` and `transform`, checks that calling them +sequeatially is the same as calling fit_transform from scikit-learn. +""" import unittest import numpy as np from sklearn.datasets import load_iris diff --git a/test/test_mahalanobis_mixin.py b/test/test_mahalanobis_mixin.py index e69aa032..f5a38b25 100644 --- a/test/test_mahalanobis_mixin.py +++ b/test/test_mahalanobis_mixin.py @@ -1,3 +1,7 @@ +""" +Tests all functionality for Mahalanobis Learners. Correctness, use cases, +warnings, distance properties, transform, dimentions, init, etc. +""" from itertools import product import pytest @@ -8,7 +12,6 @@ from scipy.spatial.distance import pdist, squareform, mahalanobis from scipy.stats import ortho_group from sklearn import clone -from sklearn.cluster import DBSCAN from sklearn.datasets import make_spd_matrix, make_blobs from sklearn.utils import check_random_state, shuffle from sklearn.utils.multiclass import type_of_target @@ -20,14 +23,16 @@ _PairsClassifierMixin) from metric_learn.exceptions import NonPSDError -from test.test_utils import (ids_metric_learners, metric_learners, - remove_y, ids_classifiers) +from test.test_utils import (ids_metric_learners_m, metric_learners_m, + remove_y, ids_classifiers_m, + pairs_learners_m, ids_pairs_learners_m) +from sklearn.exceptions import NotFittedError RNG = check_random_state(0) -@pytest.mark.parametrize('estimator, build_dataset', metric_learners, - ids=ids_metric_learners) +@pytest.mark.parametrize('estimator, build_dataset', metric_learners_m, + ids=ids_metric_learners_m) def test_pair_distance_pair_score_equivalent(estimator, build_dataset): """ For Mahalanobis learners, pair_score should be equivalent to the @@ -46,10 +51,11 @@ def test_pair_distance_pair_score_equivalent(estimator, build_dataset): assert_array_equal(distances, -1 * scores) -@pytest.mark.parametrize('estimator, build_dataset', metric_learners, - ids=ids_metric_learners) +@pytest.mark.parametrize('estimator, build_dataset', metric_learners_m, + ids=ids_metric_learners_m) def test_pair_distance_pairwise(estimator, build_dataset): - # Computing pairwise scores should return a euclidean distance matrix. + """Computing pairwise scores should return a euclidean distance + matrix.""" input_data, labels, _, X = build_dataset() n_samples = 20 X = X[:n_samples] @@ -70,10 +76,10 @@ def test_pair_distance_pairwise(estimator, build_dataset): assert_array_almost_equal(squareform(pairwise), pdist(model.transform(X))) -@pytest.mark.parametrize('estimator, build_dataset', metric_learners, - ids=ids_metric_learners) +@pytest.mark.parametrize('estimator, build_dataset', metric_learners_m, + ids=ids_metric_learners_m) def test_pair_distance_toy_example(estimator, build_dataset): - # Checks that pair_distance works on a toy example + """Checks that `pair_distance` works on a toy example.""" input_data, labels, _, X = build_dataset() n_samples = 20 X = X[:n_samples] @@ -88,10 +94,10 @@ def test_pair_distance_toy_example(estimator, build_dataset): assert_array_almost_equal(model.pair_distance(pairs), distances) -@pytest.mark.parametrize('estimator, build_dataset', metric_learners, - ids=ids_metric_learners) +@pytest.mark.parametrize('estimator, build_dataset', metric_learners_m, + ids=ids_metric_learners_m) def test_pair_distance_finite(estimator, build_dataset): - # tests that the score is finite + """Tests that the distance from `pair_distance` is finite""" input_data, labels, _, X = build_dataset() model = clone(estimator) set_random_state(model) @@ -100,28 +106,9 @@ def test_pair_distance_finite(estimator, build_dataset): assert np.isfinite(model.pair_distance(pairs)).all() -@pytest.mark.parametrize('estimator, build_dataset', metric_learners, - ids=ids_metric_learners) -def test_pair_distance_dim(estimator, build_dataset): - # scoring of 3D arrays should return 1D array (several tuples), - # and scoring of 2D arrays (one tuple) should return an error (like - # scikit-learn's error when scoring 1D arrays) - input_data, labels, _, X = build_dataset() - model = clone(estimator) - set_random_state(model) - model.fit(*remove_y(estimator, input_data, labels)) - tuples = np.array(list(product(X, X))) - assert model.pair_distance(tuples).shape == (tuples.shape[0],) - context = make_context(estimator) - msg = ("3D array of formed tuples expected{}. Found 2D array " - "instead:\ninput={}. Reshape your data and/or use a preprocessor.\n" - .format(context, tuples[1])) - with pytest.raises(ValueError) as raised_error: - model.pair_distance(tuples[1]) - assert str(raised_error.value) == msg - - def check_is_distance_matrix(pairwise): + """Returns True if the matrix is positive, symmetrc, the diagonal is zero, + and if it fullfills the triangular inequality for all pairs""" assert (pairwise >= 0).all() # positivity assert np.array_equal(pairwise, pairwise.T) # symmetry assert (pairwise.diagonal() == 0).all() # identity @@ -131,10 +118,11 @@ def check_is_distance_matrix(pairwise): pairwise[:, np.newaxis, :] + tol).all() -@pytest.mark.parametrize('estimator, build_dataset', metric_learners, - ids=ids_metric_learners) +@pytest.mark.parametrize('estimator, build_dataset', metric_learners_m, + ids=ids_metric_learners_m) def test_embed_toy_example(estimator, build_dataset): - # Checks that embed works on a toy example + """Checks that embed works on a toy example. That using `transform` + is equivalent to manually multiplying Lx""" input_data, labels, _, X = build_dataset() n_samples = 20 X = X[:n_samples] @@ -145,10 +133,10 @@ def test_embed_toy_example(estimator, build_dataset): assert_array_almost_equal(model.transform(X), embedded_points) -@pytest.mark.parametrize('estimator, build_dataset', metric_learners, - ids=ids_metric_learners) +@pytest.mark.parametrize('estimator, build_dataset', metric_learners_m, + ids=ids_metric_learners_m) def test_embed_dim(estimator, build_dataset): - # Checks that the the dimension of the output space is as expected + """Checks that the the dimension of the output space is as expected""" input_data, labels, _, X = build_dataset() model = clone(estimator) set_random_state(model) @@ -174,10 +162,10 @@ def test_embed_dim(estimator, build_dataset): assert str(raised_error.value) == err_msg -@pytest.mark.parametrize('estimator, build_dataset', metric_learners, - ids=ids_metric_learners) +@pytest.mark.parametrize('estimator, build_dataset', metric_learners_m, + ids=ids_metric_learners_m) def test_embed_finite(estimator, build_dataset): - # Checks that embed returns vectors with finite values + """Checks that embed (transform) returns vectors with finite values""" input_data, labels, _, X = build_dataset() model = clone(estimator) set_random_state(model) @@ -185,10 +173,11 @@ def test_embed_finite(estimator, build_dataset): assert np.isfinite(model.transform(X)).all() -@pytest.mark.parametrize('estimator, build_dataset', metric_learners, - ids=ids_metric_learners) +@pytest.mark.parametrize('estimator, build_dataset', metric_learners_m, + ids=ids_metric_learners_m) def test_embed_is_linear(estimator, build_dataset): - # Checks that the embedding is linear + """Checks that the embedding is linear, i.e. linear properties of + using `tranform`""" input_data, labels, _, X = build_dataset() model = clone(estimator) set_random_state(model) @@ -200,8 +189,8 @@ def test_embed_is_linear(estimator, build_dataset): 5 * model.transform(X[:10])) -@pytest.mark.parametrize('estimator, build_dataset', metric_learners, - ids=ids_metric_learners) +@pytest.mark.parametrize('estimator, build_dataset', metric_learners_m, + ids=ids_metric_learners_m) def test_get_metric_equivalent_to_explicit_mahalanobis(estimator, build_dataset): """Tests that using the get_metric method of mahalanobis metric learners is @@ -220,8 +209,8 @@ def test_get_metric_equivalent_to_explicit_mahalanobis(estimator, assert_allclose(metric(a, b), expected_dist, rtol=1e-13) -@pytest.mark.parametrize('estimator, build_dataset', metric_learners, - ids=ids_metric_learners) +@pytest.mark.parametrize('estimator, build_dataset', metric_learners_m, + ids=ids_metric_learners_m) def test_get_metric_is_pseudo_metric(estimator, build_dataset): """Tests that the get_metric method of mahalanobis metric learners returns a pseudo-metric (metric but without one side of the equivalence of @@ -247,21 +236,8 @@ def test_get_metric_is_pseudo_metric(estimator, build_dataset): np.isclose(metric(a, c), metric(a, b) + metric(b, c), rtol=1e-20)) -@pytest.mark.parametrize('estimator, build_dataset', metric_learners, - ids=ids_metric_learners) -def test_get_metric_compatible_with_scikit_learn(estimator, build_dataset): - """Check that the metric returned by get_metric is compatible with - scikit-learn's algorithms using a custom metric, DBSCAN for instance""" - input_data, labels, _, X = build_dataset() - model = clone(estimator) - set_random_state(model) - model.fit(*remove_y(estimator, input_data, labels)) - clustering = DBSCAN(metric=model.get_metric()) - clustering.fit(X) - - -@pytest.mark.parametrize('estimator, build_dataset', metric_learners, - ids=ids_metric_learners) +@pytest.mark.parametrize('estimator, build_dataset', metric_learners_m, + ids=ids_metric_learners_m) def test_get_squared_metric(estimator, build_dataset): """Test that the squared metric returned is indeed the square of the metric""" @@ -280,8 +256,8 @@ def test_get_squared_metric(estimator, build_dataset): rtol=1e-15) -@pytest.mark.parametrize('estimator, build_dataset', metric_learners, - ids=ids_metric_learners) +@pytest.mark.parametrize('estimator, build_dataset', metric_learners_m, + ids=ids_metric_learners_m) def test_components_is_2D(estimator, build_dataset): """Tests that the transformation matrix of metric learners is 2D""" input_data, labels, _, X = build_dataset() @@ -318,13 +294,13 @@ def test_components_is_2D(estimator, build_dataset): @pytest.mark.parametrize('estimator, build_dataset', [(ml, bd) for idml, (ml, bd) - in zip(ids_metric_learners, - metric_learners) + in zip(ids_metric_learners_m, + metric_learners_m) if hasattr(ml, 'n_components') and hasattr(ml, 'init')], ids=[idml for idml, (ml, _) - in zip(ids_metric_learners, - metric_learners) + in zip(ids_metric_learners_m, + metric_learners_m) if hasattr(ml, 'n_components') and hasattr(ml, 'init')]) def test_init_transformation(estimator, build_dataset): @@ -411,13 +387,13 @@ def test_init_transformation(estimator, build_dataset): @pytest.mark.parametrize('n_components', [3, 5, 7, 11]) @pytest.mark.parametrize('estimator, build_dataset', [(ml, bd) for idml, (ml, bd) - in zip(ids_metric_learners, - metric_learners) + in zip(ids_metric_learners_m, + metric_learners_m) if hasattr(ml, 'n_components') and hasattr(ml, 'init')], ids=[idml for idml, (ml, _) - in zip(ids_metric_learners, - metric_learners) + in zip(ids_metric_learners_m, + metric_learners_m) if hasattr(ml, 'n_components') and hasattr(ml, 'init')]) def test_auto_init_transformation(n_samples, n_features, n_classes, @@ -460,7 +436,7 @@ def test_auto_init_transformation(n_samples, n_features, n_classes, input_data = input_data[:n_samples, ..., :n_features] assert input_data.shape[0] == n_samples assert input_data.shape[-1] == n_features - has_classes = model_base.__class__.__name__ in ids_classifiers + has_classes = model_base.__class__.__name__ in ids_classifiers_m if has_classes: labels = np.tile(range(n_classes), n_samples // n_classes + 1)[:n_samples] @@ -481,13 +457,13 @@ def test_auto_init_transformation(n_samples, n_features, n_classes, @pytest.mark.parametrize('estimator, build_dataset', [(ml, bd) for idml, (ml, bd) - in zip(ids_metric_learners, - metric_learners) + in zip(ids_metric_learners_m, + metric_learners_m) if not hasattr(ml, 'n_components') and hasattr(ml, 'init')], ids=[idml for idml, (ml, _) - in zip(ids_metric_learners, - metric_learners) + in zip(ids_metric_learners_m, + metric_learners_m) if not hasattr(ml, 'n_components') and hasattr(ml, 'init')]) def test_init_mahalanobis(estimator, build_dataset): @@ -571,12 +547,12 @@ def test_init_mahalanobis(estimator, build_dataset): @pytest.mark.parametrize('estimator, build_dataset', [(ml, bd) for idml, (ml, bd) - in zip(ids_metric_learners, - metric_learners) + in zip(ids_metric_learners_m, + metric_learners_m) if idml[:4] in ['ITML', 'SDML', 'LSML']], ids=[idml for idml, (ml, _) - in zip(ids_metric_learners, - metric_learners) + in zip(ids_metric_learners_m, + metric_learners_m) if idml[:4] in ['ITML', 'SDML', 'LSML']]) def test_singular_covariance_init_or_prior_strictpd(estimator, build_dataset): """Tests that when using the 'covariance' init or prior, it returns the @@ -615,12 +591,12 @@ def test_singular_covariance_init_or_prior_strictpd(estimator, build_dataset): @pytest.mark.integration @pytest.mark.parametrize('estimator, build_dataset', [(ml, bd) for idml, (ml, bd) - in zip(ids_metric_learners, - metric_learners) + in zip(ids_metric_learners_m, + metric_learners_m) if idml[:3] in ['MMC']], ids=[idml for idml, (ml, _) - in zip(ids_metric_learners, - metric_learners) + in zip(ids_metric_learners_m, + metric_learners_m) if idml[:3] in ['MMC']]) def test_singular_covariance_init_of_non_strict_pd(estimator, build_dataset): """Tests that when using the 'covariance' init or prior, it returns the @@ -657,12 +633,12 @@ def test_singular_covariance_init_of_non_strict_pd(estimator, build_dataset): @pytest.mark.integration @pytest.mark.parametrize('estimator, build_dataset', [(ml, bd) for idml, (ml, bd) - in zip(ids_metric_learners, - metric_learners) + in zip(ids_metric_learners_m, + metric_learners_m) if idml[:4] in ['ITML', 'SDML', 'LSML']], ids=[idml for idml, (ml, _) - in zip(ids_metric_learners, - metric_learners) + in zip(ids_metric_learners_m, + metric_learners_m) if idml[:4] in ['ITML', 'SDML', 'LSML']]) @pytest.mark.parametrize('w0', [1e-20, 0., -1e-20]) def test_singular_array_init_or_prior_strictpd(estimator, build_dataset, w0): @@ -731,8 +707,8 @@ def test_singular_array_init_of_non_strict_pd(w0): @pytest.mark.integration -@pytest.mark.parametrize('estimator, build_dataset', metric_learners, - ids=ids_metric_learners) +@pytest.mark.parametrize('estimator, build_dataset', metric_learners_m, + ids=ids_metric_learners_m) def test_deterministic_initialization(estimator, build_dataset): """Test that estimators that have a prior or an init are deterministic when it is set to to random and when the random_state is fixed.""" @@ -750,3 +726,35 @@ def test_deterministic_initialization(estimator, build_dataset): model2 = model2.fit(*remove_y(model, input_data, labels)) np.testing.assert_allclose(model1.get_mahalanobis_matrix(), model2.get_mahalanobis_matrix()) + + +@pytest.mark.parametrize('with_preprocessor', [True, False]) +@pytest.mark.parametrize('estimator, build_dataset', pairs_learners_m, + ids=ids_pairs_learners_m) +def test_raise_not_fitted_error_if_not_fitted(estimator, build_dataset, + with_preprocessor): + """Test that a NotFittedError is raised if someone tries to use + pair_score, pair_distance, score_pairs, get_metric, transform or + get_mahalanobis_matrix on input data and the metric learner + has not been fitted.""" + input_data, _, preprocessor, _ = build_dataset(with_preprocessor) + estimator = clone(estimator) + estimator.set_params(preprocessor=preprocessor) + set_random_state(estimator) + with pytest.raises(NotFittedError): # TODO: Remove in 0.8.0 + msg = ("score_pairs will be deprecated in release 0.7.0. " + "Use pair_score to compute similarity scores, or " + "pair_distances to compute distances.") + with pytest.warns(FutureWarning) as raised_warning: + estimator.score_pairs(input_data) + assert any([str(warning.message) == msg for warning in raised_warning]) + with pytest.raises(NotFittedError): + estimator.pair_score(input_data) + with pytest.raises(NotFittedError): + estimator.pair_distance(input_data) + with pytest.raises(NotFittedError): + estimator.get_metric() + with pytest.raises(NotFittedError): + estimator.get_mahalanobis_matrix() + with pytest.raises(NotFittedError): + estimator.transform(input_data) diff --git a/test/test_pairs_classifiers.py b/test/test_pairs_classifiers.py index 6a725f23..2aac2b3d 100644 --- a/test/test_pairs_classifiers.py +++ b/test/test_pairs_classifiers.py @@ -1,3 +1,7 @@ +""" +Tests all functionality for PairClassifiers. Methods, threshold, calibration, +warnings, correctness, use cases, etc. +""" from functools import partial import pytest @@ -10,7 +14,8 @@ precision_score) from sklearn.model_selection import train_test_split -from test.test_utils import pairs_learners, ids_pairs_learners +from test.test_utils import pairs_learners, ids_pairs_learners, \ + pairs_learners_m, ids_pairs_learners_m from metric_learn.sklearn_shims import set_random_state from sklearn import clone import numpy as np @@ -40,7 +45,7 @@ def test_predict_only_one_or_minus_one(estimator, build_dataset, ids=ids_pairs_learners) def test_predict_monotonous(estimator, build_dataset, with_preprocessor): - """Test that there is a threshold distance separating points labeled as + """Test that there is a threshold value separating points labeled as similar and points labeled as dissimilar """ input_data, labels, preprocessor, _ = build_dataset(with_preprocessor) estimator = clone(estimator) @@ -65,32 +70,22 @@ def test_predict_monotonous(estimator, build_dataset, def test_raise_not_fitted_error_if_not_fitted(estimator, build_dataset, with_preprocessor): """Test that a NotFittedError is raised if someone tries to use - pair_score, score_pairs, decision_function, get_metric, transform or - get_mahalanobis_matrix on input data and the metric learner - has not been fitted.""" + decision_function, calibrate_threshold, set_threshold, predict + on input data and the metric learner has not been fitted.""" input_data, labels, preprocessor, _ = build_dataset(with_preprocessor) estimator = clone(estimator) estimator.set_params(preprocessor=preprocessor) set_random_state(estimator) - with pytest.raises(NotFittedError): # Remove in 0.8.0 - estimator.score_pairs(input_data) with pytest.raises(NotFittedError): - estimator.pair_score(input_data) + estimator.predict(input_data) with pytest.raises(NotFittedError): estimator.decision_function(input_data) with pytest.raises(NotFittedError): - estimator.get_metric() - with pytest.raises(NotFittedError): - estimator.transform(input_data) - with pytest.raises(NotFittedError): - estimator.get_mahalanobis_matrix() - with pytest.raises(NotFittedError): - estimator.calibrate_threshold(input_data, labels) - + estimator.score(input_data, labels) with pytest.raises(NotFittedError): estimator.set_threshold(0.5) with pytest.raises(NotFittedError): - estimator.predict(input_data) + estimator.calibrate_threshold(input_data, labels) @pytest.mark.parametrize('calibration_params', @@ -130,7 +125,7 @@ def test_fit_with_valid_threshold_params(estimator, build_dataset, ids=ids_pairs_learners) def test_threshold_different_scores_is_finite(estimator, build_dataset, with_preprocessor, kwargs): - # test that calibrating the threshold works for every metric learner + """Test that calibrating the threshold works for every metric learner""" input_data, labels, preprocessor, _ = build_dataset(with_preprocessor) estimator = clone(estimator) estimator.set_params(preprocessor=preprocessor) @@ -171,7 +166,7 @@ def test_unset_threshold(): def test_set_threshold(): - # test that set_threshold indeed sets the threshold + """Test that set_threshold indeed sets the threshold""" identity_pairs_classifier = IdentityPairsClassifier() pairs = np.array([[[0.], [1.]], [[1.], [3.]], [[2.], [5.]], [[3.], [7.]]]) y = np.array([1, 1, -1, -1]) @@ -200,8 +195,8 @@ def test_set_wrong_type_threshold(value): def test_f_beta_1_is_f_1(): - # test that putting beta to 1 indeed finds the best threshold to optimize - # the f1_score + """Test that putting beta to 1 indeed finds the best threshold to optimize + the f1_score""" rng = np.random.RandomState(42) n_samples = 100 pairs, y = rng.randn(n_samples, 2, 5), rng.choice([-1, 1], size=n_samples) @@ -266,8 +261,8 @@ def tnr_threshold(y_true, y_pred, tpr_threshold=0.): for t in [0., 0.1, 0.5, 0.8, 1.]], ) def test_found_score_is_best_score(kwargs, scoring): - # test that when we use calibrate threshold, it will indeed be the - # threshold that have the best score + """Test that when we use calibrate threshold, it will indeed be the + threshold that have the best score""" rng = np.random.RandomState(42) n_samples = 50 pairs, y = rng.randn(n_samples, 2, 5), rng.choice([-1, 1], size=n_samples) @@ -305,11 +300,11 @@ def test_found_score_is_best_score(kwargs, scoring): for t in [0., 0.1, 0.5, 0.8, 1.]] ) def test_found_score_is_best_score_duplicates(kwargs, scoring): - # test that when we use calibrate threshold, it will indeed be the - # threshold that have the best score. It's the same as the previous test - # except this time we test that the scores are coherent even if there are - # duplicates (i.e. points that have the same score returned by - # `decision_function`). + """Test that when we use calibrate threshold, it will indeed be the + threshold that have the best score. It's the same as the previous test + except this time we test that the scores are coherent even if there are + duplicates (i.e. points that have the same score returned by + `decision_function`).""" rng = np.random.RandomState(42) n_samples = 50 pairs, y = rng.randn(n_samples, 2, 5), rng.choice([-1, 1], size=n_samples) @@ -353,8 +348,8 @@ def test_found_score_is_best_score_duplicates(kwargs, scoring): ) def test_calibrate_threshold_invalid_parameters_right_error(invalid_args, expected_msg): - # test that the right error message is returned if invalid arguments are - # given to calibrate_threshold + """Test that the right error message is returned if invalid arguments are + given to `calibrate_threshold`""" rng = np.random.RandomState(42) pairs, y = rng.randn(20, 2, 5), rng.choice([-1, 1], size=20) pairs_learner = IdentityPairsClassifier() @@ -377,8 +372,8 @@ def test_calibrate_threshold_invalid_parameters_right_error(invalid_args, # to do that) ) def test_calibrate_threshold_valid_parameters(valid_args): - # test that no warning message is returned if valid arguments are given to - # calibrate threshold + """Test that no warning message is returned if valid arguments are given to + `calibrate threshold`""" rng = np.random.RandomState(42) pairs, y = rng.randn(20, 2, 5), rng.choice([-1, 1], size=20) pairs_learner = IdentityPairsClassifier() @@ -390,8 +385,7 @@ def test_calibrate_threshold_valid_parameters(valid_args): def test_calibrate_threshold_extreme(): """Test that in the (rare) case where we should accept all points or - reject all points, this is effectively what - is done""" + reject all points, this is effectively what is done""" class MockBadPairsClassifier(MahalanobisMixin, _PairsClassifierMixin): """A pairs classifier that returns bad scores (i.e. in the inverse order @@ -489,9 +483,9 @@ def decision_function(self, pairs): ) def test_validate_calibration_params_invalid_parameters_right_error( estimator, _, invalid_args, expected_msg): - # test that the right error message is returned if invalid arguments are - # given to _validate_calibration_params, for all pairs metric learners as - # well as a mocking general identity pairs classifier and the class itself + """Test that the right error message is returned if invalid arguments are + given to `_validate_calibration_params`, for all pairs metric learners as + well as a mocking general identity pairs classifier and the class itself""" with pytest.raises(ValueError) as raised_error: estimator._validate_calibration_params(**invalid_args) assert str(raised_error.value) == expected_msg @@ -515,9 +509,9 @@ def test_validate_calibration_params_invalid_parameters_right_error( ) def test_validate_calibration_params_valid_parameters( estimator, _, valid_args): - # test that no warning message is returned if valid arguments are given to - # _validate_calibration_params for all pairs metric learners, as well as - # a mocking example, and the class itself + """Test that no warning message is returned if valid arguments are given to + `_validate_calibration_params` for all pairs metric learners, as well as + a mocking example, and the class itself""" with pytest.warns(None) as record: estimator._validate_calibration_params(**valid_args) assert len(record) == 0 @@ -528,7 +522,7 @@ def test_validate_calibration_params_valid_parameters( ids=ids_pairs_learners) def test_validate_calibration_params_invalid_parameters_error_before__fit( estimator, build_dataset): - """For all pairs metric learners (which currently all have a _fit method), + """For all pairs metric learners (which currently all have a `_fit` method), make sure that calibration parameters are validated before fitting""" estimator = clone(estimator) input_data, labels, _, _ = build_dataset() @@ -545,11 +539,12 @@ def breaking_fun(**args): # a function that fails so that we will miss assert str(raised_error.value) == expected_msg -@pytest.mark.parametrize('estimator, build_dataset', pairs_learners, - ids=ids_pairs_learners) +@pytest.mark.parametrize('estimator, build_dataset', pairs_learners_m, + ids=ids_pairs_learners_m) def test_accuracy_toy_example(estimator, build_dataset): """Test that the accuracy works on some toy example (hence that the - prediction is OK)""" + prediction is OK). This test is designed for Mahalanobis learners only, + as the toy example uses the notion of distance.""" input_data, labels, preprocessor, X = build_dataset(with_preprocessor=False) estimator = clone(estimator) estimator.set_params(preprocessor=preprocessor) diff --git a/test/test_quadruplets_classifiers.py b/test/test_quadruplets_classifiers.py index a8319961..65aa9538 100644 --- a/test/test_quadruplets_classifiers.py +++ b/test/test_quadruplets_classifiers.py @@ -1,8 +1,13 @@ +""" +Tests all functionality for QuadrupletsClassifiers. Methods, warrnings, +correctness, use cases, etc. +""" import pytest from sklearn.exceptions import NotFittedError from sklearn.model_selection import train_test_split -from test.test_utils import quadruplets_learners, ids_quadruplets_learners +from test.test_utils import quadruplets_learners, ids_quadruplets_learners, \ + quadruplets_learners_m, ids_quadruplets_learners_m from metric_learn.sklearn_shims import set_random_state from sklearn import clone import numpy as np @@ -31,21 +36,27 @@ def test_predict_only_one_or_minus_one(estimator, build_dataset, ids=ids_quadruplets_learners) def test_raise_not_fitted_error_if_not_fitted(estimator, build_dataset, with_preprocessor): - """Test that a NotFittedError is raised if someone tries to predict and - the metric learner has not been fitted.""" + """Test that a NotFittedError is raised if someone tries to use the + methods: predict, decision_function and score when the metric learner + has not been fitted.""" input_data, labels, preprocessor, _ = build_dataset(with_preprocessor) estimator = clone(estimator) estimator.set_params(preprocessor=preprocessor) set_random_state(estimator) with pytest.raises(NotFittedError): estimator.predict(input_data) + with pytest.raises(NotFittedError): + estimator.decision_function(input_data) + with pytest.raises(NotFittedError): + estimator.score(input_data) -@pytest.mark.parametrize('estimator, build_dataset', quadruplets_learners, - ids=ids_quadruplets_learners) +@pytest.mark.parametrize('estimator, build_dataset', quadruplets_learners_m, + ids=ids_quadruplets_learners_m) def test_accuracy_toy_example(estimator, build_dataset): """Test that the default scoring for quadruplets (accuracy) works on some - toy example""" + toy example. This test is designed for Mahalanobis learners only, + as the toy example uses the notion of distance.""" input_data, labels, preprocessor, X = build_dataset(with_preprocessor=False) estimator = clone(estimator) estimator.set_params(preprocessor=preprocessor) diff --git a/test/test_sklearn_compat.py b/test/test_sklearn_compat.py index d2369b1c..a8cf8a67 100644 --- a/test/test_sklearn_compat.py +++ b/test/test_sklearn_compat.py @@ -12,10 +12,12 @@ MMC_Supervised, RCA_Supervised, SDML_Supervised, SCML_Supervised) from sklearn import clone +from sklearn.cluster import DBSCAN import numpy as np from sklearn.model_selection import (cross_val_score, cross_val_predict, train_test_split, KFold) from test.test_utils import (metric_learners, ids_metric_learners, + metric_learners_m, ids_metric_learners_m, mock_preprocessor, tuples_learners, ids_tuples_learners, pairs_learners, ids_pairs_learners, remove_y, @@ -110,6 +112,54 @@ def generate_array_like(input_data, labels=None): return input_data_changed, labels_changed +# TODO: Find a better way to run this test and the next one, to avoid +# duplicated code. +@pytest.mark.parametrize('with_preprocessor', [True, False]) +@pytest.mark.parametrize('estimator, build_dataset', metric_learners_m, + ids=ids_metric_learners_m) +def test_array_like_inputs_mahalanobis(estimator, build_dataset, + with_preprocessor): + """Test that metric-learners can have as input any array-like object. + This in particular tests `transform` and `pair_distance` for Mahalanobis + learners.""" + input_data, labels, preprocessor, X = build_dataset(with_preprocessor) + # we subsample the data for the test to be more efficient + input_data, _, labels, _ = train_test_split(input_data, labels, + train_size=40, + random_state=42) + X = X[:10] + + estimator = clone(estimator) + estimator.set_params(preprocessor=preprocessor) + set_random_state(estimator) + input_variants, label_variants = generate_array_like(input_data, labels) + for input_variant in input_variants: + for label_variant in label_variants: + estimator.fit(*remove_y(estimator, input_variant, label_variant)) + if hasattr(estimator, "predict"): + estimator.predict(input_variant) + if hasattr(estimator, "predict_proba"): + estimator.predict_proba(input_variant) # anticipation in case some + # time we have that, or if ppl want to contribute with new algorithms + # it will be checked automatically + if hasattr(estimator, "decision_function"): + estimator.decision_function(input_variant) + if hasattr(estimator, "score"): + for label_variant in label_variants: + estimator.score(*remove_y(estimator, input_variant, label_variant)) + + # Transform + X_variants, _ = generate_array_like(X) + for X_variant in X_variants: + estimator.transform(X_variant) + + # Pair distance + pairs = np.array([[X[0], X[1]], [X[0], X[2]]]) + pairs_variants, _ = generate_array_like(pairs) + for pairs_variant in pairs_variants: + estimator.pair_distance(pairs_variant) + + @pytest.mark.integration @pytest.mark.parametrize('with_preprocessor', [True, False]) @pytest.mark.parametrize('estimator, build_dataset', metric_learners, @@ -144,25 +194,12 @@ def test_array_like_inputs(estimator, build_dataset, with_preprocessor): for label_variant in label_variants: estimator.score(*remove_y(estimator, input_variant, label_variant)) - X_variants, _ = generate_array_like(X) - for X_variant in X_variants: - estimator.transform(X_variant) - pairs = np.array([[X[0], X[1]], [X[0], X[2]]]) pairs_variants, _ = generate_array_like(pairs) - not_implemented_msg = "" - # Todo in 0.7.0: Change 'not_implemented_msg' for the message that says - # "This learner does not have pair_distance" - + # Pair score for pairs_variant in pairs_variants: - estimator.pair_score(pairs_variant) # All learners have pair_score - - # But not all of them will have pair_distance - try: - estimator.pair_distance(pairs_variant) - except Exception as raised_exception: - assert raised_exception.value.args[0] == not_implemented_msg + estimator.pair_score(pairs_variant) @pytest.mark.parametrize('with_preprocessor', [True, False]) @@ -461,5 +498,18 @@ def test_dont_overwrite_parameters(estimator, build_dataset, " %s changed" % ', '.join(attrs_changed_by_fit)) +@pytest.mark.parametrize('estimator, build_dataset', metric_learners, + ids=ids_metric_learners) +def test_get_metric_compatible_with_scikit_learn(estimator, build_dataset): + """Check that the metric returned by get_metric is compatible with + scikit-learn's algorithms using a custom metric, DBSCAN for instance""" + input_data, labels, _, X = build_dataset() + model = clone(estimator) + set_random_state(model) + model.fit(*remove_y(estimator, input_data, labels)) + clustering = DBSCAN(metric=model.get_metric()) + clustering.fit(X) + + if __name__ == '__main__': unittest.main() diff --git a/test/test_triplets_classifiers.py b/test/test_triplets_classifiers.py index f2d5c015..ce9bee68 100644 --- a/test/test_triplets_classifiers.py +++ b/test/test_triplets_classifiers.py @@ -1,8 +1,13 @@ +""" +Tests all functionality for TripletsClassifiers. Methods, warrnings, +correctness, use cases, etc. +""" import pytest from sklearn.exceptions import NotFittedError from sklearn.model_selection import train_test_split -from test.test_utils import triplets_learners, ids_triplets_learners +from test.test_utils import triplets_learners, ids_triplets_learners, \ + triplets_learners_m, ids_triplets_learners_m from metric_learn.sklearn_shims import set_random_state from sklearn import clone import numpy as np @@ -27,13 +32,14 @@ def test_predict_only_one_or_minus_one(estimator, build_dataset, assert len(not_valid) == 0 -@pytest.mark.parametrize('estimator, build_dataset', triplets_learners, - ids=ids_triplets_learners) +@pytest.mark.parametrize('estimator, build_dataset', triplets_learners_m, + ids=ids_triplets_learners_m) def test_no_zero_prediction(estimator, build_dataset): """ Test that all predicted values are not zero, even when the distance d(x,y) and d(x,z) is the same for a triplet of the - form (x, y, z). i.e border cases. + form (x, y, z). i.e border cases for Mahalanobis distance + learners. """ triplets, _, _, X = build_dataset(with_preprocessor=False) # Force 3 dimentions only, to use cross product and get easy orthogonal vec. @@ -56,7 +62,7 @@ def test_no_zero_prediction(estimator, build_dataset): assert_array_equal(X[1], x) with pytest.raises(AssertionError): assert_array_equal(X[1], y) - # Assert the distance is the same for both + # Assert the distance is the same for both -> Wont work for b. similarity assert estimator.get_metric()(X[1], x) == estimator.get_metric()(X[1], y) # Form the three scenarios where predict() gives 0 with numpy.sign @@ -75,21 +81,27 @@ def test_no_zero_prediction(estimator, build_dataset): ids=ids_triplets_learners) def test_raise_not_fitted_error_if_not_fitted(estimator, build_dataset, with_preprocessor): - """Test that a NotFittedError is raised if someone tries to predict and - the metric learner has not been fitted.""" + """Test that a NotFittedError is raised if someone tries to use the + methods: predict, decision_function and score when the metric learner + has not been fitted.""" input_data, _, preprocessor, _ = build_dataset(with_preprocessor) estimator = clone(estimator) estimator.set_params(preprocessor=preprocessor) set_random_state(estimator) with pytest.raises(NotFittedError): estimator.predict(input_data) + with pytest.raises(NotFittedError): + estimator.decision_function(input_data) + with pytest.raises(NotFittedError): + estimator.score(input_data) -@pytest.mark.parametrize('estimator, build_dataset', triplets_learners, - ids=ids_triplets_learners) +@pytest.mark.parametrize('estimator, build_dataset', triplets_learners_m, + ids=ids_triplets_learners_m) def test_accuracy_toy_example(estimator, build_dataset): """Test that the default scoring for triplets (accuracy) works on some - toy example""" + toy example. This test is designed for Mahalanobis learners only, + as the toy example uses the notion of distance.""" triplets, _, _, X = build_dataset(with_preprocessor=False) estimator = clone(estimator) set_random_state(estimator) diff --git a/test/test_utils.py b/test/test_utils.py index f3000344..536b3bf4 100644 --- a/test/test_utils.py +++ b/test/test_utils.py @@ -1,8 +1,13 @@ +""" +Tests preprocesor, warnings, errors. Also made util functions to build datasets +in a general way for each learner. Here is also the list of learners of each +kind that are used as a parameters in tests in other files. Util functions. +""" import pytest from scipy.linalg import eigh, pinvh from collections import namedtuple import numpy as np -from numpy.testing import assert_array_equal, assert_equal +from numpy.testing import assert_array_equal, assert_equal, assert_raises from sklearn.model_selection import train_test_split from sklearn.utils import check_random_state, shuffle from metric_learn.sklearn_shims import set_random_state @@ -12,12 +17,16 @@ check_collapsed_pairs, validate_vector, _check_sdp_from_eigen, _check_n_components, check_y_valid_values_for_pairs, - _auto_select_init, _pseudo_inverse_from_eig) + _auto_select_init, _pseudo_inverse_from_eig, + _get_random_indices, + _initialize_similarity_bilinear) from metric_learn import (ITML, LSML, MMC, RCA, SDML, Covariance, LFDA, LMNN, MLKR, NCA, ITML_Supervised, LSML_Supervised, MMC_Supervised, RCA_Supervised, SDML_Supervised, - SCML, SCML_Supervised, Constraints) + SCML, SCML_Supervised, OASIS, OASIS_Supervised, + Constraints) from metric_learn.base_metric import (ArrayIndexer, MahalanobisMixin, + BilinearMixin, _PairsClassifierMixin, _TripletsClassifierMixin, _QuadrupletsClassifierMixin) @@ -28,6 +37,120 @@ SEED = 42 RNG = check_random_state(SEED) + +# -------------------- Mock classes for testing ------------------------ + + +class RandomBilinearLearner(BilinearMixin): + """A simple Random bilinear mixin that returns an random matrix + M as learned. Class for testing purposes. + """ + def __init__(self, init='random', preprocessor=None, random_state=33): + super().__init__(preprocessor=preprocessor) + self.init = init + self.random_state = random_state + + def fit(self, X, y): + """ + Checks input's format. A random (d,d) matrix is set. + """ + X, y = self._prepare_inputs(X, y, ensure_min_samples=2) + self.d_ = np.shape(X[0])[-1] + M = _initialize_similarity_bilinear(X, + init=self.init, + strict_pd=False, + random_state=self.random_state) + self.components_ = M + return self + + +class IdentityBilinearLearner(BilinearMixin): + """A simple Identity bilinear mixin that returns an identity matrix + M as learned. Class for testing purposes. + """ + def __init__(self, init='identity', preprocessor=None, random_state=33): + super().__init__(preprocessor=preprocessor) + self.init = init + self.random_state = random_state + + def fit(self, X, y): + """ + Checks input's format. Sets M matrix to identity of shape (d,d) + where d is the dimension of the input. + """ + X, y = self._prepare_inputs(X, y, ensure_min_samples=2) + self.d_ = np.shape(X[0])[-1] + M = _initialize_similarity_bilinear(X, + init=self.init, + strict_pd=False, + random_state=self.random_state) + self.components_ = M + return self + + +class MockPairIdentityBilinearLearner(BilinearMixin, + _PairsClassifierMixin): + + def __init__(self, init='identity', preprocessor=None, random_state=33): + super().__init__(preprocessor=preprocessor) + self.init = init + self.random_state = random_state + + def fit(self, pairs, y, calibration_params=None): + calibration_params = (calibration_params if calibration_params is not + None else dict()) + self._validate_calibration_params(**calibration_params) + pairs = self._prepare_inputs(pairs, type_of_inputs='tuples') + self.d_ = np.shape(pairs[0][0])[-1] + M = _initialize_similarity_bilinear(pairs, + init=self.init, + strict_pd=False, + random_state=self.random_state) + self.components_ = M + self.calibrate_threshold(pairs, y, **calibration_params) + return self + + +class MockTripletsIdentityBilinearLearner(BilinearMixin, + _TripletsClassifierMixin): + + def __init__(self, init='identity', preprocessor=None, random_state=33): + super().__init__(preprocessor=preprocessor) + self.init = init + self.random_state = random_state + + def fit(self, triplets): + triplets = self._prepare_inputs(triplets, type_of_inputs='tuples') + self.d_ = np.shape(triplets[0][0])[-1] + M = _initialize_similarity_bilinear(triplets, + init=self.init, + strict_pd=False, + random_state=self.random_state) + self.components_ = M + return self + + +class MockQuadrpletsIdentityBilinearLearner(BilinearMixin, + _QuadrupletsClassifierMixin): + + def __init__(self, init='identity', preprocessor=None, random_state=33): + super().__init__(preprocessor=preprocessor) + self.init = init + self.random_state = random_state + + def fit(self, quadruplets): + quadruplets = self._prepare_inputs(quadruplets, type_of_inputs='tuples') + self.d_ = np.shape(quadruplets[0][0])[-1] + M = _initialize_similarity_bilinear(quadruplets, + init=self.init, + strict_pd=False, + random_state=self.random_state) + self.components_ = M + return self + + +# ------------------ Building dummy data for learners ------------------ + Dataset = namedtuple('Dataset', ('data target preprocessor to_transform')) # Data and target are what we will fit on. Preprocessor is the additional # data if we use a preprocessor (which should be the default ArrayIndexer), @@ -35,7 +158,16 @@ def build_classification(with_preprocessor=False): - """Basic array for testing when using a preprocessor""" + """ + Basic array 'X, y' for testing when using a preprocessor, for instance, + for clustering. For supervised learners. + + If no preprocesor: 'data' are raw points, 'target' are dummy labels, + 'preprocesor' is None, and 'to_transform' are points. + + If preprocessor: 'data' are point indices, 'target' are dummy labels, + 'preprocessor' are unique points, 'to_transform' are points. + """ X, y = shuffle(*make_blobs(random_state=SEED), random_state=SEED) indices = shuffle(np.arange(X.shape[0]), random_state=SEED).astype(int) @@ -46,7 +178,16 @@ def build_classification(with_preprocessor=False): def build_regression(with_preprocessor=False): - """Basic array for testing when using a preprocessor""" + """ + Basic array 'X, y' for testing when using a preprocessor, for regression. + For supervised learners. + + If no preprocesor: 'data' are raw points, 'target' are dummy labels, + 'preprocesor' is None, and 'to_transform' are points. + + If preprocessor: 'data' are point indices, 'target' are dummy labels, + 'preprocessor' are unique points, 'to_transform' are points. + """ X, y = shuffle(*make_regression(n_samples=100, n_features=5, random_state=SEED), random_state=SEED) @@ -58,6 +199,8 @@ def build_regression(with_preprocessor=False): def build_data(): + """Aux function: Returns 'X, pairs' taken from the iris dataset, where + pairs are positive and negative pairs for PairClassifiers.""" input_data, labels = load_iris(return_X_y=True) X, y = shuffle(input_data, labels, random_state=SEED) num_constraints = 50 @@ -70,7 +213,17 @@ def build_data(): def build_pairs(with_preprocessor=False): - # builds a toy pairs problem + """ + For all pair weakly-supervised learners. + + Returns: data, target, preprocessor, to_transform. + + If no preprocesor: 'data' are raw pairs, 'target' are dummy labels, + 'preprocesor' is None, and 'to_transform' are points. + + If preprocessor: 'data' are pair indices, 'target' are dummy labels, + 'preprocessor' are unique points, 'to_transform' are points. + """ X, indices = build_data() c = np.vstack([np.column_stack(indices[:2]), np.column_stack(indices[2:])]) target = np.concatenate([np.ones(indices[0].shape[0]), @@ -85,6 +238,17 @@ def build_pairs(with_preprocessor=False): def build_triplets(with_preprocessor=False): + """ + For all triplet weakly-supervised learners. + + Returns: data, target, preprocessor, to_transform. + + If no preprocesor: 'data' are raw triplets, 'target' are dummy labels, + 'preprocesor' is None, and 'to_transform' are points. + + If preprocessor: 'data' are triplets indices, 'target' are dummy labels, + 'preprocessor' are unique points, 'to_transform' are points. + """ input_data, labels = load_iris(return_X_y=True) X, y = shuffle(input_data, labels, random_state=SEED) constraints = Constraints(y) @@ -98,7 +262,17 @@ def build_triplets(with_preprocessor=False): def build_quadruplets(with_preprocessor=False): - # builds a toy quadruplets problem + """ + For all Quadruplets weakly-supervised learners. + + Returns: data, target, preprocessor, to_transform. + + If no preprocesor: 'data' are raw quadruplets, 'target' are dummy labels, + 'preprocesor' is None, and 'to_transform' are points. + + If preprocessor: 'data' are quadruplets indices, 'target' are dummy labels, + 'preprocessor' are unique points, 'to_transform' are points. + """ X, indices = build_data() c = np.column_stack(indices) target = np.ones(c.shape[0]) # quadruplets targets are not used @@ -112,59 +286,132 @@ def build_quadruplets(with_preprocessor=False): return Dataset(X[c], target, None, X[c[:, 0]]) -quadruplets_learners = [(LSML(), build_quadruplets)] -ids_quadruplets_learners = list(map(lambda x: x.__class__.__name__, - [learner for (learner, _) in - quadruplets_learners])) +# ------------- List of learners, separating them by kind ------------- + +# Mahalanobis learners +# -- Weakly Supervised +quadruplets_learners_m = [(LSML(), build_quadruplets)] +ids_quadruplets_learners_m = list(map(lambda x: x.__class__.__name__, + [learner for (learner, _) in + quadruplets_learners_m])) -triplets_learners = [(SCML(n_basis=320), build_triplets)] -ids_triplets_learners = list(map(lambda x: x.__class__.__name__, +triplets_learners_m = [(SCML(n_basis=320), build_triplets)] +ids_triplets_learners_m = list(map(lambda x: x.__class__.__name__, + [learner for (learner, _) in + triplets_learners_m])) + +pairs_learners_m = [(ITML(max_iter=2), build_pairs), # max_iter=2 to be faster + (MMC(max_iter=2), build_pairs), # max_iter=2 to be faster + (SDML(prior='identity', balance_param=1e-5), build_pairs)] +ids_pairs_learners_m = list(map(lambda x: x.__class__.__name__, + [learner for (learner, _) in + pairs_learners_m])) + +# -- Supervised +classifiers_m = [(Covariance(), build_classification), + (LFDA(), build_classification), + (LMNN(), build_classification), + (NCA(), build_classification), + (RCA(), build_classification), + (ITML_Supervised(max_iter=5), build_classification), + (LSML_Supervised(), build_classification), + (MMC_Supervised(max_iter=5), build_classification), + (RCA_Supervised(num_chunks=5), build_classification), + (SDML_Supervised(prior='identity', balance_param=1e-5), + build_classification), + (SCML_Supervised(n_basis=80), build_classification)] +ids_classifiers_m = list(map(lambda x: x.__class__.__name__, [learner for (learner, _) in - triplets_learners])) - -pairs_learners = [(ITML(max_iter=2), build_pairs), # max_iter=2 to be faster - (MMC(max_iter=2), build_pairs), # max_iter=2 to be faster - (SDML(prior='identity', balance_param=1e-5), build_pairs)] -ids_pairs_learners = list(map(lambda x: x.__class__.__name__, - [learner for (learner, _) in - pairs_learners])) - -classifiers = [(Covariance(), build_classification), - (LFDA(), build_classification), - (LMNN(), build_classification), - (NCA(), build_classification), - (RCA(), build_classification), - (ITML_Supervised(max_iter=5), build_classification), - (LSML_Supervised(), build_classification), - (MMC_Supervised(max_iter=5), build_classification), - (RCA_Supervised(num_chunks=5), build_classification), - (SDML_Supervised(prior='identity', balance_param=1e-5), - build_classification), - (SCML_Supervised(n_basis=80), build_classification)] -ids_classifiers = list(map(lambda x: x.__class__.__name__, - [learner for (learner, _) in - classifiers])) - -regressors = [(MLKR(init='pca'), build_regression)] -ids_regressors = list(map(lambda x: x.__class__.__name__, - [learner for (learner, _) in regressors])) + classifiers_m])) + +regressors_m = [(MLKR(init='pca'), build_regression)] +ids_regressors_m = list(map(lambda x: x.__class__.__name__, + [learner for (learner, _) in regressors_m])) +# -- Mahalanobis sets +tuples_learners_m = pairs_learners_m + triplets_learners_m + \ + quadruplets_learners_m +ids_tuples_learners_m = ids_pairs_learners_m + ids_triplets_learners_m \ + + ids_quadruplets_learners_m + +supervised_learners_m = classifiers_m + regressors_m +ids_supervised_learners_m = ids_classifiers_m + ids_regressors_m + +metric_learners_m = tuples_learners_m + supervised_learners_m +ids_metric_learners_m = ids_tuples_learners_m + ids_supervised_learners_m + +# Bilinear learners +# -- Weakly Supervised +quadruplets_learners_b = [(MockQuadrpletsIdentityBilinearLearner(), + build_quadruplets)] +ids_quadruplets_learners_b = list(map(lambda x: x.__class__.__name__, + [learner for (learner, _) in + quadruplets_learners_b])) + +triplets_learners_b = [(MockTripletsIdentityBilinearLearner(), build_triplets), + (OASIS(), build_triplets)] +ids_triplets_learners_b = list(map(lambda x: x.__class__.__name__, + [learner for (learner, _) in + triplets_learners_b])) + +pairs_learners_b = [(MockPairIdentityBilinearLearner(), build_pairs)] +ids_pairs_learners_b = list(map(lambda x: x.__class__.__name__, + [learner for (learner, _) in + pairs_learners_b])) +# -- Supervised +classifiers_b = [(RandomBilinearLearner(), build_classification), + (IdentityBilinearLearner(), build_classification), + (OASIS_Supervised(), build_classification)] +ids_classifiers_b = list(map(lambda x: x.__class__.__name__, + [learner for (learner, _) in + classifiers_b])) +# -- Bilinear sets +tuples_learners_b = pairs_learners_b + triplets_learners_b + \ + quadruplets_learners_b +ids_tuples_learners_b = ids_pairs_learners_b + ids_triplets_learners_b \ + + ids_quadruplets_learners_b + +supervised_learners_b = classifiers_b +ids_supervised_learners_b = ids_classifiers_b + +metric_learners_b = tuples_learners_b + supervised_learners_b +ids_metric_learners_b = ids_tuples_learners_b + ids_supervised_learners_b + +# General sets (Mahalanobis + Bilinear) +# -- Weakly Supervised learners individually +pairs_learners = pairs_learners_m + pairs_learners_b +ids_pairs_learners = ids_pairs_learners_m + ids_pairs_learners_b +triplets_learners = triplets_learners_m + triplets_learners_b +ids_triplets_learners = ids_triplets_learners_m + ids_triplets_learners_b +quadruplets_learners = quadruplets_learners_m + quadruplets_learners_b +ids_quadruplets_learners = ids_quadruplets_learners_m + \ + ids_quadruplets_learners_b + +# -- All weakly supervised learners +tuples_learners = tuples_learners_m + tuples_learners_b +ids_tuples_learners = ids_tuples_learners_m + ids_tuples_learners_b + +# -- Supervised learners +supervised_learners = supervised_learners_m + supervised_learners_b +ids_supervised_learners = ids_supervised_learners_m + ids_supervised_learners_b + +# -- Weakly Supervised + Supervised learners +metric_learners = metric_learners_m + metric_learners_b +ids_metric_learners = ids_metric_learners_m + ids_metric_learners_b + +# -- For sklearn pipeline: Pair + Supervised learners +metric_learners_pipeline = pairs_learners_m + pairs_learners_b + \ + supervised_learners_m + supervised_learners_b +ids_metric_learners_pipeline = ids_pairs_learners_m + ids_pairs_learners_b +\ + ids_supervised_learners_m + \ + ids_supervised_learners_b + +# Not used WeaklySupervisedClasses = (_PairsClassifierMixin, _TripletsClassifierMixin, _QuadrupletsClassifierMixin) -tuples_learners = pairs_learners + triplets_learners + quadruplets_learners -ids_tuples_learners = ids_pairs_learners + ids_triplets_learners \ - + ids_quadruplets_learners - -supervised_learners = classifiers + regressors -ids_supervised_learners = ids_classifiers + ids_regressors - -metric_learners = tuples_learners + supervised_learners -ids_metric_learners = ids_tuples_learners + ids_supervised_learners - -metric_learners_pipeline = pairs_learners + supervised_learners -ids_metric_learners_pipeline = ids_pairs_learners + ids_supervised_learners +# ------------- Useful methods ------------- def remove_y(estimator, X, y): @@ -850,15 +1097,14 @@ def test_error_message_t_pair_distance_or_score(estimator, _): .format(make_context(estimator), triplets)) assert str(raised_err.value) == expected_msg - not_implemented_msg = "" - # Todo in 0.7.0: Change 'not_implemented_msg' for the message that says - # "This learner does not have pair_distance" + msg = ("This learner doesn't learn a distance, thus ", + "this method is not implemented. Use pair_score instead") # One exception will trigger for sure with pytest.raises(Exception) as raised_exception: estimator.pair_distance(triplets) err_value = raised_exception.value.args[0] - assert err_value == expected_msg or err_value == not_implemented_msg + assert err_value == expected_msg or err_value == msg def test_preprocess_tuples_simple_example(): @@ -897,7 +1143,8 @@ def fun(row): ids=ids_metric_learners) def test_same_with_or_without_preprocessor(estimator, build_dataset): """Test that algorithms using a preprocessor behave consistently -# with their no-preprocessor equivalent + with their no-preprocessor equivalent. Methods `pair_score`, + `score_pairs` (deprecated), `predict` and `decision_function`. """ dataset_indices = build_dataset(with_preprocessor=True) dataset_formed = build_dataset(with_preprocessor=False) @@ -926,7 +1173,7 @@ def test_same_with_or_without_preprocessor(estimator, build_dataset): estimator_with_prep_formed.set_params(preprocessor=X) estimator_with_prep_formed.fit(*remove_y(estimator, indices_train, y_train)) - # test prediction methods + # Test prediction methods for Weakly supervised algorithms. for method in ["predict", "decision_function"]: if hasattr(estimator, method): output_with_prep = getattr(estimator_with_preprocessor, @@ -940,8 +1187,9 @@ def test_same_with_or_without_preprocessor(estimator, build_dataset): method)(formed_test) assert np.array(output_with_prep == output_with_prep_formed).all() - # Test pair_score, all learners have it. - idx1 = np.array([[0, 2], [5, 3]], dtype=int) + idx1 = np.array([[0, 2], [5, 3]], dtype=int) # Sample + + # Pair score output_with_prep = estimator_with_preprocessor.pair_score( indicators_to_transform[idx1]) output_without_prep = estimator_without_preprocessor.pair_score( @@ -954,11 +1202,26 @@ def test_same_with_or_without_preprocessor(estimator, build_dataset): formed_points_to_transform[idx1]) assert np.array(output_with_prep == output_without_prep).all() - # Test pair_distance - not_implemented_msg = "" - # Todo in 0.7.0: Change 'not_implemented_msg' for the message that says - # "This learner does not have pair_distance" - try: + # Score pairs. TODO: Delete in 0.8.0 + msg = ("score_pairs will be deprecated in release 0.7.0. " + "Use pair_score to compute similarity scores, or " + "pair_distances to compute distances.") + with pytest.warns(FutureWarning) as raised_warning: + output_with_prep = estimator_with_preprocessor.score_pairs( + indicators_to_transform[idx1]) + output_without_prep = estimator_without_preprocessor.score_pairs( + formed_points_to_transform[idx1]) + assert np.array(output_with_prep == output_without_prep).all() + + output_with_prep = estimator_with_preprocessor.score_pairs( + indicators_to_transform[idx1]) + output_without_prep = estimator_with_prep_formed.score_pairs( + formed_points_to_transform[idx1]) + assert np.array(output_with_prep == output_without_prep).all() + assert any([str(warning.message) == msg for warning in raised_warning]) + + if isinstance(estimator, MahalanobisMixin): + # Pair distance output_with_prep = estimator_with_preprocessor.pair_distance( indicators_to_transform[idx1]) output_without_prep = estimator_without_preprocessor.pair_distance( @@ -971,14 +1234,7 @@ def test_same_with_or_without_preprocessor(estimator, build_dataset): formed_points_to_transform[idx1]) assert np.array(output_with_prep == output_without_prep).all() - except Exception as raised_exception: - assert raised_exception.value.args[0] == not_implemented_msg - - # Test transform - not_implemented_msg = "" - # Todo in 0.7.0: Change 'not_implemented_msg' for the message that says - # "This learner does not have transform" - try: + # Transform output_with_prep = estimator_with_preprocessor.transform( indicators_to_transform) output_without_prep = estimator_without_preprocessor.transform( @@ -991,9 +1247,6 @@ def test_same_with_or_without_preprocessor(estimator, build_dataset): formed_points_to_transform) assert np.array(output_with_prep == output_without_prep).all() - except Exception as raised_exception: - assert raised_exception.value.args[0] == not_implemented_msg - def test_check_collapsed_pairs_raises_no_error(): """Checks that check_collapsed_pairs raises no error if no collapsed pairs @@ -1270,3 +1523,134 @@ def test_pseudo_inverse_from_eig_and_pinvh_nonsingular(): A = A + A.T w, V = eigh(A, check_finite=False) np.testing.assert_allclose(_pseudo_inverse_from_eig(w, V), pinvh(A)) + + +@pytest.mark.parametrize(('n_triplets', 'n_iter'), + [(10, 10), (33, 70), (100, 67), + (10000, 20000)]) +def test_indices_funct(n_triplets, n_iter): + """ + This test verifies the behaviour of _get_random_indices. The + method used inside OASIS that defines the order in which the + triplets are given to the algorithm, in an online manner. + """ + # Not random cases + base = np.arange(n_triplets) + + # n_iter = n_triplets + if n_iter == n_triplets: + r = _get_random_indices(n_triplets=n_triplets, n_iter=n_iter, + shuffle=False, random=False, + random_state=RNG) + assert_array_equal(r, base) # No shuffle + assert len(r) == len(base) # Same lenght + + # Shuffle + r = _get_random_indices(n_triplets=n_triplets, n_iter=n_iter, + shuffle=True, random=False, + random_state=RNG) + with assert_raises(AssertionError): # Should be different + assert_array_equal(r, base) + # But contain the same elements + assert_array_equal(np.unique(r), np.unique(base)) + assert len(r) == len(base) # Same lenght + + # n_iter > n_triplets + if n_iter > n_triplets: + r = _get_random_indices(n_triplets=n_triplets, n_iter=n_iter, + shuffle=False, random=False, + random_state=RNG) + assert_array_equal(r[:n_triplets], base) # First n_triplets must match + assert len(r) == n_iter # Expected lenght + + # Next n_iter-n_triplets must be in range(n_triplets) + sample = r[n_triplets:] + for i in range(n_iter - n_triplets): + if sample[i] not in base: + raise AssertionError("Sampling has values out of range") + + # Shuffle + r = _get_random_indices(n_triplets=n_triplets, n_iter=n_iter, + shuffle=True, random=False, + random_state=RNG) + assert len(r) == n_iter # Expected lenght + + # Each triplet must be at least one time + assert_array_equal(np.unique(r), np.unique(base)) + with assert_raises(AssertionError): # First n_triplets should be different + assert_array_equal(r[:n_triplets], base) + + # Each index should appear at least ceil(n_iter/n_triplets) - 1 times + # But no more than ceil(n_iter/n_triplets) + min_times = int(np.ceil(n_iter / n_triplets)) - 1 + _, counts = np.unique(r, return_counts=True) + a = len(counts[counts >= min_times]) + b = len(counts[counts <= min_times + 1]) + assert len(np.unique(r)) == a + assert n_triplets == b + + # n_iter < n_triplets + if n_iter < n_triplets: + r = _get_random_indices(n_triplets=n_triplets, n_iter=n_iter, + shuffle=False, random=False, + random_state=RNG) + assert len(r) == n_iter # Expected lenght + u = np.unique(r) + assert len(u) == len(r) # No duplicates + # Final array must cointain only elements in range(n_triplets) + for i in range(n_iter): + if r[i] not in base: + raise AssertionError("Sampling has values out of range") + + # Shuffle must only sort elements + # It takes two instances with same random_state, to show that only + # the final order is mixed + def is_sorted(a): + return np.all(a[:-1] <= a[1:]) + + r_a = _get_random_indices(n_triplets=n_triplets, n_iter=n_iter, + shuffle=False, random=False, + random_state=SEED) + assert is_sorted(r_a) # Its not shuffled + values_r_a, counts_r_a = np.unique(r_a, return_counts=True) + + r_b = _get_random_indices(n_triplets=n_triplets, n_iter=n_iter, + shuffle=True, random=False, + random_state=SEED) + + with assert_raises(AssertionError): + assert is_sorted(r_b) # This one should not besorted, but shuffled + values_r_b, counts_r_b = np.unique(r_b, return_counts=True) + + assert_array_equal(values_r_a, values_r_b) # Same elements + assert_array_equal(counts_r_a, counts_r_b) # Same counts + with assert_raises(AssertionError): + assert_array_equal(r_a, r_b) # Diferent order + + # Random case + r = _get_random_indices(n_triplets=n_triplets, n_iter=n_iter, + random=True, random_state=RNG) + assert len(r) == n_iter # Expected lenght + for i in range(n_iter): + if r[i] not in base: + raise AssertionError("Sampling has values out of range") + # Shuffle has no effect + r_a = _get_random_indices(n_triplets=n_triplets, n_iter=n_iter, + shuffle=False, random=True, + random_state=SEED) + + r_b = _get_random_indices(n_triplets=n_triplets, n_iter=n_iter, + shuffle=True, random=True, + random_state=SEED) + assert_array_equal(r_a, r_b) + + # n_triplets and n_iter cannot be 0 + msg = ("n_triplets cannot be 0") + with pytest.raises(ValueError) as raised_error: + _get_random_indices(n_triplets=0, n_iter=n_iter, random=True) + assert msg == raised_error.value.args[0] + + msg = ("n_iter cannot be 0") + with pytest.raises(ValueError) as raised_error: + _get_random_indices(n_triplets=n_triplets, n_iter=0, random=True) + assert msg == raised_error.value.args[0]