Skip to content

Commit

Permalink
Merge pull request #72 from vc1492a/fix/dataframe_distance_matrix_inc…
Browse files Browse the repository at this point in the history
…onsistency

Fix/dataframe distance matrix inconsistency
  • Loading branch information
vc1492a authored Nov 3, 2024
2 parents f78929a + f89a99c commit 06541c7
Show file tree
Hide file tree
Showing 5 changed files with 65 additions and 4 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: ["3.7", "3.8", "3.9", "3.10", "3.11", "3.12"]
python-version: ["3.7", "3.8", "3.9", "3.10", "3.11", "3.12", "3.13"]

steps:
- uses: actions/checkout@v4
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ PyNomaly/loop_dev.py
*.pyc
*.coverage.*
.coveragerc
.pypirc

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down
4 changes: 4 additions & 0 deletions changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ All notable changes to PyNomaly will be documented in this Changelog.
The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
and adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).

## 0.3.4
### Changed
- Changed source code as necessary to address a [user-reported issue](https://github.com/vc1492a/PyNomaly/issues/49), corrected in [this commit](https://github.com/vc1492a/PyNomaly/commit/bbdd12a318316ca9c7e0272a5b06909f3fc4f9b0)

## 0.3.3
### Changed
- The implementation of the progress bar to support use when the number of
Expand Down
18 changes: 15 additions & 3 deletions readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ This Python 3 implementation uses Numpy and the formulas outlined in
to calculate the Local Outlier Probability of each sample.

## Dependencies
- Python 3.6 - 3.12
- Python 3.6 - 3.13
- numpy >= 1.16.3
- python-utils >= 2.3.0
- (optional) numba >= 0.45.1
Expand Down Expand Up @@ -281,7 +281,12 @@ PyNomaly provides the ability to specify a distance matrix so that any
distance metric can be used (a neighbor index matrix must also be provided).
This can be useful when wanting to use a distance other than the euclidean.

Note that in order to maintain alignment with the LoOP definition of closest neighbors,
an additional neighbor is added when using [scikit-learn's NearestNeighbors](https://scikit-learn.org/1.5/modules/neighbors.html) since `NearestNeighbors`
includes the point itself when calculating the cloest neighbors (whereas the LoOP method does not include distances to point itself).

```python
import numpy as np
from sklearn.neighbors import NearestNeighbors

data = np.array([
Expand All @@ -293,11 +298,18 @@ data = np.array([
[421.5, 90.3, 50.0]
])

neigh = NearestNeighbors(n_neighbors=3, metric='hamming')
# Generate distance and neighbor matrices
n_neighbors = 3 # the number of neighbors according to the LoOP definition
neigh = NearestNeighbors(n_neighbors=n_neighbors+1, metric='hamming')
neigh.fit(data)
d, idx = neigh.kneighbors(data, return_distance=True)

m = loop.LocalOutlierProbability(distance_matrix=d, neighbor_matrix=idx, n_neighbors=3).fit()
# Remove self-distances - you MUST do this to preserve the same results as intended by the definition of LoOP
indices = np.delete(indices, 0, 1)
distances = np.delete(distances, 0, 1)

# Fit and return scores
m = loop.LocalOutlierProbability(distance_matrix=d, neighbor_matrix=idx, n_neighbors=n_neighbors+1).fit()
scores = m.local_outlier_probabilities
```

Expand Down
44 changes: 44 additions & 0 deletions tests/test_loop.py
Original file line number Diff line number Diff line change
Expand Up @@ -790,3 +790,47 @@ def test_data_flipping() -> None:
fit2.norm_prob_local_outlier_factor,
decimal=6,
)


def test_distance_matrix_consistency(X_n120) -> None:
"""
Test to ensure that the distance matrix is consistent with the neighbor
matrix and that the software is able to handle self-distances.
:return: None
"""

neigh = NearestNeighbors(metric='euclidean')
neigh.fit(X_n120)
distances, indices = neigh.kneighbors(X_n120, n_neighbors=11, return_distance=True)

# remove the closest neighbor (its the point itself) from each row in the indices matrix and distances matrix
indices = np.delete(indices, 0, 1)
distances = np.delete(distances, 0, 1)

# Fit LoOP with and without distance matrix
clf_data = loop.LocalOutlierProbability(X_n120, n_neighbors=10)
clf_dist = loop.LocalOutlierProbability(distance_matrix=distances, neighbor_matrix=indices, n_neighbors=11)

# Attempt to retrieve scores and check types
scores_data = clf_data.fit().local_outlier_probabilities
scores_dist = clf_dist.fit().local_outlier_probabilities

# Debugging prints to investigate types and contents
print("Type of scores_data:", type(scores_data))
print("Type of scores_dist:", type(scores_dist))
print("Value of scores_data:", scores_data)
print("Value of scores_dist:", scores_dist)
print("Shape of scores_data:", scores_data.shape)
print("Shape of scores_dist:", scores_dist.shape)

# Convert to arrays if they aren't already
scores_data = np.array(scores_data) if not isinstance(scores_data, np.ndarray) else scores_data
scores_dist = np.array(scores_dist) if not isinstance(scores_dist, np.ndarray) else scores_dist

# Check shapes and types before assertion
assert scores_data.shape == scores_dist.shape, "Score shapes mismatch"
assert isinstance(scores_data, np.ndarray), "Expected scores_data to be a numpy array"
assert isinstance(scores_dist, np.ndarray), "Expected scores_dist to be a numpy array"

# Compare scores allowing for minor floating-point differences
assert_array_almost_equal(scores_data, scores_dist, decimal=10, err_msg="Inconsistent LoOP scores due to self-distances")

0 comments on commit 06541c7

Please sign in to comment.