Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] fused_l2_nn_argmin produces unexpected results #1036

Closed
benfred opened this issue Nov 18, 2022 · 7 comments · Fixed by #1040
Closed

[BUG] fused_l2_nn_argmin produces unexpected results #1036

benfred opened this issue Nov 18, 2022 · 7 comments · Fixed by #1040
Labels
bug Something isn't working

Comments

@benfred
Copy link
Member

benfred commented Nov 18, 2022

Describe the bug
Running fused_l2_nn_argmin sometimes produces results that appear incorrect.

As a trivial test case, passing the same matrix for both inputs - sometimes returns results where each row isn't
it's own nearest neighbour.

Steps/Code to reproduce bug

import cupy as cp
from numpy.testing import assert_array_equal
from pylibraft.distance import fused_l2_nn_argmin


X = cp.array([[0.15804534, 0.70286853, 0.1001605, 0.30935687, 0.44620642],
 [0.40115238, 0.36828412, 0.58650847, 0.26410076, 0.9525248 ],
 [0.11997995, 0.85193706, 0.27353832, 0.13150834, 0.98335868],
 [0.21238154, 0.34168486, 0.94976721, 0.33517375, 0.41023301]])

ids = fused_l2_nn_argmin(X, X, sqrt=True).copy_to_host()

# ids are [0, 2, 2, 3] on my system - but I would expect each row to be its own nearest neighbour [0, 1, 2, 3] 
assert_array_equal(ids, np.arange(4))

Environment details (please complete the following information):

  • Environment location: Bare-metal
  • Method of RAFT install: source
  • Cuda 11.4 / A6000 GPU
@benfred benfred added the bug Something isn't working label Nov 18, 2022
@cjnolet
Copy link
Member

cjnolet commented Nov 18, 2022

cc @Nyrio have you encountered this behavior in your dealings with the fused l2 primitive?

@cjnolet
Copy link
Member

cjnolet commented Nov 18, 2022

@benfred it looks like it might be related to the sqrt here:

>>> X = cp.random.random((100, 5)).astype(cp.float32)
>>> ids = fused_l2_nn_argmin(X, X, sqrt=False).copy_to_host()
>>> ids
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99],
      dtype=int32)
>>> ids = fused_l2_nn_argmin(X, X, sqrt=True).copy_to_host()
>>> ids
array([ 0, 74,  2,  3,  4,  5,  6, 79,  8,  9, 10, 11, 12, 13, 14, 63, 16,
       17, 18, 19, 54, 95, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 34, 36, 37, 48, 39, 40, 41, 56, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 26, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73,  1, 75, 76, 57, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89, 90, 22, 92, 93, 94, 95, 96, 97, 98, 99],
      dtype=int32)
>>> 

@Nyrio
Copy link
Contributor

Nyrio commented Nov 21, 2022

@benfred Thanks for reporting this bug!
@cjnolet I haven't noticed that before, but can have a closer look.

rapids-bot bot pushed a commit that referenced this issue Nov 23, 2022
… x and y (#1040)

Solves #1036 

Even when computing a sum of squares, the distance from a point to itself can apparently be `-0.0` in which case the square root is `nan` and comparisons are broken.

Authors:
  - Louis Sugy (https://github.com/Nyrio)

Approvers:
  - Ben Frederickson (https://github.com/benfred)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #1040
@cjnolet cjnolet reopened this Oct 17, 2023
@Nyrio
Copy link
Contributor

Nyrio commented Oct 17, 2023

@cjnolet What's the reason for reopening?

@cjnolet
Copy link
Member

cjnolet commented Oct 17, 2023

@Nyrio we noticed yesterday that our Google tests haven't been running in CI since July because a change was made to the CI script causing them the runner to fail silently.

Unfortunately, there's now some failures in the fused l2 nn tests and the ivf pq tests. We think the fused l2 nn is likely causing the ivf pq errors (through the balanced kmeans).

I'm not sure yet if it's the round off errors from this issue that are the culprit. Our python fused p2 nn tests are passing but the Google tests are not and it looks like it's more issues where a neigbor's distance is supposed to be 0 but it's nonzero instead.

What I found last night is that ether approximation in the polynomial expansion can cause this.

For example, let's say we have a self looped distance, which should be 0. Let's say the norm for X and Y vectors are 2.54321. Those will match because we compute the norms using the same function. The dot product, however, could be 2.54322 because of round off errors. What I notice happening is that the difference between the norms and the dot product will leave something like 0.00001 instead of true 0 and once we take the sqrt of that, we get something like 0.00316 instead of 0.

@Nyrio
Copy link
Contributor

Nyrio commented Oct 17, 2023

Ok, thanks for explaining.

@cjnolet
Copy link
Member

cjnolet commented Oct 18, 2023

This should be fixed once and for all in this PR: #1904. I think the changes in that PR also beg for potential future optimizations. That round-off error is not easy to deal with.

@cjnolet cjnolet closed this as completed Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

Successfully merging a pull request may close this issue.

3 participants