Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InnerProduct Distance Metric for CAGRA search #2260

Merged
merged 34 commits into from
Apr 30, 2024

Conversation

tarang-jain
Copy link
Contributor

@tarang-jain tarang-jain commented Apr 10, 2024

InnerProduct Distance Metric for CAGRA search. InnerProduct in graph building is supported using IVF-PQ for building the graph. NNDescent does not currently support any other metric except L2Expanded.

@github-actions github-actions bot added the cpp label Apr 10, 2024
@tarang-jain tarang-jain added 2 - In Progress Currenty a work in progress non-breaking Non-breaking change feature request New feature or request labels Apr 10, 2024
@tarang-jain tarang-jain marked this pull request as ready for review April 11, 2024 21:08
@tarang-jain tarang-jain requested a review from a team as a code owner April 11, 2024 21:08
@tarang-jain
Copy link
Contributor Author

tarang-jain commented Apr 11, 2024

/rerun tests

Copy link
Member

@enp1s0 enp1s0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tarang-jain for adding support for innerproduct similarity! Most of the changes look good to me. There is one performance concern and some minor suggestions, so can you modify them?

@lowener
Copy link
Contributor

lowener commented Apr 12, 2024

Can you add support for InnerProduct on Python side in this PR as well?

@tarang-jain
Copy link
Contributor Author

/ok to test

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tarang-jain for the update. LGTM!

Copy link
Member

@enp1s0 enp1s0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tarang-jain for the changes. The code looks good to me.

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good Tarang. Mostly small items on my side, but want to make sure we're providing a consistent API experience across all of our ANN APIs (this will become particularly important when these APIs are migrated to cuVS).

cpp/include/raft/neighbors/detail/cagra/cagra_build.cuh Outdated Show resolved Hide resolved
cpp/include/raft/neighbors/detail/cagra/cagra_build.cuh Outdated Show resolved Hide resolved
cpp/include/raft/neighbors/detail/cagra/cagra_build.cuh Outdated Show resolved Hide resolved
cpp/test/neighbors/ann_cagra.cuh Outdated Show resolved Hide resolved
cpp/include/raft/neighbors/cagra.cuh Outdated Show resolved Hide resolved
*
* Usage example:
* @code{.cpp}
* using namespace raft::neighbors;
* // use default index parameters
* ivf_pq::index_params build_params;
* ivf_pq::index_params = cagra::get_default_ivf_pq_build_params(dataset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I like this approach as it doesn't follow the standard API design flow that we've been using in hte other algos. I wonder if instead of adding this additional helper function if we could instead just add an overloaded constructor to the index_params object that accepts an mdsan with the dataset and sets the defaults automatically. This would then follow the standard build flow but add an option for the user to establish the ressonable defaultsby passing in the dataset.

Copy link
Contributor Author

@tarang-jain tarang-jain Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't add constructors to the ivf_pq::index_params struct because the aggregate assertion here. A helper is needed. One thing that can be done is that instead of a constructor within the index_params struct, we can have a helper get_default_index_params outside of the struct, but in ivf_pq_types.hpp

Copy link
Member

@cjnolet cjnolet Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'd just remove the static assertion there. It's not necessary. My problem with exposing helper functions like this is that they make the APIs harder to use as they aren't immediatley obvious and don't provide for a consistent experience. Keeping everything together within the index_params object itself allows there to be a single entrypoint (e.g. index_params) for construction. The other option would be to have factory functions which would always be used for construction, however I think that will also make things even more confusing unless we used that pattern across all the ANN APis.

Copy link
Contributor

@tfeher tfeher Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are constructing CAGRA specific ivf_pq::index_params. I do not think the constructor of ivf_pq::index_params shall have the job to define parameters according to CAGRA's requirements.

For consistent experience, the user just need to fill the metric field of cagra::index_params, which is inherited from ann::index_params, and call cagra::build as usual.

Users of build_knn_graph are advanced users who are using build_knn_graph & optimize in a separate step. Other algos don't have these steps, this is custom for CAGRA. Before this PR, the user either passes a manually filled ivf_pq::index_params, or just uses a default. I think a helper function is an improvement.

@tarang-jain
Copy link
Contributor Author

/ok to test

1 similar comment
@tarang-jain
Copy link
Contributor Author

/ok to test

@tarang-jain
Copy link
Contributor Author

/ok to test

Comment on lines 108 to 125
/**
* Helper that sets values according to the extents of the dataset mdspan.
*/
template <typename DataT, typename Accessor>
static index_params from_dataset(
mdspan<const DataT, matrix_extent<int64_t>, row_major, Accessor> dataset,
raft::distance::DistanceType metric = raft::distance::L2Expanded)
{
index_params params;
params.n_lists =
dataset.extent(0) < 4 * 2500 ? 4 : static_cast<uint32_t>(std::sqrt(dataset.extent(0)));
params.pq_dim =
round_up_safe(static_cast<uint32_t>(dataset.extent(1) / 4), static_cast<uint32_t>(8));
params.pq_bits = 8;
params.kmeans_trainset_fraction = dataset.extent(0) < 10000 ? 1 : 0.1;
params.metric = metric;
return params;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are specific default params that are useful for constructing a CAGRA index, but possibly not ideal as default params for an ivf_pq index. The documentation has to make this clear. I think it would be also preferred to keep this helper in the CAGRA namespace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't know that I agree with this. The sqrt(n) and compression ratio are reasonable defaults for ivfpq in general, irrespective of CAGRA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, maybe you are right. It is specific for CAGRA only in the sense that it leads to good enough recall for a building a CAGRA graph. But we do not give any recall guarantees (we cannot since it is dataset dependent) so we can treat this general defaults.

@tarang-jain
Copy link
Contributor Author

/ok to test

Comment on lines 108 to 125
/**
* Helper that sets values according to the extents of the dataset mdspan.
*/
template <typename DataT, typename Accessor>
static index_params from_dataset(
mdspan<const DataT, matrix_extent<int64_t>, row_major, Accessor> dataset,
raft::distance::DistanceType metric = raft::distance::L2Expanded)
{
index_params params;
params.n_lists =
dataset.extent(0) < 4 * 2500 ? 4 : static_cast<uint32_t>(std::sqrt(dataset.extent(0)));
params.pq_dim =
round_up_safe(static_cast<uint32_t>(dataset.extent(1) / 4), static_cast<uint32_t>(8));
params.pq_bits = 8;
params.kmeans_trainset_fraction = dataset.extent(0) < 10000 ? 1 : 0.1;
params.metric = metric;
return params;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, maybe you are right. It is specific for CAGRA only in the sense that it leads to good enough recall for a building a CAGRA graph. But we do not give any recall guarantees (we cannot since it is dataset dependent) so we can treat this general defaults.

@tarang-jain
Copy link
Contributor Author

/ok to test

@cjnolet
Copy link
Member

cjnolet commented Apr 24, 2024

Ok, maybe you are right. It is specific for CAGRA only in the sense that it leads to good enough recall for a building a CAGRA graph. But we do not give any recall guarantees (we cannot since it is dataset dependent) so we can treat this general defaults.

I went back and forth about this a couple times too, and I ended up at the conclusion that taking the dataset's shape into account should add tiny bit more generalization to the default params, even if they aren't guaranteed to yield the highest recall. IMO, in the worst case, a user doesn't use the factory method and they get the same default params we've always provided and in the best case, they are able to use the factory method and they are happy w/ the resultling recall.

@cjnolet
Copy link
Member

cjnolet commented Apr 26, 2024

/merge

@tarang-jain
Copy link
Contributor Author

/ok to test

@tarang-jain
Copy link
Contributor Author

/ok to test

@rapids-bot rapids-bot bot merged commit e720de7 into rapidsai:branch-24.06 Apr 30, 2024
70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currenty a work in progress cpp feature request New feature or request non-breaking Non-breaking change python
Projects
Development

Successfully merging this pull request may close these issues.

5 participants