-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InnerProduct Distance Metric for CAGRA search #2260
Conversation
/rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tarang-jain for adding support for innerproduct similarity! Most of the changes look good to me. There is one performance concern and some minor suggestions, so can you modify them?
cpp/include/raft/neighbors/detail/cagra/compute_distance_vpq.cuh
Outdated
Show resolved
Hide resolved
cpp/include/raft/neighbors/detail/cagra/search_multi_kernel.cuh
Outdated
Show resolved
Hide resolved
cpp/include/raft/neighbors/detail/cagra/search_multi_kernel.cuh
Outdated
Show resolved
Hide resolved
cpp/include/raft/neighbors/detail/cagra/search_multi_kernel.cuh
Outdated
Show resolved
Hide resolved
cpp/include/raft/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh
Show resolved
Hide resolved
cpp/include/raft/neighbors/detail/cagra/search_multi_kernel.cuh
Outdated
Show resolved
Hide resolved
Co-authored-by: tsuki <[email protected]>
cpp/include/raft/neighbors/detail/cagra/compute_distance_vpq.cuh
Outdated
Show resolved
Hide resolved
cpp/include/raft/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh
Outdated
Show resolved
Hide resolved
Can you add support for InnerProduct on Python side in this PR as well? |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tarang-jain for the update. LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tarang-jain for the changes. The code looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good Tarang. Mostly small items on my side, but want to make sure we're providing a consistent API experience across all of our ANN APIs (this will become particularly important when these APIs are migrated to cuVS).
cpp/include/raft/neighbors/cagra.cuh
Outdated
* | ||
* Usage example: | ||
* @code{.cpp} | ||
* using namespace raft::neighbors; | ||
* // use default index parameters | ||
* ivf_pq::index_params build_params; | ||
* ivf_pq::index_params = cagra::get_default_ivf_pq_build_params(dataset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I like this approach as it doesn't follow the standard API design flow that we've been using in hte other algos. I wonder if instead of adding this additional helper function if we could instead just add an overloaded constructor to the index_params
object that accepts an mdsan with the dataset and sets the defaults automatically. This would then follow the standard build flow but add an option for the user to establish the ressonable defaultsby passing in the dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't add constructors to the ivf_pq::index_params
struct because the aggregate assertion here. A helper is needed. One thing that can be done is that instead of a constructor within the index_params
struct, we can have a helper get_default_index_params
outside of the struct, but in ivf_pq_types.hpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I'd just remove the static assertion there. It's not necessary. My problem with exposing helper functions like this is that they make the APIs harder to use as they aren't immediatley obvious and don't provide for a consistent experience. Keeping everything together within the index_params
object itself allows there to be a single entrypoint (e.g. index_params) for construction. The other option would be to have factory functions which would always be used for construction, however I think that will also make things even more confusing unless we used that pattern across all the ANN APis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are constructing CAGRA specific ivf_pq::index_params
. I do not think the constructor of ivf_pq::index_params
shall have the job to define parameters according to CAGRA's requirements.
For consistent experience, the user just need to fill the metric
field of cagra::index_params
, which is inherited from ann::index_params, and call cagra::build
as usual.
Users of build_knn_graph
are advanced users who are using build_knn_graph
& optimize
in a separate step. Other algos don't have these steps, this is custom for CAGRA. Before this PR, the user either passes a manually filled ivf_pq::index_params
, or just uses a default. I think a helper function is an improvement.
/ok to test |
1 similar comment
/ok to test |
0e498a8
to
dda1810
Compare
/ok to test |
/** | ||
* Helper that sets values according to the extents of the dataset mdspan. | ||
*/ | ||
template <typename DataT, typename Accessor> | ||
static index_params from_dataset( | ||
mdspan<const DataT, matrix_extent<int64_t>, row_major, Accessor> dataset, | ||
raft::distance::DistanceType metric = raft::distance::L2Expanded) | ||
{ | ||
index_params params; | ||
params.n_lists = | ||
dataset.extent(0) < 4 * 2500 ? 4 : static_cast<uint32_t>(std::sqrt(dataset.extent(0))); | ||
params.pq_dim = | ||
round_up_safe(static_cast<uint32_t>(dataset.extent(1) / 4), static_cast<uint32_t>(8)); | ||
params.pq_bits = 8; | ||
params.kmeans_trainset_fraction = dataset.extent(0) < 10000 ? 1 : 0.1; | ||
params.metric = metric; | ||
return params; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are specific default params that are useful for constructing a CAGRA index, but possibly not ideal as default params for an ivf_pq
index. The documentation has to make this clear. I think it would be also preferred to keep this helper in the CAGRA namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I don't know that I agree with this. The sqrt(n) and compression ratio are reasonable defaults for ivfpq in general, irrespective of CAGRA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, maybe you are right. It is specific for CAGRA only in the sense that it leads to good enough recall for a building a CAGRA graph. But we do not give any recall guarantees (we cannot since it is dataset dependent) so we can treat this general defaults.
/ok to test |
/** | ||
* Helper that sets values according to the extents of the dataset mdspan. | ||
*/ | ||
template <typename DataT, typename Accessor> | ||
static index_params from_dataset( | ||
mdspan<const DataT, matrix_extent<int64_t>, row_major, Accessor> dataset, | ||
raft::distance::DistanceType metric = raft::distance::L2Expanded) | ||
{ | ||
index_params params; | ||
params.n_lists = | ||
dataset.extent(0) < 4 * 2500 ? 4 : static_cast<uint32_t>(std::sqrt(dataset.extent(0))); | ||
params.pq_dim = | ||
round_up_safe(static_cast<uint32_t>(dataset.extent(1) / 4), static_cast<uint32_t>(8)); | ||
params.pq_bits = 8; | ||
params.kmeans_trainset_fraction = dataset.extent(0) < 10000 ? 1 : 0.1; | ||
params.metric = metric; | ||
return params; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, maybe you are right. It is specific for CAGRA only in the sense that it leads to good enough recall for a building a CAGRA graph. But we do not give any recall guarantees (we cannot since it is dataset dependent) so we can treat this general defaults.
893e09d
to
5791743
Compare
/ok to test |
I went back and forth about this a couple times too, and I ended up at the conclusion that taking the dataset's shape into account should add tiny bit more generalization to the default params, even if they aren't guaranteed to yield the highest recall. IMO, in the worst case, a user doesn't use the factory method and they get the same default params we've always provided and in the best case, they are able to use the factory method and they are happy w/ the resultling recall. |
/merge |
/ok to test |
/ok to test |
InnerProduct
Distance Metric for CAGRA search. InnerProduct in graph building is supported using IVF-PQ for building the graph. NNDescent does not currently support any other metric except L2Expanded.