Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support vacuum inverted index #17291

Merged
merged 5 commits into from
Jan 20, 2025

Conversation

SkyFan2002
Copy link
Member

@SkyFan2002 SkyFan2002 commented Jan 15, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

This PR implements a new table function fuse_vacuum_drop_inverted_index() to clean up the data of dropped and outside the retention period inverted indexes.

Implemention

A new key-value pair is added to the meta-service:

__fd_marked_deleted_table_index/table_id/index_name/index_version-> marked_deleted_index_meta

When an inverted index is dropped or replaced, the fd_marked_deleted_table_index key-value pair is added.

When a vacuum is triggered, the meta-service will check the __fd_marked_deleted_table_index key. And filter out the indexes that is in retention period with MarkedDeletedIndexMeta.dropped_on.

The vacuum will delete the index data that is not in retention period, by identifying the index files with index_name and index_version. After that, the meta-service will remove the index meta from the __fd_marked_deleted_table_index/index_name/index_version key.

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@SkyFan2002 SkyFan2002 changed the title Vacuum inverted feat: support vacuum inverted index Jan 15, 2025
@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Jan 15, 2025
@databendlabs databendlabs deleted a comment from github-actions bot Jan 15, 2025
Copy link
Member

@drmingdrmer drmingdrmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@b41sh mentioned that Index and Table may not have a one-to-one mapping, as an aggregate index could be built from multiple tables. At the time I suggested adding table_id to the index key, I wasn’t aware of this. Should we reconsider the design of the marked-deleted key in light of this information?

Reviewed 13 of 21 files at r1, all commit messages.
Reviewable status: 13 of 21 files reviewed, 2 unresolved discussions (waiting on @SkyFan2002)


src/meta/app/src/schema/table.rs line 889 at r1 (raw file):

#[derive(Clone, Debug, PartialEq, Eq)]
pub struct GetMarkedDeletedTableIndexesReply {
    pub table_indexes: HashMap<u64, Vec<(String, String, MarkedDeletedIndexMeta)>>,

What about introducing two type alias IndexName = String and IndexVersion = String to improve readability?

Or make it more strict, by defining a explicit type for index version: struct IndexVersion(String);


src/meta/api/src/schema_api_impl.rs line 2655 at r1 (raw file):

    #[logcall::logcall]
    #[fastrace::trace]
    async fn get_marked_deleted_table_indexes(

The implementation of this method is actually a list operation thus this method should be list_marked_deleted_table_indexes()

@SkyFan2002
Copy link
Member Author

@drmingdrmer adding table_id to the index key makes vacuum index from specific table more convient and efficient.

Currently, we don't have indexes associated with multiple tables. If we add such indexes in the future, they would be fundamentally different from our current indexes. I propose that if we add such indexes in the future, we should use a separate key to store these dropped indexes.

Copy link
Member

@drmingdrmer drmingdrmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 21 files at r1, 10 of 10 files at r2, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @SkyFan2002)


src/meta/app/src/schema/table.rs line 893 at r2 (raw file):

pub type IndexName = String;
pub type IndexVersion = String;

Since these aliases already exist, use them throughout the codebase to make the code more self-explanatory and improve readability.


src/meta/api/src/schema_api_impl.rs line 2679 at r2 (raw file):

                DirName::new_with_level(ident, 3)
            }
        };

Suggestion:

        let dir = {
            let table_id = table_id.unwrap_or_default();
            let ident = MarkedDeletedTableIndexIdIdent::new_generic(
                tenant,
                MarkedDeletedTableIndexId::new(
                    table_id,
                    "dummy".to_string(),
                    "dummy".to_string(),
                ),
            );
            DirName::new_with_level(ident, 2)
        };

src/meta/api/src/schema_api_impl.rs line 3249 at r2 (raw file):

        tenant: &Tenant,
        table_id: u64,
        indexes: &[(String, String)],

Is there need to remove a collection of indexes in transaction? AFAIK, they are independent and can be removed one by one.

Code quote:

    async fn remove_marked_deleted_table_indexes(
        &self,
        tenant: &Tenant,
        table_id: u64,
        indexes: &[(String, String)],

@SkyFan2002 SkyFan2002 added this pull request to the merge queue Jan 17, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 17, 2025
@SkyFan2002 SkyFan2002 added this pull request to the merge queue Jan 20, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 20, 2025
@SkyFan2002 SkyFan2002 added this pull request to the merge queue Jan 20, 2025
Merged via the queue into databendlabs:main with commit 4db02ec Jan 20, 2025
70 of 71 checks passed
@SkyFan2002 SkyFan2002 deleted the vacuum_inverted branch January 20, 2025 06:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants