Skip to content
This repository has been archived by the owner on Sep 16, 2024. It is now read-only.

Refactor the API for ResourceIndex #91

Closed
tareknaser opened this issue Feb 28, 2024 · 4 comments
Closed

Refactor the API for ResourceIndex #91

tareknaser opened this issue Feb 28, 2024 · 4 comments
Assignees

Comments

@tareknaser
Copy link
Collaborator

This issue is an open discussion to changes proposed to the API in src/index.rs.

Current situation

pub struct ResourceIndex {
    /// A mapping of resource IDs to their corresponding file paths
    id2path: HashMap<ResourceId, PathBuf>,
    /// A mapping of file paths to their corresponding index entries
    path2id: HashMap<PathBuf, IndexEntry>,
    /// A mapping of resource IDs to the number of collisions they have
    pub collisions: HashMap<ResourceId, usize>,
    /// The root path of the index
    root: PathBuf,
}


impl ResourceIndex {
/// Returns the number of entries in the index
///
/// Note that the amount of resource can be lower in presence of collisions
pub fn count_files(&self) -> usize;

/// Returns the number of resources in the index
pub fn count_resources(&self) -> usize;

/// Builds a new resource index from scratch using the root path
///
/// This function recursively scans the directory structure starting from
/// the root path, constructs index entries for each resource found, and
/// populates the resource index
pub fn build<P: AsRef<Path>>(root_path: P) -> Self;

/// Loads a previously stored resource index from the root path
///
/// This function reads the index from the file system and returns a new
/// [`ResourceIndex`] instance. It looks for the index file in
/// `$root_path/.ark/index`.
///
/// Note that the loaded index can be outdated and `update_all` needs to
/// be called explicitly by the end-user. For automated updating and
/// persisting the new index version, use [`ResourceIndex::provide()`] method.
pub fn load<P: AsRef<Path>>(root_path: P) -> Result<Self>;    

/// Stores the resource index to the file system
///
/// This function writes the index to the file system. It writes the index
/// to `$root_path/.ark/index` and creates the directory if it's absent.
pub fn store(&self) -> Result<()>;

/// Provides the resource index, loading it if available or building it from
/// scratch if not
///
/// If the index exists at the provided `root_path`, it will be loaded,
/// updated, and stored. If it doesn't exist, a new index will be built
/// from scratch
pub fn provide<P: AsRef<Path>>(root_path: P) -> Result<Self>;

/// Updates the index based on the current state of the file system
///
/// Returns an [`IndexUpdate`] object containing the paths of deleted and
/// added resources
pub fn update_all(&mut self) -> Result<IndexUpdate>;

/// Indexes a new entry identified by the provided path, updating the index
/// accordingly.
///
/// The caller must ensure that:
/// - The index is up-to-date except for this single path
/// - The path hasn't been indexed before
///
/// Returns an error if:
/// - The path does not exist
/// - Metadata retrieval fails
pub fn index_new(&mut self, path: &dyn AsRef<Path>) -> Result<IndexUpdate>;


/// Updates a single entry in the index with a new resource located at the
/// specified path, replacing the old resource associated with the given
/// ID.
///
/// # Restrictions
///
/// The caller must ensure that:
/// * the index is up-to-date except for this single path
/// * the path has been indexed before
/// * the path maps into `old_id`
/// * the content by the path has been modified
///
/// # Errors
///
/// Returns an error if the path does not exist, if the path is a directory
/// or an empty file, if the index cannot find the specified path, or if
/// the content of the path has not been modified.
pub fn update_one(
	&mut self,
	path: &dyn AsRef<Path>,
	old_id: ResourceId,
) -> Result<IndexUpdate>;

/// Inserts an entry into the index, updating associated data structures
///
/// If the entry ID already exists in the index, it handles collisions
/// appropriately
fn insert_entry(&mut self, path: PathBuf, entry: IndexEntry);

/// Removes the given resource ID from the index and returns an update
/// containing the deleted entries
pub fn forget_id(&mut self, old_id: ResourceId) -> Result<IndexUpdate>;

/// Removes an entry with the specified path and updates the collision
/// information accordingly
///
/// Returns an update containing the deleted entries
fn forget_path(
	&mut self,
	path: &Path,
	old_id: ResourceId,
) -> Result<IndexUpdate>;

Some of the issues with the current structure are:

  • We have 2 mappings in ResoureIndex which prevents us from simply serializing the index to a file
  • Methods such as provide may be confusing when compared to other methods

Proposal

  • We could consider this structure for ResourceIndex
#[derive(PartialEq, Clone, Debug, Serialize, Deserialize)]
pub struct IndexedResource {
    pub id: ResourceId,
    pub path: PathBuf,
    pub last_modified: SystemTime,
}

#[derive(PartialEq, Clone, Debug, Serialize, Deserialize)]
pub struct ResourceIndex {
    pub resources: Vec<IndexedResource>,
    pub root: PathBuf,
}
  • To port the conversation from Refactor src/index.rs and Enhance Documentation #87 (comment) : We should consider refactoring the methods so that the API has 2 halves:
    • "Snapshot API" allowing to query particular paths or ids, separate functions for relative and absolute paths. We could also implement something like Iter interface.
    • "Reactive API" allowing updates handling without explicit queries. It can be done in pull-based manner like now (update_all), push-based using Tokio streams and/or registering handlers which would be called automatically (variant of push-based)
@tareknaser tareknaser self-assigned this Feb 28, 2024
@tareknaser
Copy link
Collaborator Author

Related discussion: #87 (comment)

@tareknaser
Copy link
Collaborator Author

Just to port our discussion from the call earlier today:

Proposed Resource Index API

Reactive API

  • update: Method to update the index by rescanning files and returning changes (additions/deletions/updates).

Snapshot API

  • get_resource_by_id: Query a resource from the index by its ID.
  • get_resource_by_path: Query a resource from the index by its path.

Track API

  • track_addition: Track a newly added file (checks if the file exists in the file system).
  • track_deletion: Track the deletion of a file (checks if it was actually deleted).
  • track_update: Track an update on a single file

File System Watcher

  • A method to watch a specific folder for updates, automatically updating the resource index and storing changes if any modifications are detected. (using notify)

Additionally, ResourceIndex should maintain the double hashmaps for storage

@kirillt
Copy link
Member

kirillt commented Mar 16, 2024

Thanks @tareknaser for putting this all together. Looks correct.

Auxiliary comments about these APIs:

  1. Reactive API is the main API, it detects updates which can be pulled by the user/app.

  2. Snapshot API is a simpler to use API for experimental apps or for apps without performance requirements.

  3. Track API is an additional API for advanced developers. Developers must understand what they do when they use this API. Use-cases include saving a call to Reactive API when the user/app precisely knows what entry has been changed. Another use-case is implementing a foreground service in Android which would push updates using intents to subscribed apps. In this case, the subscriber app must update local index correspondingly and this code would be done by us as part of Android API.

  4. File System Watcher can be seen as push-based version of Reactive API.

Additionally, ResourceIndex should maintain the double hashmaps for storage.

That's only for the sake of performance. But if benchmarks show that it isn't the best thing, or that simpler structure is enough, then we can use something else.

@kirillt kirillt moved this to In Progress in Development Mar 16, 2024
@tareknaser
Copy link
Collaborator Author

That's only for the sake of performance. But if benchmarks show that it isn't the best thing, or that simpler structure is enough, then we can use something else.

UPDATE:
I tried updating ResourceIndex to use a vector of resources instead of double mappings. This change mainly impacted the Snapshot API and led to a significant hit in performance. It might be better to stick with the current structure.

Before

pub struct ResourceIndex {
    /// The root path of the index (canonicalized)
    pub(crate) root: PathBuf,
    /// A map from resource IDs to resources
    pub(crate) id_to_resource: HashMap<HashType, IndexedResource>,
    /// A map from resource paths to resources
    pub(crate) path_to_resource: HashMap<PathBuf, IndexedResource>,
    #[cfg(feature = "non-cryptographic-hash")]
    /// A map from resource IDs to the number of collisions (i.e., resources with the same ID)
    pub(crate) collisions: HashMap<HashType, usize>,
}

After

#[derive(Clone, Debug)]
pub struct ResourceIndex {
    /// The root path of the index (canonicalized)
    pub(crate) root: PathBuf,
    // Indexed resources
    pub(crate) resources: Vec<IndexedResource>,
    #[cfg(feature = "non-cryptographic-hash")]
    /// A map from resource IDs to the number of collisions (i.e., resources with the same ID)
    pub(crate) collisions: HashMap<HashType, usize>,
}

Here is the benchmarks changes from the double mappings to the vector:

     Running benches/index_query_benchmarks.rs (target/release/deps/bench_get_resource-daa9ca3ecffcaba9)
Gnuplot not found, using plotters backend
index_query/get_resource_by_id
                        time:   [1.6808 µs 1.6840 µs 1.6884 µs]
                        change: [+2612.5% +2626.4% +2643.6%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe
index_query/get_resource_by_path
                        time:   [37.287 µs 37.326 µs 37.365 µs]
                        change: [+128223% +128390% +128558%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
index_query/update_index
                        time:   [16.421 ms 16.444 ms 16.473 ms]
                        change: [+3.9468% +4.8545% +5.7332%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) high mild
  8 (8.00%) high severe
index_query/track_addition
                        time:   [16.414 µs 16.430 µs 16.448 µs]
                        change: [-2.3982% -2.1804% -1.9777%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe
index_query/track_removal
                        time:   [47.164 µs 47.184 µs 47.211 µs]
                        change: [+2065.1% +2074.3% +2081.2%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants