-
Notifications
You must be signed in to change notification settings - Fork 10
Refactor src/index.rs
and Enhance Documentation
#87
Conversation
…Update Signed-off-by: Tarek <[email protected]>
Signed-off-by: Tarek <[email protected]>
Signed-off-by: Tarek <[email protected]>
Signed-off-by: Tarek <[email protected]>
Signed-off-by: Tarek <[email protected]>
Signed-off-by: Tarek <[email protected]>
Signed-off-by: Tarek <[email protected]>
Signed-off-by: Tarek <[email protected]>
Signed-off-by: Tarek <[email protected]>
Signed-off-by: Tarek <[email protected]>
Signed-off-by: Tarek <[email protected]>
Signed-off-by: Tarek <[email protected]>
Signed-off-by: Tarek <[email protected]>
Signed-off-by: Tarek <[email protected]>
Signed-off-by: Tarek <[email protected]>
Signed-off-by: Tarek <[email protected]>
Benchmark for c3d41c0Click to view benchmark
|
src/index.rs
Outdated
} | ||
let file = File::open(index_path)?; | ||
let reader = BufReader::new(file); | ||
let index: ResourceIndex = serde_json::from_reader(reader)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right that this changes the current file format to JSON?
In the main
branch, index is stored in plain-text format:
1675087824710 58922-3276384608 Info/crdt_vs_ot.jpg
1675087824610 125308-3041567246 Info/Watson-Crick & Hoogsteen.png
1675087831180 386929-63288935 Screenshots/Screenshot_20210717_120856.jpg
1675087831440 945565-3484272718 Screenshots/Screenshot_20220111_040205_com.adobe.reader.jpg
1675087832200 57186-2804307164 Screenshots/lenovo-fedora-20210427-023231-183683953.png
1675087832990 107862-4031618865 Screenshots/lenovo-fedora-20210328-050520-182040364.png
1675087833830 343648-561346264 Screenshots/Screenshot_20210804_003308.jpg
1675087834130 257481-490338418 Info/Technologies.jpg
1675087846130 415620-3764250547 Gif/сitrus.gif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have found that storing and loading the same index with the main
branch doesn't produce the same object. Not a matter of sorting but timestamps might be different as well. The code in the main
branch for storing and loading contains more logic that just writing the struct to a file.
The idea here was to make ResourceIndex
implement Serialize
. with this setup, we can serialize to various formats
but It's up to you, we can revert to the old file format if you prefer it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your idea, I just don't want to write huge JSON to disk.
Some compact format would be great: postcard, CBOR, what else could serve this purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On other hand, if we use automatic serialization, then the format will be excessive most likely: 2 mappings means creating two separate structures. Right now, we write only one table and derive 2 mappings from it during reading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On other hand, if we use automatic serialization, then the format will be excessive
Good point. I noticed that as well.
I now reverted ResourceIndex::store()
and ResourceIndex::load()
to the old code. I just kept the code to make them serializable to have this possibility later.
I just wanted to add some notes about the current implementation of the 2 methods:
- we currently don't store or load information about collisions in
ResourceIndex
- I noticed some inconsistencies when trying to store an object of
ResourceIndex
and then load it. I noticed some differences not only in order but also inmodified_time
. I think this may be a problem with how we parse the time
I could move this conversation to an issue and work on a separate PR to solve it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tareknaser this is interesting
I noticed some inconsistencies when trying to store an object of ResourceIndex and then load it. I noticed some differences not only in order but also in modified_time. I think this may be a problem with how we parse the time
Could you tell more about the timestamps discrepancies? Probably steps to reproduce with ark-cli
? Yeah, we could create separate issue if this is a bug indeed.
we currently don't store or load information about collisions in ResourceIndex
Yes, but is it an issue? Collisions should be derivable from the table itself, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw test case fn index_entry_order()
could help debug lines ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you tell more about the timestamps discrepancies? Probably steps to reproduce with ark-cli? Yeah, we could create separate issue if this is a bug indeed.
I actually noticed this difference when I wrote a simple test
Here is the test to reproduce
#[test]
fn resource_index_load_store() {
let temp_dir = TempDir::new("arklib_test")
.expect("Failed to create temporary directory");
let temp_dir = temp_dir.into_path();
create_file_at(
temp_dir.to_owned(),
Some(FILE_SIZE_1),
Some(FILE_NAME_1),
);
let index = ResourceIndex::build(temp_dir.to_owned());
index
.store()
.expect("Should store index successfully");
let loaded_index = ResourceIndex::load(temp_dir.to_owned())
.expect("Should load index successfully");
// Assert that the loaded index is equal to the original index
assert_eq!(index, loaded_index);
}
When I run this test, I get this output
thread 'index::tests::resource_index_load_store' panicked at src/index.rs:824:9:
assertion `left == right` failed
left: ResourceIndex { id2path: {ResourceId { data_size: 10, crc32: 3817498742 }: "/private/var/folders/fg/kw558kp56h574pchd9zylc500000gn/T/arklib_test.VcRhfirveNvX/test1.txt"}, path2id: {"/private/var/folders/fg/kw558kp56h574pchd9zylc500000gn/T/arklib_test.VcRhfirveNvX/test1.txt": IndexEntry { modified: SystemTime { tv_sec: 1709340915, tv_nsec: 542881544 }, id: ResourceId { data_size: 10, crc32: 3817498742 } }}, collisions: {}, root: "/private/var/folders/fg/kw558kp56h574pchd9zylc500000gn/T/arklib_test.VcRhfirveNvX" }
right: ResourceIndex { id2path: {ResourceId { data_size: 10, crc32: 3817498742 }: "/private/var/folders/fg/kw558kp56h574pchd9zylc500000gn/T/arklib_test.VcRhfirveNvX/test1.txt"}, path2id: {"/private/var/folders/fg/kw558kp56h574pchd9zylc500000gn/T/arklib_test.VcRhfirveNvX/test1.txt": IndexEntry { modified: SystemTime { tv_sec: 1709340915, tv_nsec: 542000000 }, id: ResourceId { data_size: 10, crc32: 3817498742 } }}, collisions: {}, root: "/private/var/folders/fg/kw558kp56h574pchd9zylc500000gn/T/arklib_test.VcRhfirveNvX" }
Notice how tv_nsec
is different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like rounding.. If you could investigate why is it rounded, it would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand now
We're currently storing and loading timestamps in milliseconds, but SystemTime
relies on Timespec
under the hood which is:
struct Timespec {
tv_sec: i64,
tv_nsec: Nanoseconds,
}
I confirmed locally that if we store and parse the timestamp as nanoseconds, we don't run into this issue. I'll double-check before pushing the code.
pub fn index_new(&mut self, path: &dyn AsRef<Path>) -> Result<IndexUpdate> { | ||
log::debug!("Indexing a new path"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use name track_addition
if it's same function as here https://github.com/ARK-Builders/arklib/blob/index/update-one/src/index.rs#L359
/// Updates a single entry in the index with a new resource located at the | ||
/// specified path, replacing the old resource associated with the given | ||
/// ID. | ||
/// | ||
/// # Restrictions | ||
/// | ||
/// The caller must ensure that: | ||
/// * the index is up-to-date except for this single path | ||
/// * the path has been indexed before | ||
/// * the path maps into `old_id` | ||
/// * the content by the path has been modified | ||
/// | ||
/// # Errors | ||
/// | ||
/// Returns an error if the path does not exist, if the path is a directory | ||
/// or an empty file, if the index cannot find the specified path, or if | ||
/// the content of the path has not been modified. | ||
pub fn update_one( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me, that we should separate this method into track_deletion
and track_update
as it was done in #72
https://github.com/ARK-Builders/arklib/blob/index/update-one/src/index.rs#L393
https://github.com/ARK-Builders/arklib/blob/index/update-one/src/index.rs#L416
The idea is that track
methods serve as performance workaround when end-users know what they are doing. On other hand, update_all
should be efficient enough to be called all the time. But from API point of view, it track
methods are also simpler sometimes.
OPEN FOR DISCUSSION
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so the API will look like:
track_addition()
,track_update()
andtrack_deletion()
when the user wants to be specific about the actionupdate_all()
otherwise
This looks like a clean API but I have some questions about other public methods
- what are the use cases for
ResourceIndex::provide()
? - we can now remove
forget_id()
in favor oftrack_deletion()
or they do different tasks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
ResourceIndex::provide()
is the starting point for any user/app. It combines index loading and updating in single step. This could be done by separate calls, but it seems to me that users will always need updated index before they do something. I can imagine rare use-cases where they need more control, so we should expose separate steps, too. -
Name
forget_id
is confusing now. In this PR, it is what is calledtrack_deletion
in the other PR. We have private functionforget_path
that is symmetric toinsert_entry
, both are called fromtrack
methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ResourceIndex::provide() is the starting point for any user/app. It combines index loading and updating in single step.
Got it
Thanks for clearing this out.
Name forget_id is confusing now.
The initial idea for this PR was to rewrite the file src/index.rs
as it is and then implement the changes to replace #72
Do you prefer to include all the changes in the current PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... It's difficult to keep in mind several PRs because the scope is pretty big.
Let's incorporate useful parts of #72 and unit tests from there into this PR.
// canonicalize the path to avoid duplicates | ||
match fs::canonicalize(&path) { | ||
Ok(canonical_path) => { | ||
discovered_files.insert(canonical_path, entry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After canonicalization, we should trim the common prefix ("root").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's related to #86. Do you want me to implement it in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, we can do it in separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After spending some time modifying this PR based on #72, I've realized that it might not be a good idea to combine all the changes from both PRs into a single PR. #72 has significant API changes for ResourceIndex
, and I think we should discuss these changes further in #91 before proceeding.
This PR focuses only on refactoring certain parts of the code to enhance readability. It doesn't alter any logic for arklib
so it might be a good idea to keep it simple.
What do you think?
Signed-off-by: Tarek <[email protected]>
Signed-off-by: Tarek <[email protected]>
Benchmark for 1c39010Click to view benchmark
|
fn forget_path( | ||
&mut self, | ||
path: &CanonicalPath, | ||
path: &Path, | ||
old_id: ResourceId, | ||
) -> Result<IndexUpdate> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I would return just
deleted
collection, becauseadded
is always empty here. - Maybe name it
forget_entry
since path and id have same importance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would return just
deleted
collection, becauseadded
is always empty here.
That's why I suggest finalizing the API schema in #91 before implementing changes in a separate PR. #72 modifies IndexUpdate
, adds its Default
implementation, and integrates it into multiple API methods.
Trying to merge #72 changes into this PR while making further modifications along the way could become complicated.
/// Removes the given resource ID from the index and returns an update | ||
/// containing the deleted entries | ||
pub fn forget_id(&mut self, old_id: ResourceId) -> Result<IndexUpdate> { | ||
log::debug!("Forgetting a single entry in the index"); | ||
|
||
// Collect all paths associated with the old ID | ||
let mut old_paths = Vec::new(); | ||
for (path, entry) in &self.path2id { | ||
if entry.id == old_id { | ||
old_paths.push(path.clone()); | ||
} | ||
} | ||
|
||
// Remove entries from path2id and id2path | ||
for path in &old_paths { | ||
self.path2id.remove(path); | ||
} | ||
self.id2path.remove(&old_id); | ||
|
||
let mut deleted = HashSet::new(); | ||
deleted.insert(old_id); | ||
|
||
Ok(IndexUpdate { | ||
added: HashMap::new(), | ||
deleted, | ||
}) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From index/update-one
branch:
/// The caller must ensure that:
/// * the index is up-to-date except this single id
/// * the resource with this id has been indexed before
/// * the resource with this id doesn't exist anymore
///
/// Should only be used if reactive updates are not possible.
pub fn track_deletion(&mut self, id: ResourceId) -> Result<IndexUpdate> {
log::debug!("Tracking a single deletion in the index");
let indexed_path = self.id2path.get(&id);
if indexed_path.is_none() {
return Err(ArklibError::Path(
"The id cannot be found in the index".into(),
));
}
let indexed_path = indexed_path.unwrap().clone();
self.forget_entry(indexed_path.as_canonical_path(), id)?;
Ok(IndexUpdate::default().deleted(id))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in forget_id in #72 don't work on their own because ResourceIndex::forget_entry()
and IndexUpdate
were also modified.
Signed-off-by: Tarek <[email protected]>
Benchmark for 3ca26c4Click to view benchmark
|
Previously, the IndexEntry::modified field was parsed and loaded from miliseconds, which caused a rounding issue when comparing the loaded index with the stored one. This was due to SystemTime relying on Timespec, which includes nanoseconds. To address this issue, this commit updates the parsing and loading of IndexEntry::modified to use nanoseconds instead of miliseconds. Additionally, a new test resource_index_load_store has been added to verify that the stored index is equal to the loaded one. Signed-off-by: Tarek <[email protected]>
…as broken invariants Signed-off-by: Tarek <[email protected]>
The function load_raw_metadata() is only called in unit tests and is not used elsewhere in the codebase. Signed-off-by: Tarek <[email protected]>
Benchmark for b6ce103Click to view benchmark
|
About change from milliseconds to nanoseconds in timestamps: we should double-check that we really can have nanoseconds on mobile platforms. I don't remember precisely, but it seems we had an issue with nanoseconds due to implicit rounding to milliseconds. Could be so even on Linux. We should investigate why this threshold was introduced: 267b844#diff-701633ed548410624cfcae90d38a986adbd60deb5aeedd92355fe2e7702a9c18L34 |
load_raw_metadata was a method defined to read the metadata file and return its content. This commit removes it and its unit test because it is not used anywhere in the code and seems to be redundant Signed-off-by: Tarek <[email protected]>
Before making any changes to If we must keep the millisecond precision, we have two options to consider:
let duration = modified
.duration_since(UNIX_EPOCH)
.expect("SystemTime before UNIX EPOCH!")
.as_millis();
let modified =
UNIX_EPOCH + std::time::Duration::from_millis(duration as u64); |
@tareknaser we could use nanoseconds, but:
I could imagine us needing microseconds to timestamp realistically filesystem events. Would we have the rounding issues in this case?
This sounds like the way to go.
Is this sill a problem if we go the way with converting to milliseconds in |
Yes I think milliseconds is enough. Only issue is #87 (comment)
No. that should solve it |
The code previously retrieved the last modified time with nanosecond precision. To ensure compatibility across different file systems, particularly on platforms like Android, where nanosecond precision might not be supported or could lead to compatibility issues, the code now adjusts the precision to milliseconds only. Signed-off-by: Tarek <[email protected]>
Description
This PR focuses on refactoring the
src/index.rs
file.Changes
The changes primarily include:
serde::{Deserialize, Serialize}
for loading and storing of the index filediscover_paths
function todiscover_files
TempDir
for testing#[cfg(target_family = "unix")]
)Edit
Fixes #89 with 633335a