-
-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tracking Issue] Deduplicate blob files #6265
Comments
Maybe instead of |
One of the reasons why #5495 exists is that it preserves original file names so that they are displayed as expected in external programs, this allows to avoid file copying. Though this requires the Delta Chat blobdir is traversable by that program which isn't true for all supported platforms. |
i closed #5495 for now, we can cherry-pick or re-open as needed, but it does not make much sense to get that in beforehand and without considering this issue first. i also have the gut feeling that it is better to leave things in a flat structure. #5495 would also only prevent copying if at the same time, things are set read-only, which is not that easy as it sounds iirc. also, the copying is not that much of an issue as it affects exporting files only - not showing images or playing audio/videos inside delta chat. exporting is not done that often and only on direct user action, taking anyways a moment |
When receiving messages, blobs will be deduplicated with the new function `create_and_deduplicate_from_bytes()`. For sending files, this adds a new function `set_file_and_deduplicate()` instead of deduplicating by default. This is for #6265; read the issue description there for more details. TODO: - [x] Set files as read-only - [x] Don't do a write when the file is already identical - [x] The first 32 chars or so of the 64-character hash are enough. I calculated that if 10b people (i.e. all of humanity) use DC, and each of them has 200k distinct blob files (I have 4k in my day-to-day account), and we used 20 chars, then the expected value for the number of name collisions would be ~0.0002 (and the probability that there is a least one name collision is lower than that) [^1]. I added 12 more characters to be on the super safe side, but this wouldn't be necessary and I could also make it 20 instead of 32. - Not 100% sure whether that's necessary at all - it would mainly be necessary if we might hit a length limit on some file systems (the blobdir is usually sth like `accounts/2ff9fc096d2f46b6832b24a1ed99c0d6/dc.db-blobs` (53 chars), plus 64 chars for the filename would be 117). - [x] "touch" the files to prevent them from being deleted - [x] TODOs in the code For later PRs: - Replace `BlobObject::create(…)` with `BlobObject::create_and_deduplicate(…)` in order to deduplicate everytime core creates a file - Modify JsonRPC to deduplicate blob files - Possibly rename BlobObject.name to BlobObject.file in order to prevent confusion (because `name` usually means "user-visible-name", not "name of the file on disk"). [^1]: Calculated with both https://printfn.github.io/fend/ and https://www.geogebra.org/calculator, both of which came to the same result ([1](https://github.com/user-attachments/assets/bbb62550-3781-48b5-88b1-ba0e29c28c0d), [2](https://github.com/user-attachments/assets/82171212-b797-4117-a39f-0e132eac7252)) --------- Co-authored-by: l <[email protected]>
I think this can be closed, it's implemented in core and there are open issues for the specific things that still need to be done. |
We would like to eventually deduplicate blob files.
This supersedes #5495 and #4309. We may be able to revert #5778 afterwards.
Motivation
Especially with Webxdc, there are a lot of duplicate files in the blobs directory, because when you get the same file sent twice then it will be saved twice.
Also, it would be nice to use random filenames because it may happen that the SQL database references a file that doesn't exist anymore, and if the user sends or receives a file with this filename then this new file will accidentally be shown in the place of the removed file.
Prerequisites
dc_msg_get_filename()
(C-FFI) orMessageObject.file_name
(JsonRPC) needs to be usedParam::Filename
is set to the actual original filenameset_file()
, andset_file()
doesn't have anoriginal_name
parameterset_file_and_deduplicate(&mut self, path: &str, original_name: &str, mime: Option<&str>)
that is similar toset_file()
but you can specify the original file nameset_file()
is doing). It should be made to only work on files that already are in the blobs directory, in order to avoid accidentally moving a file that was still needed. Also, it should be allowed to immediately move the file (as opposed toset_file()
, which will only rename the file when sending.Current plan
TL;DR: Save all files as
<hash>.<extension>
.When inserting a file into the blobdir:
blake3
andiroh-blake3
dependencies anyway and iroh devs really like it. It is supposed to be much faster than other cryptographic hashes: https://peergos.org/posts/blake3<hash>.<extension>
already exists; if yes: use the existing file (and to be safe, check that the content is still correct and overwrite it otherwise). Only if it doesn't exist yet, create it.Existing files will be kept as they are. Also, the existing
set_file()
function still won't deduplicate, only the newset_file_and_deduplicate()
and when receiving messages.Alternatives
guess_msgtype_from_suffix()
uses the actual filename on the disk to guess the mime type; this means that we need to be careful if we deduplicate files that have different extensions.Open questions
set_file_and_deduplicate()
rename the file immediately before returning, asynchronously in the background, or when sending out the message?The text was updated successfully, but these errors were encountered: