Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tracking Issue] Deduplicate blob files #6265

Closed
4 of 6 tasks
Hocuri opened this issue Nov 26, 2024 · 4 comments
Closed
4 of 6 tasks

[Tracking Issue] Deduplicate blob files #6265

Hocuri opened this issue Nov 26, 2024 · 4 comments

Comments

@Hocuri
Copy link
Collaborator

Hocuri commented Nov 26, 2024

We would like to eventually deduplicate blob files.

This supersedes #5495 and #4309. We may be able to revert #5778 afterwards.

Motivation

Especially with Webxdc, there are a lot of duplicate files in the blobs directory, because when you get the same file sent twice then it will be saved twice.

Also, it would be nice to use random filenames because it may happen that the SQL database references a file that doesn't exist anymore, and if the user sends or receives a file with this filename then this new file will accidentally be shown in the place of the removed file.

Prerequisites

  • DC Desktop: Don't allow users to edit attachments Attachmentfiles on devices are rw - changes to attachements are not signaled to the user by DeltaChat deltachat-desktop#4352
  • Make sure that the actual name of the file isn't used in any UIs, instead dc_msg_get_filename() (C-FFI) or MessageObject.file_name (JsonRPC) needs to be used
  • Optional: Make sure that Param::Filename is set to the actual original filename
    • The problem here is that when sending a file, UIs rename the file before passing it to set_file(), and set_file() doesn't have an original_name parameter
    • We could simply ignore the problem because with the current plan (see below) all files will be renamed to a hash anyway, so, in most cases the original name will be free in the blobs dir. Not a super nice solution, but it would work.
    • Implemented in File deduplication #6332: Add a function set_file_and_deduplicate(&mut self, path: &str, original_name: &str, mime: Option<&str>) that is similar to set_file() but you can specify the original file name
      • @link2xt says that he thinks that we should do this, and also allow core to rename the file (as opposed to copying it, which set_file() is doing). It should be made to only work on files that already are in the blobs directory, in order to avoid accidentally moving a file that was still needed. Also, it should be allowed to immediately move the file (as opposed to set_file(), which will only rename the file when sending.
      • We need to make sure that things work fine with Preparing messages

Current plan

TL;DR: Save all files as <hash>.<extension>.

When inserting a file into the blobdir:

  • Compute the hash of the file content
    • Use Blake3 for hash because we have blake3 and iroh-blake3 dependencies anyway and iroh devs really like it. It is supposed to be much faster than other cryptographic hashes: https://peergos.org/posts/blake3
  • Check if the file <hash>.<extension> already exists; if yes: use the existing file (and to be safe, check that the content is still correct and overwrite it otherwise). Only if it doesn't exist yet, create it.
  • If possible: Set the file permissions to read-only.

Existing files will be kept as they are. Also, the existing set_file() function still won't deduplicate, only the new set_file_and_deduplicate() and when receiving messages.

Alternatives

  • Save the hash together with the filename in a SQL table in order to check if the file already exists.
    • Note that right now, guess_msgtype_from_suffix() uses the actual filename on the disk to guess the mime type; this means that we need to be careful if we deduplicate files that have different extensions.
  • Save the exact byte size together with the filename in a SQL table in order to check whether the file may already exist. If so, compare the files byte-by-byte.
    • This may be more performant, at least as long as there aren't a lot of files with the exact same size

Open questions

  • Is the performance of this solution good enough?
  • Should set_file_and_deduplicate() rename the file immediately before returning, asynchronously in the background, or when sending out the message?
    • Right now, it's renaming it immediately ebfore returning
@Septias
Copy link
Collaborator

Septias commented Nov 26, 2024

Maybe instead of [Tracking Issue] use a tag instead?

@iequidoo
Copy link
Collaborator

One of the reasons why #5495 exists is that it preserves original file names so that they are displayed as expected in external programs, this allows to avoid file copying. Though this requires the Delta Chat blobdir is traversable by that program which isn't true for all supported platforms.

@r10s
Copy link
Member

r10s commented Nov 29, 2024

i closed #5495 for now, we can cherry-pick or re-open as needed, but it does not make much sense to get that in beforehand and without considering this issue first. i also have the gut feeling that it is better to leave things in a flat structure.

#5495 would also only prevent copying if at the same time, things are set read-only, which is not that easy as it sounds iirc.

also, the copying is not that much of an issue as it affects exporting files only - not showing images or playing audio/videos inside delta chat. exporting is not done that often and only on direct user action, taking anyways a moment

@Hocuri Hocuri mentioned this issue Dec 11, 2024
5 tasks
Hocuri added a commit that referenced this issue Jan 21, 2025
When receiving messages, blobs will be deduplicated with the new
function `create_and_deduplicate_from_bytes()`. For sending files, this
adds a new function `set_file_and_deduplicate()` instead of
deduplicating by default.

This is for
#6265; read the
issue description there for more details.

TODO:
- [x] Set files as read-only
- [x] Don't do a write when the file is already identical
- [x] The first 32 chars or so of the 64-character hash are enough. I
calculated that if 10b people (i.e. all of humanity) use DC, and each of
them has 200k distinct blob files (I have 4k in my day-to-day account),
and we used 20 chars, then the expected value for the number of name
collisions would be ~0.0002 (and the probability that there is a least
one name collision is lower than that) [^1]. I added 12 more characters
to be on the super safe side, but this wouldn't be necessary and I could
also make it 20 instead of 32.
- Not 100% sure whether that's necessary at all - it would mainly be
necessary if we might hit a length limit on some file systems (the
blobdir is usually sth like
`accounts/2ff9fc096d2f46b6832b24a1ed99c0d6/dc.db-blobs` (53 chars), plus
64 chars for the filename would be 117).
- [x] "touch" the files to prevent them from being deleted
- [x] TODOs in the code

For later PRs:
- Replace `BlobObject::create(…)` with
`BlobObject::create_and_deduplicate(…)` in order to deduplicate
everytime core creates a file
- Modify JsonRPC to deduplicate blob files
- Possibly rename BlobObject.name to BlobObject.file in order to prevent
confusion (because `name` usually means "user-visible-name", not "name
of the file on disk").

[^1]: Calculated with both https://printfn.github.io/fend/ and
https://www.geogebra.org/calculator, both of which came to the same
result
([1](https://github.com/user-attachments/assets/bbb62550-3781-48b5-88b1-ba0e29c28c0d),

[2](https://github.com/user-attachments/assets/82171212-b797-4117-a39f-0e132eac7252))

---------

Co-authored-by: l <[email protected]>
@Hocuri
Copy link
Collaborator Author

Hocuri commented Jan 24, 2025

I think this can be closed, it's implemented in core and there are open issues for the specific things that still need to be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants