Saving filenames in remote storage for data recovery without a repo #5938

adamsvystun · 2020-04-10T23:27:49Z

adamsvystun
Apr 10, 2020

Consider the following situation:
You create a git repo with DVC for data management. You use S3 remote storage. Let's say you then lose the repo (e.g. somebody force pushes into master), consequently, you lose all the .dvc files. You want to recover the data solely from the S3 backend.

Currently, it is possible to recover the data, but without the filenames.
It would be nice if the filenames and any additional information for such situations were stored somewhere in remote storage.

karajan1001 · 2020-04-13T14:35:43Z

karajan1001
Apr 13, 2020

I think the file is still in the dvc remote cache dir. Maybe you can download the whole remote cache dir and check through all the files. The file size and create time can give you useful informations.

0 replies

efiop · 2020-04-13T14:42:22Z

efiop
Apr 13, 2020

@adamsvystun You could simply git push to your s3 repo, that will back it up. Also, force pushes are usually forbidden for important branches (e.g. on github/gitlab/etc it is really easy to do in the settings and could probably do that in plain git too), so it is worth doing that, since you might not only lose dvc files but also your code. I don't think there is anything dvc could do, except upload a git repo copy, which seems like a hack that went too far. Dvc usually goes along with your git repo, so as long as you take care of it (git repo), you will be fine. And in most of the workflows, code matter as much, if not even more, than data.

0 replies

karajan1001 · 2020-04-13T15:03:20Z

karajan1001
Apr 13, 2020

@efiop

And in most of the workflows, code matter as much, if not even more, than data.

Emmm, as a machine learning engineer. Data is more important for me, especially those manual labeling data.

0 replies

efiop · 2020-04-13T15:33:44Z

efiop
Apr 13, 2020

@karajan1001 Touche, I went too far on that one 🙂 Thanks for the correction!

Thinking about it some more, theoretically we could store some list of filenames and their hashes in plain form as a kind of tags of sorts. For example, for directories, we store .dir cache file, which has relpaths and their hashes in plain text form, so you will be able to find and recover a file/directory by path from our regular dvc remote pretty easily. Might consider something similar for standalone files and standalone dirs too. I guess one way to go about it is to kinda dvc add your project root, that will save .dir cache file with the structure of your repo. Not sure how useful that would be and how we would orginise the UI for viewing that, but, none the less, it could be done.

0 replies

efiop · 2020-04-13T15:37:00Z

efiop
Apr 13, 2020

Also, we have been thinking about creating some command for listing your raw cache to be able to find lost files by their size/mtime/etc (pretty much what @karajan1001 suggested earlier). Maybe we need something similar here, but for remotes.

Again, these are the thoughts from the top of my head. If you have any suggestions on how it could look like, please feel free to share 🙂

0 replies

efiop · 2020-04-13T15:45:03Z

efiop
Apr 13, 2020

E.g. dvc add path/to/data (md5 of data is 12345) could create empty file .dvc/cache/skeleton/path/to/data/12345 to mark that such path used to have such hash at some point. If data later changes and has md5 of 54321, then dvc would create empty file .dvc/cache/skeleton/path/to/data/54321. If data ever becomes dir with 33333.dir hash, then it would create empty file .dvc/cache/skeleton/path/to/data/33333.dir. This way it would be simple to push/pull and would probably be a good enough solution to find your lost files.

Maybe skeleton is not really needed and file/dirnames are enough, so we could flatten it.

0 replies

adamsvystun · 2020-04-13T16:24:28Z

adamsvystun
Apr 13, 2020
Author

@karajan1001

Maybe you can download the whole remote cache dir and check through all the files. The file size and create time can give you useful informations.

Yes, I can do that, this would allow me to recover the files, but it would be a little nicer if the filenames were also in the remote.

@efiop

force pushes are usually forbidden for important branches

I would know to properly set up the repo in this way for sure. But many people don't do that, and eventually, somebody might lose all the data because of some intern and bad configuration. It's unlikely, but the probability is nonzero. Maybe the probability is too low to consider it necessary to do any changes :) I am not sure.

@efiop Also, more broadly, saving filenames on the remote would give the remote more 'independence' in a sense. Like you would be able to know what is what without the git repo. Not sure what is your internal philosophy as regards to DVC, but I would consider that to be step in a cool direction. Potentially you can have tools that browse the remote files, solely based on the remote, without the need to sync-up with the git repo - this might be a cool feature.

0 replies

efiop · 2020-04-13T16:42:42Z

efiop
Apr 13, 2020

@efiop Also, more broadly, saving filenames on the remote would give the remote more 'independence' in a sense. Like you would be able to know what is what without the git repo. Not sure what is your internal philosophy as regards to DVC, but I would consider that to be step in a cool direction. Potentially you can have tools that browse the remote files, solely based on the remote, without the need to sync-up with the git repo - this might be a cool feature.

@adamsvystun The reason why I am admittedly hesitant is because it feels like duplicating git's functionality in a sense 🙂 But I understand your point and do see the value here.

@adamsvystun @karajan1001 What kind of functionality would you expect from such a feature/command? What if a remote is used by multiple dvc projects? Should it show files as if they are in the same dvc project?

0 replies

shcheklein · 2020-04-13T16:51:40Z

shcheklein
Apr 13, 2020
Maintainer

My 2cs on this - we can just dump any information about file names w/o expecting any additional functionality. Just to help to recover information if needed. Any mapping from hash to a name is way better than having nothing. I think that alone can be a good step. At least in making it more secure.

0 replies

adamsvystun · 2020-04-13T16:58:40Z

adamsvystun
Apr 13, 2020
Author

I agree with @shcheklein. Simple, but good enough if you do lose something.

0 replies

karajan1001 · 2020-04-14T07:11:03Z

karajan1001
Apr 14, 2020

@efiop My opinion, now the data is tracked by dvc, data information( file name?or something else) is tracked by git. There is only one mapping directions between them. It's easy to find data files from dvc files but the opposite operation is much harder. In most cases using git to track the .dvc file is enough. Make it more secure means we had to recover data only with information from a remote repository. As a remote repository now only stores datafile, we had to store more information in a remote repostion. This can be done in three different ways:

An independant table file stores names and data .
Store the useful information at the beginning of each datafiles.
Store information in independent files for each data file, just like what .dvc files do now.

In solution 1, this table file is hard to maintain. We had to pull it down and merge it before we push it up to the remotes or we may overwrite information from other ones.
In solution 2, packing and unpacking process are needed, and we can't use links to get better performance (packing file is different from original ones)
In solution 3, directly put .dvc files to remote and rename them?

I don't known what the total benefits and costs are.
For example, with the mapping information, the problem that we couldn't provide useful information of data files which were going to be deleted in a dvc gc process can be solved.

metioned in:
#1511
( I tried to solve it last weekend and found showing hash name of the data files gives little information)

And finnaly dvc recover download all the files in the repository and renamed them, maybe a name like {original name}.{create time}.{git tag/git commit msg/ git snap shot} or some other useful informtions

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving filenames in remote storage for data recovery without a repo #5938

{{title}}

Replies: 11 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Saving filenames in remote storage for data recovery without a repo #5938

Replies: 11 comments

adamsvystun Apr 13, 2020 Author

shcheklein Apr 13, 2020 Maintainer

adamsvystun Apr 13, 2020 Author

adamsvystun
Apr 13, 2020
Author

shcheklein
Apr 13, 2020
Maintainer

adamsvystun
Apr 13, 2020
Author