Saving filenames in remote storage for data recovery without a repo #5938
Replies: 11 comments
-
I think the file is still in the dvc remote cache dir. Maybe you can download the whole remote cache dir and check through all the files. The file size and create time can give you useful informations. |
Beta Was this translation helpful? Give feedback.
-
@adamsvystun You could simply git push to your s3 repo, that will back it up. Also, force pushes are usually forbidden for important branches (e.g. on github/gitlab/etc it is really easy to do in the settings and could probably do that in plain git too), so it is worth doing that, since you might not only lose dvc files but also your code. I don't think there is anything dvc could do, except upload a git repo copy, which seems like a hack that went too far. Dvc usually goes along with your git repo, so as long as you take care of it (git repo), you will be fine. And in most of the workflows, code matter as much, if not even more, than data. |
Beta Was this translation helpful? Give feedback.
-
Emmm, as a machine learning engineer. Data is more important for me, especially those manual labeling data. |
Beta Was this translation helpful? Give feedback.
-
@karajan1001 Touche, I went too far on that one 🙂 Thanks for the correction! Thinking about it some more, theoretically we could store some list of filenames and their hashes in plain form as a kind of tags of sorts. For example, for directories, we store |
Beta Was this translation helpful? Give feedback.
-
Also, we have been thinking about creating some command for listing your raw cache to be able to find lost files by their size/mtime/etc (pretty much what @karajan1001 suggested earlier). Maybe we need something similar here, but for remotes. Again, these are the thoughts from the top of my head. If you have any suggestions on how it could look like, please feel free to share 🙂 |
Beta Was this translation helpful? Give feedback.
-
E.g. Maybe skeleton is not really needed and file/dirnames are enough, so we could flatten it. |
Beta Was this translation helpful? Give feedback.
-
Yes, I can do that, this would allow me to recover the files, but it would be a little nicer if the filenames were also in the remote.
I would know to properly set up the repo in this way for sure. But many people don't do that, and eventually, somebody might lose all the data because of some intern and bad configuration. It's unlikely, but the probability is nonzero. Maybe the probability is too low to consider it necessary to do any changes :) I am not sure. @efiop Also, more broadly, saving filenames on the remote would give the remote more 'independence' in a sense. Like you would be able to know what is what without the git repo. Not sure what is your internal philosophy as regards to DVC, but I would consider that to be step in a cool direction. Potentially you can have tools that browse the remote files, solely based on the remote, without the need to sync-up with the git repo - this might be a cool feature. |
Beta Was this translation helpful? Give feedback.
-
@adamsvystun The reason why I am admittedly hesitant is because it feels like duplicating git's functionality in a sense 🙂 But I understand your point and do see the value here. @adamsvystun @karajan1001 What kind of functionality would you expect from such a feature/command? What if a remote is used by multiple dvc projects? Should it show files as if they are in the same dvc project? |
Beta Was this translation helpful? Give feedback.
-
My 2cs on this - we can just dump any information about file names w/o expecting any additional functionality. Just to help to recover information if needed. Any mapping from hash to a name is way better than having nothing. I think that alone can be a good step. At least in making it more secure. |
Beta Was this translation helpful? Give feedback.
-
I agree with @shcheklein. Simple, but good enough if you do lose something. |
Beta Was this translation helpful? Give feedback.
-
@efiop My opinion, now the data is tracked by dvc, data information( file name?or something else) is tracked by git. There is only one mapping directions between them. It's easy to find data files from dvc files but the opposite operation is much harder. In most cases using git to track the .dvc file is enough. Make it more secure means we had to recover data only with information from a remote repository. As a remote repository now only stores datafile, we had to store more information in a remote repostion. This can be done in three different ways:
In solution 1, this table file is hard to maintain. We had to pull it down and merge it before we push it up to the remotes or we may overwrite information from other ones. I don't known what the total benefits and costs are. metioned in: And finnaly |
Beta Was this translation helpful? Give feedback.
-
Consider the following situation:
You create a git repo with DVC for data management. You use S3 remote storage. Let's say you then lose the repo (e.g. somebody force pushes into master), consequently, you lose all the
.dvc
files. You want to recover the data solely from the S3 backend.Currently, it is possible to recover the data, but without the filenames.
It would be nice if the filenames and any additional information for such situations were stored somewhere in remote storage.
Beta Was this translation helpful? Give feedback.
All reactions