DVC prevent duplicate files in repository #8617

apethree · 2022-11-24T08:37:25Z

apethree
Nov 24, 2022

Hello,

My use case it store large data files structured/unstructured in repos for training data. But unique files only. Two employees might be working on the same file without knowing in two different repos/folder.

Example: Engineer one is working on FileA and has it's versions and clean data, Engineer 2 has FileB but what if they both were working on same file? I would like for DVC to compute hash on commit and prevent any other commits if a file with same hash already exists and show some kinda error.

Ex:

Collection
        --- DATE 1 -- File 1 Success
                           -- File 2 Success
        --- DATE 2 -- File 2 Success
                           -- File 3 Success
                           -- File 1 (I would like to get notified and deny this file because this is a duplicate training data ) Fail

Not sure how to achieve this workflow. I do know DVC does a good job of managing versions of the same file. Very similar to UNIQUE constraint in SQL.

shcheklein · 2022-11-24T18:40:16Z

shcheklein
Nov 24, 2022
Maintainer

@apethree could you give a bit more details? what is the use case to prevent multiple ML engineers using the same data for training?

What is the layout of the project - one repo, multiple repos, do you use DVC data registry?

0 replies

apethree · 2022-11-25T01:12:58Z

apethree
Nov 25, 2022
Author

Thanks for your response.

Typical pipellin:

Ingestion (DVC to manage these files but also avoid duplicates)
Pre Process ( If we solved this issue on ingestion layer we won't need to check duplicates here)
Extract data using NLP model
Use data for training

Problem:
Yes, multiple ML engineers/our pipeline unknowingly running ML models on same files for NLP analysis. Our data ingestion process is manual and engineers upload/add training/processing files to DVC and we store results in DB. We would like to prevent redundancy of data at Ingestion layer. Otherwise we are wasting computer resources in all stages of the pipeline and we also end up adding more results in our DB thinking we processed different files but we instead just process the same files over and over again.

Solutions looking for:
Ideally we would want DVC to either notify or prevent any redundancy in files being added in the ingestion repo for example. So keep only one copy of each file and if it finds a duplicate we can just link it to the existing files or even better just delete/ignore it. We can keep the pre processed files in a different repo or we can keep all files in the same big repo and add some kinda trigger or action that compares the hash of the new files with the existing files and takes some kinda action if there is a hash match. DVC already computes MD5 hash of the file and if we can use that info to prevent duplicates.

Example:
10 different engineers are working on ML image analysis. 5 engineers unknowingly pickup same training/sample dataset. Unknowingly they all added that initial dataset to the DVC. Now we have 10X redundancy and so on as more pre-processing is done and models are generated. So for the ingestion repo* we would wanna check for duplicates.

7 replies

shcheklein Nov 25, 2022
Maintainer

What is the structure of the repo atm?

ended up with essentially 3X storage wasted coz of duplicate files

storage should not be taking 3x even if you have duplicates. DVC stores object in a content-addressable way (by their md5s), so they should be deduplicated in the DVC cache and DVC remote. Make sure that links are enabled though - https://dvc.org/doc/user-guide/data-management/large-dataset-optimization

keep data more clean and structured

could you give a bit more details on this please?

apethree Nov 25, 2022
Author

Thanks for ringing me on the links I had totally forgotten about those. Based on a quick check I believe we do have links enabled. I still need to investigate and figure our why there is such huge different in storage. I think I need to read more about how cache is handles and processed in DVC.

Something that happened recently that made us look into restructuring DVC ( Keep data more clean and structured ):

Two engineers on my team spend a week cleaning and pre processing the same data file but with different filenames. They both added them to the ingestion_repo. We don't want any versioning here because we store these files from our customers for integrity here and we do not wish to save modified versions/duplicate versions in this repo. Let's call it "FileA."
Engineers cleaned up the data files for ML analysis, so we moved the cleaned up data files into a different repo called "pre_processed". Data versioning was super useful because we run ML models to get NLP results on these files and if we think we are getting lot of noise, we go back and clean the file more. Here file versioning is super useful and we can track changes we made in the files. Let's call the cleaned up version of file "FileB.1" "FileB.2" etc.
There are use cases where a "FileA" might be same as "FileB.1" in cases where no cleanup is required. Because our pipeline always looks into the "pre_processed": repo for files to run our scripts we essentially are keeping the same files in two different repos.
Same goes on as we go into modelling phase to testing phase to storing results phase.....

Please feel free to criticize our architecture clearly we are struggling to get it right. What changes you would recommend to solve the main issue which is the most important to us. How to prevent engineers from adding and working on same files in the ingestion phase.

shcheklein Nov 25, 2022
Maintainer

I still need to investigate and figure our why there is such huge different in storage.

sometimes it might be the way tools like du report linked data, btw. I would double check that.

Please feel free to criticize our architecture clearly we are struggling to get it right. What changes you would recommend to solve the main issue which is the most important to us. How to prevent engineers from adding and working on same files in the ingestion phase.

thanks! I'm getting more and more information and I think I understand it better now!

do you use GH, GitLab? when you update the ingestion repo, how does the operation look like - is it a new commit to it? is it a single directory with all the files inside or they are split by customer / by engineer?

I'm thinking about introducing a CI/CD check for the ingestion repo as a way to run checks on data changes and prevent duplicates (or you can run advanced stuff there). Or you can introduce a dedup stage that runs on top of the whole dataset and require running it after someone adds any files. Output of the stage is another directory w/o duplicates. Or it can fail to run if there are duplicates.

apethree Nov 25, 2022
Author

Yes its on Github. Files on the staging/TestSite are uploaded by Engineer's folder. Each engineer has his/her folder in the same repo. And in production version we have files by each customer and a business unit with subdirectories for classification. Kinda like csv files, unstructured files etc.

CI/CD solution would be to setup something with CML: https://cml.dev/doc/cml-with-dvc? That checks on import.I am thinking using CML CI/CD would be a better approach. I think very similar to using CI/CD to validate schema we just validate and compare the file hashes. Just bouncing off of your ideas.

Using "dedup" has one disadvantage: DVC would do data caching and all other work on the file before running a new pipeline.

shcheklein Nov 25, 2022
Maintainer

So, I would have a folder 'ivan' in the ingest repo, and you would have apethree. And if I upload a file into my folder that is a duplicate of a file in your folder you'd like to detect it?

And in production version

what is the difference production / test / staging - is about data being pre-processed?

CI/CD solution would be to setup something with CML: https://cml.dev/doc/cml-with-dvc

CML is just a helper. It doesn't matter if CML or not. Idea is that any update to the "data" happens through a PR and some check is being run automatically. It's similar to the way we would run linters in software engineering.

Yes, it's similar to validating schema.

DVC would do data caching and all other work on the file before running a new pipeline.

yes, there will be an extra cost of reading the directory each time. But even if we do CI/CD there will be a similar cost at some point - someone needs to read all files from time to time and see if there are duplicates. At the very list get the number of files and compare it with the number in DVC records.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DVC prevent duplicate files in repository #8617

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

DVC prevent duplicate files in repository #8617

apethree Nov 24, 2022

Replies: 2 comments · 7 replies

shcheklein Nov 24, 2022 Maintainer

apethree Nov 25, 2022 Author

shcheklein Nov 25, 2022 Maintainer

apethree Nov 25, 2022 Author

shcheklein Nov 25, 2022 Maintainer

apethree Nov 25, 2022 Author

shcheklein Nov 25, 2022 Maintainer

apethree
Nov 24, 2022

Replies: 2 comments 7 replies

shcheklein
Nov 24, 2022
Maintainer

apethree
Nov 25, 2022
Author

shcheklein Nov 25, 2022
Maintainer

apethree Nov 25, 2022
Author

shcheklein Nov 25, 2022
Maintainer

apethree Nov 25, 2022
Author

shcheklein Nov 25, 2022
Maintainer