-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Symbolic linking within datasets #526
Comments
Symlinks and deduplication seem like problems for the filesystem or a storage system like datalad, and should not be part of the specification. Not all filesystems support symlinks, so I think it would be unwise for us to recommend or require them in the spec. Do we currently have a principle in which we say files must not be duplicated? |
I haven't seen anything about duplication in the spec, but I could have missed it. Are your concerns specifically about symlinks, or about duplicate files in general? I don't think having copies of files would be a problem for Datalad, but then I think there'd be a need for unique identifiers stored in the sidecars. Perhaps this could just be reflected in the scans file, which, at least with heudiconv, generally has some random string that is unique to each file? Since symlinks are a part of BEP001, should they be replaced with file duplication? |
Per bids-standard/bids-2-devel#43 (comment), @satra agrees that symlinking would not be compatible with common storage systems. Does anyone have any ideas for a good alternative that will work well with scanner-generated "derivatives"? |
My suggestion would be to generate your dataset as a compliant derivatives dataset, and stick derivatives side-by-side with raw files. I'm not sure if this is anything like a consensus position, but given that derivatives datasets may contain raw filenames IFF they are raw files, I think it's a kind of nice way to handle the case. If it becomes common behavior, it drives us toward the end state where we acknowledge that all datasets are derivative. |
To tie it back to #508, the BEP001 team has proposed the following format for a dataset with scanner-generated derivatives and sufficient provenance (with minor adjustments to add functional data):
In the case of scanner-generated derivatives without provenance, I believe that their proposal is to simply have the data in the raw data folder:
If I understand correctly, you're proposing that folks do almost the opposite- put everything in the derivatives folder? Like this:
|
No, I'm proposing:
With dataset_description.json: {
...
"DatasetType": "derivatives",
"GeneratedBy": [...]
} |
Ohhhh okay. Thanks! Now that there's a symlink-less solution on the table, I'll feed it back into the BEP001 review. |
I commented on the BEP001 PR with the proposed solution, so I'm going to close this. |
#508 proposes a number of new suffixes meant for qMRI workflows. These suffixes all require multiple files, and in some cases some of those files may be equivalent to existing suffixes. For example, one file from a multi-parametric mapping (MPM) scheme may be the same as a T1w scan, and if the dataset curator knows this, they could identify it as such.
#508 also introduces the idea of symbolically linking dataset files to derivatives, in cases where the scanner automatically generates what would typically be considered a derivative (e.g., a T2map).
Would it be reasonable for the curator to symbolically link files within a dataset?
So, for example, we could have the two following:
Tagging @yarikoptic and @adswa to get Datalad-related thoughts, as well as @agahkarakuzu and @emdupre because they were involved in the initial conversation that spawned this issue.
This issue is related to #508 and #512.
The text was updated successfully, but these errors were encountered: