Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File naming conventions #31

Closed
JoeZiminski opened this issue Nov 9, 2022 · 10 comments
Closed

File naming conventions #31

JoeZiminski opened this issue Nov 9, 2022 · 10 comments
Labels
documentation Improvements or additions to documentation

Comments

@JoeZiminski
Copy link
Member

JoeZiminski commented Nov 9, 2022

There has been some discussion here on the best file naming convention. Should the sub/ses info be at the file level?
e.g.

└── project_name/
    └── raw_data/
        └── sub-001/
            └── ses-001_date-01012022/
                └── sub-001_ses-001_date-01012022_video.mp4
or not
.
└── project_name/
    └── raw_data/
        └── sub-001/
            └── ses-001_date-01012022/
                └── video.mp4

My preference is for the longer version, even though it is ugly it is completely unamgibious and protects against some possible worst-case-scenario bugs (that should absolutely never happen, but still e.g. copying a session data to the wrong session / subject). I think this is also neuroimaging BIDS preferred but might be wrong.

The problem is filenames will be created by users, so will be harder to enfore. We could at least provide some functionality to copy to clipboard the correct prefix based on the cwd or something.

@adamltyson
Copy link
Member

The problem is filenames will be created by users, so will be harder to enfore

Can we enforce this at all? What if the data generated is 10k files from some third party software?

There's a discussion to be had to as to where we fall on this spectrum:

standard data formats <-> any data format with some standardisation (filename, metadata etc) <-> generic "buckets" (directories) to save data into

@JoeZiminski
Copy link
Member Author

Yes this is true we can't enforce it at all, maybe it is best left up to the user and we can provide a recommendation.

Yes this is another key point, how 'far down' the tree we manage and how much we leave to the user. The current preferred choice folder structure is nice for this as we can just provide 'behav', 'ephys' at the data type level (i.e.one level below the session level) and leave it at that (rather than ses-001/ephys/behav/camera etc there was before).

For now, we could leave things agnoistic from below the data-type level?
i.e. below, its possible to copy everything in the ephys, or behav, or imaging etc. for a selection of subjects, sessions, but there is no finer grained control (although, there is already a function for specifying a full path to a file to transfer)

.
└── project_name/
    └── raw_data/
        └── sub-001/
            ├── ses-001/
            │   ├── behav/
            │   │   ├── video.mp4
            │   │   └── responses.csv
            │   ├── ephys/
            │   │   └── recording.bin
            │   └── imaging/
            │       └── some_filetype.whatever
            └── histology/
                └── brain.tiff

In future either provide options for additional structure if it is very widely used, or alternatively support for creating / searching with custom strings.

Related, is everyone happy with a single folder for histology at the subject level? I think this makes sense

@adamltyson @niksirbi @lauraporta

@adamltyson
Copy link
Member

Yes this is true we can't enforce it at all, maybe it is best left up to the user and we can provide a recommendation.

I agree. Maybe some docs on good practice, and we could even print out a recommended filename string.

For now, we could leave things agnoistic from below the data-type level?

I think for now this is the best idea.

In future either provide options for additional structure if it is very widely used, or alternatively support for creating / searching with custom strings.

Yep, as the tool is adopted, we could provide support for a limited set of acquisition setups. Bonsai etc.

Related, is everyone happy with a single folder for histology at the subject level? I think this makes sense

Agree. In future we could potentially provide support for whole-brain, sections, spatial-transcriptomics etc.

@niksirbi
Copy link
Member

I agree with Adam.

My personal preference would be including sub/ses in the filename, just how BIDS does it, but it will be a headache to enforce for everything. Even BIDS started by supporting a few acquisition types (e.g. T1w, BOLD EPI) before expanding to others.

Context on requirements vs recommendations

When BIDS validates datasets (see bids-validator), it differentiates between REQUIRED, RECOMMENDED, and OPTIONAL. So if a REQUIRED feature is violated, you get an error, whereas if a RECOMMENDED feature is violated you get a warning. We could have a similar stratified system, and as different versions of datashuttle roll out we could promote some RECOMMENDED features to REQUIRED (while ensuring backwards compatibility).

For now I propose the following:

  • We enforce only the folder structure up to the data type level
  • We make some recommendations regarding file naming and filetypes
  • If a specific acquisition type is used very often (e.g. videos saved by bonsai), we can think about specific filenaming schemes for that and ultimately offer functionalities to 'BIDSify' (aka rename) the files.

Conclusion

The standard itself should be versioned, improved upon, and expanded by trial and error. Let's start small with minimum requirement and see where we go from there

@niksirbi
Copy link
Member

Additionally, we can offer some (non-enforced) guidelines on how to store metadata. E.g. use .csv or .tsv for tables/dataframes and .json file for key-value pairs. If a specific metadata file, or table pertains to a specific subject/session/acquisition, it's name should reflect that.

@niksirbi
Copy link
Member

Also @JoeZiminski , nitpicking some things I noticed in your example directory trees above:

  1. I would use rawdata, instead of raw_data (to follow BIDS, there is no reason to differentiate ourselves here)
  2. The dates I would write in YYYYMMDD format, because it's the least ambiguous considering international standards, it naturally sorts in chronological order, and it's the BIDS and ISO recommended format. So e.g. ses-002_date-20221110.

@JoeZiminski
Copy link
Member Author

That's really nice, how do you think it is best to manage the documentation for the standard vs. datashuttle implementation? Shall we have a single (versioned) help page which introduces BIDS , makes reccomendations? and use the current ephys BEP for the formal standard?

cheers for those points I will open / amend isues

@niksirbi
Copy link
Member

You are right in the sense that datashuttle (the tool) is not the same thing as the standard. It's probably more a tool that helps you implement the standard.

The standard itself (let's tentatively call its BIDS-SWC) is probably best hosted as a separate repo, which will be solely documentation, similar to bids-specification.

In the future we might also implement a tool like bids-validator to check whether a dataset is BIDS-SWC compliant.

We should of course monitor (and contribute to) BEP029 and BEP032, and strive to converge with them over time. That said, the in-house needs already go beyond these BEPs.

Those were my first thoughts on this, so fully open to counter-points.

@JoeZiminski
Copy link
Member Author

I think thats a good approach, completely agree

@adamltyson
Copy link
Member

In terms of docs, we could have multiple repos containing docs, or directories containing docs. These could all be rendered with Sphinx, and hosted using github pages. Something like :

  • github.com/neuroinformatics-unit/datashuttle/docs -> neuroinformatics-unit.github.io/datashuttle
  • github.com/BIDS-SWC -> neuroinformatics-unit.github.io/BIDS-SWC

and other tools e.g.

github.com/neuroinformatics-unit/behaviour-pipeline/docs -> neuroinformatics-unit.github.io/behaviour-pipeline

As adoption increases, the repos and docs could be moved to either SWC, or their own organisation.

@JoeZiminski JoeZiminski added the documentation Improvements or additions to documentation label Nov 10, 2022
@JoeZiminski JoeZiminski added this to the Beta release 0.0.0.1 milestone Nov 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants