-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mongo filetypes #966
base: master
Are you sure you want to change the base?
Mongo filetypes #966
Conversation
Codecov Report
@@ Coverage Diff @@
## master #966 +/- ##
=========================================
+ Coverage 90.61% 90.7% +0.09%
=========================================
Files 50 51 +1
Lines 6766 6843 +77
=========================================
+ Hits 6131 6207 +76
- Misses 635 636 +1 |
@@ -43,6 +44,9 @@ | |||
# Filename | |||
'fname': '[^/]+', | |||
|
|||
# File type name | |||
'ftypename': '[^/]+', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same regex as fname
above, does it need to be separate?
api/config.py
Outdated
for filetype in filetypes: | ||
try_replace_one(db, 'filetypes', {'_id': filetype['_id']}, filetype, upsert=True) | ||
|
||
log.info('Initializing database, creating indexes ....DONE') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this two log statements, one before the try_update_one
call and one here.
api/dao/dbutil.py
Outdated
@@ -4,6 +4,7 @@ | |||
from pymongo.errors import DuplicateKeyError | |||
from . import APIStorageException | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this whitespace is in error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just followed pep8. It says: Surround top-level function and class definitions with two blank lines.
Partial review. It looks like the regexes are largely all of the form "file ends with dot-patttern", so it might be nicer to have a function generate those to make it a little more readable, but up to you. I'm lukewarm on doing a JS query scan on every file upload - maybe we should just pull the table in and loop ourselves? Or use a singleton doc, but then modifying the doc is hard. I defer to you and @nagem. |
api/handlers/filetypehandler.py
Outdated
Insert or replace a file type. Required fields: '_id' and 'regex' where the '_id' is the unique name of | ||
the file type and 'regex' is a regular expression which is used to figure out the file type from the file name. | ||
""" | ||
permchecker = userauth.default(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the container types have permissions checking decorators written for them, the one you used is for the User container. For these endpoints, I would expect any site admin (a user whose user doc has root=true
) to be able to add/change/remove and any logged-in user to be able to get the list of filetypes. For these kinds of checks, there is are decorators here you can use to require_admin
and require_login
. Here is an example of one being used.
Note: the difference between superuser_request
and user_is_admin
is currently different but should be combined. The first is old behavior that called out you were making a superuser request via a query param. The second is notes the privileges you inherently get on all requests because your user doc distinguishes you as a site admin.
api/handlers/filetypehandler.py
Outdated
payload = self.request.json_body | ||
mongo_schema_uri = validators.schema_uri('mongo', 'filetype.json') | ||
mongo_validator = validators.decorator_from_schema_path(mongo_schema_uri) | ||
mongo_validator(permchecker(noop))('PUT', payload=payload) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also use a mongo schema validator (although we've kind of stopped that practice as we didn't see many real world situations where it prevented dev error), but this is better labeled as an input validator as it's using a json schema to validate request bodies from clients. Rather than creating a decorator*, you can call the validator directly (here is an example).
Related, when you call the existing validator with PUT
, it drops the required
piece of the json schema to allow for patch updates. POST
does not.
* This decorator practice was created in the old rewrite of containerhandler to facilitate calling the permissions check, validators and storage function in one line. It works alright in that handler for now, but new functionality does not need to follow that convention.
bootstrap.sample.json
Outdated
@@ -25,5 +25,34 @@ | |||
"_id": "local", | |||
"type": "engine" | |||
} | |||
], | |||
"filetypes": [ | |||
{ "_id": "bval", "regex": ".*\\.(bval$|bvals$)" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation might identify x.dcm.zip
as an archive
(instead of as a dicom
) depending on the order of iteration over filetypes/regexes. This was working previously because we were using an exact match for after the first dot like so.
Let's remove .*
from the beginning of regexes (they get even simpler) and switch to re.search()
instead of re.match()
. If multiple matches are found, one could still infer that the best match is the longest one. That way we can still get the added flexibility of regexes (eg. for matching EFiles) but don't lose existing functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
I wonder if there's an algorithm that can tell us if a set of regexes are mutually exclusive 🤔
b983555
to
b1432aa
Compare
Thanks for making the requested changes, @davidfarkas! Looks good to me. I'll let @ryansanford confirm the file type loading works with what he was expecting and then it's good to merge. |
The loading scheme looks fine to me. I'd like to tweak the fly/fly branch for a proper test before this lands. Doing that now... |
@gsfr Looking for the stylized file type names |
I've added the stylized types from the classification branch and also added a few new types. Does this implementation still prefer longer matches over shorter ones (i.e., .dicom.zip over .zip)? I used square brackets in some of the regular expressions. Hope that works. This branch needs to be rebased onto master to resolve the Django related CI error. |
bootstrap.sample.json
Outdated
{ "_id": "Tabular data", "regex": "\\.([ct]sv\\.gz|[ct]sv)$" }, | ||
{ "_id": "Video", "regex": "\\.(mpeg|mpg|mov|mp4|m4v|mts)$" }, | ||
{ "_id": "XML", "regex": "\\.xml$" }, | ||
{ "_id": "YAML", "regex": "\\.(yaml|yml)$" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Megan alerted me to this change: I don't think we can do this since it would break the gear surface area. Suggest using the original names in the filetypes.json
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kofalt Thanks for the heads up. Could you please elaborate? What are all the aspects you see impacted by this change?
Anything beyond this?
- massive DB migration
- manifest updates
- all gears updated at all sites
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, my understanding because these filetypes are presented to gears, and they might rely on that field, a check for == "MATLAB data"
would fail if the string is now "MATLAB Data"
.
This is a good example of how we don't quite have the full gear "surface area" mapped out and committed to: with our quickly-built deep hierarchy integration, there's more that could make a gear fail through no fault of its own.
So our options (I think) are:
- Not change existing strings
- Change the strings, but have the old versions for when they're provided externally (?)
- Triage the full set of gears & gear authors, communicate a breaking change
…or in initialize_db method
…tch, update the integration tests and validate the regex in the file type handler
dfffaba
to
d79658a
Compare
Final review pending DB change by @nagem |
Adding breaking change label as this PR contains a DB update that retypes files that exist in the system. |
@gsfr Did you ever make a decision regarding my (collapsed) review comment above? Based on our offline discussion, I am fine with us either resolving that, or shipping this to customers after we use our TBD new customer-notification strategy. |
@ryansanford @gsfr @davidfarkas @kofalt I added the database update that will retype files (and insert the starting state of the filetypes collection for existing customers). Currently the only thing preventing this from landing is resolving any conflicts with existing gears and properly alerting existing users. |
Thanks for the ping, @kofalt. Properly defining the gear surface area seems like a fairly urgent matter. Gear authors must know what they can rely on and what not. We need to make that explicit, rather than retroactively holding ourselves to standards we never knowingly committed to. So, you certainly have good point! I would like to go ahead with the current change, which can be rolled out to most sites without much communication, as determined by ops. Any site in question, needs to be properly contacted and advised, again, as determined by ops. Cc @tcbtcb. @lmperry @jenreiter @kjamison Please advise on FW-internal gear impact. |
I'd like to communicate with every site, just to get in the habit. But this plan SGTM. |
I'm pinged on this but have no idea what this change actually means, can someone TL;DR? As far as gears, is this just a change in the meta-data for input files? |
@kjamison the stylization of a file's
Any gears that reference this key in a way that would be broken by a different string stylization (or in some situations, a different type will be applied (@gsfr knows more if you think this applies to you)) should be updated. Anyone else can chime in if they have information that better applies to gear authors, I'm responding with API developer context. |
@kjamison If you are not relying on specific file type or classification metadata, you should be unaffected. Thanks! |
Breaking change: the If you are using one of the types below that has unambiguously changed, you can prepare for the change by adding a function such as this one to your code: def updated_type(old_type):
"""
Updates an old file type to the new format, if it matches.
"""
return (
old_type
.replace('archive', 'Archive')
.replace('bval', 'BVAL')
.replace('bvec', 'BVEC')
.replace('dicom', 'DICOM')
.replace('document', 'Document')
.replace('eeg', 'EEG')
.replace('gephysio', 'GE Physio')
.replace('image', 'Image')
.replace('ismrmrd', 'HDF5')
.replace('log', 'Log')
.replace('markdown', 'Markdown')
.replace('MATLAB data', 'MATLAB Data')
.replace('MGH data', 'MGH Data')
.replace('nifti', 'NIfTI')
.replace('parrec', 'PAR/REC')
.replace('pdf', 'PDF')
.replace('pfile', 'PFile')
.replace('presentation', 'Presentation')
.replace('PsychoPy data', 'PsychoPy Data')
.replace('qa', 'QC')
.replace('spreadsheet', 'Spreadsheet')
.replace('tabular data', 'Tabular Data')
.replace('text', 'Plain Text')
.replace('video', 'Video')
) Then, change your logic as shown: # Old version
if filetype == 'nifti':
run_script()
# New version
if updated_type(filetype) == 'NIfTI':
run_script() This will make your code ready for the breaking change. A full list of the changes are as follows: Unambiguously changed:
Ambiguously changed:
New types:
|
One potential concern with using replace in the python method - if, for example, I added and used a custom type called |
Totally - I was considering using regexes but I figured this version would be easier to read. The I would accept an alternative that added |
A quick note: some gear rules may be broken by this change, if any of the types were referenced that were not merely case-changed. We should probably address that in this branch. |
Thanks, @kofalt. Not sure we should include code, especially, if it's not bullet proof. The change users need to make is likely trivial compared to the update function. |
Well, the trick is that they need to have their changes ready before this break ships, right? So there needs to be a way for a script to work with the old & new versions simultaneously. Unless I'm missing something... |
Theoretically, you are right, of course. Practically, I continue to think it won't be a problem. We may, however, want to use a forward compatible change like that in our own gears. But I still haven't heard that any of our own gears are actually affected. |
@gsfr At minimum one gear that Jen is working on is affected. I've been keeping her up to date on this change to help with that. I'm going to assume that any gear break could affect anyone going forward. |
Additional upgrade needs:
|
Resolves #955
Resolves part of #844 specifically the 'unknown' duplicated key issue in the initialize_db method
Review Checklist