Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mongo filetypes #966

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open

Mongo filetypes #966

wants to merge 16 commits into from

Conversation

davidfarkas
Copy link
Contributor

Resolves #955
Resolves part of #844 specifically the 'unknown' duplicated key issue in the initialize_db method

Review Checklist

  • Tests were added to cover all code changes
  • Documentation was added / updated
  • Code and tests follow standards in CONTRIBUTING.md

@davidfarkas davidfarkas requested a review from nagem October 25, 2017 22:49
@codecov-io
Copy link

codecov-io commented Oct 26, 2017

Codecov Report

Merging #966 into master will increase coverage by 0.09%.
The diff coverage is 93.47%.

@@            Coverage Diff            @@
##           master    #966      +/-   ##
=========================================
+ Coverage   90.61%   90.7%   +0.09%     
=========================================
  Files          50      51       +1     
  Lines        6766    6843      +77     
=========================================
+ Hits         6131    6207      +76     
- Misses        635     636       +1

@@ -43,6 +44,9 @@
# Filename
'fname': '[^/]+',

# File type name
'ftypename': '[^/]+',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same regex as fname above, does it need to be separate?

api/config.py Outdated
for filetype in filetypes:
try_replace_one(db, 'filetypes', {'_id': filetype['_id']}, filetype, upsert=True)

log.info('Initializing database, creating indexes ....DONE')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this two log statements, one before the try_update_one call and one here.

@@ -4,6 +4,7 @@
from pymongo.errors import DuplicateKeyError
from . import APIStorageException


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this whitespace is in error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just followed pep8. It says: Surround top-level function and class definitions with two blank lines.

@kofalt
Copy link
Contributor

kofalt commented Nov 2, 2017

Partial review. It looks like the regexes are largely all of the form "file ends with dot-patttern", so it might be nicer to have a function generate those to make it a little more readable, but up to you.

I'm lukewarm on doing a JS query scan on every file upload - maybe we should just pull the table in and loop ourselves? Or use a singleton doc, but then modifying the doc is hard. I defer to you and @nagem.

Insert or replace a file type. Required fields: '_id' and 'regex' where the '_id' is the unique name of
the file type and 'regex' is a regular expression which is used to figure out the file type from the file name.
"""
permchecker = userauth.default(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the container types have permissions checking decorators written for them, the one you used is for the User container. For these endpoints, I would expect any site admin (a user whose user doc has root=true) to be able to add/change/remove and any logged-in user to be able to get the list of filetypes. For these kinds of checks, there is are decorators here you can use to require_admin and require_login. Here is an example of one being used.

Note: the difference between superuser_request and user_is_admin is currently different but should be combined. The first is old behavior that called out you were making a superuser request via a query param. The second is notes the privileges you inherently get on all requests because your user doc distinguishes you as a site admin.

payload = self.request.json_body
mongo_schema_uri = validators.schema_uri('mongo', 'filetype.json')
mongo_validator = validators.decorator_from_schema_path(mongo_schema_uri)
mongo_validator(permchecker(noop))('PUT', payload=payload)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also use a mongo schema validator (although we've kind of stopped that practice as we didn't see many real world situations where it prevented dev error), but this is better labeled as an input validator as it's using a json schema to validate request bodies from clients. Rather than creating a decorator*, you can call the validator directly (here is an example).

Related, when you call the existing validator with PUT, it drops the required piece of the json schema to allow for patch updates. POST does not.

* This decorator practice was created in the old rewrite of containerhandler to facilitate calling the permissions check, validators and storage function in one line. It works alright in that handler for now, but new functionality does not need to follow that convention.

@ambrussimon ambrussimon self-requested a review November 14, 2017 16:42
@@ -25,5 +25,34 @@
"_id": "local",
"type": "engine"
}
],
"filetypes": [
{ "_id": "bval", "regex": ".*\\.(bval$|bvals$)" },
Copy link
Contributor

@ambrussimon ambrussimon Nov 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation might identify x.dcm.zip as an archive (instead of as a dicom) depending on the order of iteration over filetypes/regexes. This was working previously because we were using an exact match for after the first dot like so.

Let's remove .* from the beginning of regexes (they get even simpler) and switch to re.search() instead of re.match(). If multiple matches are found, one could still infer that the best match is the longest one. That way we can still get the added flexibility of regexes (eg. for matching EFiles) but don't lose existing functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

I wonder if there's an algorithm that can tell us if a set of regexes are mutually exclusive 🤔

@nagem
Copy link
Contributor

nagem commented Nov 27, 2017

Thanks for making the requested changes, @davidfarkas! Looks good to me. I'll let @ryansanford confirm the file type loading works with what he was expecting and then it's good to merge.

@ryansanford
Copy link
Contributor

The loading scheme looks fine to me. I'd like to tweak the fly/fly branch for a proper test before this lands. Doing that now...

@ryansanford
Copy link
Contributor

@gsfr Looking for the stylized file type names

@gsfr
Copy link
Member

gsfr commented Dec 6, 2017

I've added the stylized types from the classification branch and also added a few new types.

Does this implementation still prefer longer matches over shorter ones (i.e., .dicom.zip over .zip)?

I used square brackets in some of the regular expressions. Hope that works.

This branch needs to be rebased onto master to resolve the Django related CI error.

{ "_id": "Tabular data", "regex": "\\.([ct]sv\\.gz|[ct]sv)$" },
{ "_id": "Video", "regex": "\\.(mpeg|mpg|mov|mp4|m4v|mts)$" },
{ "_id": "XML", "regex": "\\.xml$" },
{ "_id": "YAML", "regex": "\\.(yaml|yml)$" }
Copy link
Contributor

@kofalt kofalt Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Megan alerted me to this change: I don't think we can do this since it would break the gear surface area. Suggest using the original names in the filetypes.json file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kofalt Thanks for the heads up. Could you please elaborate? What are all the aspects you see impacted by this change?

Anything beyond this?

  • massive DB migration
  • manifest updates
  • all gears updated at all sites

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, my understanding because these filetypes are presented to gears, and they might rely on that field, a check for == "MATLAB data" would fail if the string is now "MATLAB Data".

This is a good example of how we don't quite have the full gear "surface area" mapped out and committed to: with our quickly-built deep hierarchy integration, there's more that could make a gear fail through no fault of its own.

So our options (I think) are:

  1. Not change existing strings
  2. Change the strings, but have the old versions for when they're provided externally (?)
  3. Triage the full set of gears & gear authors, communicate a breaking change

@ryansanford
Copy link
Contributor

Final review pending DB change by @nagem

@nagem
Copy link
Contributor

nagem commented Dec 13, 2017

Adding breaking change label as this PR contains a DB update that retypes files that exist in the system.

@kofalt
Copy link
Contributor

kofalt commented Dec 13, 2017

@gsfr Did you ever make a decision regarding my (collapsed) review comment above?

Based on our offline discussion, I am fine with us either resolving that, or shipping this to customers after we use our TBD new customer-notification strategy.

@nagem
Copy link
Contributor

nagem commented Dec 13, 2017

@ryansanford @gsfr @davidfarkas @kofalt

I added the database update that will retype files (and insert the starting state of the filetypes collection for existing customers). Currently the only thing preventing this from landing is resolving any conflicts with existing gears and properly alerting existing users.

@gsfr
Copy link
Member

gsfr commented Dec 14, 2017

Thanks for the ping, @kofalt.

Properly defining the gear surface area seems like a fairly urgent matter. Gear authors must know what they can rely on and what not. We need to make that explicit, rather than retroactively holding ourselves to standards we never knowingly committed to. So, you certainly have good point!

I would like to go ahead with the current change, which can be rolled out to most sites without much communication, as determined by ops. Any site in question, needs to be properly contacted and advised, again, as determined by ops. Cc @tcbtcb.

@lmperry @jenreiter @kjamison Please advise on FW-internal gear impact.

@kofalt
Copy link
Contributor

kofalt commented Dec 14, 2017

I'd like to communicate with every site, just to get in the habit. But this plan SGTM.

@kjamison
Copy link

I'm pinged on this but have no idea what this change actually means, can someone TL;DR? As far as gears, is this just a change in the meta-data for input files?

@nagem
Copy link
Contributor

nagem commented Dec 14, 2017

@kjamison the stylization of a file's type key available on the inputs map will change. For example,

"config" : {
        "inputs" : {
            "nifti" : {
                "hierarchy" : {...},
                "base" : "file",
                "location" : {...},
                "object" : {
                    "info" : {...},
                    "mimetype" : "application/octet-stream",
                    "tags" : [],
                    "measurements" : [],
                    "type" : "nifti",  ## This key will have a different stylization, after this update it will be "NIfTI"
                    "modality" : null,
                    "size" : 389164
                }
            }
        },
        "destination" : {...},
        "config" : {...}
    }

Any gears that reference this key in a way that would be broken by a different string stylization (or in some situations, a different type will be applied (@gsfr knows more if you think this applies to you)) should be updated. Anyone else can chime in if they have information that better applies to gear authors, I'm responding with API developer context.

@gsfr
Copy link
Member

gsfr commented Dec 15, 2017

@kjamison If you are not relying on specific file type or classification metadata, you should be unaffected. Thanks!

@kofalt
Copy link
Contributor

kofalt commented Dec 15, 2017

Breaking change: the type key on files will have a new format. Some types will have changed, some have been broken up, and some new types have been added. In addition, it will now be possible to add custom types on a per-installation basis.

If you are using one of the types below that has unambiguously changed, you can prepare for the change by adding a function such as this one to your code:

def updated_type(old_type):
	"""
	Updates an old file type to the new format, if it matches.
	"""
	
	return (
		old_type 
		.replace('archive',       'Archive')
		.replace('bval',          'BVAL')
		.replace('bvec',          'BVEC')
		.replace('dicom',         'DICOM')
		.replace('document',      'Document')
		.replace('eeg',           'EEG')
		.replace('gephysio',      'GE Physio')
		.replace('image',         'Image')
		.replace('ismrmrd',       'HDF5')
		.replace('log',           'Log')
		.replace('markdown',      'Markdown')
		.replace('MATLAB data',   'MATLAB Data')
		.replace('MGH data',      'MGH Data')
		.replace('nifti',         'NIfTI')
		.replace('parrec',        'PAR/REC')
		.replace('pdf',           'PDF')
		.replace('pfile',         'PFile')
		.replace('presentation',  'Presentation')
		.replace('PsychoPy data', 'PsychoPy Data')
		.replace('qa',            'QC')
		.replace('spreadsheet',   'Spreadsheet')
		.replace('tabular data',  'Tabular Data')
		.replace('text',          'Plain Text')
		.replace('video',         'Video')
	)

Then, change your logic as shown:

# Old version
if filetype == 'nifti':
	run_script()

# New version
if updated_type(filetype) == 'NIfTI':
	run_script()

This will make your code ready for the breaking change.
The updated_type function can be removed later.

A full list of the changes are as follows:

Unambiguously changed:

Old New
archive Archive
bval BVAL
bvec BVEC
dicom DICOM
document Document
eeg EEG
gephysio GE Physio
image Image
ismrmrd HDF5
log Log
markdown Markdown
MATLAB data MATLAB Data
MGH data MGH Data
nifti NIfTI
parrec PAR/REC
pdf PDF
pfile PFile
presentation Presentation
PsychoPy data PsychoPy Data
qa QC
spreadsheet Spreadsheet
tabular data Tabular Data
text Plain Text
video Video

Ambiguously changed:

Old New
markup HTML or XML
source code C/C++ or CSS or Java or JSON or MATLAB or Python or PHP or JavaScript or TOML or YAML

New types:

Old New
n/a Audio
n/a EFile
n/a Jupyter
n/a PFile Header

@ehlertjd
Copy link
Contributor

One potential concern with using replace in the python method - if, for example, I added and used a custom type called text/csv and they run that function, won't it incorrectly map into Plain Text/csv?

@kofalt
Copy link
Contributor

kofalt commented Dec 15, 2017

Totally - I was considering using regexes but I figured this version would be easier to read.

The type key is not an enum, but in practice the range of values are pretty small. I am also only suggesting this in-place change for comparison only, not to send back to the DB.

I would accept an alternative that added ^$ to everything to make it more correct.

@kofalt
Copy link
Contributor

kofalt commented Dec 15, 2017

A quick note: some gear rules may be broken by this change, if any of the types were referenced that were not merely case-changed. We should probably address that in this branch.

@gsfr
Copy link
Member

gsfr commented Dec 15, 2017

Thanks, @kofalt. Not sure we should include code, especially, if it's not bullet proof. The change users need to make is likely trivial compared to the update function.

@kofalt
Copy link
Contributor

kofalt commented Dec 15, 2017

Well, the trick is that they need to have their changes ready before this break ships, right? So there needs to be a way for a script to work with the old & new versions simultaneously. Unless I'm missing something...

@gsfr
Copy link
Member

gsfr commented Dec 15, 2017

Theoretically, you are right, of course. Practically, I continue to think it won't be a problem.

We may, however, want to use a forward compatible change like that in our own gears. But I still haven't heard that any of our own gears are actually affected.

@kofalt
Copy link
Contributor

kofalt commented Dec 19, 2017

@gsfr At minimum one gear that Jen is working on is affected. I've been keeping her up to date on this change to help with that. I'm going to assume that any gear break could affect anyone going forward.

@ryansanford
Copy link
Contributor

Additional upgrade needs:

  • Update existing gear rules
  • Update existing session templates
  • Reconcile how pending jobs locked into gear IDs at time of upgrade will result in properly typed files.

@gsfr gsfr unassigned nagem May 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants