Mongo filetypes #966

davidfarkas · 2017-10-25T22:49:52Z

Resolves #955
Resolves part of #844 specifically the 'unknown' duplicated key issue in the initialize_db method

Review Checklist

Tests were added to cover all code changes
Documentation was added / updated
Code and tests follow standards in CONTRIBUTING.md

codecov-io · 2017-10-26T00:42:36Z

Codecov Report

Merging #966 into master will increase coverage by 0.09%.
The diff coverage is 93.47%.

@@            Coverage Diff            @@
##           master    #966      +/-   ##
=========================================
+ Coverage   90.61%   90.7%   +0.09%     
=========================================
  Files          50      51       +1     
  Lines        6766    6843      +77     
=========================================
+ Hits         6131    6207      +76     
- Misses        635     636       +1

kofalt · 2017-11-02T15:26:48Z

api/api.py

@@ -43,6 +44,9 @@
    # Filename
    'fname': '[^/]+',

+    # File type name
+    'ftypename': '[^/]+',


This is the same regex as fname above, does it need to be separate?

kofalt · 2017-11-02T15:27:46Z

api/config.py

+        for filetype in filetypes:
+            try_replace_one(db, 'filetypes', {'_id': filetype['_id']}, filetype, upsert=True)
+
+    log.info('Initializing database, creating indexes ....DONE')


Let's make this two log statements, one before the try_update_one call and one here.

kofalt · 2017-11-02T15:28:01Z

api/dao/dbutil.py

@@ -4,6 +4,7 @@
 from pymongo.errors import DuplicateKeyError
 from . import APIStorageException

+


I think this whitespace is in error

I just followed pep8. It says: Surround top-level function and class definitions with two blank lines.

kofalt · 2017-11-02T16:24:39Z

Partial review. It looks like the regexes are largely all of the form "file ends with dot-patttern", so it might be nicer to have a function generate those to make it a little more readable, but up to you.

I'm lukewarm on doing a JS query scan on every file upload - maybe we should just pull the table in and loop ourselves? Or use a singleton doc, but then modifying the doc is hard. I defer to you and @nagem.

nagem · 2017-11-07T20:59:41Z

api/handlers/filetypehandler.py

+        Insert or replace a file type. Required fields: '_id' and 'regex' where the '_id' is the unique name of
+        the file type and 'regex' is a regular expression which is used to figure out the file type from the file name.
+        """
+        permchecker = userauth.default(self)


Some of the container types have permissions checking decorators written for them, the one you used is for the User container. For these endpoints, I would expect any site admin (a user whose user doc has root=true) to be able to add/change/remove and any logged-in user to be able to get the list of filetypes. For these kinds of checks, there is are decorators here you can use to require_admin and require_login. Here is an example of one being used.

Note: the difference between superuser_request and user_is_admin is currently different but should be combined. The first is old behavior that called out you were making a superuser request via a query param. The second is notes the privileges you inherently get on all requests because your user doc distinguishes you as a site admin.

nagem · 2017-11-07T21:16:19Z

api/handlers/filetypehandler.py

+        payload = self.request.json_body
+        mongo_schema_uri = validators.schema_uri('mongo', 'filetype.json')
+        mongo_validator = validators.decorator_from_schema_path(mongo_schema_uri)
+        mongo_validator(permchecker(noop))('PUT', payload=payload)


We can also use a mongo schema validator (although we've kind of stopped that practice as we didn't see many real world situations where it prevented dev error), but this is better labeled as an input validator as it's using a json schema to validate request bodies from clients. Rather than creating a decorator*, you can call the validator directly (here is an example).

Related, when you call the existing validator with PUT, it drops the required piece of the json schema to allow for patch updates. POST does not.

* This decorator practice was created in the old rewrite of containerhandler to facilitate calling the permissions check, validators and storage function in one line. It works alright in that handler for now, but new functionality does not need to follow that convention.

ambrussimon · 2017-11-14T17:05:43Z

bootstrap.sample.json

@@ -25,5 +25,34 @@
      "_id": "local",
      "type": "engine"
    }
+  ],
+  "filetypes": [
+    { "_id": "bval", "regex": ".*\\.(bval$|bvals$)" },


This implementation might identify x.dcm.zip as an archive (instead of as a dicom) depending on the order of iteration over filetypes/regexes. This was working previously because we were using an exact match for after the first dot like so.

Let's remove .* from the beginning of regexes (they get even simpler) and switch to re.search() instead of re.match(). If multiple matches are found, one could still infer that the best match is the longest one. That way we can still get the added flexibility of regexes (eg. for matching EFiles) but don't lose existing functionality.

Nice catch!

I wonder if there's an algorithm that can tell us if a set of regexes are mutually exclusive 🤔

nagem · 2017-11-27T22:19:40Z

Thanks for making the requested changes, @davidfarkas! Looks good to me. I'll let @ryansanford confirm the file type loading works with what he was expecting and then it's good to merge.

ryansanford · 2017-11-28T14:07:33Z

The loading scheme looks fine to me. I'd like to tweak the fly/fly branch for a proper test before this lands. Doing that now...

ryansanford · 2017-12-05T16:43:49Z

@gsfr Looking for the stylized file type names

gsfr · 2017-12-06T04:50:55Z

I've added the stylized types from the classification branch and also added a few new types.

Does this implementation still prefer longer matches over shorter ones (i.e., .dicom.zip over .zip)?

I used square brackets in some of the regular expressions. Hope that works.

This branch needs to be rebased onto master to resolve the Django related CI error.

kofalt · 2017-12-06T15:12:16Z

bootstrap.sample.json

+    { "_id": "Tabular data",    "regex": "\\.([ct]sv\\.gz|[ct]sv)$" },
+    { "_id": "Video",           "regex": "\\.(mpeg|mpg|mov|mp4|m4v|mts)$" },
+    { "_id": "XML",             "regex": "\\.xml$" },
+    { "_id": "YAML",            "regex": "\\.(yaml|yml)$" }


Megan alerted me to this change: I don't think we can do this since it would break the gear surface area. Suggest using the original names in the filetypes.json file.

@kofalt Thanks for the heads up. Could you please elaborate? What are all the aspects you see impacted by this change?

Anything beyond this?

massive DB migration

manifest updates

all gears updated at all sites

So, my understanding because these filetypes are presented to gears, and they might rely on that field, a check for == "MATLAB data" would fail if the string is now "MATLAB Data".

This is a good example of how we don't quite have the full gear "surface area" mapped out and committed to: with our quickly-built deep hierarchy integration, there's more that could make a gear fail through no fault of its own.

So our options (I think) are:

Not change existing strings

Change the strings, but have the old versions for when they're provided externally (?)

Triage the full set of gears & gear authors, communicate a breaking change

… at app startup

…e, delete)

…or in initialize_db method

…tch, update the integration tests and validate the regex in the file type handler

ryansanford · 2017-12-12T16:44:03Z

Final review pending DB change by @nagem

nagem · 2017-12-13T19:52:53Z

Adding breaking change label as this PR contains a DB update that retypes files that exist in the system.

kofalt · 2017-12-13T22:22:09Z

@gsfr Did you ever make a decision regarding my (collapsed) review comment above?

Based on our offline discussion, I am fine with us either resolving that, or shipping this to customers after we use our TBD new customer-notification strategy.

nagem · 2017-12-13T22:24:45Z

@ryansanford @gsfr @davidfarkas @kofalt

I added the database update that will retype files (and insert the starting state of the filetypes collection for existing customers). Currently the only thing preventing this from landing is resolving any conflicts with existing gears and properly alerting existing users.

gsfr · 2017-12-14T06:37:51Z

Thanks for the ping, @kofalt.

Properly defining the gear surface area seems like a fairly urgent matter. Gear authors must know what they can rely on and what not. We need to make that explicit, rather than retroactively holding ourselves to standards we never knowingly committed to. So, you certainly have good point!

I would like to go ahead with the current change, which can be rolled out to most sites without much communication, as determined by ops. Any site in question, needs to be properly contacted and advised, again, as determined by ops. Cc @tcbtcb.

@lmperry @jenreiter @kjamison Please advise on FW-internal gear impact.

kofalt · 2017-12-14T15:35:23Z

I'd like to communicate with every site, just to get in the habit. But this plan SGTM.

kjamison · 2017-12-14T19:16:31Z

I'm pinged on this but have no idea what this change actually means, can someone TL;DR? As far as gears, is this just a change in the meta-data for input files?

nagem · 2017-12-14T19:24:48Z

@kjamison the stylization of a file's type key available on the inputs map will change. For example,

"config" : {
        "inputs" : {
            "nifti" : {
                "hierarchy" : {...},
                "base" : "file",
                "location" : {...},
                "object" : {
                    "info" : {...},
                    "mimetype" : "application/octet-stream",
                    "tags" : [],
                    "measurements" : [],
                    "type" : "nifti",  ## This key will have a different stylization, after this update it will be "NIfTI"
                    "modality" : null,
                    "size" : 389164
                }
            }
        },
        "destination" : {...},
        "config" : {...}
    }

Any gears that reference this key in a way that would be broken by a different string stylization (or in some situations, a different type will be applied (@gsfr knows more if you think this applies to you)) should be updated. Anyone else can chime in if they have information that better applies to gear authors, I'm responding with API developer context.

gsfr · 2017-12-15T00:32:29Z

@kjamison If you are not relying on specific file type or classification metadata, you should be unaffected. Thanks!

kofalt · 2017-12-15T16:28:15Z

Breaking change: the type key on files will have a new format. Some types will have changed, some have been broken up, and some new types have been added. In addition, it will now be possible to add custom types on a per-installation basis.

If you are using one of the types below that has unambiguously changed, you can prepare for the change by adding a function such as this one to your code:

def updated_type(old_type):
	"""
	Updates an old file type to the new format, if it matches.
	"""
	
	return (
		old_type 
		.replace('archive',       'Archive')
		.replace('bval',          'BVAL')
		.replace('bvec',          'BVEC')
		.replace('dicom',         'DICOM')
		.replace('document',      'Document')
		.replace('eeg',           'EEG')
		.replace('gephysio',      'GE Physio')
		.replace('image',         'Image')
		.replace('ismrmrd',       'HDF5')
		.replace('log',           'Log')
		.replace('markdown',      'Markdown')
		.replace('MATLAB data',   'MATLAB Data')
		.replace('MGH data',      'MGH Data')
		.replace('nifti',         'NIfTI')
		.replace('parrec',        'PAR/REC')
		.replace('pdf',           'PDF')
		.replace('pfile',         'PFile')
		.replace('presentation',  'Presentation')
		.replace('PsychoPy data', 'PsychoPy Data')
		.replace('qa',            'QC')
		.replace('spreadsheet',   'Spreadsheet')
		.replace('tabular data',  'Tabular Data')
		.replace('text',          'Plain Text')
		.replace('video',         'Video')
	)

Then, change your logic as shown:

# Old version
if filetype == 'nifti':
	run_script()

# New version
if updated_type(filetype) == 'NIfTI':
	run_script()

This will make your code ready for the breaking change.
The updated_type function can be removed later.

A full list of the changes are as follows:

Unambiguously changed:

Old	New
archive	Archive
bval	BVAL
bvec	BVEC
dicom	DICOM
document	Document
eeg	EEG
gephysio	GE Physio
image	Image
ismrmrd	HDF5
log	Log
markdown	Markdown
MATLAB data	MATLAB Data
MGH data	MGH Data
nifti	NIfTI
parrec	PAR/REC
pdf	PDF
pfile	PFile
presentation	Presentation
PsychoPy data	PsychoPy Data
qa	QC
spreadsheet	Spreadsheet
tabular data	Tabular Data
text	Plain Text
video	Video

Ambiguously changed:

Old	New
markup	HTML or XML
source code	C/C++ or CSS or Java or JSON or MATLAB or Python or PHP or JavaScript or TOML or YAML

New types:

Old	New
n/a	Audio
n/a	EFile
n/a	Jupyter
n/a	PFile Header

ehlertjd · 2017-12-15T16:43:49Z

One potential concern with using replace in the python method - if, for example, I added and used a custom type called text/csv and they run that function, won't it incorrectly map into Plain Text/csv?

kofalt · 2017-12-15T17:16:23Z

Totally - I was considering using regexes but I figured this version would be easier to read.

The type key is not an enum, but in practice the range of values are pretty small. I am also only suggesting this in-place change for comparison only, not to send back to the DB.

I would accept an alternative that added ^$ to everything to make it more correct.

kofalt · 2017-12-15T17:41:29Z

A quick note: some gear rules may be broken by this change, if any of the types were referenced that were not merely case-changed. We should probably address that in this branch.

gsfr · 2017-12-15T18:01:36Z

Thanks, @kofalt. Not sure we should include code, especially, if it's not bullet proof. The change users need to make is likely trivial compared to the update function.

kofalt · 2017-12-15T18:08:18Z

Well, the trick is that they need to have their changes ready before this break ships, right? So there needs to be a way for a script to work with the old & new versions simultaneously. Unless I'm missing something...

gsfr · 2017-12-15T18:14:25Z

Theoretically, you are right, of course. Practically, I continue to think it won't be a problem.

We may, however, want to use a forward compatible change like that in our own gears. But I still haven't heard that any of our own gears are actually affected.

kofalt · 2017-12-19T15:45:23Z

@gsfr At minimum one gear that Jen is working on is affected. I've been keeping her up to date on this change to help with that. I'm going to assume that any gear break could affect anyone going forward.

ryansanford · 2017-12-19T17:08:32Z

Additional upgrade needs:

Update existing gear rules
Update existing session templates
Reconcile how pending jobs locked into gear IDs at time of upgrade will result in properly typed files.

davidfarkas requested a review from nagem October 25, 2017 22:49

kofalt reviewed Nov 2, 2017

View reviewed changes

nagem reviewed Nov 7, 2017

View reviewed changes

ambrussimon self-requested a review November 14, 2017 16:42

ambrussimon suggested changes Nov 14, 2017

View reviewed changes

davidfarkas force-pushed the mongo-filetypes branch from b983555 to b1432aa Compare November 16, 2017 12:06

ambrussimon approved these changes Nov 16, 2017

View reviewed changes

kofalt reviewed Dec 6, 2017

View reviewed changes

davidfarkas93 and others added 14 commits December 8, 2017 15:23

Add regexs to indentify file types, store these in mongodb, load them…

0e68c71

… at app startup

Remove unused import

f2af51a

Add file type handler to manage file type through the api (add/replac…

5cdd129

…e, delete)

New integration test for the file types handlers

466ca1b

Add new try_update_one db util method, and fix the dup key exists err…

0e1ffa0

…or in initialize_db method

Fix typo in config.py

206d69e

Add some documentation, and increase code coverage

b043b86

Update permission checking, schema validating, remove JS query

ad9a703

Load mongo filetypes via the bootstrap script

dd7e4ca

Update file type tests

0cdbf5d

Refine the file type regular expressions, use re.search instead of ma…

6cf6e2c

…tch, update the integration tests and validate the regex in the file type handler

Update abao load fixture script to load the necessary file types too

c5c6b42

Add stylized file types

82684f2

Refine file types

d79658a

davidfarkas force-pushed the mongo-filetypes branch from dfffaba to d79658a Compare December 8, 2017 14:27

ryansanford assigned nagem Dec 12, 2017

nagem added the Breaking Change label Dec 13, 2017

Add db upgrade for filetypes

3af4e58

kofalt approved these changes Dec 14, 2017

View reviewed changes

Add one more case change

0cc3c28

gsfr approved these changes Dec 14, 2017

View reviewed changes

gsfr unassigned nagem May 11, 2018

		@@ -4,6 +4,7 @@
		from pymongo.errors import DuplicateKeyError
		from . import APIStorageException

Mongo filetypes #966

Are you sure you want to change the base?

Mongo filetypes #966

Conversation

davidfarkas commented Oct 25, 2017

Review Checklist

codecov-io commented Oct 26, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kofalt commented Nov 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ambrussimon Nov 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nagem commented Nov 27, 2017

ryansanford commented Nov 28, 2017

ryansanford commented Dec 5, 2017

gsfr commented Dec 6, 2017

kofalt Dec 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryansanford commented Dec 12, 2017

nagem commented Dec 13, 2017

kofalt commented Dec 13, 2017 • edited Loading

nagem commented Dec 13, 2017 • edited Loading

gsfr commented Dec 14, 2017

kofalt commented Dec 14, 2017

kjamison commented Dec 14, 2017

nagem commented Dec 14, 2017 • edited Loading

gsfr commented Dec 15, 2017

kofalt commented Dec 15, 2017 • edited Loading

ehlertjd commented Dec 15, 2017

kofalt commented Dec 15, 2017 • edited Loading

kofalt commented Dec 15, 2017

gsfr commented Dec 15, 2017

kofalt commented Dec 15, 2017

gsfr commented Dec 15, 2017

kofalt commented Dec 19, 2017

ryansanford commented Dec 19, 2017

codecov-io commented Oct 26, 2017 •

edited

Loading

ambrussimon Nov 14, 2017 •

edited

Loading

kofalt Dec 6, 2017 •

edited

Loading

kofalt commented Dec 13, 2017 •

edited

Loading

nagem commented Dec 13, 2017 •

edited

Loading

nagem commented Dec 14, 2017 •

edited

Loading

kofalt commented Dec 15, 2017 •

edited

Loading

kofalt commented Dec 15, 2017 •

edited

Loading