Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Obfuscation feature #1

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

ahouck
Copy link

@ahouck ahouck commented Feb 21, 2019

Initial end to end implementation of workflow, basic implementation

Migrate Data -> Apply masking values -> Encrypt Data -> Overwrite masked data with decrypted restoration of migrated data

Includes base condition tests, and rudimentary AES encryption module

@ahouck ahouck requested a review from yangchristian February 21, 2019 05:09
Copy link

@dimensia dimensia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Adam!

I have a couple suggestions for work to be done later on (definitely don't need to do these as a part of this trial project!).

(1) Just use a single Tyr.Collection

  • There should just be one *Tyr.Collection instance per domain.
  • A single Tyr.Collection instance can map to multiple MongoDB collections.

For example, the migrate function could look more like:

migrate(..., from: Tyr.Collection, to: Tyr.Collection, ...)

to

migrate(..., collection: Tyr.Collection, obfuscationSuffix, ...)

Maybe the suffix could have a default value like (_obfuscated). So then if User information is
by default stored in the MongoDB collection "users", then the obfuscated data would be stored at "users_obfuscated" by default.

(2) There is an existing module for Tyranid text sanitization which could be used for generating "randomized" text located at:

https://github.com/tyranid-org/tyranid/tree/master/packages/tyranid-sanitize

(3) The Tyranid repository is set up as a Mono Repo (https://en.wikipedia.org/wiki/Monorepo) using Lerna (https://lernajs.io/). (This is mostly done to ensure that tests for all subprojects are run when any project is updated to ensure that everything plays well with each other.) Later on we can set you up as a Tyranid committer and I can walk you through setting up your stuff as another Tyranid project.

(4) I think the obfuscator and de-obfuscator needs to be able to be done both in batch and on a per-document basis. I am guessing that in the CrossLead administration area there will someday be an option to "Obfuscate" or "Unobfuscate" a single user (or set of users) without affecting the existing obfuscations.

(5) I notice that you are using the same _id for the main record in the main collection and the obfuscated record in the obfuscated collection. This is great! -- definitely keep this.

(6) It would be great to add a "_obfuscated" field or something to the original document when it is obfuscated so code operating on that data knows it's essentially working with (ideally) read-only data at this point.

(7) Tyranid is first and foremost a metadata library so it is a good use of tyranid to define more refined sub-types of existing types. So, for example, rather than modeling firstName as { is: 'string', ... } we could define a new firstName type pretty easily. Then each type could have its own "sanitized"/"obfuscated" function. We already have some existing types already, like "url", "email", and so on which can be used to generate higher quality "sample data".

Anyway, all these suggestions are for the future. I'm perfectly happy with what you have already for the purposes of this trial project.

Great work again! Thanks.

@dimensia
Copy link

(8) We also will need to obfuscate the historical data that is currently being kept. (A lot of the collections are historical in that when new values are changed we keep records of what the previous values were and who changed them for auditing and for being able to show visualizations how organizations are evolving over time.) This should not be too much work, probablyt would be easiest to just pair program for a couple hours some afternoon.

@ahouck
Copy link
Author

ahouck commented Feb 21, 2019

@dimensia

  1. Just pushed a quick update that will do this, didn't put in the default case, will do that.

  2. This would work really well with the obfuscation plugin, they seem to have a similar basic underlying goal. Would it be worth it to merge the two eventually? So for instance if you wanted to sanitize it would be the same as obfuscating with generated values.

  3. Sounds good

  4. Technically you can do a batch of one right now, as the code is, if you wanted to preserve and encrypt the original value (which is optional) it would create a standalone collection for that one user. This is mainly due to the bug I ran into with the $out function.

Going forward from this proof of concept to a production ready functionality, I would prefer to refactor it to use the in-place obfuscation approach instead of generating a new collection. Therefore I probably wouldn't alter the current code to allow for single obfuscations, but rather implement it as part of the redesign.

  1. Yes I wanted the records in the derived collections to always be able to be tied back to the original data, even if they are exported and later imported.

  2. See #4, I would like to re-implement on the in-place obfuscation idea

  3. That would be good, when I started the project I did not have a great grasp of Tyranid and approached it more from a raw data/working with mongo direction.

src/obfuscator.ts Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
package.json Outdated Show resolved Hide resolved
"@types/mongodb": "^3.1.19",
"mongodb": "^3.1.10",
"tyranid": "^0.5.66"
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably convert these to peerDependencies in the future so that consuming apps don't run into version mismatches and/or multiple imported copies of these libs.

src/obfuscator.ts Outdated Show resolved Hide resolved
}

const migrateData = async (targetCollection: Tyr.CollectionInstance, sourceCollection: Tyr.CollectionInstance, query?: Tyr.MongoQuery) => {
const q = query ? query : {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: You can do default paramaters if you like using this syntax:

async (..., query: Tyr.MongoQuery = {})

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, I just assumed the function signature had to match the interface

*/
replacementValues?: object,

}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking:
You could enforce either replacementValCollection or replacementValues (mutually exclusive) with a base type and type unions:

interface OptsCommon {
  query: FilterQuery;
  ...
}
interface OptsWithCol extends OptsCommon {
  replacementValCollection?: Collection;
}
interface OptsWithVals extends OptsCommon {
  replacementValues?: object,
}
export type ObfuscateBatchOpts  = OptsWithCol | OptsWithVals;


const records = await collection.findAll({ query: {} });
t.true(records.length === 10, 'All records still in collection');
t.notDeepEqual(JSON.stringify(records), ExpectedResults.CopiedObfuscateableData, 'Data not encrypted');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future: This assert seems a little prone to false positives. For example, if anyone accidentally adds to the users dataset but not the expected results, the test would always pass. Compounded slightly with the stringification: if stringify ever returns a format slightly different from your string literal expected result, same issue. What was the purpose of using stringify by the way? Were you running into issues with a regular deep comparison?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very true, I was trying to quickly replicate how DBunit asserts database result batches. Also I wasn't sure how to account for the unique values generated in a simplistic comparison.

Copy link
Member

@yangchristian yangchristian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some inline questions and minor feedback, but overall looks good.

Regarding point (2) from @dimensia about tyranid-sanitize: Indeed the ultimate goals are similar. That plugin was developed more for generating test data sets without PII from existing data. However, that plugin wasn't quite configurable enough to satisfy the test data nor GDPR use case. You couldn't, for example, punt the mask value generation to the application level, so a database with unique constraints wouldn't validate. I'd basically imagine this plugin superseding tyranid-sanitize but maybe taking bits and pieces, like the data generation via faker.js.

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants