-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial Obfuscation feature #1
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work Adam!
I have a couple suggestions for work to be done later on (definitely don't need to do these as a part of this trial project!).
(1) Just use a single Tyr.Collection
- There should just be one *Tyr.Collection instance per domain.
- A single Tyr.Collection instance can map to multiple MongoDB collections.
For example, the migrate function could look more like:
migrate(..., from: Tyr.Collection, to: Tyr.Collection, ...)
to
migrate(..., collection: Tyr.Collection, obfuscationSuffix, ...)
Maybe the suffix could have a default value like (_obfuscated
). So then if User information is
by default stored in the MongoDB collection "users", then the obfuscated data would be stored at "users_obfuscated" by default.
(2) There is an existing module for Tyranid text sanitization which could be used for generating "randomized" text located at:
https://github.com/tyranid-org/tyranid/tree/master/packages/tyranid-sanitize
(3) The Tyranid repository is set up as a Mono Repo (https://en.wikipedia.org/wiki/Monorepo) using Lerna (https://lernajs.io/). (This is mostly done to ensure that tests for all subprojects are run when any project is updated to ensure that everything plays well with each other.) Later on we can set you up as a Tyranid committer and I can walk you through setting up your stuff as another Tyranid project.
(4) I think the obfuscator and de-obfuscator needs to be able to be done both in batch and on a per-document basis. I am guessing that in the CrossLead administration area there will someday be an option to "Obfuscate" or "Unobfuscate" a single user (or set of users) without affecting the existing obfuscations.
(5) I notice that you are using the same _id for the main record in the main collection and the obfuscated record in the obfuscated collection. This is great! -- definitely keep this.
(6) It would be great to add a "_obfuscated" field or something to the original document when it is obfuscated so code operating on that data knows it's essentially working with (ideally) read-only data at this point.
(7) Tyranid is first and foremost a metadata library so it is a good use of tyranid to define more refined sub-types of existing types. So, for example, rather than modeling firstName
as { is: 'string', ... }
we could define a new firstName
type pretty easily. Then each type could have its own "sanitized"/"obfuscated" function. We already have some existing types already, like "url", "email", and so on which can be used to generate higher quality "sample data".
Anyway, all these suggestions are for the future. I'm perfectly happy with what you have already for the purposes of this trial project.
Great work again! Thanks.
(8) We also will need to obfuscate the historical data that is currently being kept. (A lot of the collections are historical in that when new values are changed we keep records of what the previous values were and who changed them for auditing and for being able to show visualizations how organizations are evolving over time.) This should not be too much work, probablyt would be easiest to just pair program for a couple hours some afternoon. |
Going forward from this proof of concept to a production ready functionality, I would prefer to refactor it to use the in-place obfuscation approach instead of generating a new collection. Therefore I probably wouldn't alter the current code to allow for single obfuscations, but rather implement it as part of the redesign.
|
"@types/mongodb": "^3.1.19", | ||
"mongodb": "^3.1.10", | ||
"tyranid": "^0.5.66" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably convert these to peerDependencies
in the future so that consuming apps don't run into version mismatches and/or multiple imported copies of these libs.
} | ||
|
||
const migrateData = async (targetCollection: Tyr.CollectionInstance, sourceCollection: Tyr.CollectionInstance, query?: Tyr.MongoQuery) => { | ||
const q = query ? query : {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking: You can do default paramaters if you like using this syntax:
async (..., query: Tyr.MongoQuery = {})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, I just assumed the function signature had to match the interface
*/ | ||
replacementValues?: object, | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking:
You could enforce either replacementValCollection
or replacementValues
(mutually exclusive) with a base type and type unions:
interface OptsCommon {
query: FilterQuery;
...
}
interface OptsWithCol extends OptsCommon {
replacementValCollection?: Collection;
}
interface OptsWithVals extends OptsCommon {
replacementValues?: object,
}
export type ObfuscateBatchOpts = OptsWithCol | OptsWithVals;
|
||
const records = await collection.findAll({ query: {} }); | ||
t.true(records.length === 10, 'All records still in collection'); | ||
t.notDeepEqual(JSON.stringify(records), ExpectedResults.CopiedObfuscateableData, 'Data not encrypted'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future: This assert seems a little prone to false positives. For example, if anyone accidentally adds to the users dataset but not the expected results, the test would always pass. Compounded slightly with the stringification: if stringify ever returns a format slightly different from your string literal expected result, same issue. What was the purpose of using stringify by the way? Were you running into issues with a regular deep comparison?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very true, I was trying to quickly replicate how DBunit asserts database result batches. Also I wasn't sure how to account for the unique values generated in a simplistic comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some inline questions and minor feedback, but overall looks good.
Regarding point (2) from @dimensia about tyranid-sanitize
: Indeed the ultimate goals are similar. That plugin was developed more for generating test data sets without PII from existing data. However, that plugin wasn't quite configurable enough to satisfy the test data nor GDPR use case. You couldn't, for example, punt the mask value generation to the application level, so a database with unique constraints wouldn't validate. I'd basically imagine this plugin superseding tyranid-sanitize
but maybe taking bits and pieces, like the data generation via faker.js
.
Thanks again!
Initial end to end implementation of workflow, basic implementation
Migrate Data -> Apply masking values -> Encrypt Data -> Overwrite masked data with decrypted restoration of migrated data
Includes base condition tests, and rudimentary AES encryption module