Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: scrub substring values with their hash #44

Merged
merged 4 commits into from
Jul 19, 2024
Merged

Conversation

mrnagydavid
Copy link
Contributor

@mrnagydavid mrnagydavid commented Jul 4, 2024

In this PR, I added a new scrubber named saltedHashSubstringScrubber which behaves similar to saltedHashScrubber - with the difference that the new scrubber replaces a substring with the hash rather than the entire value.

The substring can be specified as a string or a regex.
The scrubber takes the substring and hashes it the same way as saltedHashScrubber.
Finally it replaces the substring with the hash.

The underlying intention is to be able to replace multiple occurrences of substring with the same replacement value.

@flo-monin
Copy link
Contributor

One missing piece here is the scrubberSQL part :D
if you look in getScrubberSql, you will see that

_assert(scrubber, `No SQL factory for ${scrubberCurrentField.scrubber}, used for ${fieldName}`)

will throw. In order to avoid that, we want to implement a macAndIdScrubberSQL function. It feels like a lot of effort for not much to implement the same logic for the SQL scrubbing (it is used to mask the data in Snowflake), so I would suggest using the undefinedScrubber

tl;dr:

add this to defaultScrubbersSQL:

macAndIdScrubber: undefinedScrubberSQL,

@mrnagydavid
Copy link
Contributor Author

I added the requested changes. I'm afraid it will effectively make every HardwareDevice and pairedDevice undefined. Not sure if that is going to be an issue or not.

I also added a pair of tests to ensure that future devs have a failing test when they miss the SQL-scrubber.

@ryangibbsnc
Copy link

I defer to Florent here 🤞

@flo-monin
Copy link
Contributor

It might indeed, but I don't think we use hardware device in analytics anyway
But yes, that's not great 😬
Actually, I tried thinking of a way to make this work, but I'm not sure the scrubber SQL can actually work for this "object level" scrubber :o
I don't want to block you with this, so easiest is probably to still go with undefined indeed

@kirillgroshkov
Copy link
Member

I want to point out one thing.

Look at this list of scrubbers:

export const defaultScrubbers: ScrubbersMap = {
  staticScrubber,
  preserveOriginalScrubber,
  isoDateStringScrubber,
  unixTimestampScrubber,
  undefinedScrubber,
  charsFromRightScrubber,
  randomScrubber,
  randomEmailScrubber,
  randomEmailInContentScrubber,
  saltedHashScrubber,
  saltedHashEmailScrubber,
  bcryptStringScrubber,
  macAndIdScrubber,
}

Do you see an outlier here? (like a children's game - find what does not belong in a list)

The original idea for the scrubber-lib project was that it's generic/open-source, not NC specific. All the scrubbers until this PR followed that. But macAndId looks 100% NC-specific.

2 possible outcomes:

  1. We drop the original idea of the library to be generic, and go "all in NC"
  2. We preserve the generic-ness, and approach the problem differently

I suggest to start with 2 and see if there's a rather simple way to make it generic. And if we really really cannot - come back here and discuss/consider 1. My gut feeling is that 2 should be possible.

Alternative is to define NC-specific scrubbers in NCShared.

src/scrubbers.ts Outdated
Comment on lines 494 to 496
const substringToReplace = `REGEXP_SUBSTR(${sqlValueToReplace}, ${params.regex})`
const hashedValue = `SHA2(${substringToReplace} || '${params.initializationVector}', 256)`
const replacedValue = `REGEXP_REPLACE(${sqlValueToReplace}, ${params.regex}, ${hashedValue})`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flo-monin Can you help me check if this actually works?

I checked that the functions are available on all major rdbms, but I am not sure how to test this properly.

@mrnagydavid mrnagydavid changed the title feat: scrub mac address from objects feat: scrub substring values with their hash Jul 18, 2024
src/scrubbers.ts Outdated
Comment on lines 494 to 496
const substringToReplace = `REGEXP_SUBSTR(${sqlValueToReplace}, ${params.regex})`
const hashedValue = `SHA2(${substringToReplace} || '${params.initializationVector}', 256)`
const replacedValue = `REGEXP_REPLACE(${sqlValueToReplace}, ${params.regex}, ${hashedValue})`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const substringToReplace = `REGEXP_SUBSTR(${sqlValueToReplace}, ${params.regex})`
const hashedValue = `SHA2(${substringToReplace} || '${params.initializationVector}', 256)`
const replacedValue = `REGEXP_REPLACE(${sqlValueToReplace}, ${params.regex}, ${hashedValue})`
const substringToReplace = `NVL(REGEXP_SUBSTR(${sqlValueToReplace}, '${params.regex}'), '')`
const hashedValue = `SHA2(${substringToReplace} || '${params.initializationVector}', 256)`
const replacedValue = `REGEXP_REPLACE(${sqlValueToReplace}, '${params.regex}', ${hashedValue})`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long story short, you need to coalesce the null value to an empty string to avoid returning null if the field doens't match the regex
and you also need to add sing quotes around the regex itself, for the SQL to compile :D

@mrnagydavid
Copy link
Contributor Author

@kirillgroshkov I picked up my Generalizator gun, pointed at the code and pulled the trigger.

Copy link
Member

@kirillgroshkov kirillgroshkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Just one comment on naming. You call the scrubber "Salted". But the parameter is called "initializationVector". Is it better/worse to make them called the same, e.g rename initializationVector to salt?

@mrnagydavid
Copy link
Contributor Author

mrnagydavid commented Jul 19, 2024

There already is a saltedHashScrubber, and the new scrubber does exactly the same hashing (initialVector + hashing).
The only difference is that the new one works on substrings, not the entire value.
Hence the name.

I think if we change the name, both should be changed.
I am not sure if it's worth it.

@kirillgroshkov
Copy link
Member

There already is a saltedHashScrubber, and the new scrubber does exactly the same hashing (initialVector + hashing). The only difference is that the new one works on substrings, not the entire value. Hence the name.

I think if we change the name, both should be changed. I am not sure if it's worth it.

No worries then, ok to keep.

How about the other part of my question - to rename initializationVector to salt (in the implementation/param name)?

@mrnagydavid
Copy link
Contributor Author

How about the other part of my question - to rename initializationVector to salt (in the implementation/param name)?

As we discussed on the meeting, this name comes from the lib as it is being used elsewhere too.
Inside the lib, it is actually only used for saltedHashScrubber (and the new scrubber).
Outside the lib, I searched for this initializationVector string, and it seems that it is not explicitly configured from outside from any active repo - so it uses the default value inside the lib -> it could be fairly easy to rename.

@mrnagydavid mrnagydavid merged commit ae87afa into master Jul 19, 2024
1 check passed
@mrnagydavid mrnagydavid deleted the DEV-18701 branch July 19, 2024 10:38
Copy link

🎉 This issue has been resolved in version 2.14.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants