Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash transformer is too slow #8

Closed
viniciuschiele opened this issue Feb 8, 2024 · 8 comments · Fixed by #11 or #14
Closed

Hash transformer is too slow #8

viniciuschiele opened this issue Feb 8, 2024 · 8 comments · Fixed by #11 or #14
Assignees
Labels
enhancement New feature or request

Comments

@viniciuschiele
Copy link

I'm currently using RandomUuid for most of the columns but I was asked to hash the original values to maintain the same masked value.

I've replaced RandomUuid with Hash and what used to take less than a minute to dump/transform the data now takes 30 min.

This is what looks like the transformation config for 6 tables

    - schema: core
      name: users
      transformers:
        - name: Hash
          params:
            column: email
        - name: Hash
          params:
            column: first_name
        - name: Hash
          params:
            column: last_name
@wwoytenko
Copy link
Contributor

We will review the implementation. But I suspect it is due to the hash function. I think we should provide hash function choice and hash function params. We will try to resolve it in the next release.

For now, I can suggest a temporal solution. You can use a simple shell script to implement any hashing function. For instance

#!/bin/bash

while read line
do
   printf "%s" "$line" | md5sum | awk '{print $1}'
done

And the config can be like:

    - schema: "humanresources"
      name: "employee"
      transformers:
        - name: "Cmd"
          params:
            driver:
              name: "text"
            expected_exit_code: -1
            skip_on_null_input: true
            executable: "/var/lib/playground/test.sh"
            columns:
              - name: "jobtitle"

The result

image

Read about Cmd transformer

@wwoytenko wwoytenko self-assigned this Feb 8, 2024
@wwoytenko wwoytenko added the enhancement New feature or request label Feb 8, 2024
@viniciuschiele
Copy link
Author

I see that the current hash implementation is using an encryption algorithm, maybe a hash algorithm would be faster.

I'm going to push back this requirement for now until there is a built-in solution for it, thanks

@wwoytenko
Copy link
Contributor

Agreed. We will try to deliver it soon, but any contribution is appreciated. Thank you!

@viniciuschiele
Copy link
Author

Another finding about the Hash transformer, it seems to generate duplicate values even for a small set of values.

I have a table users with 1000 records and the email column is unique (UNIQUE INDEX), when I use the Hash I get an error during pg_restore: ERROR: could not create unique index \"users_email_ux\""

@wwoytenko
Copy link
Contributor

Yeah, a collision was caused. I will rewrite the implementation with the possibility of choosing a hash function (md5, sha1, SHA224/256/384/512). The expected release date is 14 February. Thank you so much for reporting.

@viniciuschiele
Copy link
Author

Never mind, it was my fault, the UNIQUE INDEX has a FILTER condition allowing duplicate emails for deleted users.
Sorry for the false alarm.

wwoytenko added a commit that referenced this issue Feb 10, 2024
* New `Hash` transformer uses `sha1` hash by default.
* Added parameter `function` that can provide a choice of possible hash algorithms `md5, sha1, sha256, sha512`.
* Added `max_length` parameter allowing to truncate hash tail higher than provided length. The default value is `0` - meaning "do not truncate"
* Fixed metadata enrichment for validation warnings caused by `RawValueValidator`

Additional changes:
* Added Error severity for Cmd parameter validator

Closes #8
@wwoytenko
Copy link
Contributor

FIxed in v0.1.5

@viniciuschiele
Copy link
Author

I gave it a try, the new hash transformer is crazy fast now, thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants