Skip to content

Commit

Permalink
feat: Apply the same transformation on the references (#229)
Browse files Browse the repository at this point in the history
- Added test coverage for `validateAndBuildEntriesConfig` using test
containers to validate the `apply_for_references` and
`apply_for_inherited` parameters.
- Added a check to retain manually defined transformers if an
`apply_for_references` transformer has already been set manually.
- Added documentation with examples.
- Revised transformers implementation structure by moving string
literals to constants.
- Configured built-in transformers to support `apply_for_references`.
- Revised context initialization logic.
- Added comments for clarity.
- Refined introspection queries.
- Updated logic for partition table transformation inheritance.
- Removed artifacts and unused attributes.
- Simplified table configuration initialization by decomposing it into
smaller functions.

Closes #182
  • Loading branch information
wwoytenko authored Nov 3, 2024
1 parent ed0b9b0 commit 5e17392
Show file tree
Hide file tree
Showing 62 changed files with 2,262 additions and 479 deletions.
6 changes: 6 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,9 @@ install:
# The build flag -tags=viper_bind_struct has been added to avoid the need to bind each of the environment variables
build: $(CMD_FILES)
CGO_ENABLED=0 go build -tags=viper_bind_struct -ldflags="$(LDFLAGS)" -v -o $(CMD_NAME) $(MAIN_PATH)

lint:
golangci-lint run ./...

up:
docker-compose up playground-dbs-filler
9 changes: 7 additions & 2 deletions docs/built_in_transformers/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,15 @@
Transformers in Greenmask are methods which are applied to anonymize sensitive data. All Greenmask transformers are
split into the following groups:

- [Transformation engines](transformation_engines.md) — the type of generator used in transformers. Hash (deterministic)
and random (randomization)
- [Dynamic parameters](dynamic_parameters.md) — transformers that require an input of parameters and generate
random data based on them.
- [Transformation engines](transformation_engines.md) — the type of generator used in transformers. Hash (deterministic)
and random (randomization)
- [Parameters templating](parameters_templating.md) — generate static parameters values from templates.
- [Transformation conditions](transformation_condition.md) — conditions that can be applied to transformers. If the
condition is not met, the transformer will not be applied.
- [Transformation Inheritance](transformation_inheritance.md) — transformation inheritance for partitioned tables and
tables with foreign keys. Define once and apply to all.
- [Standard transformers](standard_transformers/index.md) — transformers that require only an input of parameters.
- [Advanced transformers](advanced_transformers/index.md) — transformers that can be modified according to user's needs
with the help of [custom functions](advanced_transformers/custom_functions/index.md).
Expand Down
2 changes: 1 addition & 1 deletion docs/built_in_transformers/transformation_condition.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ The condition must be defined as a boolean expression that evaluates to `true` o
`expr` library.

You can use the same functions that are described in
the [built-in transformers](/docs/built_in_transformers/advanced_transformers/custom_functions/index.md)
the [built-in transformers](./advanced_transformers/custom_functions/index.md)

The transformers are executed one by one - this helps you create complex transformation pipelines. For instance
depending on value chosen in the previous transformer, you can decide to execute the next transformer or not.
Expand Down
300 changes: 300 additions & 0 deletions docs/built_in_transformers/transformation_inheritance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,300 @@
# Transformation Inheritance

## Description

If you have partitioned tables or want to apply a transformation to a primary key and propagate it to all tables
referencing that column, you can do so with Greenmask.

## Apply for inherited

Using `apply_for_inherited`, you can apply transformations to all partitions of a partitioned table, including any
subpartitions.

### Configuration conflicts

When a partition has a transformation defined manually via config, and `apply_for_inherited` is set on the parent table,
Greenmask will merge both the inherited and manually defined configurations. The manually defined transformation will
execute last, giving it higher priority.

If this situation occurs, you will see the following information in the log:

```json
{
"level": "info",
"ParentTableSchema": "public",
"ParentTableName": "sales",
"ChildTableSchema": "public",
"ChildTableName": "sales_2022_feb",
"ChildTableConfig": [
{
"name": "RandomDate",
"params": {
"column": "sale_date",
"engine": "random",
"max": "2005-01-01",
"min": "2001-01-01"
}
}
],
"time": "2024-11-03T22:14:01+02:00",
"message": "config will be merged: found manually defined transformers on the partitioned table"
}
```

## Apply for references

Using `apply_for_references`, you can apply transformations to columns involved in a primary key or in tables with a
foreign key that references that column. This simplifies the transformation process by requiring you to define the
transformation only on the primary key column, which will then be applied to all tables referencing that column.

The transformer must be deterministic or support `hash` engine and the `hash` engin must be set in the
configuration file.

List of transformers that supports `apply_for_references`:

* Hash
* NoiseDate
* NoiseFloat
* NoiseInt
* NoiseNumeric
* RandomBool
* RandomDate
* RandomEmail
* RandomFloat
* RandomInt
* RandomIp
* RandomMac
* RandomNumeric
* RandomString
* RandomUuid
* RandomUnixTimestamp

### End-to-End Identifiers

End-to-end identifiers in databases are unique identifiers that are consistently used across multiple tables in a
relational database schema, allowing for a seamless chain of references from one table to another. These identifiers
typically serve as primary keys in one table and are propagated as foreign keys in other tables, creating a direct,
traceable link from one end of a data relationship to the other.

Greenmask can detect end-to-end identifiers and apply transformations across the entire sequence of tables. These
identifiers are detected when the following condition is met: the foreign key serves as both a primary key and a foreign
key in the referenced table.

### Configuration conflicts

When on the referenced column a transformation is manually defined via config, and the `apply_for_references` is set on
parent table, the transformation defined will be chosen and the inherited transformation will be ignored. You will
receive a `INFO` message in the logs.

```json
{
"level": "info",
"TransformerName": "RandomInt",
"ParentTableSchema": "public",
"ParentTableName": "tablea",
"ChildTableSchema": "public",
"ChildTableName": "tablec",
"ChildColumnName": "id2",
"TransformerConfig": {
"name": "RandomInt",
"apply_for_references": true
},
"time": "2024-11-03T21:28:10+02:00",
"message": "skipping apply transformer for reference: found manually configured transformer"
}
```

### Limitations

- The transformation must be deterministic.
- The transformation condition will not be applied to the referenced column.
- Not all transformers support `apply_for_references`

!!! warning

We do not recommend using `apply_for_references` with transformation conditions, as these conditions are not
inherited by transformers on the referenced columns. This may lead to inconsistencies in the data.

## Example 1. Partitioned tables

In this example, we have a partitioned table `sales` that is partitioned by year and then by month. Each partition
contains a subset of data based on the year and month of the sale. The `sales` table has a primary key `sale_id` and is
partitioned by `sale_date`. The `sale_date` column is transformed using the `RandomDate` transformer.

```sql
CREATE TABLE sales
(
sale_id SERIAL NOT NULL,
sale_date DATE NOT NULL,
amount NUMERIC(10, 2) NOT NULL
) PARTITION BY RANGE (EXTRACT(YEAR FROM sale_date));

-- Step 2: Create first-level partitions by year
CREATE TABLE sales_2022 PARTITION OF sales
FOR VALUES FROM (2022) TO (2023)
PARTITION BY LIST (EXTRACT(MONTH FROM sale_date));

CREATE TABLE sales_2023 PARTITION OF sales
FOR VALUES FROM (2023) TO (2024)
PARTITION BY LIST (EXTRACT(MONTH FROM sale_date));

-- Step 3: Create second-level partitions by month for each year, adding PRIMARY KEY on each partition

-- Monthly partitions for 2022
CREATE TABLE sales_2022_jan PARTITION OF sales_2022 FOR VALUES IN (1)
WITH (fillfactor = 70);
CREATE TABLE sales_2022_feb PARTITION OF sales_2022 FOR VALUES IN (2);
CREATE TABLE sales_2022_mar PARTITION OF sales_2022 FOR VALUES IN (3);
-- Continue adding monthly partitions for 2022...

-- Monthly partitions for 2023
CREATE TABLE sales_2023_jan PARTITION OF sales_2023 FOR VALUES IN (1);
CREATE TABLE sales_2023_feb PARTITION OF sales_2023 FOR VALUES IN (2);
CREATE TABLE sales_2023_mar PARTITION OF sales_2023 FOR VALUES IN (3);
-- Continue adding monthly partitions for 2023...

-- Step 4: Insert sample data
INSERT INTO sales (sale_date, amount)
VALUES ('2022-01-15', 100.00);
INSERT INTO sales (sale_date, amount)
VALUES ('2022-02-20', 150.00);
INSERT INTO sales (sale_date, amount)
VALUES ('2023-03-10', 200.00);
```

To transform the `sale_date` column in the `sales` table and all its partitions, you can use the following
configuration:

```yaml
- schema: public
name: sales
apply_for_inherited: true
transformers:
- name: RandomDate
params:
min: "2000-01-01"
max: "2005-01-01"
column: "sale_date"
engine: "random"
```
## Example 2. Simple table references
This is ordinary table references where the primary key of the `users` table is referenced in the `orders` table.

```sql
-- Enable the extension for UUID generation (if not enabled)
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE TABLE users
(
user_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
username VARCHAR(50) NOT NULL
);
CREATE TABLE orders
(
order_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
user_id UUID REFERENCES users (user_id),
order_date DATE NOT NULL
);
INSERT INTO users (username)
VALUES ('john_doe');
INSERT INTO users (username)
VALUES ('jane_smith');
INSERT INTO orders (user_id, order_date)
VALUES ((SELECT user_id FROM users WHERE username = 'john_doe'), '2024-10-31'),
((SELECT user_id FROM users WHERE username = 'jane_smith'), '2024-10-30');
```

To transform the `username` column in the `users` table, you can use the following configuration:

```yaml
- schema: public
name: users
apply_for_inherited: true
transformers:
- name: RandomUuid
apply_for_references: true
params:
column: "user_id"
engine: "hash"
```

This will apply the `RandomUuid` transformation to the `user_id` column in the `orders` table automatically.

## Example 3. References on tables with end-to-end identifiers

In this example, we have three tables: `tablea`, `tableb`, and `tablec`. All tables have a composite primary key.
In the tables `tableb` and `tablec`, the primary key is also a foreign key that references the primary key of `tablea`.
This means that all PKs are end-to-end identifiers.

```sql
CREATE TABLE tablea
(
id1 INT,
id2 INT,
data VARCHAR(50),
PRIMARY KEY (id1, id2)
);
CREATE TABLE tableb
(
id1 INT,
id2 INT,
detail VARCHAR(50),
PRIMARY KEY (id1, id2),
FOREIGN KEY (id1, id2) REFERENCES tablea (id1, id2) ON DELETE CASCADE
);
CREATE TABLE tablec
(
id1 INT,
id2 INT,
description VARCHAR(50),
PRIMARY KEY (id1, id2),
FOREIGN KEY (id1, id2) REFERENCES tableb (id1, id2) ON DELETE CASCADE
);
INSERT INTO tablea (id1, id2, data)
VALUES (1, 1, 'Data A1'),
(2, 1, 'Data A2'),
(3, 1, 'Data A3');
INSERT INTO tableb (id1, id2, detail)
VALUES (1, 1, 'Detail B1'),
(2, 1, 'Detail B2'),
(3, 1, 'Detail B3');
INSERT INTO tablec (id1, id2, description)
VALUES (1, 1, 'Description C1'),
(2, 1, 'Description C2'),
(3, 1, 'Description C3');
```

To transform the `data` column in `tablea`, you can use the following configuration:

```yaml
- schema: public
name: "tablea"
apply_for_inherited: true
transformers:
- name: RandomInt
apply_for_references: true
params:
min: 0
max: 100
column: "id1"
engine: "hash"
- name: RandomInt
apply_for_references: true
params:
min: 0
max: 100
column: "id2"
engine: "hash"
```

This will apply the `RandomInt` transformation to the `id1` and `id2` columns in `tableb` and `tablec` automatically.
Loading

0 comments on commit 5e17392

Please sign in to comment.