Skip to content

Commit

Permalink
Merge pull request #216 from GreenmaskIO/docs/revised_readmemd
Browse files Browse the repository at this point in the history
Docs: Revised README.md
  • Loading branch information
wwoytenko authored Oct 13, 2024
2 parents 361af15 + afe1315 commit 4e91518
Show file tree
Hide file tree
Showing 9 changed files with 83 additions and 92 deletions.
145 changes: 68 additions & 77 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,76 @@
# Greenmask - dump obfuscation tool
# [Greenmask](https://greenmask.io)

## Preface
## Dump anonymization and synthetic data generation tool

**Greenmask** is a powerful open-source utility that is designed for logical database backup dumping,
obfuscation, and restoration. It offers extensive functionality for backup, anonymization, and data masking. Greenmask
is written entirely in pure Go and includes ported PostgreSQL libraries, making it platform-independent. This tool is
stateless and does not require any changes to your database schema. It is designed to be highly customizable and
backward-compatible with existing PostgreSQL utilities.
anonymization, synthetic data generation and restoration. It has ported PostgreSQL libraries, making it reliable.
It is stateless and does not require any changes to your database schema. It is designed to be highly customizable and
backward-compatible with existing PostgreSQL utilities, fast and reliable.

![Build status](https://github.com/greenmaskio/greenmask/workflows/ci/badge.svg)
[![License](https://img.shields.io/github/license/greenmaskio/greenmask)](https://github.com/greenmaskio/greenmask/blob/main/LICENSE)
![GitHub Release](https://img.shields.io/github/v/release/greenmaskio/greenmask)
![GitHub Downloads (all assets, all releases)](https://img.shields.io/github/downloads/greenmaskio/greenmask/total)
[![Docker pulls](https://img.shields.io/docker/pulls/greenmask/greenmask)](https://hub.docker.com/r/greenmask/greenmask)
[![Go Report Card](https://goreportcard.com/badge/github.com/greenmaskio/greenmask)](https://goreportcard.com/report/github.com/greenmaskio/greenmask)

![schema.png](docs/assets/schema.png)

# Features

* **Database subset** - Dumps only the necessary data consistently based on the subset condition, reducing the size
of the dump and speeding up the restoration process.
* **Deterministic transformers** — deterministic approach to data transformation based on the hash
* **[Deterministic transformers](https://greenmask.io/latest/built_in_transformers/transformation_engines/#hash-engine)**
— deterministic approach to data transformation based on the hash
functions. This ensures that the same input data will always produce the same output data. Almost each transformer
supports either `random` or `hash` engine making it universal for any use case.
* **Dynamic parameters** — almost each transformer supports dynamic parameters, allowing to parametrize the
* **[Dynamic parameters](https://greenmask.io/latest/built_in_transformers/dynamic_parameters/)** — almost each
transformer supports dynamic parameters, allowing to parametrize the
transformer dynamically from the table column value. This is helpful for resolving the functional dependencies
between columns and satisfying the constraints.
* **[Transformation validation and easy maintainable](https://greenmask.io/latest/commands/validate/)** - During
configuration process, Greenmask provides validation
warnings, data transformation diff and schema diff features, allowing you to monitor and maintain transformations
effectively
throughout the software lifecycle. Schema diff helps to avoid data leakage when schema changed.
* **[Partitioned tables transformation inheritance](https://greenmask.io/latest/configuration/?h=partition#dump-section)**
— Define transformation configurations once and apply them to all
partitions within partitioned tables (using `apply_for_inherited` parameter), simplifying the anonymization process.
* **Stateless** - Greenmask operates as a logical dump and does not impact your existing database schema.
* **Cross-platform** - Can be easily built and executed on any platform, thanks to its Go-based architecture,
which eliminates platform dependencies.
* **Database type safe** - Ensures data integrity by validating data and utilizing the database driver for
encoding and decoding operations. This approach guarantees the preservation of data formats.
* **Transformation validation and easy maintainable** - During obfuscation development, Greenmask provides validation
warnings and a transformation diff feature, allowing you to monitor and maintain transformations effectively
throughout the software lifecycle.
* **Partitioned tables transformation inheritance** - Define transformation configurations once and apply them to all
partitions within partitioned tables, simplifying the obfuscation process.
* **Stateless** - Greenmask operates as a logical dump and does not impact your existing database schema.
* **Backward compatible** - It fully supports the same features and protocols as existing vanilla PostgreSQL utilities.
Dumps created by Greenmask can be successfully restored using the pg_restore utility.
* **Extensible** - Users have the flexibility to implement domain-based transformations in any programming language or
use predefined templates.
* **Declarative** - Greenmask allows you to define configurations in a structured, easily parsed, and recognizable
format.
* **Integrable** - Integrate Greenmask seamlessly into your CI/CD system for automated database obfuscation and
* **Extensible** - Users have the flexibility
to [implement domain-based transformations](https://greenmask.io/latest/built_in_transformers/standard_transformers/cmd/)
in any programming language or
use [predefined templates](https://greenmask.io/latest/built_in_transformers/advanced_transformers/).
* **Integrable** - Integrate seamlessly into your CI/CD system for automated database anonymization and
restoration.
* **Parallel execution** - Take advantage of parallel dumping and restoration, significantly reducing the time required
to deliver results.
* **Provide variety of storages** - Greenmask offers a variety of storage options for local and remote data storage,
* **Provide variety of storages** - offers a variety of storage options for local and remote data storage,
including directories and S3-like storage solutions.
* **Pgzip support for faster compression** — by setting `--pgzip`, greenmask can speeds up the dump and restoration
processes through parallel compression.
* **[Pgzip support for faster compression](https://greenmask.io/latest/commands/dump/?h=pgzip#pgzip-compression)** — by
setting `--pgzip`, it can speeds up the dump and restoration
processes through parallel compression.

## Getting started

Greenmask has a [Playground](https://greenmask.io/latest/playground/) - it is a sandbox environment in Docker with
sample databases included to help you try Greenmask without any additional actions

1. Clone the `greenmask` repository and navigate to its directory by running the following commands:

```shell
git clone [email protected]:GreenmaskIO/greenmask.git && cd greenmask
```

2. Once you have cloned the repository, start the environment by running Docker Compose:

```shell
docker-compose run greenmask
```

## Use Cases

Expand All @@ -55,13 +84,6 @@ Greenmask is ideal for various scenarios, including:
a pre-production environment with consistently anonymized data, facilitating faster time-to-market in the development
lifecycle.

## Our purpose

The Greenmask utility plays a central role in the Greenmask ecosystem. Our goal is to develop a comprehensive, UI-based
solution for managing obfuscation procedures. We recognize the challenges of maintaining obfuscation consistency
throughout the software lifecycle. Greenmask is dedicated to providing valuable tools and features that ensure the
obfuscation process remains fresh, predictable, and transparent.

### General Information

It is evident that the most appropriate approach for executing logical backup dumping and restoration is by leveraging
Expand All @@ -70,74 +92,43 @@ align with PostgreSQL's native utilities, ensuring compatibility. Greenmask prim
operations independently and delegates the responsibilities of schema dumping and restoration to pg_dump and pg_restore,
maintaining seamless integration with PostgreSQL's standard tools.

### Backup Process
#### Backup and Process

The process of backing up PostgreSQL databases is divided into three distinct sections:

* **Pre-data** - This section encompasses the raw schema of tables, excluding primary keys (PK) and foreign keys (FK).
* **Data** - The data section contains the actual table data in COPY format, including information about sequence
current
values and Large Objects data.
* **Post-data** - In this section, you'll find the definitions of indexes, triggers, rules, and constraints (such as PK
and
FK).

Greenmask focuses exclusively on the data section during runtime. It delegates the handling of the _pre-data_ and
_post-data_ sections to the core PostgreSQL utilities, _pg_dump_ and _pg_restore_.

Greenmask employs the **directory format** of _pg_dump_ and _pg_restore_. This format is particularly suitable for
Greenmask uses the **directory format** of _pg_dump_ and _pg_restore_. This format is particularly suitable for
parallel execution and partial restoration, and it includes clear metadata files that aid in determining the backup and
restoration steps. Greenmask has been optimized to work seamlessly with remote storage systems and obfuscation
restoration steps. Greenmask has been optimized to work seamlessly with remote storage systems and anonymization
procedures.

When performing data dumping, Greenmask utilizes the COPY command in TEXT format, maintaining reliability and
compatibility with the vanilla PostgreSQL utilities.

Additionally, Greenmask supports parallel execution, significantly reducing the time required for the dumping process.

## Storage Options

The core PostgreSQL utilities, _pg_dump_ and _pg_restore_, traditionally operate with files in a directory format,
offering no alternative methods. To meet **modern backup requirements** and provide flexible approaches,
Greenmask introduces the concept of **Storages**.
#### Storage Options

* **s3** - This option supports any S3-like storage system, including AWS S3, making it versatile and adaptable to
various cloud-based storage solutions.
* **directory** - This is the standard choice, representing the ordinary filesystem directory for local storage.

## Restoration Process

In the restoration process, Greenmask combines the capabilities of different tools:

* **Schema Restoration** - Greenmask utilizes _pg_restore_ to restore the database schema. This ensures that the schema
is accurately reconstructed.
* **Data Restoration** - For data restoration, Greenmask independently applies the data using the COPY protocol.
This allows Greenmask to handle the data efficiently, especially when working with various storage solutions.
Greenmask is aware of the restoration metadata, which enables it to download only the necessary data. This feature
is particularly useful for partial restoration scenarios, such as restoring a single table from a complete backup.

Greenmask also **supports parallel restoration**, which can significantly reduce the time required to complete the
restoration process. This parallel execution enhances the efficiency of restoring large datasets.

## Data Obfuscation and Validation
## Data Anonymization and Validation

Greenmask works with **COPY lines**, collects schema metadata using the Golang driver, and employs this driver in the
encoding and decoding process. The **validate command** offers a way to assess the impact on both schema
(**validation warnings**) and data (**transformation and displaying differences**). This command allows you to validate
the schema and data transformations, ensuring the desired outcomes during the obfuscation process.
the schema and data transformations, ensuring the desired outcomes during the Anonymization process.

## Customization

If your table schema relies on functional dependencies between columns, you can address this challenge using the
**TemplateRecord** transformer. This transformer enables you to define transformation logic for entire tables,
offering type-safe operations when assigning new values.
[Dynamic parameters](https://greenmask.io/latest/built_in_transformers/dynamic_parameters/). By setting dynamic
parameters, you can resolve such as created_at and updated_at cases, where the
updated_at must be greater or equal than the created_at.

If you need to implement custom logic imperatively use
[TemplateRecord](https://greenmask.io/latest/built_in_transformers/advanced_transformers/template_record/) or
[Template](https://greenmask.io/latest/built_in_transformers/advanced_transformers/template/) transformers.

Greenmask provides a framework for creating your custom transformers, which can be reused efficiently. These
transformers can be seamlessly integrated without requiring recompilation, thanks to the PIPE (stdin/stdout)
interaction.

Furthermore, Greenmask's architecture is designed to be highly extensible, making it possible to introduce other
interaction protocols, such as HTTP or Socket, for conducting obfuscation procedures.
interaction protocols, such as HTTP or Socket, for conducting anonymization procedures.
## PostgreSQL Version Compatibility
Expand All @@ -152,7 +143,7 @@ interaction protocols, such as HTTP or Socket, for conducting obfuscation proced
## Links
* [Documentation](https://greenmask.io)
* [Documentation](https://docs.greenmask.io)
* Email: **[email protected]**
* [Twitter](https://twitter.com/GreenmaskIO)
* [Telegram](https://t.me/greenmask_community)
Expand Down
8 changes: 4 additions & 4 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ The process of backing up PostgreSQL databases is divided into three distinct se
Greenmask focuses exclusively on the data section during runtime. It delegates the handling of the `pre-data` and `post-data` sections to the core PostgreSQL utilities, `pg_dump` and `pg_restore`.

Greenmask employs the directory format of `pg_dump` and `pg_restore`. This format is particularly suitable for
parallel execution and partial restoration, and it includes clear metadata files that aid in determining the backup and restoration steps. Greenmask has been optimized to work seamlessly with remote storage systems and obfuscation procedures.
parallel execution and partial restoration, and it includes clear metadata files that aid in determining the backup and restoration steps. Greenmask has been optimized to work seamlessly with remote storage systems and anonymization procedures.

When performing data dumping, Greenmask utilizes the COPY command in TEXT format, maintaining reliability and
compatibility with the vanilla PostgreSQL utilities.
Expand All @@ -39,10 +39,10 @@ In the restoration process, Greenmask combines the capabilities of different too

Greenmask also supports **parallel restoration**, which can significantly reduce the time required to complete the restoration process. This parallel execution enhances the efficiency of restoring large datasets.

## Data obfuscation and validation
## Data anonymization and validation

Greenmask works with COPY lines, collects schema metadata using the Golang driver, and employs this driver in the encoding and decoding process. The **validate command** offers a way to assess the impact on both schema
(**validation warnings**) and data (**transformation and displaying differences**). This command allows you to validate the schema and data transformations, ensuring the desired outcomes during the obfuscation process.
(**validation warnings**) and data (**transformation and displaying differences**). This command allows you to validate the schema and data transformations, ensuring the desired outcomes during the anonymization process.

## Customization

Expand All @@ -53,7 +53,7 @@ transformers can be seamlessly integrated without requiring recompilation, thank
interaction.

!!! note
Furthermore, Greenmask's architecture is designed to be highly extensible, making it possible to introduce other interaction protocols, such as HTTP or Socket, for conducting obfuscation procedures.
Furthermore, Greenmask's architecture is designed to be highly extensible, making it possible to introduce other interaction protocols, such as HTTP or Socket, for conducting anonymization procedures.

## PostgreSQL version compatibility

Expand Down
Binary file added docs/assets/schema.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/built_in_transformers/advanced_transformers/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Advanced transformers

Advanced transformers are modifiable obfuscation methods that users can adjust based on their needs by using [custom functions](custom_functions/index.md).
Advanced transformers are modifiable anonymization methods that users can adjust based on their needs by using [custom functions](custom_functions/index.md).

Below you can find an index of all advanced transformers currently available in Greenmask.

Expand Down
2 changes: 1 addition & 1 deletion docs/built_in_transformers/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# About transformers

Transformers in Greenmask are methods which are applied to obfuscate sensitive data. All Greenmask transformers are
Transformers in Greenmask are methods which are applied to anonymize sensitive data. All Greenmask transformers are
split into the following groups:

- [Transformation engines](transformation_engines.md) — the type of generator used in transformers. Hash (deterministic)
Expand Down
2 changes: 1 addition & 1 deletion docs/built_in_transformers/standard_transformers/hash.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Generate a hash of the text value using the `Scrypt` hash function under the hoo
|------------|---------------------------------------------------------------------------------------------------------------------------------------|---------|----------|--------------------|
| column | The name of the column to be affected | | Yes | text, varchar |
| salt | Hex encoded salt string. This value may be provided via environment variable `GREENMASK_GLOBAL_SALT` | | Yes | text, varchar |
| function | Hash algorithm to obfuscate data. Can be any of `md5`, `sha1`, `sha256`, `sha512`, `sha3-224`, `sha3-254`, `sha3-384`, `sha3-512`. | `sha1` | No | - |
| function | Hash algorithm to anonymize data. Can be any of `md5`, `sha1`, `sha256`, `sha512`, `sha3-224`, `sha3-254`, `sha3-384`, `sha3-512`. | `sha1` | No | - |
| max_length | Indicates whether to truncate the hash tail and specifies at what length. Can be any integer number, where `0` means "no truncation". | `0` | No | - |

## Example: Generate hash from job title
Expand Down
2 changes: 1 addition & 1 deletion docs/database_subset.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ subset_conds:
## Use cases
* Database scale down - create obfuscated dump but for the limited and consistent set of tables
* Database scale down - create anonymized dump but for the limited and consistent set of tables
* Data migration - migrate only some records from one database to another
* Data anonymization - dump and anonymize only a specific records in the database
* Database catchup - catchup your another instance of database logically by adding a new records. In this case it
Expand Down
Loading

0 comments on commit 4e91518

Please sign in to comment.