From 02df98a84e6c8ff96ad0aae8ddeee859f3b93639 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Matou=C5=A1=20Bor=C3=A1k?= Date: Wed, 12 Jun 2024 09:22:37 +0200 Subject: [PATCH] Add more examples --- doc/workflow_steps.md | 185 ++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 179 insertions(+), 6 deletions(-) diff --git a/doc/workflow_steps.md b/doc/workflow_steps.md index c5d5489..fdad115 100644 --- a/doc/workflow_steps.md +++ b/doc/workflow_steps.md @@ -44,12 +44,76 @@ The grouping has two goals: - `under_keys`: notifies DumpCleaner that the cleanup data is not a list but actually a hash of multiple lists and that the grouping should be done only in lists under the specified keys of the data hash. This is useful in cases when the cleanup data needs to hold multiple unrelated lists of values. +#### Examples: + + + + + + + + + + + + + + + +
configurationinput dataoutput data
+ +```yaml +- step: GroupByBytesize +``` + + +``` +["newspaper", "show", "rest", "résumé"] +``` + + +``` +{ + "9-9" => ["newspaper"], + "4-4" => ["show", "rest"], + "6-8" => ["résumé"] +} +``` +
+ +```yaml +- step: GroupByBytesize + params: + under_keys: + - words +``` + + +``` +{ + "words" => ["newspaper", "show", "rest", "résumé"], + "domains" => ["gmail.com", "example.com"] +} + +``` + + +``` +{ + "words" => { + "9-9" => ["newspaper"], + "4-4" => ["show", "rest"], + "6-8" => ["résumé"] + }, + "domains" => ["gmail.com", "example.com"] +} +``` +
+ ### [InspectContext](https://github.com/NejRemeslnici/dump-cleaner/blob/main/lib/dump_cleaner/cleanup/data_source_steps/inspect_context.rb) This is purely a debugging step that makes DumpCleaner print the current step context. The step context includes: -- `orig_value`: original value taken from the record field -- `current_value`: i.e. the running state of the result value in the current workflow - `type`: the [cleanup type](/README.md#cleanup_types) that this step is working with - `record`: the record context taken from the current record (see the `record_context_columns` option above) - `cleanup_data`: the data available for the step (only a subset of all data is shown here) @@ -66,9 +130,110 @@ Care should be taken when loading string data taken from various dictionaries. T - `file`: specifies the path to the YAML file; this is a mandatory parameter. - `under_key`: optionally makes the step put the loaded data into a hash under the specified key instead of returning the loaded data itself. This is useful for grabbing multiple value lists from different dictionary files. +#### Examples: + + + + + + + + + + + + + + + + + + + + +
configurationinput dataoutput data
+ +```yaml +- step: LoadYamlFile + params: + file: some_file.yml +``` + +```yaml +# some_file.yml: +- words +- to +- load +``` + + +`nil` (or just anything) + + +``` +["words", "to", "load"] +``` +
+ +```yaml +- step: LoadYamlFile + params: + file: dictionary.yml + under_key: words +``` + +```yaml +# dictionary.yml: +- words +- to +- load +``` + + +`nil` (or just anything) + + +``` +{ + "words" => ["words", "to", "load"] +} +``` +
+ +```yaml +- step: LoadYamlFile + params: + file: dictionary.yml + under_key: words +``` + +```yaml +# dictionary.yml: +- words +- to +- load +``` + + +``` +{ + "existing_key" => ["some", "other", "words"] +} +``` + + +``` +{ + "existing_key" => ["some", "other", "words"], + "words" => ["words", "to", "load"] +} +``` +
+ + ### [RemoveAccents](https://github.com/NejRemeslnici/dump-cleaner/blob/main/lib/dump_cleaner/cleanup/data_source_steps/remove_accents.rb) -This step uses the [`unicode_normalize`](https://ruby-doc.org/stdlib-2.4.0/libdoc/unicode_normalize/rdoc/String.html) method to remove all accents from all values in the cleanup data, i.e., for example, ”naïve“ will be converted to ”naive“. This can be useful when we want to use the same YAML file to build a generic random words dictionary as well as a dictionary of logins or domains (which should have no accented characters in them). +This step uses the [`unicode_normalize`](https://ruby-doc.org/stdlib-2.4.0/libdoc/unicode_normalize/rdoc/String.html) method to remove all accents from all values in the cleanup data, i.e., for example, ”naïve“ will be converted to ”naive“. This can be useful when we want to use the same YAML file to build a generic random words dictionary as well as a dictionary of logins, domains or other words that should have no accented characters in them. #### Params: @@ -104,8 +269,9 @@ This step uses the [`unicode_normalize`](https://ruby-doc.org/stdlib-2.4.0/libdo ```yaml - step: RemoveAccents - under_keys: - - accounts + params: + under_keys: + - accounts ``` @@ -169,7 +335,14 @@ This step replaces the current value using a generated a random string with the ### [InspectContext](https://github.com/NejRemeslnici/dump-cleaner/blob/main/lib/dump_cleaner/cleanup/cleaning_steps/inspect_context.rb) -This step serves the exact same purpose as the [InspectContext](#inspectcontext) step in the ”data source“ workflow. +This is purely a debugging step that makes DumpCleaner print the current step context. The step context includes: + +- `orig_value`: original value taken from the table record field +- `current_value`: i.e. the running state of the result value in the current workflow +- `type`: the [cleanup type](/README.md#cleanup_types) that this step is working with +- `record`: the record context taken from the current record (see the `record_context_columns` option above) +- `cleanup_data`: the data available for the step (only a subset of all data is shown here) +- `repetition`: the current iteration in the uniqueness loop. ### [RandomizeEmail](https://github.com/NejRemeslnici/dump-cleaner/blob/main/lib/dump_cleaner/cleanup/cleaning_steps/randomize_email.rb)