diff --git a/.gitignore b/.gitignore index 5729aa4..0aa8bc2 100644 --- a/.gitignore +++ b/.gitignore @@ -32,3 +32,4 @@ _training-models/ temp-files/ vocab.txt dependencies.csv +site/ \ No newline at end of file diff --git a/README.md b/README.md index ccf7ffe..c3c742b 100644 --- a/README.md +++ b/README.md @@ -6,6 +6,8 @@ Philter is built upon the open source PII and PHI redaction engine [Phileas](htt Philter was released as open source under the Apache License, version 2.0, in July 2024 for version 2.6.0, but Philter dates back to 2019. See the [Release Notes](https://github.com/philterd/philter/blob/main/RELEASE_NOTES.md) for a description of past versions. +For Philter's User Guide please see https://philterd.github.io/philter/. + ## Philter on the Cloud Marketplaces Philter is available on the cloud marketplaces as a turnkey redaction solution. These cloud images are pre-configured and ready to be used immediately after launch. diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..cb48207 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,23 @@ +# Phileas Documentation + +The documentation files here are markdown files used by MkDocs. + +To build the documentation website, first install MkDocs: + +``` +python3 -m venv venv +source ./venv/bin/activate +python3 -m pip install -r requirements.txt +``` + +Next, build the site: + +``` +mkdocs build +``` + +To serve the documentation website while editing: + +``` +mkdocs serve +``` diff --git a/docs/docs/deidentification/bucketing.md b/docs/docs/deidentification/bucketing.md new file mode 100644 index 0000000..e074e7b --- /dev/null +++ b/docs/docs/deidentification/bucketing.md @@ -0,0 +1,2 @@ +# Bucketing + diff --git a/docs/docs/deidentification/date-shifting.md b/docs/docs/deidentification/date-shifting.md new file mode 100644 index 0000000..5facb6c --- /dev/null +++ b/docs/docs/deidentification/date-shifting.md @@ -0,0 +1,2 @@ +# Date Shifting + diff --git a/docs/docs/deidentification/de-identification_README.md b/docs/docs/deidentification/de-identification_README.md new file mode 100644 index 0000000..8fe10df --- /dev/null +++ b/docs/docs/deidentification/de-identification_README.md @@ -0,0 +1,13 @@ +# De-identification Methods + +There are several ways data can be de-identified, and which you use depends on the types of data you want to de-identify and your use-case for de-identifying the data. The terminology around the different methods is often used interchangeably, but there are differences between each method. + +> In this User's Guide, we may use the terms `filter` and `redact` interchangeably. + +In Philter, de-identification methods vary for each type of sensitive information. For example, all types can be replaced or redacted, but only dates can be shifted and only zip codes can be truncated. How a de-identification method is applied by Philter is called a filter strategy. Each type of sensitive information can have one or more filter strategies, and the combination of the filter strategies you select is called a policy. A policy determines how a document will be de-identified. + +The following is a list of de-identification methods that describes how each method works and its applicability to our [Philter](https://philterd.ai/philter/) software. De-identifying a document is likely to require a combination of the following methods. For instance, you may want to redact names, encrypt credit card numbers, and shift appointment dates. + +
De-identification MethodDescription
ReplacementReplaces sensitive information with a defined value. For example, you might want to replace a credit card number with the literal value "CREDIT_CARD_NUMBER".
Redaction and MaskingRemoves sensitive information. Our Philter software gives you a choice of how to remove the sensitive information, whether it is by replacing it with ***** (masking) or by some other set of characters.
EncryptionEncrypts sensitive information.
Date ShiftingShifts dates either forward or backward by some interval.
BucketingCategorizes data into buckets based on the data. Examples of bucketing is Philter can bucket dates into years, and zip codes by population.
+ +> A difference between [Philter](https://philterd.ai/philter/) and other services is that Philter does not send your data to a third-party for de-identification. Philter runs in your cloud and your data stays in your cloud. \ No newline at end of file diff --git a/docs/docs/deidentification/encryption.md b/docs/docs/deidentification/encryption.md new file mode 100644 index 0000000..ec5b8cf --- /dev/null +++ b/docs/docs/deidentification/encryption.md @@ -0,0 +1,2 @@ +# Encryption + diff --git a/docs/docs/deidentification/redaction-and-masking.md b/docs/docs/deidentification/redaction-and-masking.md new file mode 100644 index 0000000..8c0fd5c --- /dev/null +++ b/docs/docs/deidentification/redaction-and-masking.md @@ -0,0 +1,5 @@ +# Redaction and Masking + +Redaction and masking are two methods of de-identification that are often used interchangeably. The term redaction refers to removing a sensitive value from a document. When we hear the term redaction we often think of an image of a document with black bars across pieces of the text. + +Masking is similar to redaction but allows for configuring how the sensitive value is removed. The most common example is using asterisks (i.e. \*\*\*\*\*\*) in place of a sensitive value. diff --git a/docs/docs/deidentification/replacement.md b/docs/docs/deidentification/replacement.md new file mode 100644 index 0000000..55b9819 --- /dev/null +++ b/docs/docs/deidentification/replacement.md @@ -0,0 +1,5 @@ +# Replacement + +Replacement is a method of de-identification that simply replaces a sensitive value with another value. Replacement is useful when the sensitive value is not needed once the document has been de-identified. Philter can replace a sensitive value with a preset value or with a random value. + +In Philter's filter strategies, replacement is achieved by using the strategy to `REDACT`, `STATIC_REPLACE` , or `RANDOM_REPLACE` . diff --git a/docs/docs/evaluating-performance.md b/docs/docs/evaluating-performance.md new file mode 100644 index 0000000..c4566df --- /dev/null +++ b/docs/docs/evaluating-performance.md @@ -0,0 +1,160 @@ +# How to Evaluate Phileas' Performance + +A common question we receive is how well does Phileas perform? Our answer to this question is probably less than satisfactory because it simply depends. What does it depend on? Phileas' performance is heavily dependent upon your individual data. Sharing to compare metrics of Phileas' performance between different customer datasets is like comparing apples and oranges. + +If your data is not exactly like another customer's data then the metrics will not be applicable to your data. In terms of the classic information retrieval metrics precision and recall, comparing these values between customers can give false impressions about Phileas' performance, both good and bad. + +> This guide walks you through how to evaluate Phileas' performance. If you are just getting started with Phileas please see the Quick Starts instead. Then you can come back here to learn how to evaluate Phileas' performance. + +## Guide to Evaluating Performance + +We have created this guide to help guide you in evaluating Phileas' performance on your data. The guide involves determining the types of sensitive information you want to redact, configuring those filters, optimizing the configuration, and then capturing the performance metrics. + +> If you are using Philter we will gladly perform these steps for you and provide you a detailed Phileas performance report generated from your data. Please contact us to start the process. + +#### What You Need + +To evaluate Phileas' performance you need: + +* An application using Phileas. +* A list of the types of sensitive information you want to redact. +* A data set representative of the text you will be redacting using Phileas. It's important the data set be representative so the evaluation results will transfer to the actual data redaction. +* The same data set but with annotated sensitive information. These annotations will be used to calculate the precision and recall metrics. + +#### Configuring Phileas + +Before we can begin our evaluation we need to create a policy. A [policy](policies_README.md) is a file that defines the types of sensitive information that will be redacted and how it will be redacted. The policies are stored on the Phileas instance under `/opt/Phileas/policies`. You can edit the policies directly there using a text editor or you can use Phileas' [API](policies-api.md) to upload a policy. In this case we recommend just using a text editor on the Phileas instance to create a policy. + +When using a text editor to create and edit a policy, be sure to save the policy often. Frequent saving can make editing a policy easier. + +We also recommend considering to place your policy directory under source control to have a history and change log of your policies. + +#### Creating a Policy + +Make a copy of the default policy, and we will modify the copy for our needs. + +`cp /opt/Phileas/policies/default.json /opt/Phileas/policies/evaluation.json` + +Now open `/opt/Phileas/policies/evaluation.json` in a text editor. (The content of `evaluation.json` will be similar to what's shown below but may have minor differences between different versions of Phileas.) + +``` +{ + "name": "default", + "identifiers": { + "emailAddress": { + "emailAddressFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + }, + "phoneNumber": { + "phoneNumberFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` + +The first thing we need to do is to set the name of the policy. Replace `default` with `evaluation` and save the file. + +#### Identifying the Filters You Need + +The rest of the file contains the filters that are enabled in the default policy. We need to make sure that each type of sensitive information that you want to redact is represented by a filter in this file. Look through the rest of the policy and determine which filters are listed that you do not need and also which filters you do need that are not listed. + +#### Disabling Filters We Do Not Need + +If a filter is listed in the policy and you do not need the filter you have two options. You can either delete those lines from the policy and save the file, or you can set the filter's `enabled` property to false. Using the `enabled` property allows you to keep the filter configuration in the policy in case it is needed later but both options have the same effect. + +#### Enabling Filters Not in the Default Policy + +Let's say you want to redact bitcoin addresses. The bitcoin address filter is not in the default policy. To add the bitcoin address filter we will refer to Phileas' documentation on the bitcoin address filter, get the configuration, and copy it into the policy. + +From the [bitcoin address filter documentation](bitcoin-addresses.md) we see the configuration for the bitcoin address filter is: + +``` + "bitcoinAddress": { + "bitcoinAddressFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } +``` + +We can copy this configuration and paste it into our policy: + +``` +{ + "name": "evaluation", + "identifiers": { + "bitcoinAddress": { + "bitcoinAddressFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + }, + "emailAddress": { + "emailAddressFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + }, + "phoneNumber": { + "phoneNumberFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` + +The order of the filters in the policy does not matter and has no impact on performance. We typically place the filters in the policy alphabetically just to improve readability. + +Repeat these steps until you have added a filter for each of the types of sensitive information you want to redact. Typically, the default redaction `strategy` and `redactionFormat` values for each filter should be fine for evaluation. + +When finished modifying the policy, save the file and close the text editor. Now restart Phileas for the policy changes to be loaded: + +``` +sudo systemctl restart Phileas +``` + +#### Submitting Text for Redaction + +With our policy in place we can now send text to Phileas for redaction using that policy: + +``` +PhileasConfiguration phileasConfiguration = ConfigFactory.create(PhileasConfiguration.class); + +FilterService filterService = new PhileasFilterService(phileasConfiguration); + +FilterResponse response = filterService.filter(policies, context, documentId, body, MimeType.TEXT_PLAIN); +``` + +The `explain` API [endpoint](filtering-api.md#explain) produces a detailed description of the redaction. The response will include a list of spans that contain the start and stop positions of redacted text and the type of sensitive information that was redacted. Using this information we can compare the redacted information to our annotated file to calculate precision and recall metrics. + +#### Calculating Precision and Recall + +Now we can calculate the precision and recall metrics. + +* Precision is the number of true positives divided by the number true positives plus false positives. +* Recall is the number of true positives divided by the number of false negatives plus true positives. + +![Calculating the precision and recall](Images/precision.png) + +* The F-1 score is the harmonic mean of precision and recall. + +![Calculating the F-1 score](Images/f1.png) diff --git a/docs/docs/filter_policies/filter_policies.md b/docs/docs/filter_policies/filter_policies.md new file mode 100644 index 0000000..1974523 --- /dev/null +++ b/docs/docs/filter_policies/filter_policies.md @@ -0,0 +1,65 @@ +# Filter Policies + +The types of sensitive information identified by Phileas and how that information is de-identified are controlled through policies. A policy is a file stored under Phileas’s `policies` directory, which by default is located at `/opt/Phileas/policies/`. You can have an unlimited number of policies. + +Each policy has a `name` that is used by Phileas to apply the appropriate de-identification methods. The `name` is passed to Phileas’s [API](filtering-api.md) along with the text to be filtered when submitting text to Phileas. This provides flexibility and allows you to de-identify different types of documents in differing manners with a single instance of Phileas. For example, you may have a policy for bankruptcy documents and a separate policy for financial documents. + +> There are [sample policies](sample_filter_policies.md) available for immediate use or customization to fit your use-cases. + + +### The Structure of a Policy + +A policy: + +* Must have a `name` that uniquely identifies it. +* Must have a list of `identifiers` that are filters for sensitive information. + * Each `identifier` , or filter, can have zero or more [filter strategies](filter-strategies.md). A filter strategy tells Phileas how to manipulate that type of sensitive information when it is identified. +* Can have an optional list of [terms](ignore-lists.md) or [patterns](ignoring-patterns.md). +* Can have encryption keys to support [encryption](filter-strategies.md#fpe) of sensitive information. + +### An Example Policy + +The following is an example policy. In the example below you can see the [types of sensitive information](filters_README.md) that are enabled and the strategy for manipulating each type when found. This policy identifies email addresses and phone numbers and redacts each with the format given. + +``` +{ + "name": "email-and-phone-numbers", + "identifiers": { + "emailAddress": { + "emailAddressFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + }, + "phoneNumber": { + "phoneNumberFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` + +When an email address is identified by this policy, the email address is replaced with the text `{{{REDACTED-email-address}}}`. The `%t` gets replaced by the type of the filter. Likewise, when a phone number is found it is replaced with the text `{{{REDACTED-phone-number}}}`. You are free to change the redaction formats to whatever fits your use-case. See [Filter Strategies](filter-strategies.md) for all replacement options. + +The name of the policy is `email-and-phone-numbers`. Policies can be named anything you like but their names must be unique from all other policies. As a best practice, the policy should be saved as `[name].json`, e.g. `email-and-phone-numbers.json`. + +### Applying a Policy to Text + +To use this policy we will save it as `/opt/Phileas/profiles/email-and-phone-numbers.json`. We must restart Phileas for the new profile to be available for use. To apply the policy we will pass the policy's name to Phileas when making a filter request, as shown in the example request below. + +``` +curl -k -X POST "https://localhost:8080/api/filter?c=context&p=email-and-phone-numbers" \ + -d @file.txt -H Content-Type "text/plain" +``` + +In this command, we have provided the parameter `p` along with a value that is the name of the policy we want to use for this request. If we had multiple policies in Phileas we could choose a different policy for this request simply by changing the name given to the parameter `p`. For more details see Phileas’s [API](filtering-api.md). + +Phileas will process the contents of `file.txt` by applying the policy named `email-and-phone-numbers`. As we saw in the policy above, this policy redacts email addresses and phone numbers. Phileas will return the redacted text in response to the API call. + +To manipulate the sensitive information by methods other than redaction, see the [Filter Strategies](filter-strategies.md). diff --git a/docs/docs/filter_policies/filter_strategies.md b/docs/docs/filter_policies/filter_strategies.md new file mode 100644 index 0000000..69052f8 --- /dev/null +++ b/docs/docs/filter_policies/filter_strategies.md @@ -0,0 +1,300 @@ +# Filter Strategies + +A filter strategy defines how sensitive information identified by Phileas should be manipulated, whether it is redacted, replaced, encrypted, or manipulated in some other fashion. + +In a policy, you list the types of sensitive information that should be filtered. How Phileas replaces each type of sensitive information is specific to each type. For instance, zip codes can be truncated based on the leading digits or zip code population while phone numbers are redacted. These replacements are performed by "filter strategies." + +> Each filter can have one or more filter strategies and conditions can be used to determine when to apply each filter strategy. + + +A sample policy containing a filter strategy is shown below. In this example, email addresses will be redacted. + +``` +{ + "name": "email-address", + "identifiers": { + "emailAddress": { + "emailAddressFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` + +> Most of the filter strategies apply to all types of data, however, some filter strategies only apply to a few types. For example, the `TRUNCATE` filter strategy only applies to a zip code filter. + + +## Filter Strategies + +The filter strategies are described below. Each filter type can specify zero or more filter strategies. When no filter strategies are given, Phileas will default to `REDACT` for that filter type. When multiple filter strategies are given for a single filter type, the filter strategies will be applied in order as they are listed in the policy, top to bottom. + +* [`REDACT`](filter-strategies.md#the-redact-filter-strategy) +* [`CRYPTO_REPLACE`](filter-strategies.md#crypto)(AES encryption) +* [`HASH_SHA256_REPLACE`](filter-strategies.md#hash)(SHA512 encryption) +* [`FPE_ENCRYPT_REPLACE`](filter-strategies.md#fpe)(Format-preserving encryption) +* [`RANDOM_REPLACE`](filter-strategies.md#random) +* [`STATIC_REPLACE`](filter-strategies.md#static) +* [`TRUNCATE`](filter-strategies.md#truncate) +* [`ZERO_LEADING`](filter-strategies.md#zero_leading) + +### The `REDACT` Filter Strategy + +The REDACT filter strategy replaces sensitive information with a given redaction format. You can put variables in the redaction format that Phileas will replace when performing the redaction. + +The available redaction variables are: + +| Redaction Variable | Description | +| ------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `%t` | Will be replaced with the type of sensitive information. This is to allow you to know the type of sensitive information that was identified and redacted. | +| `%l` | Will be replaced by the given classification for the type of sensitive information. | +| `%v` | Will be replaced by the original value of the sensitive text. With `%v` you can annotate sensitive information instead of masking or removing it. | + +To redact sensitive information by replacing it with the type of sensitive information, the redaction format would be `REDACTED-%t`. + +An example filter using the `REDACT` filter strategy: + +``` +{ + "name": "email-address", + "identifiers": { + "emailAddress": { + "emailAddressFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` + +### The `CRYPTO_REPLACE` Filter Strategy {id="crypto"} + +The `CRYPTO_REPLACE` filter strategy replaces each identified piece of sensitive information by encrypting it using the AES encryption algorithm. To use this filter strategy, the policy must include the details of the encryption key as shown below: + +``` +{ + "name":"sample-profile", + "crypto": { + "key": "....", + "iv": "...." + }, + ... +``` + +In the snippet of a policy shown above, a crypto element is is defined with a `key` and an initialization vector (`iv`). These two items are required to encrypt the sensitive information. To generate a key, run the following command: + +``` +openssl enc -e -aes-256-cbc -a -salt -P +``` + +You will be prompted to enter an encryption password. Once entered, the values of the `key` and `iv` will be shown. Copy and paste those values into the policy. + +An example policy using the `CRYPTO_REPLACE` filter strategy: + +``` +{ + "name": "email-address", + "crypto": { + "key": "....", + "iv": "...." + }, + "identifiers": { + "emailAddress": { + "emailAddressFilterStrategies": [ + { + "strategy": "CRYPTO_REPLACE" + } + ] + } + } +} +``` + +### The `HASH_SHA256_REPLACE` Filter Strategy {id="hash"} + +The `HASH_SHA256_REPLACE` filter strategy replaces sensitive information with the SHA256 hash value of the sensitive information. To append a random salt value to each value prior to hashing, set the `salt` property to `true`. The salt value used will be returned in the `explain` response from Phileas' API. + +An example policy using the `HASH_SHA256_REPLACE` filter strategy: + +``` +{ + "name": "email-address", + "identifiers": { + "emailAddress": { + "emailAddressFilterStrategies": [ + { + "strategy": "HASH_SHA256_REPLACE" + } + ] + } + } +} +``` + +### The FPE\_ENCRYPT\_REPLACE Filter Strategy {id="fpe"} + +The `FPE_ENCRYPT_REPLACE` filter strategy uses format-preserving encryption (FPE) to encrypt the sensitive information. Phileas uses the FF3-1 algorithm for format-preserving encryption. The FPE\_ENCRYPT\_REPLACE filter strategy requires a `key` and a `tweak` value. These values control the format-preserving encryption. For more information on these values and format-preserving encryption, refer to the resources below: + +* [https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-38Gr1-draft.pdf](https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-38Gr1-draft.pdf) +* [https://nvlpubs.nist.gov/nistpubs/specialpublications/nist.sp.800-38g.pdf](https://nvlpubs.nist.gov/nistpubs/specialpublications/nist.sp.800-38g.pdf) + +An example policy using the FPE\_ENCRYPT\_REPLACE filter strategy: + +``` +{ + "name": "credit-cards", + "identifiers": { + "creditCardNumbers": { + "creditCardNumbersFilterStrategies": [ + { + "strategy": "FPE_ENCRYPT_REPLACE", + "key": "...", + "tweak": "..." + } + ] + } + } +} +``` + +### The `RANDOM_REPLACE` Filter Strategy {id="random"} + +Replaces the identified text with a fake value but of the same type. For example, an SSN will be replaced by a random text having the format `###-##-####`, such as 123-45-6789. An email address will be replaced with a randomly generated email address. Available to all filter types. + +An example policy using the `RANDOM_REPLACE` filter strategy: + +``` +{ + "name": "email-address", + "identifiers": { + "emailAddress": { + "emailAddressFilterStrategies": [ + { + "strategy": "RANDOM_REPLACE" + } + ] + } + } +} +``` + +### The `STATIC_REPLACE` Filter Strategy {id="static"} + +Replaces the identified text with a given static value. Available to all filter types. + +An example policy using the `STATIC_REPLACE` filter strategy: + +``` +{ + "name": "email-address", + "identifiers": { + "emailAddress": { + "emailAddressFilterStrategies": [ + { + "strategy": "STATIC_REPLACE", + "staticReplacement": "some new value" + } + ] + } + } +} +``` + +### The `TRUNCATE` Filter Strategy {id="truncate"} + +Available only to zip codes, this strategy allows for truncating zip codes to only a select number of digits. Specify `truncateDigits` to set the desired number of leading digits to leave. For example, if `truncateDigits` is 2, the zip code 90210 will be truncated to `90***`. + +The TRUNCATE filter strategy is available only to the zip code filter. An example policy using the `TRUNCATE` filter strategy: + +``` +{ + "name": "zip-codes", + "identifiers": { + "zipCode": { + "zipCodeFilterStrategies": [ + { + "strategy": "TRUNCATE", + "truncateDigits": 3 + } + ] + } + } +} +``` + +### The `ZERO_LEADING` Filter Strategy {id="zero_leading"} + +Available only to zip codes, this strategy changes the first 3 digits of a zip code to be 0. For example, the zip code 90210 will be changed to 00010. + +The `ZERO_LEADING` filter strategy is only available to zip code filters. An example zip code filter using the `ZERO_LEADING` filter strategy: + +``` +{ + "name": "zip-codes", + "identifiers": { + "zipCodes": { + "zipCodeFilterStrategies": [ + { + "strategy": "ZERO_LEADING" + } + ] + } + } +} +``` + +## Filter Strategy Conditions + +A replacement strategy can be applied based on the sensitive information meeting one or more conditions. For example, you can create a condition such that only dates of `11/05/2010` are replaced by using the condition `token == "11/05/2010"`. The conditions that can be applied vary based on the type of sensitive information. For instance, zip codes can have conditions based on their population. Refer to each specific [filter type](filters_README.md) for the conditions available. + +The following is an example policy for credit cards that contains a condition to only redact credit card numbers that start with the digits `3000`: + +``` +{ + "name": "default", + "identifiers": { + "creditCard": { + "creditCardFilterStrategies": [ + { + "condition": "token startswith \"3000\"", + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` + +#### Combining Conditions + +Conditions can be joined through the use of the `and` keyword. When conditions are joined, each condition must be satisfied for the identified text to be filtered. If any of the conditions are not satisfied the identified text will not be filtered. Below is an example joined condition: + +``` +token != "123-45-6789" and context == "my-context" +``` + +This condition requires that the identified text (the token) not be equal to `123-45-6789` and the context be equal to `my-context`. Both of these conditions must be satisfied for the identified text to be filtered. + +Conversely, conditions can be `OR`'d through the use of multiple filter strategies. For example, if we want to `OR` a condition on the token and a condition on the context, we would use two filter strategies: + +``` +"ssnFilterStrategies": [ + { + "condition": "token != \"123-45-6789\"", + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + }, + { + "condition": "context == \"my-context\"", + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } +] +``` diff --git a/docs/docs/filter_policies/filters.md b/docs/docs/filter_policies/filters.md new file mode 100644 index 0000000..f386619 --- /dev/null +++ b/docs/docs/filter_policies/filters.md @@ -0,0 +1,67 @@ +# Filters + +A "filter" corresponds to a type of sensitive information. Phileas has filters for sensitive information such as names, addresses, ages, and lots of others. + +These are predefined filters that are ready to be used as well as custom filters that let you define your own Phileas to identify sensitive information outside of what the predefined filters can identify. An example of a custom filter is a filter to identify your patient account numbers, where the structure of an account number is specific to your organization. + +Each filter is capable of identifying and redacting a specific type of sensitive information. For example, there is a filter for phone numbers, a filter for US social security numbers, and a filter for person's names. You can enable any combination of these filters based on the types of sensitive information you need to redact. + +This section of the documentation describes the filters available in Phileas. The configuration options for each filter can vary due to the type of the sensitive information. For instance, only the zip code filter has a configuration to truncate the zip code. + +A selection of filters and their configurations is called a [policy](policies_README.md). A policy describes how to de-identify a document. + +## Predefined Filters + +### Person's Names + +Phileas uses several methods to identify person's names. + +| Type | Description | +|-------------------------------------------------------------------------|----------------------------------------------------------------------| +| [First Names](filters/persons_names/first-names.md) | Identifies common first names | +| [Surnames](filters/persons_names/surnames.md) | Identifies common surnames | +| [Person's Names (NER)](filters/persons_names/persons-names-ner.md) | Identifies full names using natural language processing analysis | +| [Physician's Names (NER)](filters/persons_names/physician-names-ner.md) | Identifies physican names using natural language processing analysis | + +### Other Filters + +| Type | Description | +|----------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------| +| [Ages](filters/common_filters/ages.md) | Identifies ages such as `3.5 years old` | +| [Bank Routing Numbers](filters/common_filters/bank-routing-numbers.md.md) | Identifies bank routing numbers | +| [Bitcoin Addresses](filters/common_filters/bitcoin-addresses.md) | Identifies Bitcoin addresses such as `127NVqnjf8gB9BFAW2dnQeM6wqmy1gbGtv` | +| [Cities](filters/common_filters/cities.md) | Identifies common cities | +| [Counties](filters/common_filters/counties.md) | Identifies common counties | +| [Credit Card Numbers](filters/common_filters/credit-cards.md) | Identifies VISA, American Express, MasterCard, and Discover credit card numbers | +| [Dates](filters/common_filters/dates.md) | Identifies dates in many formats such as May 22, 1999 | +| [Driver's License Numbers](filters/common_filters/drivers-license-numbers.md) | Identifies driver's license numbers for all 50 US states | +| [Email Addresses](filters/common_filters/email-addresses.md) | Identifies email addresses | +| [Hospitals](filters/locations/hospitals.md) | Identifies common hospital names | +| [Hospital Abreviations](filters/locations/hospital-abbreviations.md) | Identifies common hospitals by their name abbreviations | +| [IBAN Codes](filters/common_filters/iban-codes.md) | Identifies international bank account numbers | +| [IP Addresses](filters/common_filters/ip-addresses.md) | Identifies IPv4 and IPv6 addresses | +| [MAC Addresses](filters/common_filters/mac-addresses.md) | Identifies network MAC addresses | +| [Passport Numbers](filters/common_filters/passport-numbers.md) | Identifies US passport numbers | +| [Phone Numbers](filters/common_filters/phone-numbers.md) | Identifies phone numbers | +| [Phone Number Extensions](filters/common_filters/phone-number-extensions.md) | Identifies phone numbers | +| [Sections](filters/common_filters/sections.md) | Identifies sections in text denoted by | +| [SSNs and TINs](filters/common_filters/ssns-and-tins.md) | Identifies US SSNs and TINs | +| [States](filters/locations/states.md) | Identifies US state names | +| [State Abbreviations](filters/locations/state-abbreviations.md) | Identifies US state names by their abbreviations | +| [Tracking Numbers](filters/common_filters/tracking-numbers.md) | Identifies UPS, FedEx, and USPS tracking numbers | +| [URLs](filters/common_filters/urls.md) | Identifies URLs | +| [VINs](filters/common_filters/vins.md) | Identifies vehicle identification numbers | +| [Zip Codes](filters/common_filters/zip-codes.md) | Identifies US zip codes | + +## Custom Filter Types of Sensitive Information + +In addition to the predefined types of sensitive information listed in the table above, you can also define your own types of sensitive information. Through custom identifiers and dictionaries, Phileas can identify many other types of information that may be sensitive in your use-case. For example, if you have patient identifiers that follow a pattern of `AA-00000` you can define a custom identifier for this sensitive information. + +Phileas can be configured to look identify sensitive information based on custom dictionaries. When a term in the dictionary is found in the text, Phileas will treat the term as sensitive information and apply the given filter strategy. + +Custom dictionaries support fuzziness to accommodate for misspellings. The replacement strategy for a custom dictionary has a `sensitivityLevel` that controls the amount of allowed fuzziness. + +| Type | Description | +|-------------------------------------------------------------| ---------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [Custom Dictionaries](filters/custom_filters/dictionary.md) | Identifies sensitive information based on dictionary values. | +| [Custom Identifiers](filters/custom_filters/identifier.md) | Identifies custom alphanumeric identifiers that may be used for medical record numbers, patient identifiers, account number, or other specific identifier. | diff --git a/docs/docs/filter_policies/filters/common_filters/ages.md b/docs/docs/filter_policies/filters/common_filters/ages.md new file mode 100644 index 0000000..91ada5d --- /dev/null +++ b/docs/docs/filter_policies/filters/common_filters/ages.md @@ -0,0 +1,57 @@ +# Ages + +## Filter + +This filter identifies ages such as `3.5 years old` in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| --------------------- | -------------------------------------------------------------- | ------------- | +| `ageFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in order as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- | -------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | + +### Conditions + +Each filter strategy may have one condition. The filter will only be applied when the condition is satisfied. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "ages-example", + "identifiers": { + "age": { + "ageFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/common_filters/bank-routing-numbers.md b/docs/docs/filter_policies/filters/common_filters/bank-routing-numbers.md new file mode 100644 index 0000000..a10a7bd --- /dev/null +++ b/docs/docs/filter_policies/filters/common_filters/bank-routing-numbers.md @@ -0,0 +1,58 @@ +# Bank Routing Numbers + +## Filter + +This filter identifies bank routing numbers (ABA routing transit numbers) such as `111000025` in text. Identified routing numbers must pass checksum validation. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| ----------------------------------- | -------------------------------------------------------------- | ------------- | +| `bankRoutingNumberFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in order as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- |---------------------------------------------------------------------------------------------------------------------| +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | +| `FPE_ENCRYPT_REPLACE` | Replace the sensitive text with a value generated by [format-preserving encryption](filter-strategies.md#fpe) (FPE) | + +### Conditions + +Each filter strategy may have one condition. The filter will only be applied when the condition is satisfied. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "bank-routing-number-example", + "identifiers": { + "bankRoutingNumber": { + "bankRoutingNumberFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/common_filters/bitcoin-addresses.md b/docs/docs/filter_policies/filters/common_filters/bitcoin-addresses.md new file mode 100644 index 0000000..60389b3 --- /dev/null +++ b/docs/docs/filter_policies/filters/common_filters/bitcoin-addresses.md @@ -0,0 +1,58 @@ +# Bitcoin Addresses + +## Filter + +This filter identifies bitcoin addresses such as `1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2` in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| -------------------------------- | -------------------------------------------------------------- | ------------- | +| `bitcoinAddressFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in order as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- |---------------------------------------------------------------------------------------------------------------------| +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | +| `FPE_ENCRYPT_REPLACE` | Replace the sensitive text with a value generated by [format-preserving encryption](filter-strategies.md#fpe) (FPE) | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "bitcoin-address-example", + "identifiers": { + "bitcoinAddress": { + "bitcoinAddressFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/common_filters/creditcards.md b/docs/docs/filter_policies/filters/common_filters/creditcards.md new file mode 100644 index 0000000..4a5217c --- /dev/null +++ b/docs/docs/filter_policies/filters/common_filters/creditcards.md @@ -0,0 +1,62 @@ +# Credit Cards + +## Filter + +This filter identifies credit cards such as `378282246310005` in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| ---------------------------- | ------------------------------------------------------------------ | ------------- | +| `creditCardFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | +| `onlyValidCreditCardNumbers` | When set to true, only valid credit card numbers will be filtered. | `true` | +| `ignoreWhenInUnixTimestamp` | When set to true, only credit card numbers that do not match the pattern for a Unix timestamp will be filtered. | `false` | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in order as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- |--------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | +| `FPE_ENCRYPT_REPLACE` | Replace the sensitive text with a value generated by [format-preserving encryption](filter-strategies.md#fpe) (FPE) | +| `LAST_4` | Replace the sensitive text with just the last four characters of the text. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "credit-cards-example", + "identifiers": { + "creditcard": { + "onlyValidCreditCardNumbers": false, + "creditCardFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/common_filters/dates.md b/docs/docs/filter_policies/filters/common_filters/dates.md new file mode 100644 index 0000000..7811506 --- /dev/null +++ b/docs/docs/filter_policies/filters/common_filters/dates.md @@ -0,0 +1,116 @@ +# Dates + +## Filter + +This filter identifies dates such as `May 22, 2014` in text. The supported date formats are: + +| Format | Example | +| ------------- | ------------------------- | +| yyyy-MM-d | 2020-05-10 | +| MM-dd-yyyy | 05-10-2020 | +| M-d-y | 5-10-2020 | +| MMM dd | May 5 or May 05 | +| MMMM dd, yyyy | May 5, 2020 or May 5 2020 | + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| ---------------------- | -------------------------------------------------------------- | ------------- | +| `dateFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | +| `onlyValidDates` | When set to true, only valid dates will be filtered. | `false` | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in order as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- | ------------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | +| `SHIFT` | Shift the date by a number of months, days, and/or years. | +| `SHIFTRANDOM` | Shift the data by a random number of months, days, and years. | +| `RELATIVE` | Replace the date by a words relative to the date. | + +### Filter Strategy Options + +The following filter strategy options are available for the `RELATIVE` filter strategy. + +| | Description | Default Value | +| ------------- | -------------------------------------------------------------------------------------------------- | ------------- | +| `futureDates` | When `true`, future dates are replaced by relative words. When `false`, future dates are redacted. | `false` | + +The following filter strategy options are available for the `SHIFT` filter strategy. + +| Option | Description | Default Value | +| -------------- | ----------------------------------------------------------------------------------------------------------------- | ------------- | +| `shiftDays` | The number of days to shift the date. Can be a negative or positive integer. Defaults to `0` if not specified. | `0` | +| `shiftMinutes` | The number of minutes to shift the date. Can be a negative or positive integer. Defaults to `0` if not specified. | `0` | +| `shiftYears` | The number of years to shift the date. Can be a negative or positive integer. Defaults to `0` if not specified. | `0` | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `TOKEN` | Compares the sensitive text to some category, e.g. `birthdate`. | `is` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +#### Differentiating Between Dates and Birth Dates + +In some cases it may be necessary to redact birth dates and dates differently. Using conditions it is possible to determine if an identified date is a birth date. The conditional `token is birthdate` will determine if the identified date (token) is a birth date by analyzing the content surrounding the date. + +## Example Policy to Redact Dates + +The following policy redacts dates. + +``` +{ + "name": "dates-example", + "identifiers": { + "date": { + "onlyValidDates": false, + "dateFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` + +## Example Policy to Shift Dates + +The following policy to shift dates forward by 2 days and 4 months. + +``` +{ + "name": "dates-example", + "identifiers": { + "date": { + "onlyValidDates": false, + "dateFilterStrategies": [ + { + "strategy": "SHIFT", + "shiftDays": 2, + "shiftMonths": 4, + "shiftYears": 0 + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/common_filters/drivers-license-numbers.md b/docs/docs/filter_policies/filters/common_filters/drivers-license-numbers.md new file mode 100644 index 0000000..0c60ced --- /dev/null +++ b/docs/docs/filter_policies/filters/common_filters/drivers-license-numbers.md @@ -0,0 +1,58 @@ +# Driver's License Numbers + +## Filter + +This filter identifies driver's license numbers such as 194784357 in text. Driver's license number formats for all 50 US states are supported. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| -------------------------------- | -------------------------------------------------------------- | ------------- | +| `driversLicenseFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in order as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | +| `FPE_ENCRYPT_REPLACE` | Replace the sensitive text with a value generated by [format-preserving encryption](filter-strategies.md#fpe) (FPE) | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "drivers-license-example", + "identifiers": { + "driversLicense": { + "driversLicenseFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/common_filters/email-addresses.md b/docs/docs/filter_policies/filters/common_filters/email-addresses.md new file mode 100644 index 0000000..c87b3e5 --- /dev/null +++ b/docs/docs/filter_policies/filters/common_filters/email-addresses.md @@ -0,0 +1,61 @@ +# Email Addresses + +## Filter + +This filter identifies email addresses such as `john.fake.address@hotmail.com` in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +|--------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| +| `emailAddressFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | +| `onlyStrictMatches` | When set to false, the pattern for identifying email addresses will be relaxed. Filtered email addresses will have a lower confidence, but filter performance will increase. | `true` | +| `onlyValidTLDs` | When set to true, only email addresses that are for a top-level domain are filtered. | `false` | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is +used. When multiple filter strategies are given the filter strategies will be applied in order as they are listed. +See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +|-----------------------|----------------------------------------------------------| +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +|--------------|--------------------------------------------------------------------------|------------------------------------| +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "email-address-example", + "identifiers": { + "emailAddress": { + "emailAddressFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/common_filters/iban-codes.md b/docs/docs/filter_policies/filters/common_filters/iban-codes.md new file mode 100644 index 0000000..7453a4b --- /dev/null +++ b/docs/docs/filter_policies/filters/common_filters/iban-codes.md @@ -0,0 +1,62 @@ +# IBAN Codes + +## Filter + +This filter identifies IBAN (international banking account numbers) Codes such as `HU4211773016111110180000000` in text. Driver's license number formats for all 50 US states are supported. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| -------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- | ------------- | +| `allowSpaces` | When `true`, IBAN codes will be allowed to contain spaces and grouped in sections of 4. Set to `false` to disallow spaces in IBAN codes. | `true` | +| `ibanCodeFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | +| `onlyValidIBANCodes` | When set to true, only valid IBAN codes will be filtered. | `true` | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in order as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- |--------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | +| `FPE_ENCRYPT_REPLACE` | Replace the sensitive text with a value generated by [format-preserving encryption](filter-strategies.md#fpe) (FPE) | +| `LAST_4` | Replace the sensitive text with just the last four characters of the text. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "iban-example", + "identifiers": { + "ibanCode": { + "onlyValidIBANCodes": false, + "ibanCodeFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/common_filters/ip-addresses.md b/docs/docs/filter_policies/filters/common_filters/ip-addresses.md new file mode 100644 index 0000000..8cf1106 --- /dev/null +++ b/docs/docs/filter_policies/filters/common_filters/ip-addresses.md @@ -0,0 +1,57 @@ +# IP Addresses + +## Filter + +This filter identifies IPv4 and IPv6 addresses `127.0.0.1`, `192.168.3.58`, and `2001:0db8:85a3:0000:0000:8a2e:0370:7334` in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| --------------------------- | -------------------------------------------------------------- | ------------- | +| `ipAddressFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in order as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- | -------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "ip-address-example", + "identifiers": { + "ipAddress": { + "ipAddressFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/common_filters/mac-addresses.md b/docs/docs/filter_policies/filters/common_filters/mac-addresses.md new file mode 100644 index 0000000..dfde325 --- /dev/null +++ b/docs/docs/filter_policies/filters/common_filters/mac-addresses.md @@ -0,0 +1,57 @@ +# MAC Addresses + +## Filter + +This filter identifies MAC addresses in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| ---------------------------- | -------------------------------------------------------------- | ------------- | +| `macAddressFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in order as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- | -------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "mac-address-example", + "identifiers": { + "macAddress": { + "macAddressFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/common_filters/passport-numbers.md b/docs/docs/filter_policies/filters/common_filters/passport-numbers.md new file mode 100644 index 0000000..a9eb029 --- /dev/null +++ b/docs/docs/filter_policies/filters/common_filters/passport-numbers.md @@ -0,0 +1,59 @@ +# Passport Numbers + +## Filter + +This filter identifies US passport numbers in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| -------------------------------- | -------------------------------------------------------------- | ------------- | +| `passportNumberFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in order as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | +| `FPE_ENCRYPT_REPLACE` | Replace the sensitive text with a value generated by [format-preserving encryption](filter-strategies.md#fpe) (FPE) | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ---------------- | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CLASSIFICATION` | Compares the issuing country of the passport number. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "passport-number-example", + "identifiers": { + "passportNumber": { + "passportNumberFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/common_filters/phone-number-extensions.md b/docs/docs/filter_policies/filters/common_filters/phone-number-extensions.md new file mode 100644 index 0000000..b227acb --- /dev/null +++ b/docs/docs/filter_policies/filters/common_filters/phone-number-extensions.md @@ -0,0 +1,57 @@ +# Phone Number Extensions + +## Filter + +This filter identifies phone numbers extensions such as "x100" in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| -------------------------------------- | -------------------------------------------------------------- | ------------- | +| `phoneNumberExtensionFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in order as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- | -------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "phone-number-ext-example", + "identifiers": { + "phoneNumberExtension": { + "phoneNumberExtensionFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/common_filters/phone-numbers.md b/docs/docs/filter_policies/filters/common_filters/phone-numbers.md new file mode 100644 index 0000000..ab44ef5 --- /dev/null +++ b/docs/docs/filter_policies/filters/common_filters/phone-numbers.md @@ -0,0 +1,57 @@ +# Phone Numbers + +## Filter + +This filter identifies phone and fax numbers such as (304) 555-5555, 304-555-5555, and 1-800-123-4567 in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| ----------------------------- | -------------------------------------------------------------- | ------------- | +| `phoneNumberFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in order as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- | -------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "phone-number-example", + "identifiers": { + "phoneNumber": { + "phoneNumberFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/common_filters/sections.md b/docs/docs/filter_policies/filters/common_filters/sections.md new file mode 100644 index 0000000..12ddb98 --- /dev/null +++ b/docs/docs/filter_policies/filters/common_filters/sections.md @@ -0,0 +1,61 @@ +# Sections + +## Filter + +This filter identifies sections in text between a given start regular expression pattern and a given end regular expression pattern. + +### Required Parameters + +| Parameter | Description | Default Value | +| -------------- | ------------------------------------------------------- | ------------- | +| `startPattern` | A regular expression denoting the start of the section. | None | +| `endPattern` | A regular expression denoting the end of the section. | None | + +### Optional Parameters + +| Parameter | Description | Default Value | +| ------------------------- | -------------------------------------------------------------- | ------------- | +| `sectionFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in order as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- | -------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "sections-example", + "identifiers": { + "section": { + "startPattern": "START", + "endPattern": "END", + "sectionFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } +} +``` diff --git a/docs/docs/filter_policies/filters/common_filters/ssns-and-tins.md b/docs/docs/filter_policies/filters/common_filters/ssns-and-tins.md new file mode 100644 index 0000000..093a8b4 --- /dev/null +++ b/docs/docs/filter_policies/filters/common_filters/ssns-and-tins.md @@ -0,0 +1,59 @@ +# SSNs and TINs + +## Filter + +This filter identifies US SSNs and TINs such as `123-45-6789` and `123456789` in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| --------------------- | -------------------------------------------------------------- | ------------- | +| `ssnFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in order as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- |---------------------------------------------------------------------------------------------------------------------| +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | +| `FPE_ENCRYPT_REPLACE` | Replace the sensitive text with a value generated by [format-preserving encryption](filter-strategies.md#fpe) (FPE) | +| `LAST_4` | Replace the sensitive text with just the last four characters of the text. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "ssn-tin-example", + "identifiers": { + "ssn": { + "ssnFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/common_filters/tracking-numbers.md b/docs/docs/filter_policies/filters/common_filters/tracking-numbers.md new file mode 100644 index 0000000..8834ab6 --- /dev/null +++ b/docs/docs/filter_policies/filters/common_filters/tracking-numbers.md @@ -0,0 +1,59 @@ +# Tracking Numbers + +## Filter + +This filter identifies tracking numbers in text. FedEx, UPS, and USPS tracking number formats are supported. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| -------------------------------- | -------------------------------------------------------------- | ------------- | +| `trackingNumberFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in order as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- |--------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | +| `FPE_ENCRYPT_REPLACE` | Replace the sensitive text with a value generated by [format-preserving encryption](filter-strategies.md#fpe) (FPE) | +| `LAST_4` | Replace the sensitive text with just the last four characters of the text. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "tracking-numbers-example", + "identifiers": { + "trackingNumber": { + "trackingNumberFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/common_filters/urls.md b/docs/docs/filter_policies/filters/common_filters/urls.md new file mode 100644 index 0000000..20f58ce --- /dev/null +++ b/docs/docs/filter_policies/filters/common_filters/urls.md @@ -0,0 +1,59 @@ +# URLs + +## Filter + +This filter identifies URLs such as `myhomepage.com`, `http://myhomepage.com/folder/page.html`, and `www.myhomepage.com/folder/page.html` in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| ---------------------- | ----------------------------------------------------------------------------- | ------------- | +| `urlFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | +| `requireHttpWwwPrefix` | When set to true, only URLs that begin with `http` or `www` will be filtered. | `true` | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in order as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- | -------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "urls-example", + "identifiers": { + "url": { + "requireHttpWwwPrefix": true, + "urlFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/common_filters/vins.md b/docs/docs/filter_policies/filters/common_filters/vins.md new file mode 100644 index 0000000..2652ef8 --- /dev/null +++ b/docs/docs/filter_policies/filters/common_filters/vins.md @@ -0,0 +1,59 @@ +# VINs + +## Filter + +This filter identifies 17-digit vehicle identification numbers (VINs) such as `WBAPM7G50ANL19218` and `1GBJC34K3RE176005` in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| --------------------- | -------------------------------------------------------------- | ------------- | +| `vinFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in order as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- |--------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | +| `FPE_ENCRYPT_REPLACE` | Replace the sensitive text with a value generated by [format-preserving encryption](filter-strategies.md#fpe) (FPE) | +| `LAST_4` | Replace the sensitive text with just the last four characters of the text. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "vins-example", + "identifiers": { + "vin": { + "vinFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/common_filters/zip-codes.md b/docs/docs/filter_policies/filters/common_filters/zip-codes.md new file mode 100644 index 0000000..6ca527a --- /dev/null +++ b/docs/docs/filter_policies/filters/common_filters/zip-codes.md @@ -0,0 +1,61 @@ +# Zip Codes + +## Filter + +This filter identifies zip codes in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------- | +| `zipCodeFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | +| `requireDelimiter` | When set to false, the filter will not require a dash in 9 digit zip codes, e.g. 12345-6789. Setting to false may increase the number of zip code false positives. | `true` | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in order as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | +| `TRUNCATE` | Replace the sensitive text by removing the last `x` digits. (Set the number of digits using the `truncateDigits` parameter of the filter strategy.) | +| `ZERO_LEADING` | Replace the sensitive text by zeroing the first 3 digits. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | +| `POPULATION` | Compares the population of the zip code against the 2010 census values. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "zip-code-example", + "identifiers": { + "zipCode": { + "zipCodeFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/custom_filters/dictionary.md b/docs/docs/filter_policies/filters/custom_filters/dictionary.md new file mode 100644 index 0000000..7ea9e08 --- /dev/null +++ b/docs/docs/filter_policies/filters/custom_filters/dictionary.md @@ -0,0 +1,70 @@ +# Dictionary + +## Filter + +This filter identifies custom text based on a given dictionary. + +### Required Parameters + +At least one of `terms` or `files` must be provided. + +| Parameter | Description | Default Value | +| --------- | ---------------------------------------------- | ------------- | +| `terms` | A list of terms in the dictionary. | None | +| `files` | A list of files containing terms one per line. | None | + +### Optional Parameters + +| Parameter | Description | Default Value | +| ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------- | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | +| `fuzzy` | When set to true, the dictionary will employ fuzzy comparisons. Use the `sensitivity` parameter to control the level of fuzziness. Setting this value to false will disable fuzziness and provide a higher level of performance. | `false` | +| `classification` | Used to apply an arbitrary label to the identifier, such as "patient-id", or "account-number." | `"custom-identifier"` | +| `sensitivity` | Controls the "fuzziness" of allowed values to account for misspellings and derivations. Valid values are `low`, `medium`, and `high`. Only applies when `fuzzy` is set to `true`. | `medium` | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in as they are listed. See [Filter Strategies ](filter-strategies.md)for details. + +| Strategy | Description | +| --------------------- | -------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "dictionary-example", + "identifiers": { + "dictionaries": [ + "customDictionary": { + "terms": ["john", "jane", "doe"], + "files": "c:\temp\dictionary.txt", + "fuzzy": true, + "sensitivity": "medium", + "sectionFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + ] + } +} +``` diff --git a/docs/docs/filter_policies/filters/custom_filters/identifier.md b/docs/docs/filter_policies/filters/custom_filters/identifier.md new file mode 100644 index 0000000..121ec86 --- /dev/null +++ b/docs/docs/filter_policies/filters/custom_filters/identifier.md @@ -0,0 +1,71 @@ +# Identifier + +## Filter + +This filter identifies custom text based on a given regular expression. + +The Identifier filter accepts a list of regular expression-based identifiers. See the policy at the bottom of this page for an example. + +_Note that backslashes in the regular expression will need to be escaped for the policy to be valid JSON._ + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| ---------------- | ---------------------------------------------------------------------------------------------- | --------------------- | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | +| `caseSensitive` | When set to true, the regular expression will be case sensitive. | `true` | +| `classification` | Used to apply an arbitrary label to the identifier, such as "patient-id", or "account-number." | `"custom-identifier"` | +| `pattern` | A regular expression for the identifier. _Note that backslashes will need to be escaped._ | `\b[A-Z0-9_-]{4,}\b` | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in as they are listed. See [Filter Strategies ](filter-strategies.md)for details. + +| Strategy | Description | +| --------------------- | -------------------------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | +| `LAST_4` | Replace the sensitive text with just the last four characters of the text. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ---------------- | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | +| `CLASSIFICATION` | Compares the classification of the sensitive text. | `==` , `!=` | + +## Example Policy + +``` +{ + "name": "default", + "identifiers": { + "identifiers": [ + { + "pattern": "[A-Z]{9}", + "caseSensitive": false, + "classification": "custom-identifier", + "enabled": true, + "identifierFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + ] + } +} +``` diff --git a/docs/docs/filter_policies/filters/locations/cities.md b/docs/docs/filter_policies/filters/locations/cities.md new file mode 100644 index 0000000..b475b5a --- /dev/null +++ b/docs/docs/filter_policies/filters/locations/cities.md @@ -0,0 +1,57 @@ +# Cities + +## Filter + +This filter identifies common US cities as determined by the US census in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------- | +| `cityFilterStrategies` | A list of filter strategies. | None | +| `sensitivity` | Controls the "fuzziness" of allowed values to account for misspellings and derivations. Valid values are `low`, `medium`, and `high`. | `medium` | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- | -------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "cities-example", + "identifiers": { + "city": { + "sensitivity": "medium", + "cityFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/locations/counties.md b/docs/docs/filter_policies/filters/locations/counties.md new file mode 100644 index 0000000..1a6a37f --- /dev/null +++ b/docs/docs/filter_policies/filters/locations/counties.md @@ -0,0 +1,57 @@ +# Counties + +## Filter + +This filter identifies common US counties as determined by the US census in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------- | ------------- | +| `countyFilterStrategies` | A list of filter strategies. | None | +| `sensitivity` | Controls the "fuzziness" of allowed values to account for misspellings and derivations. Valid values are `low`, `medium`, and `high`. | `medium` | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- | -------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "counties-example", + "identifiers": { + "county": { + "sensitivity": "medium", + "countyFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/locations/hospital-abbreviations.md b/docs/docs/filter_policies/filters/locations/hospital-abbreviations.md new file mode 100644 index 0000000..bce0213 --- /dev/null +++ b/docs/docs/filter_policies/filters/locations/hospital-abbreviations.md @@ -0,0 +1,57 @@ +# Hospital Abbreviations + +## Filter + +This filter identifies US hospital abbreviations in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| -------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------- | +| `hospitalAbbreviationFilterStrategies` | A list of filter strategies. | None | +| `sensitivity` | Controls the "fuzziness" of allowed values to account for misspellings and derivations. Valid values are `low`, `medium`, and `high`. | `medium` | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- | -------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "hospital-abbreviations-example", + "identifiers": { + "hospitalAbbreviation": { + "sensitivity": "medium", + "hospitalAbbreviationFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/locations/hospitals.md b/docs/docs/filter_policies/filters/locations/hospitals.md new file mode 100644 index 0000000..21fe7f1 --- /dev/null +++ b/docs/docs/filter_policies/filters/locations/hospitals.md @@ -0,0 +1,57 @@ +# Hospitals + +## Filter + +This filter identifies US hospitals in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------- | +| `hospitalFilterStrategies` | A list of filter strategies. | None | +| `sensitivity` | Controls the "fuzziness" of allowed values to account for misspellings and derivations. Valid values are `low`, `medium`, and `high`. | `medium` | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- | -------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "hospitals-example", + "identifiers": { + "hospital": { + "sensitivity": "medium", + "hospitalFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/locations/state-abbreviations.md b/docs/docs/filter_policies/filters/locations/state-abbreviations.md new file mode 100644 index 0000000..fa14fea --- /dev/null +++ b/docs/docs/filter_policies/filters/locations/state-abbreviations.md @@ -0,0 +1,57 @@ +# State Abbreviations + +## Filter + +This filter identifies US state abbreviations in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| ------------------------------------ | -------------------------------------------------------------- | ------------- | +| `stateAbbreviationsFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- | -------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "states-abbreviations-example", + "identifiers": { + "stateAbbreviation": { + "stateAbbreviationFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/locations/states.md b/docs/docs/filter_policies/filters/locations/states.md new file mode 100644 index 0000000..c177f86 --- /dev/null +++ b/docs/docs/filter_policies/filters/locations/states.md @@ -0,0 +1,57 @@ +# States + +## Filter + +This filter identifies US states in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| ----------------------- | -------------------------------------------------------------- | ------------- | +| `stateFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- | -------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "states-example", + "identifiers": { + "state": { + "stateFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/persons_names/first-names.md b/docs/docs/filter_policies/filters/persons_names/first-names.md new file mode 100644 index 0000000..e1b7d37 --- /dev/null +++ b/docs/docs/filter_policies/filters/persons_names/first-names.md @@ -0,0 +1,58 @@ +# First Names + +## Filter + +This filter identifies common first names as identified by the US census in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------- | +| `sensitivity` | Controls the "fuzziness" of allowed values to account for misspellings and derivations. Valid values are `low`, `medium`, and `high`. | `medium` | +| `firstNameFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- | -------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "first-names-example", + "identifiers": { + "firstName": { + "firstNameFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/persons_names/persons-names-ner.md b/docs/docs/filter_policies/filters/persons_names/persons-names-ner.md new file mode 100644 index 0000000..171a816 --- /dev/null +++ b/docs/docs/filter_policies/filters/persons_names/persons-names-ner.md @@ -0,0 +1,59 @@ +# Person's Names (NER) + +## Filter + +This filter identifies person's names based on natural language processing (NLP) and named-entity recognition (NER) in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| --------------------------- | ---------------------------------------------------------------- | ------------- | +| `removePunctuation` | When set to true, punctuation will be removed prior to analysis. | `false` | +| `firstNameFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- | --------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | +| `ABBREVIATE` | Replace the sensitive text with the initials of the text. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "ner-example", + "identifiers": { + "ner": { + "nerFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/persons_names/physician-names-ner.md b/docs/docs/filter_policies/filters/persons_names/physician-names-ner.md new file mode 100644 index 0000000..ba156d6 --- /dev/null +++ b/docs/docs/filter_policies/filters/persons_names/physician-names-ner.md @@ -0,0 +1,57 @@ +# Physician Names + +## Filter + +This filter identifies physician names (e.g. Dr. John Smith) in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| ------------------------------- | -------------------------------------------------------------- | ------------- | +| `physicianNameFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- | -------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "physician-names-example", + "identifiers": { + "physicianName": { + "physicianNameFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/filters/persons_names/surnames.md b/docs/docs/filter_policies/filters/persons_names/surnames.md new file mode 100644 index 0000000..f1c4629 --- /dev/null +++ b/docs/docs/filter_policies/filters/persons_names/surnames.md @@ -0,0 +1,58 @@ +# Surnames + +## Filter + +This filter identifies common surnames as identified by the US census in text. + +### Required Parameters + +This filter has no required parameters. + +### Optional Parameters + +| Parameter | Description | Default Value | +| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------- | +| `sensitivity` | Controls the "fuzziness" of allowed values to account for misspellings and derivations. Valid values are `low`, `medium`, and `high`. | `medium` | +| `surnameFilterStrategies` | A list of filter strategies. | None | +| `enabled` | When set to false, the filter will be disabled and not applied | `true` | +| `ignored` | A list of terms to be ignored by the filter. | None | + +### Filter Strategies + +The filter may have zero or more filter strategies. When no filter strategy is given the default strategy of `REDACT` is used. When multiple filter strategies are given the filter strategies will be applied in as they are listed. See [Filter Strategies](#filter-strategies) for details. + +| Strategy | Description | +| --------------------- | -------------------------------------------------------- | +| `REDACT` | Replace the sensitive text with a placeholder. | +| `RANDOM_REPLACE` | Replace the sensitive text with a similar, random value. | +| `STATIC_REPLACE` | Replace the sensitive text with a given value. | +| `CRYPTO_REPLACE` | Replace the sensitive text with its encrypted value. | +| `HASH_SHA256_REPLACE` | Replace the sensitive text with its SHA256 hash value. | + +### Conditions + +Each filter strategy may have one condition. See [Conditions](#conditions) for details. + +| Conditional | Description | Operators | +| ------------ | ------------------------------------------------------------------------ | ---------------------------------- | +| `TOKEN` | Compares the value of the sensitive text. | `==` , `!=` | +| `CONTEXT` | Compares the filtering context. | `==` , `!=` | +| `CONFIDENCE` | Compares the confidence in the sensitive text against a threshold value. | `<` , `<=`, `>` , `>=`, `==`, `!=` | + +## Example Policy + +``` +{ + "name": "surnames-example", + "identifiers": { + "surname": { + "surnameFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/ignoring_sensitive_information.md b/docs/docs/filter_policies/ignoring_sensitive_information.md new file mode 100644 index 0000000..fc4161e --- /dev/null +++ b/docs/docs/filter_policies/ignoring_sensitive_information.md @@ -0,0 +1,144 @@ +# Ignoring Sensitive Information + +Phileas can optionally ignore a list of terms and prevent those terms from being redacted. For example, if the name `John Smith` is being redacted and you do not want it to be redacted, you can add `John Smith` to an ignore list. Each time Phileas identifies sensitive information it will check the ignore lists to see if the sensitive information is to be ignored. + +> Phileas can ignore terms and patterns per-policy, meaning each policy can have its own unique list of terms or patterns to ignore. + + +## Ignore Lists + +Ignore lists can be specified at the policy level and/or for each filter in the policy. When set for the policy, the list of ignored terms will be applied to _all_ filter types. When set for a filter, the list of ignored terms will be applied _only_ to that filter. + +### Ignore List for a Policy + +In the policy shown below, an ignore list is set at the level of the policy. The terms specified in the list will be ignored for _all_ filter types enabled in the policy. Only the terms property is required. The `name` and `caseSensitive` properties are optional. + +``` +{ + "name": "example-policy", + "ignored": [ + { + "name": "names to ignore", + "terms": ["john smith", "jane doe"], + "caseSensitive": false + } + ], + "identifiers": { + "emailAddress": { + "emailAddressFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` + +Terms to be ignored at the policy level can also be read from one or more files located on the local file system. The file must be formatted as one term per line. + +``` +{ + "name": "example-policy", + "ignored": [ + { + "name": "names to ignore", + "terms": ["john smith", "jane doe"], + "files": ["/tmp/names.txt"] + "caseSensitive": false + } + ], + "identifiers": { + "emailAddress": { + "emailAddressFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` + +### Ignore List for a Filter + +In the policy shown below, an ignore list is set at the level of a filter. The terms specified in the list will be ignored _only_ for that filter type. Each filter in a policy can have its own list of ignored terms. The terms listed will be ignored case-sensitive, meaning, "John" will be ignored if "John" is an ignored term but will not be ignored if "john" is an ignored term. + +``` +{ + "name": "example-filter-profile", + "identifiers": { + "emailAddress": { + "ignored": ["john smith", "jane doe"], + "emailAddressFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` + +## Ignoring Patterns + +Phileas can ignore information based on a regular expression pattern. An example use of this feature is to ignore terms that are present in your text but are dynamic, such as logged timestamps. When using the date filter these timestamps may be identified as being sensitive but you do not want them redacted. With an ignore pattern we can ignore the logged timestamps. + +## Ignore Patterns + +Ignore patterns can be specified at the policy level and/or at the level of each type of filter. When set at the policy level, the list of ignored patterns will be applied to _all_ filter types. When set for an individual filter, the list of ignored patterns will be applied _only_ to that filter. + +### Ignore Patterns for a Policy + +In the policy shown below, ignore patterns are set at the level of the policy. The patterns specified in the list will be ignored for _all_ filter types enabled in the policy. + +``` +{ + "name": "example-policy", + "ignoredPatterns": [ + { + "name": "ignore-room-numbers", + "pattern": "Room [A-Z0-4]{4}" + } + ], + "identifiers": { + "emailAddress": { + "emailAddressFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` + +### Ignore Patterns for a Filter + +In the policy shown below, ignore patterns are set at the level of a filter. The patterns specified in the list will be ignored _only_ for that filter type. Each filter in a policy can have its own list of ignored patterns. + +``` +{ + "name": "example-policy", + "identifiers": { + "emailAddress": { + "ignoredPatterns": [ + { + "name": "ignore-room-numbers", + "pattern": "Room [A-Z0-4]{4}" + } + ], + "emailAddressFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` diff --git a/docs/docs/filter_policies/sample_filter_policies.md b/docs/docs/filter_policies/sample_filter_policies.md new file mode 100644 index 0000000..c6561ae --- /dev/null +++ b/docs/docs/filter_policies/sample_filter_policies.md @@ -0,0 +1,216 @@ +# Sample Policies + +This page lists some sample policies. You can use these policies either as-is or as starting points for customizing them to meet your specific de-identification needs. + + + +> These policies are examples and not an exhaustive list of all the sensitive information Phileas can identify. Items from each of these policies can be combined to make policies to meet your use-cases. + + +### Email Addresses and Phone Numbers + +This policy finds email addresses and phone numbers and redacts them with `{{{REDACTED-email-address}}}` and `{{{REDACTED-phone-number}}}`, respectively. + +``` +{ + "name": "email-and-phone-numbers", + "identifiers": { + "emailAddress": { + "emailAddressFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + }, + "phoneNumber": { + "phoneNumberFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` + +### Persons Names and SSNs + +This policy finds persons names and SSNs and redacts them with `{{{REDACTED-entity}}}` and `{{{REDACTED-ssn}}}`, respectively. + +``` +{ + "name": "persons-names-ssn", + "identifiers": { + "ner": { + "nerFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + }, + "ssn": { + "ssnFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` + +### Dates, URLs, and VINs + +This policy finds dates, URLs, and VINs. Dates and URLs are redacted with `{{{REDACTED-date}}}` and `{{{REDACTED-url}}}`, respectively. Each VIN number are replaced by a randomly generated VIN number. + +``` +{ + "name": "dates-urls-vin", + "identifiers": { + "date": { + "dateFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + }, + "url": { + "urlFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + }, + "vin": { + "vinFilterStrategies": [ + { + "strategy": "RANDOM_REPLACE" + } + ] + } + } +} +``` + +### IP Addresses + +This policy finds IP addresses and replaces each identified IP address with the static text `IP_ADDRESS` as long as the IP address is not `127.0.0.1`. (A condition on the filter strategy sets the IP address requirement.) + +``` +{ + "name": "ip-addresses", + "identifiers": { + "ipAddress": { + "ipAddressFilterStrategies": [ + { + "strategy": "STATIC_REPLACE", + "redactionFormat": "IP_ADDRESS", + "condition": "token != \"127.0.0.1\"" + } + ] + } + } +} +``` + +### Zip Codes + +This policy finds ZIP codes starting with `90` and truncates the zip code to just the first two digits. + +``` +{ + "name": "zip-codes", + "identifiers": { + "creditCard": { + "creditCardFilterStrategies": [ + { + "condition": "token startswith \"90\"", + "strategy": "TRUNCATE", + "truncateDigits": 2 + } + ] + } + } +} +``` + +### Enable Text Splitting + +This policy enables text splitting for input over 10,000 characters. + +``` +{ + "name": "default-split-enabled", + "config": { + "splitting": { + "enabled": true, + "threshold": 10000, + "method": "newline" + } + }, + "identifiers": { + "ssn": { + "ssnFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` + +### Globally Ignored Terms + +This policy has a list of globally ignored terms. + +``` +{ + "name": "default-global-ignore", + "ignored": [ + { + "name": "ignored credit cards", + "terms": ["4111111111111111", "0000000000000000"] + } + ], + "identifiers": { + "creditCard": { + "creditCardFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}" + } + ] + } + } +} +``` + +### Generating Alerts + +This policy generates an alert when a matching email address is identified. + +``` +{ + "name": "email-address-alert", + "identifiers": { + "emailAddress": { + "emailAddressFilterStrategies": [ + { + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}", + "condition": "token == \"test@test.com\"", + "alert": true + } + ] + } + } +} +``` \ No newline at end of file diff --git a/docs/docs/images/ambiguous-span-example-context.png b/docs/docs/images/ambiguous-span-example-context.png new file mode 100644 index 0000000..3ccd0e1 Binary files /dev/null and b/docs/docs/images/ambiguous-span-example-context.png differ diff --git a/docs/docs/images/ambiguous-span-example.png b/docs/docs/images/ambiguous-span-example.png new file mode 100644 index 0000000..9fbacba Binary files /dev/null and b/docs/docs/images/ambiguous-span-example.png differ diff --git a/docs/docs/images/f1.png b/docs/docs/images/f1.png new file mode 100644 index 0000000..9a66b6c Binary files /dev/null and b/docs/docs/images/f1.png differ diff --git a/docs/docs/images/precision.png b/docs/docs/images/precision.png new file mode 100644 index 0000000..4b830d3 Binary files /dev/null and b/docs/docs/images/precision.png differ diff --git a/docs/docs/index.md b/docs/docs/index.md new file mode 100644 index 0000000..b29dd93 --- /dev/null +++ b/docs/docs/index.md @@ -0,0 +1,7 @@ +# Philter + +Philter is an application that finds, identifies, and removes sensitive information, such as protected health information (PHI) and personally identifiable information (PII), and user-defined sensitive information from natural language text. Philter is ideal for usage in text processing pipelines where sensitive information needs removed, encrypted, or redacted from the text. + +This documentation applies to Philter 2.4.0. If you are upgrading to this version see Upgrading Philter. + +To get going fast, jump to the Quick Starts to launch Philter on AWS, Azure, or Google Cloud. diff --git a/docs/docs/other_features/alerts.md b/docs/docs/other_features/alerts.md new file mode 100644 index 0000000..b9e0cde --- /dev/null +++ b/docs/docs/other_features/alerts.md @@ -0,0 +1,50 @@ +# Alerts + +Phileas can optionally generate alerts when a particular type of sensitive information is identified. + +### Alert Conditions + +In a policy, each type of sensitive information can have zero or more filter strategies. Each filter strategy can optionally have a condition associated with it. When a condition is present, the filter strategy will only be satisfied when the condition is satisfied. For example, a condition may be created to only filter phone numbers that start with the digits `123` or only filter names that start with `John`. Filter strategy conditions give you granular control over the filtering process. + +When a filter strategy condition is satisfied, Phileas can optionally generate an alert. This feature allows you to be notified when a particular type of sensitive information is identified. + +### Enabling Alerts + +Alerts are enabled on a per-condition basis. For instance, given the following policy to identify email addresses, a condition has been added to only match the email address `test@test.com`. Because of the property `alert` set to `true`, an alert will be generated when this condition is satisfied. By default, the alert property is set to `false` disabling alerts for the condition. + +``` +{ + "name": "email-address-alert", + "identifiers": { + "emailAddress": { + "emailAddressFilterStrategies": [ + { + "id": "my-email-strategy", + "strategy": "REDACT", + "redactionFormat": "{{{REDACTED-%t}}}", + "condition": "token == \"test@test.com\"", + "alert": true + } + ] + } + } +} +``` + +### Structure of an Alert + +An alert contains the following information: + +| Property Name | Description | +| --------------- | --------------------------------------------------------------------------------------------------------------- | +| `id` | A unique ID for the alert formatted as an UUID. | +| `filterProfile` | The name of the policy triggering the alert. | +| `strategyId` | The ID of the filter strategy triggering the alert. In the example above the `id` would be `my-email-strategy`. | +| `context` | The context. | +| `documentId` | The ID of the document which triggered the alert. | +| `filterType` | The filter type ("email-address", "credit-card", etc.) triggering the alert. | +| `date` | A timestamp when the alert was generated formatted as `yyyy-MM-dd'T'HH:mm:ss.SSS'Z'`**.** | + +### Retrieving and Deleting Alerts + +The alerts that Phileas has generated are available through Phileas' [alerts API](alerts-api.md). This API allows for retrieving and deleting alerts. Using this API you can build sophisticated notification systems around Phileas' capabilities. diff --git a/docs/docs/other_features/anonymization.md b/docs/docs/other_features/anonymization.md new file mode 100644 index 0000000..3fef412 --- /dev/null +++ b/docs/docs/other_features/anonymization.md @@ -0,0 +1,19 @@ +# Consistent Anonymization + +Anonymization in the context of Phileas is the process of replacing certain values with random but similar values. For example, the identified name of “John Smith” may be replaced with “David Jones”, or an identified phone number of 123-555-9358 may be replaced by 842-436-2042. A [VIN](vins.md) number will be replaced by a 17 character randomly selected VIN number that adheres to the standard for VIN numbers. + +Anonymization is useful in instances where you want to remove sensitive information from text without changing the meaning of the text. Anonymization can be enabled for each type of sensitive information in the policy by setting the filter strategy to `RANDOM_REPLACE`. (See [Policies](policies_README.md) for more information.) + +## Consistent Anonymization + +Consistent anonymization refers to the process of always anonymizing the same sensitive information with the same replacement values. For example, if the name "John Smith" is randomly replaced with "Pete Baker", all other occurrences of "John Smith" will also be replaced by "Pete Baker." + +Consistent anonymization can be done on the document level or on the context level. When enabled on the document level, "John Smith" will only be replaced by "Pete Baker" in the same document. If "John Smith" occurs in a separate document it will be anonymized with a different random name. When enabled on the context level, "John Smith" will be replaced by "Pete Baker" whenever "John Smith" is found in all documents in the same context. + +Enabling consistent anonymization on the context level requires a cache to store the sensitive information and the corresponding replacement values. If a single instance of Phileas is running, its internal cache service (enabled by default) is the best choice and no additional configuration is required. + +If multiple instances of Phileas are deployed together, Phileas requires access to a Redis cache service as shown below. See Phileas' [Settings](settings.md) on how to configure the cache. + +**When Phileas is deployed in a cluster, a Redis cache is required to enable consistent anonymization.** + +The anonymization cache will contain PHI. It is important that you take the necessary precautions to secure the cache and all communication to and from the cache. diff --git a/docs/docs/other_features/span_disambiguation.md b/docs/docs/other_features/span_disambiguation.md new file mode 100644 index 0000000..7578441 --- /dev/null +++ b/docs/docs/other_features/span_disambiguation.md @@ -0,0 +1,31 @@ +# Span Disambiguation + +Span disambiguation is an optional feature in Phileas that is disabled by default. Refer to Phileas' [Settings](settings.md#cache) to enable and configure span disambiguation. + +In Phileas, a _span_ is a piece of the input text that Phileas has identified as sensitive information. A span has a start and end positions, a confidence, a type, and other attributes. Ideally, each piece of identified sensitive information will only have a single span associated with it. In this case, the type of sensitive information is unambiguous. The goal of span disambiguation is provide more accurate filtering by removing the potential ambiguities in the types of sensitive information for duplicate spans. + +However, sometimes a piece of text can be identified by multiple spans, each having a different type of sensitive information. In an example hypothetical scenario, let's say given the input text `My SSN is 123456789.` , Phileas identifies `123456789` as an SSN and as a phone number. This type of scenario can be quite common, and its likelihood increases as the number of enabled filters in a policy increase. + +### How Phileas' Span Disambiguation Works + +When we read the sentence `My SSN is 123456789.` we can tell the span in question should be identified as an SSN because we can look at the text surrounding the span. We use the surrounding words to deduce the correct type of sensitive information for `123456789`. + +That is exactly how Phileas' span disambiguation works. When presented with identical spans differing only by the type of sensitive information, Phileas looks at the text surrounding the span in question in combination with the previous spans it has seen in the same context to determine which type of sensitive information is most likely to be correct. Phileas then removes the ambiguous spans from the results and replaces them with a single span. + +### Improves Over Time + +Because Phileas is able to consider previously seen text to make its decision concerning ambiguous spans, Phileas' span disambiguation gets "smarter" as more text is filtered. This is because Phileas will have more text to consider in its calculations. + +### More Details + +#### Span Disambiguation and Confidence Values + +Span disambiguation is only invoked for spans that differ only by the type of sensitive information. This means the span's location (start and end positions), confidence, and all other values must match. If two spans have identical locations but have different confidence values, span disambiguation will not be applied and the span having the highest confidence will be used. + +#### Cache Service + +When multiple application using Phileas are deployed alongside each other behind a load balancer, Phileas' [cache service](settings.md#cache) should be configured and enabled. Phileas will store the information needed to disambiguate spans in the cache such that the information is available to each instance of Phileas. If only a single instance of Phileas is running then the cache service is not required, however, the information needed to disambiguate spans will be stored in memory and will be lost when Phileas is stopped or restarted. Because of this, we recommend the cache service always be used unless there is a specific reason not to. + +#### Fine-Tuning the Span Disambiguation + +There are properties available to fine-tune how the span disambiguation operates. These properties are not documented because improper use of the properties could have a negative impact on performance. We will be glad to walk through these properties upon request. diff --git a/docs/docs/settings.md b/docs/docs/settings.md new file mode 100644 index 0000000..26c2aa6 --- /dev/null +++ b/docs/docs/settings.md @@ -0,0 +1,70 @@ +# Settings + +Phileas has settings to control how it operates. The settings and how to configure each are described below. + +> The configuration for the types of sensitive information that Phileas identifies are defined in [filter policies](filter_policies/filter_policies.md) outside of Phileas' configuration properties described on this page. + +## Configuring Phileas + +### The Phileas Settings File + +Phileas looks for its settings in an `application.properties` file. + +### Using Environment Variables + +Properties set via environment variables take precedence over properties set in Phileas' settings file. + +All following properties can also be set as environment variables by prepending `PHILTER_` to the property name and changing periods to underscores. For example, the property `filter.profiles.directory` can be set using the environment variable `PHILTER_FILTER_PROFILES_DIRECTORY` by: + +``` +export PHILTER_FILTER_PROFILES_DIRECTORY=/profiles/ +``` + +Using environment variables to configure Phileas instead of using Phileas' settings file can allow for easier configuration management when deploying Phileas. + +## Policies + +| Setting | Description | Allowed Values | Default Value | +| --------------------------- | -------------------------------------------- | ------------------------- | ------------- | +| `filter.policies.directory` | The directory in which to look for policies. | Any valid directory path. | `./policies/` | + +## Span Disambiguation + +These values configure Phileas' span disambiguation feature to determine the most appropriate type of sensitive information when duplicate spans are identified. In a deployment of multiple Phileas instances, you must enable the [cache service](Settings#cache) for span disambiguation to work as expected. + +| | Description | Allowed Values | Default Value | +| ----------------------------- | --------------------------------------------- | --------------- | ------------- | +| `span.disambiguation.enabled` | Whether or not to enable span disambiguation. | `true`, `false` | `false` | + +## Cache Service + +The cache service is required to use [consistent anonymization](anonymization.md) and policies stored in Amazon S3. Phileas supports Redis as the backend cache. When Redis is not used, an in-memory cache is used instead. The in-memory cache is not recommended because all contents will be stored in memory on the local Phileas instance. + +The cache will contain sensitive information. It is important that you take the necessary precautions to secure the cache itself and all communication between Phileas and the cache. + +| Setting | Description | Allowed Values | Default Value | +| ------------------------ | ----------------------------------------------------------------- | ------------------------- | ------------- | +| `cache.redis.enabled` | Whether or not to use Redis as the cache. | `true`, `false` | `false` | +| `cache.redis.host` | The hostname or IP address of the Redis cache. | Any valid Redis endpoint. | None | +| `cache.redis.port` | The Redis cache port. | Any valid port. | `6379` | +| `cache.redis.auth.token` | The Redis auth token. | Any valid token. | None | +| `cache.redis.ssl` | Whether or not to use SSL for communication with the Redis cache. | `true`, `false` | `false` | + +The following Redis settings are only required when using a self-signed SSL certificate. + +| Setting | Description | Allowed Values | Default Value | +| --------------------------------- | ---------------------------- | -------------------- | ------------- | +| `cache.redis.truststore` | The path to the trust store. | Any valid file path. | None | +| `cache.redis.truststore.password` | The trust store password. | Any valid file path. | None | +| `cache.redis.keystore` | The path to the keystore. | Any valid file path. | None | +| `cache.redis.keystore.password` | The keystore password. | Any valid file path. | None | + +## Advanced Settings + +> In most cases the settings below do not need changed. Contact us for more information on any of these settings. + +| Setting | Description | Allowed Values | Default Value | +| ---------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | ----------------- | ------------- | +| `ner.timeout.sec` | Controls the timeout in seconds when performing name entity recognition. Longer text may require longer processing times. | An integer value | `600` | +| `ner.max.idle.connections` | The maximum number of idle connections to maintain for the named entity recognition. More connections may improve performance in some cases. | An integer value. | `30` | +| `ner.keep.alive.duration.ms` | The amount of time in milliseconds to keep named entity recognition connections alive. Longer text may require longer processing times. | An integer value. | `60` | diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml new file mode 100644 index 0000000..7c2b59d --- /dev/null +++ b/docs/mkdocs.yml @@ -0,0 +1,28 @@ +site_name: Philter +repo_url: https://github.com/philterd/philter +site_url: https://philterd.github.io/philter/ +edit_uri: '?query=docs/docs/' +copyright: Copyright 2024 Philterd, LLC +site_author: philterd +site_description: User guide for Philter, the open source PII/PHI redaction engine. +theme: + name: material + locale: en + palette: + primary: red +nav: + - Home: index.md + - Deidentification: + - 'Bucketing': 'deidentification/bucketing.md' + - 'Date Shifting': 'deidentification/date-shifting.md' + - 'Filter Policies': + - 'Filter Policies': 'filter_policies/filter_policies.md' + - 'Filters': 'filter_policies/filters.md' + - 'Filter Strategies': 'filter_policies/filter_strategies.md' + - 'Ignoring Sensitive Information': 'filter_policies/ignoring_sensitive_information.md' + - 'Sample Filter Policies': 'filter_policies/sample_filter_policies.md' + - 'Other Features': + - 'Alerts': 'other_features/alerts.md' + - 'Anonymization': 'other_features/anonymization.md' + - 'Span Disambiguation': 'other_features/span_disambiguation.md' + - Settings: settings.md \ No newline at end of file diff --git a/docs/requirements.txt b/docs/requirements.txt new file mode 100644 index 0000000..9a8a4ca --- /dev/null +++ b/docs/requirements.txt @@ -0,0 +1,2 @@ +mkdocs +mkdocs-material