#11 Working on moving documentation.

philterd · Oct 23, 2024 · 201ce14 · 201ce14
1 parent 21e9374
commit 201ce14
Show file tree

Hide file tree

Showing 57 changed files with 3,106 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -32,3 +32,4 @@ _training-models/
 temp-files/
 vocab.txt
 dependencies.csv
+site/
diff --git a/README.md b/README.md
@@ -6,6 +6,8 @@ Philter is built upon the open source PII and PHI redaction engine [Phileas](htt
 
 Philter was released as open source under the Apache License, version 2.0, in July 2024 for version 2.6.0, but Philter dates back to 2019. See the [Release Notes](https://github.com/philterd/philter/blob/main/RELEASE_NOTES.md) for a description of past versions.
 
+For Philter's User Guide please see https://philterd.github.io/philter/.
+
 ## Philter on the Cloud Marketplaces
 
 Philter is available on the cloud marketplaces as a turnkey redaction solution. These cloud images are pre-configured and ready to be used immediately after launch.

diff --git a/docs/README.md b/docs/README.md
@@ -0,0 +1,23 @@
+# Phileas Documentation
+
+The documentation files here are markdown files used by MkDocs.
+
+To build the documentation website, first install MkDocs:
+
+```
+python3 -m venv venv
+source ./venv/bin/activate
+python3 -m pip install -r requirements.txt
+```
+
+Next, build the site:
+
+```
+mkdocs build
+```
+
+To serve the documentation website while editing:
+
+```
+mkdocs serve
+```
diff --git a/docs/docs/deidentification/bucketing.md b/docs/docs/deidentification/bucketing.md
@@ -0,0 +1,2 @@
+# Bucketing
+
diff --git a/docs/docs/deidentification/date-shifting.md b/docs/docs/deidentification/date-shifting.md
@@ -0,0 +1,2 @@
+# Date Shifting
+
diff --git a/docs/docs/deidentification/de-identification_README.md b/docs/docs/deidentification/de-identification_README.md
@@ -0,0 +1,13 @@
+# De-identification Methods
+
+There are several ways data can be de-identified, and which you use depends on the types of data you want to de-identify and your use-case for de-identifying the data. The terminology around the different methods is often used interchangeably, but there are differences between each method.
+
+> In this User's Guide, we may use the terms `filter` and `redact` interchangeably.
+
+In Philter, de-identification methods vary for each type of sensitive information. For example, all types can be replaced or redacted, but only dates can be shifted and only zip codes can be truncated. How a de-identification method is applied by Philter is called a filter strategy. Each type of sensitive information can have one or more filter strategies, and the combination of the filter strategies you select is called a policy. A policy determines how a document will be de-identified.
+
+The following is a list of de-identification methods that describes how each method works and its applicability to our [Philter](https://philterd.ai/philter/) software. De-identifying a document is likely to require a combination of the following methods. For instance, you may want to redact names, encrypt credit card numbers, and shift appointment dates.
+
+<table><thead><tr><th width="268">De-identification Method</th><th>Description</th></tr></thead><tbody><tr><td><a href="replacement.md">Replacement</a></td><td>Replaces sensitive information with a defined value. For example, you might want to replace a credit card number with the literal value "CREDIT_CARD_NUMBER".</td></tr><tr><td><a href="redaction-and-masking.md">Redaction and Masking</a></td><td>Removes sensitive information. Our Philter software gives you a choice of how to remove the sensitive information, whether it is by replacing it with ***** (masking) or by some other set of characters.</td></tr><tr><td><a href="encryption.md">Encryption</a></td><td>Encrypts sensitive information.</td></tr><tr><td><a href="date-shifting.md">Date Shifting</a></td><td>Shifts dates either forward or backward by some interval.</td></tr><tr><td><a href="bucketing.md">Bucketing</a></td><td>Categorizes data into buckets based on the data. Examples of bucketing is Philter can bucket dates into years, and zip codes by population.</td></tr></tbody></table>
+
+> A difference between [Philter](https://philterd.ai/philter/) and other services is that Philter does not send your data to a third-party for de-identification. Philter runs in your cloud and your data stays in your cloud.
diff --git a/docs/docs/deidentification/encryption.md b/docs/docs/deidentification/encryption.md
@@ -0,0 +1,2 @@
+# Encryption
+
diff --git a/docs/docs/deidentification/redaction-and-masking.md b/docs/docs/deidentification/redaction-and-masking.md
@@ -0,0 +1,5 @@
+# Redaction and Masking
+
+Redaction and masking are two methods of de-identification that are often used interchangeably. The term redaction refers to removing a sensitive value from a document. When we hear the term redaction we often think of an image of a document with black bars across pieces of the text.
+
+Masking is similar to redaction but allows for configuring how the sensitive value is removed. The most common example is using asterisks (i.e. \*\*\*\*\*\*) in place of a sensitive value.
diff --git a/docs/docs/deidentification/replacement.md b/docs/docs/deidentification/replacement.md
@@ -0,0 +1,5 @@
+# Replacement
+
+Replacement is a method of de-identification that simply replaces a sensitive value with another value. Replacement is useful when the sensitive value is not needed once the document has been de-identified. Philter can replace a sensitive value with a preset value or with a random value.
+
+In Philter's filter strategies, replacement is achieved by using the strategy to `REDACT`, `STATIC_REPLACE` , or `RANDOM_REPLACE` .
diff --git a/docs/docs/evaluating-performance.md b/docs/docs/evaluating-performance.md
@@ -0,0 +1,160 @@
+# How to Evaluate Phileas' Performance
+
+A common question we receive is how well does Phileas perform? Our answer to this question is probably less than satisfactory because it simply depends. What does it depend on? Phileas' performance is heavily dependent upon your individual data. Sharing to compare metrics of Phileas' performance between different customer datasets is like comparing apples and oranges.
+
+If your data is not exactly like another customer's data then the metrics will not be applicable to your data. In terms of the classic information retrieval metrics precision and recall, comparing these values between customers can give false impressions about Phileas' performance, both good and bad.
+
+> This guide walks you through how to evaluate Phileas' performance. If you are just getting started with Phileas please see the Quick Starts instead. Then you can come back here to learn how to evaluate Phileas' performance.
+
+## Guide to Evaluating Performance
+
+We have created this guide to help guide you in evaluating Phileas' performance on your data. The guide involves determining the types of sensitive information you want to redact, configuring those filters, optimizing the configuration, and then capturing the performance metrics.
+
+> If you are using Philter we will gladly perform these steps for you and provide you a detailed Phileas performance report generated from your data. Please contact us to start the process.
+
+#### What You Need
+
+To evaluate Phileas' performance you need:
+
+* An application using Phileas.
+* A list of the types of sensitive information you want to redact.
+* A data set representative of the text you will be redacting using Phileas. It's important the data set be representative so the evaluation results will transfer to the actual data redaction.
+* The same data set but with annotated sensitive information. These annotations will be used to calculate the precision and recall metrics.
+
+#### Configuring Phileas
+
+Before we can begin our evaluation we need to create a policy. A [policy](policies_README.md) is a file that defines the types of sensitive information that will be redacted and how it will be redacted. The policies are stored on the Phileas instance under `/opt/Phileas/policies`. You can edit the policies directly there using a text editor or you can use Phileas' [API](policies-api.md) to upload a policy. In this case we recommend just using a text editor on the Phileas instance to create a policy.
+
+When using a text editor to create and edit a policy, be sure to save the policy often. Frequent saving can make editing a policy easier.
+
+We also recommend considering to place your policy directory under source control to have a history and change log of your policies.
+
+#### Creating a Policy
+
+Make a copy of the default policy, and we will modify the copy for our needs.
+
+`cp /opt/Phileas/policies/default.json /opt/Phileas/policies/evaluation.json`
+
+Now open `/opt/Phileas/policies/evaluation.json` in a text editor. (The content of `evaluation.json` will be similar to what's shown below but may have minor differences between different versions of Phileas.)
+
+```
+{
+   "name": "default",
+   "identifiers": {
+      "emailAddress": {
+         "emailAddressFilterStrategies": [
+            {
+               "strategy": "REDACT",
+               "redactionFormat": "{{{REDACTED-%t}}}"
+            }
+         ]
+      },
+      "phoneNumber": {
+         "phoneNumberFilterStrategies": [
+            {
+               "strategy": "REDACT",
+               "redactionFormat": "{{{REDACTED-%t}}}"
+            }
+         ]
+      }
+   }
+}
+```
+
+The first thing we need to do is to set the name of the policy. Replace `default` with `evaluation` and save the file.
+
+#### Identifying the Filters You Need
+
+The rest of the file contains the filters that are enabled in the default policy. We need to make sure that each type of sensitive information that you want to redact is represented by a filter in this file. Look through the rest of the policy and determine which filters are listed that you do not need and also which filters you do need that are not listed.
+
+#### Disabling Filters We Do Not Need
+
+If a filter is listed in the policy and you do not need the filter you have two options. You can either delete those lines from the policy and save the file, or you can set the filter's `enabled` property to false. Using the `enabled` property allows you to keep the filter configuration in the policy in case it is needed later but both options have the same effect.
+
+#### Enabling Filters Not in the Default Policy
+
+Let's say you want to redact bitcoin addresses. The bitcoin address filter is not in the default policy. To add the bitcoin address filter we will refer to Phileas' documentation on the bitcoin address filter, get the configuration, and copy it into the policy.
+
+From the [bitcoin address filter documentation](bitcoin-addresses.md) we see the configuration for the bitcoin address filter is:
+
+```
+      "bitcoinAddress": {
+         "bitcoinAddressFilterStrategies": [
+            {
+               "strategy": "REDACT",
+               "redactionFormat": "{{{REDACTED-%t}}}"
+            }
+         ]
+      }
+```
+
+We can copy this configuration and paste it into our policy:
+
+```
+{
+   "name": "evaluation",
+   "identifiers": {
+      "bitcoinAddress": {
+         "bitcoinAddressFilterStrategies": [
+            {
+               "strategy": "REDACT",
+               "redactionFormat": "{{{REDACTED-%t}}}"
+            }
+         ]
+      },
+      "emailAddress": {
+         "emailAddressFilterStrategies": [
+            {
+               "strategy": "REDACT",
+               "redactionFormat": "{{{REDACTED-%t}}}"
+            }
+         ]
+      },
+      "phoneNumber": {
+         "phoneNumberFilterStrategies": [
+            {
+               "strategy": "REDACT",
+               "redactionFormat": "{{{REDACTED-%t}}}"
+            }
+         ]
+      }
+   }
+}
+```
+
+The order of the filters in the policy does not matter and has no impact on performance. We typically place the filters in the policy alphabetically just to improve readability.
+
+Repeat these steps until you have added a filter for each of the types of sensitive information you want to redact. Typically, the default redaction `strategy` and `redactionFormat` values for each filter should be fine for evaluation.
+
+When finished modifying the policy, save the file and close the text editor. Now restart Phileas for the policy changes to be loaded:
+
+```
+sudo systemctl restart Phileas
+```
+
+#### Submitting Text for Redaction
+
+With our policy in place we can now send text to Phileas for redaction using that policy:
+
+```
+PhileasConfiguration phileasConfiguration = ConfigFactory.create(PhileasConfiguration.class);
+
+FilterService filterService = new PhileasFilterService(phileasConfiguration);
+
+FilterResponse response = filterService.filter(policies, context, documentId, body, MimeType.TEXT_PLAIN);
+```
+
+The `explain` API [endpoint](filtering-api.md#explain) produces a detailed description of the redaction. The response will include a list of spans that contain the start and stop positions of redacted text and the type of sensitive information that was redacted. Using this information we can compare the redacted information to our annotated file to calculate precision and recall metrics.
+
+#### Calculating Precision and Recall
+
+Now we can calculate the precision and recall metrics.
+
+* Precision is the number of true positives divided by the number true positives plus false positives.
+* Recall is the number of true positives divided by the number of false negatives plus true positives.
+
+![Calculating the precision and recall](Images/precision.png)
+
+* The F-1 score is the harmonic mean of precision and recall.
+
+![Calculating the F-1 score](Images/f1.png)
diff --git a/docs/docs/filter_policies/filter_policies.md b/docs/docs/filter_policies/filter_policies.md
@@ -0,0 +1,65 @@
+# Filter Policies
+
+The types of sensitive information identified by Phileas and how that information is de-identified are controlled through policies. A policy is a file stored under Phileas’s `policies` directory, which by default is located at `/opt/Phileas/policies/`. You can have an unlimited number of policies.
+
+Each policy has a `name` that is used by Phileas to apply the appropriate de-identification methods. The `name` is passed to Phileas’s [API](filtering-api.md) along with the text to be filtered when submitting text to Phileas. This provides flexibility and allows you to de-identify different types of documents in differing manners with a single instance of Phileas. For example, you may have a policy for bankruptcy documents and a separate policy for financial documents.
+
+> There are [sample policies](sample_filter_policies.md) available for immediate use or customization to fit your use-cases.
+
+
+### The Structure of a Policy
+
+A policy:
+
+* Must have a `name` that uniquely identifies it.
+* Must have a list of `identifiers` that are filters for sensitive information.
+    * Each `identifier` , or filter, can have zero or more [filter strategies](filter-strategies.md). A filter strategy tells Phileas how to manipulate that type of sensitive information when it is identified.
+* Can have an optional list of [terms](ignore-lists.md) or [patterns](ignoring-patterns.md).
+* Can have encryption keys to support [encryption](filter-strategies.md#fpe) of sensitive information.
+
+### An Example Policy
+
+The following is an example policy. In the example below you can see the [types of sensitive information](filters_README.md) that are enabled and the strategy for manipulating each type when found. This policy identifies email addresses and phone numbers and redacts each with the format given.
+
+```
+{
+   "name": "email-and-phone-numbers",
+   "identifiers": {
+      "emailAddress": {
+         "emailAddressFilterStrategies": [
+            {
+               "strategy": "REDACT",
+               "redactionFormat": "{{{REDACTED-%t}}}"
+            }
+         ]
+      },
+      "phoneNumber": {
+         "phoneNumberFilterStrategies": [
+            {
+               "strategy": "REDACT",
+               "redactionFormat": "{{{REDACTED-%t}}}"
+            }
+         ]
+      }
+   }
+}
+```
+
+When an email address is identified by this policy, the email address is replaced with the text `{{{REDACTED-email-address}}}`. The `%t` gets replaced by the type of the filter. Likewise, when a phone number is found it is replaced with the text `{{{REDACTED-phone-number}}}`. You are free to change the redaction formats to whatever fits your use-case. See [Filter Strategies](filter-strategies.md) for all replacement options.
+
+The name of the policy is `email-and-phone-numbers`. Policies can be named anything you like but their names must be unique from all other policies. As a best practice, the policy should be saved as `[name].json`, e.g. `email-and-phone-numbers.json`.
+
+### Applying a Policy to Text
+
+To use this policy we will save it as `/opt/Phileas/profiles/email-and-phone-numbers.json`. We must restart Phileas for the new profile to be available for use. To apply the policy we will pass the policy's name to Phileas when making a filter request, as shown in the example request below.
+
+```
+curl -k -X POST "https://localhost:8080/api/filter?c=context&p=email-and-phone-numbers" \
+  -d @file.txt -H Content-Type "text/plain"
+```
+
+In this command, we have provided the parameter `p` along with a value that is the name of the policy we want to use for this request. If we had multiple policies in Phileas we could choose a different policy for this request simply by changing the name given to the parameter `p`. For more details see Phileas’s [API](filtering-api.md).
+
+Phileas will process the contents of `file.txt` by applying the policy named `email-and-phone-numbers`. As we saw in the policy above, this policy redacts email addresses and phone numbers. Phileas will return the redacted text in response to the API call.
+
+To manipulate the sensitive information by methods other than redaction, see the [Filter Strategies](filter-strategies.md).