Filter Profiles

Describes filter profiles and how they work.

The types of sensitive information identified by Philter and how that information is manipulated are controlled through files called Filter Profiles. A Filter Profile is a JSON file stored under Philter’s profiles directory or in a Filter Profile Registry. (Refer to the settings for more information on how to configure the location of filter profiles.)

There are sample filter profiles available for immediate use or customization to fit your use-cases.

Each Filter Profile has a name and that name is used to tell Philter which filter profile to use during filtering. The name can be passed to Philter’s API when submitting text to Philter. This provides flexibility and allows you to process different types of documents in differing manners with a single instance of Philter.

We recommend using Philter Studio to create and modify filter profiles. Using Philter Studio provides a more user-friendly experience for manipulating the filter profiles than manually as described on this page. Philter Studio is a Microsoft Windows application that provides a graphical interface for creating, modifying, and managing filter profiles.

Structure of a Filter Profile

A filter profile:

  • Has a name that uniquely identifies it.

  • Has a list of identifiers corresponding to sensitive information that Philter should find.

    • Each identifier has zero or more replacement strategies that tells Philter how to manipulate the type of sensitive information when found. Replacement strategies are described on down this page.

An Example Filter Profile

The following is a sample filter profile. In this sample you can see the types of sensitive information that are enabled and the strategy for manipulating each type when found. This filter profile only considers email addresses and phone numbers when looking for sensitive information.

{
"name":"email-and-phone-numbers",
"identifiers":{
"emailAddress":{
"emailAddressFilterStrategies":[
{
"strategy":"REDACT",
"redactionFormat":"{{{REDACTED-%t}}}"
}
]
},
"phoneNumber":{
"phoneNumberFilterStrategies":[
{
"strategy":"REDACT",
"redactionFormat":"{{{REDACTED-%t}}}"
}
]
}
}
}

When an email address is identified, it is replaced with the text {{{REDACTED-email-address}}}. The %t gets replaced by the type of the filter. Likewise, when a phone number is found it is replaced with the text {{{REDACTED-phone-number}}}. You are free to change the redaction formats to whatever fits your use-case.

The name of the filter profile is email-and-phone-numbers. Filter profiles can be named anything you like but their names must be unique from all other filter profiles. The filter profile should be saved as its name.json, or email-and-phone-numbers.json.

To use this filter profile, we will save it as /opt/philter/profiles/email-and-phone-numbers.json. We must restart Philter for the new profile to be found. Once Philter has been restarted, the filter profile can be used. To do so, we will pass the filter profile name to Philter when making a filter request, as shown below.

curl -k -X POST "https://localhost:8080/api/filter?c=context&p=email-and-phone-numbers" \
-d @file.txt -H Content-Type "text/plain"

In this command, we have provided the parameter p along with a value that is the name of the filter profile we want to use for this request. If we had multiple filter profiles in Philter we could choose a different filter profile for this request simply by changing the name given to the parameter p. For more details see Philter’s API.

Replacement Strategies

A replacement strategy defines how sensitive information identified by Philter should be manipulated.

In a filter profile you specify how Philter should identify and replace sensitive information. In the example above, the filter profile only identified email address and phone numbers. However, we can make a filter profile to identify as many (or as few) of sensitive information types as we need to. How Philter replaces each type of sensitive information is specific to each type. For instance, zip codes can be truncated based on the leading digits or zip code population.

Each type of sensitive information defined in a filter profile has at least one strategy associated with it. These strategies instruct Philter how to manipulate that type of text when it is identified. For example, in the email address example given above, the strategy is to REDACT the text per the specified redactionFormat.

Filter Strategies

The available filter strategies are described below. The first column is the name of the strategy as specified in the filter profile. The third column is the types of filters in which the strategy can be used. For example usage see the sample filter profiles.

Strategy

Description

Filter Types

REDACT

Replaces the identified text with a set pattern given by redactionFormat. Redaction variables are available to customize the redaction text dynamically. See the Redaction Format Variables.

All filter types

RANDOM_REPLACE

Replaces the identified text with a fake value but of the same type. For example, an SSN will be replaced by a random text having the format ###-##-####, such as 123-45-6789. An email address will be replaced with a randomly generated email address.

All filter types

STATIC_REPLACE

Replaces the identified text with a given static value.

All filter types

TRUNCATE

Available only to zip codes, this strategy allows for truncating zip codes to only a select number of digits. Specify truncateDigits to set the desired number of leading digits to leave. For example, if truncateDigits is 2, the zip code 90210 will be truncated to 90***.

Only Zip Codes

ZERO_LEADING

Available only to zip codes, this strategy changes the first 3 digits of a zip code to be 0. For example, the zip code 90210 will be changed to 00010.

Only Zip Codes

Redaction Format Variables

When using the REDACT strategy, you can put variables in the redaction format that Philter will replace when performing the redaction. The available variables are:

  • %t - will be replaced with the type of sensitive information. This is to allow you to know the type of sensitive information that was identified and redacted.

  • %l - will be replaced by the given label for the type of sensitive information. This variable only applies to custom identifier types.

Conditions

A replacement strategy can be applied based on the sensitive information meeting one or more conditions. For example, you can create a condition such that only dates of 11/05/2010 are replaced by using the condition token == "11/05/2010". The conditions that can be applied vary based on the type of sensitive information. For instance, zip codes can have conditions based on their population. The following is an example filter profile for credit cards that contains a condition to only redact credit card numbers that start with the digits 3000:

{
"name": "default",
"identifiers": {
"creditCard": {
"creditCardFilterStrategies": [
{
"condition": "token startswith \"3000\"",
"strategy": "REDACT",
"redactionFormat": "{{{REDACTED-%t}}}"
}
]
}
}
}

List of Conditions per Type of Sensitive Information

Type of Sensitive Information

Available Conditions

Examples

Age

  • token

  • token == "23yrs"

Credit Card

  • token

  • token == "4136033768658155"

City

  • token

  • token == "Bridgeville"

County

  • token

  • token == "Aiken"

Date

  • token

  • token == "02/23/2003"

Email Address

  • token

  • token == "john.fake@hotmail.com"

First Name

  • token

  • token == "John"

Hospital

  • token

  • token == "General Hospital"

Hospital Abbreviation

  • token

  • token == "GH"

Identifier

  • token

  • token == "MT10933"

IP Address

  • token

  • token == "192.168.1.23"

Entity

  • token

  • token == "John Smith"

Phone Number

  • token

  • token == "800-123-4567"

Phone Number Extension

  • token

  • token == "x123"

SSN

  • token

  • token == "123-45-6789"

State Abbreviation

  • token

  • token == "OH"

State

  • token

  • token == "Ohio"

Surname

  • token

  • token == "Smith"

URL

  • token

  • token == "http://www.fakesite.com"

VIN

  • token

  • token == "1VWCT7A37EC163642"

Zip Codes

  • population

  • token

  • population > 2000

  • token == "90210"