Custom dictionaries provide the simple but powerful ability to match a list of words or phrases. You can use a custom dictionary as a detector or as an exception list for built-in detectors. You can also use custom dictionaries to augment built-in infoType detectors to match additional findings.
This section describes how to create a regular custom dictionary detector from a list of words.
Anatomy of a dictionary custom infoType detector
As summarized in API overview, to create a dictionary custom infoType detector, you define a CustomInfoType object that contains the following:
- The name you want to give the custom infoType detector, within in an InfoTypeobject.
- An optional Likelihoodvalue. If you omit this field, matches to the dictionary items will return a default likelihood ofVERY_LIKELY.
- Optional DetectionRuleobjects, or hotword rules. These rules adjust the likelihood of findings within a given proximity of specified hotwords. Learn more about hotword rules in Customizing match likelihood.
- An optional - SensitivityScorevalue. If you omit this field, matches to the dictionary items will return a default sensitivity level of- HIGH.- Sensitivity scores are used in data profiles. When profiling your data, Sensitive Data Protection uses the sensitivity scores of the infoTypes to calculate the sensitivity level. 
- A - Dictionary, as either a- WordListcontaining a list of words to scan for or a- CloudStoragePathto a single text file containing a newline-delimited list of words to scan for.
As a JSON object, a dictionary custom infoType detector that includes all optional components looks like the following. This JSON includes a path to a dictionary text file stored in Cloud Storage. To see an inline word list, see the Examples section, later in this topic.
{ "customInfoTypes":[ { "infoType":{ "name":"CUSTOM_INFOTYPE_NAME" }, "likelihood":"LIKELIHOOD_LEVEL", "detectionRules":[ { "hotwordRule":{ HOTWORD_RULE } }, ... ], "sensitivityScore":{ "score": "SENSITIVITY_SCORE" }, "dictionary": { "cloudStoragePath": { "path": "gs://PATH_TO_TXT_FILE" } } } ], ... } Dictionary matching specifics
Following is guidance about how Sensitive Data Protection matches dictionary words and phrases. These points apply to both regular and large custom dictionaries:
- Dictionary words are case-insensitive. If your dictionary includes Abby, it will match onabby,ABBY,Abby, and so on.
- All characters—in dictionaries or in content to be scanned—other than letters, digits, and other alphabetic characters contained within the Unicode Basic Multilingual Plane are considered as whitespace when scanning for matches. If your dictionary scans for Abby Abernathy, it will match onabby abernathy,Abby, Abernathy,Abby (ABERNATHY), and so on.
- The characters surrounding any match must be of a different type (letters or digits) than the adjacent characters within the word. If your dictionary scans for Abi, it will match the first three characters ofAbi904, but not ofAbigail.
- Dictionary words containing characters in the Supplementary Multilingual Plane of the Unicode standard can yield unexpected findings. Examples of such characters are emojis, scientific symbols, and historical scripts.
Letters, digits, and other alphabetic characters are defined as follows:
- Letters: characters with general categories Lu,Ll,Lt,Lm, orLoin the Unicode specification
- Digits: characters with general category Ndin the Unicode specification
- Other alphabetic characters: characters with general category Nlin the Unicode specification or with contributory propertyOther_Alphabeticas defined by the Unicode Standard
Examples
Simple word list
Suppose you have data that includes what hospital room a patient was treated in during a visit. These locations may be considered sensitive in a particular data set, but they are not something that would be picked up by Sensitive Data Protection's built-in detectors.
The rooms were listed as:
- "RM-Orange"
- "RM-Yellow"
- "RM-Green"
C#
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Go
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
PHP
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
REST
The following example JSON defines a custom dictionary that you could use to de-identify custom room numbers.
JSON input:
POST https://dlp.googleapis.com/v2/projects/[PROJECT_ID]/content:deidentify?key={YOUR_API_KEY} {  "item":{  "value":"Patient was seen in RM-YELLOW then transferred to rm green."  },  "deidentifyConfig":{  "infoTypeTransformations":{  "transformations":[  {  "primitiveTransformation":{  "replaceWithInfoTypeConfig":{  }  }  }  ]  }  },  "inspectConfig":{  "customInfoTypes":[  {  "infoType":{  "name":"CUSTOM_ROOM_ID"  },  "dictionary":{  "wordList":{  "words":[  "RM-GREEN",  "RM-YELLOW",  "RM-ORANGE"  ]  }  }  }  ]  } } JSON output:
When we POST the JSON input to content:deidentify, it returns the following JSON response:
{  "item":{  "value":"Patient was seen in [CUSTOM_ROOM_ID] then transferred to [CUSTOM_ROOM_ID]."  },  "overview":{  "transformedBytes":"17",  "transformationSummaries":[  {  "infoType":{  "name":"CUSTOM_ROOM_ID"  },  "transformation":{  "replaceWithInfoTypeConfig":{  }  },  "results":[  {  "count":"2",  "code":"SUCCESS"  }  ],  "transformedBytes":"17"  }  ]  } } Sensitive Data Protection has correctly identified the room numbers specified in the custom dictionary's WordList message. Note that items are even matched when the case and the hyphen (-) are missing, as in the second example, "rm green."
Exception list
Suppose you have log data that includes customer identifiers such as email addresses, and you want to redact this information. However, these logs also include the email addresses of internal developers, and you don't want to redact those.
C#
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Go
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
PHP
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
REST
The following JSON example creates a custom dictionary that lists a subset of email addresses within the WordList message (jack@example.org and jill@example.org), and assigns them the custom infoType name DEVELOPER_EMAIL. This JSON instructs Sensitive Data Protection to ignore the specified email addresses, while replacing any other email addresses it detects with a string that corresponds to its infoType (in this case, EMAIL_ADDRESS):
JSON input:
POST https://dlp.googleapis.com/v2/projects/[PROJECT_ID]/content:deidentify?key={YOUR_API_KEY} {  "item":{  "value":"jack@example.org accessed customer record of user5@example.com"  },  "deidentifyConfig":{  "infoTypeTransformations":{  "transformations":[  {  "primitiveTransformation":{  "replaceWithInfoTypeConfig":{  }  },  "infoTypes":[  {  "name":"EMAIL_ADDRESS"  }  ]  }  ]  }  },  "inspectConfig":{  "customInfoTypes":[  {  "infoType":{  "name":"DEVELOPER_EMAIL"  },  "dictionary":{  "wordList":{  "words":[  "jack@example.org",  "jill@example.org"  ]  }  }  }  ],  "infoTypes":[  {  "name":"EMAIL_ADDRESS"  }  ]  "ruleSet": [  {  "infoTypes": [  {  "name": "EMAIL_ADDRESS"  }  ],  "rules": [  {  "exclusionRule": {  "excludeInfoTypes": {  "infoTypes": [  {  "name": "DEVELOPER_EMAIL"  }  ]  },  "matchingType": "MATCHING_TYPE_FULL_MATCH"  }  }  ]  }  ]  } } JSON output:
When we POST this JSON to content:deidentify, it returns the following JSON response:
{  "item":{  "value":"jack@example.org accessed customer record of [EMAIL_ADDRESS]"  },  "overview":{  "transformedBytes":"17",  "transformationSummaries":[  {  "infoType":{  "name":"EMAIL_ADDRESS"  },  "transformation":{  "replaceWithInfoTypeConfig":{  }  },  "results":[  {  "count":"1",  "code":"SUCCESS"  }  ],  "transformedBytes":"17"  }  ]  } } The output has correctly identified user1@example.com as matched by the EMAIL_ADDRESS infoType detector and jack@example.org as matched by the DEVELOPER_EMAIL custom infoType detector. Note that because we chose to only transform EMAIL_ADDRESS, jack@example.org was left intact.
Augment a built-in infotype detector
Consider a scenario in which a built-in infoType detector isn't returning the correct values. For example, you want to return matches on person names, but Sensitive Data Protection's built-in PERSON_NAME detector is failing to return matches on some person names that are common in your dataset.
Sensitive Data Protection allows you to augment built-in infoType detectors by including a built-in detector in the declaration for a custom infoType detector, as shown in the following example. This snippet illustrates how to configure Sensitive Data Protection so that the PERSON_NAME built-in infoType detector will additionally match the name "Quasimodo:"
C#
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Go
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
PHP
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
REST
 ... "inspectConfig":{ "customInfoTypes":[ { "infoType":{ "name":"PERSON_NAME" }, "dictionary":{ "wordList":{ "words":[ "quasimodo" ] } } } ] } ... What's next
Learn about large custom dictionaries.