Scala Common Enrich: add PII Enrichment

**Disclaimer: Snowplow Analytics Ltd make no claims that the use of this enrichment will ensure or help to ensure compliance with the EU's GDPR or ePrivacy regulations, and we will not be liable for any failure to comply with GDPR. Your use of this enrichment is governed by Apache License Version 2.0, January 2004.**

The PII Enrichment lets you [pseudonymize](https://en.wikipedia.org/wiki/Pseudonymization) all fields in self-describing events and contexts that might contain PII.

The configuration JSON for this enrichment contains two sub-objects:

* `pii` specifies the datapoint(s) from the Snowplow event which may represent PII
* `strategy` defines how the enrichment should handle making the PII safe

Here is an example configuration:

``` json
{
  "enabled": true,
  "parameters": {
    "pii": [
      {
        "pojo": {
          "field": "user_id"
        }
      },
      {
        "json": {
          "field": "contexts",
          "schemaCriterion": "iglu:com.acme/email_sent/jsonschema/1-*-*",
          "jsonPath": "$.emailAddress"
        }
      }
    ],
    "strategy": {
      "pseudonymize": {
        "hashFunction": "SHA-256"
      }
    }
  }
}
```

To go through each of these sections in more detail:

### `pii`

Specify an array of `pii`, namely properties in the enriched event which could represent PII. Each property is identified by its source: either `pojo` if the datapoint comes from the Snowplow enriched event POJO, or `json` if the datapoint comes from a self-describing JSON inside one of the three JSON fields.

For `pojo`, the field name must be specified. The field name will be ignored if it is not one of the following whitelisted PII fields:

* `user_id`
* `user_ipaddress`
* `user_fingerprint`
* `domain_userid`
* `network_userid`
* `ip_organization`
* `ip_domain`
* `tr_orderid`
* `ti_orderid`
* `mkt_term`
* `mkt_content`
* `se_category`
* `se_action`
* `se_label`
* `se_property`
* `mkt_clickid`
* `refr_domain_userid`
* `domain_sessionid`

For `json`, you must specify the field name as either `unstruct_event`, `contexts` or `derived_contexts`. You must then provide two additional fields:
- `schemaCriterion` lets you specify the self-describing JSON you are looking in for the given JSON field. You can specify only the SchemaVer MODEL (e.g. 1-), MODEL plus REVISION (e.g. 1-1-) or a full MODEL-REVISION-ADDITION version (e.g. 1-1-1)
- `jsonPath` lets you provide the [JSON Path statement](https://github.com/gatling/jsonpath#jsonpath) to navigate to the field inside the JSON that you want to pseudonymize

### `strategy`

The `strategy` section lets you configured precisely how the PII is handled by the enrichment.

Currently the only supported strategy is `pseudonymize`, which has one configuration options:

`hashFunction` specifies the hash to apply to the properties identified by the `pii` array. Supported values for the `hashFunction` are:

* `MD2`, the 128-bit algorithm [MD2](https://en.wikipedia.org/wiki/MD2_(cryptography)#MD2_hashes) (not-recommended due to performance see [RFC6149](https://tools.ietf.org/html/rfc6149))
* `MD5`, the 128-bit algorithm [MD5](https://en.wikipedia.org/wiki/MD5#MD5_hashes)
* `SHA-1`, the 160-bit algorithm [SHA-1](https://en.wikipedia.org/wiki/SHA-1#Example_hashes)
* `SHA-256`, 256-bit variant of the [SHA-2](https://en.wikipedia.org/wiki/SHA-2#Comparison_of_SHA_functions) algorithm
* `SHA-384`, 384-bit variant of the [SHA-2](https://en.wikipedia.org/wiki/SHA-2#Comparison_of_SHA_functions) algorithm
* `SHA-512`, 512-bit variant of the [SHA-2](https://en.wikipedia.org/wiki/SHA-2#Comparison_of_SHA_functions) algorithm

With psuedonymization, note that the specified property in the enriched event POJO or self-describing event or context will be hashed using the `hashFunction` and then *the newly hashed value will replace (i.e. overwrite) the prior unhashed value in the POJO or JSON*.

The limitations of this approach are discussed below. 

### Example

Imagine an event where:

`user_id` is set to `John Smith`

The `contexts` array includes:

```json
{
  "schema": "iglu:com.acme/email_sent/jsonschema/1-1-1",
  "data": {
    "subject": "Sensitive information",
    "emailAddress": "john@acme.com"
   }
}
```

Following processing by the PII Enrichment with the configuration provided above:

`user_id` would be mutated to:

`ED014A19BB67A85F9C8B1D81E04A0E7101725BE8627D79D02CA4F3BD803F33CF3B8FED53E80D2A12C0D0E426824D99D110F0919298A5055EFFF040A3FC091518`

The relevant context would become:

```json
{
  "schema": "iglu:com.acme/email_sent/jsonschema/1-1-1",
  "data": {
    "subject": "Sensitive information",
    "emailAddress": "D63227AB419893C2483E7B8F5584AC49305191CAC19531E2F8F87C3F303B5F325B470AC51E307680E4B767E9DC685CBE025B1ADC4EA8A986EFD20BFD7E4B55E9"
   }
}
```

### Limitations

**In support of compliance with GDPR and ePrivacy, we strongly recommend that you familiarize yourself with the following limitations of the enrichment. This is a non-exhaustive list of limitations.**

#### Only supports strings

Because the enrichment *mutates* each property's value in place, replacing it with a hash string, it only works if the property's value is already typed as a string. If the value is not a string, it will be ignored by the enrichment.

#### Can cause downstream JSON Schema validation to fail

Remember that this enrichment:

1. Only supports hashing, not [format-preserving encryption](https://en.wikipedia.org/wiki/Format-preserving_encryption), and 
2. Mutates each property's value in place

Therefore, it is possible for the updated value to cause downstream validation, such as that performed by the RDB Loader, to fail. This will typically be because the length or format of the hashed value conflicts with the that of the original value.

#### Is lossy

The properties processed by this enrichment are hashed, not encrypted, and are mutated in place. The original value is therefore not recoverable without re-processing the raw collector logs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scala Common Enrich: add PII Enrichment #3472

`pii`

`strategy`

Example

Limitations

Only supports strings

Can cause downstream JSON Schema validation to fail

Is lossy

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development