Skip to content

Scala Common Enrich: add PII Enrichment #3472

Closed
@alexanderdean

Description

Disclaimer: Snowplow Analytics Ltd make no claims that the use of this enrichment will ensure or help to ensure compliance with the EU's GDPR or ePrivacy regulations, and we will not be liable for any failure to comply with GDPR. Your use of this enrichment is governed by Apache License Version 2.0, January 2004.

The PII Enrichment lets you pseudonymize all fields in self-describing events and contexts that might contain PII.

The configuration JSON for this enrichment contains two sub-objects:

  • pii specifies the datapoint(s) from the Snowplow event which may represent PII
  • strategy defines how the enrichment should handle making the PII safe

Here is an example configuration:

{
  "enabled": true,
  "parameters": {
    "pii": [
      {
        "pojo": {
          "field": "user_id"
        }
      },
      {
        "json": {
          "field": "contexts",
          "schemaCriterion": "iglu:com.acme/email_sent/jsonschema/1-*-*",
          "jsonPath": "$.emailAddress"
        }
      }
    ],
    "strategy": {
      "pseudonymize": {
        "hashFunction": "SHA-256"
      }
    }
  }
}

To go through each of these sections in more detail:

pii

Specify an array of pii, namely properties in the enriched event which could represent PII. Each property is identified by its source: either pojo if the datapoint comes from the Snowplow enriched event POJO, or json if the datapoint comes from a self-describing JSON inside one of the three JSON fields.

For pojo, the field name must be specified. The field name will be ignored if it is not one of the following whitelisted PII fields:

  • user_id
  • user_ipaddress
  • user_fingerprint
  • domain_userid
  • network_userid
  • ip_organization
  • ip_domain
  • tr_orderid
  • ti_orderid
  • mkt_term
  • mkt_content
  • se_category
  • se_action
  • se_label
  • se_property
  • mkt_clickid
  • refr_domain_userid
  • domain_sessionid

For json, you must specify the field name as either unstruct_event, contexts or derived_contexts. You must then provide two additional fields:

  • schemaCriterion lets you specify the self-describing JSON you are looking in for the given JSON field. You can specify only the SchemaVer MODEL (e.g. 1-), MODEL plus REVISION (e.g. 1-1-) or a full MODEL-REVISION-ADDITION version (e.g. 1-1-1)
  • jsonPath lets you provide the JSON Path statement to navigate to the field inside the JSON that you want to pseudonymize

strategy

The strategy section lets you configured precisely how the PII is handled by the enrichment.

Currently the only supported strategy is pseudonymize, which has one configuration options:

hashFunction specifies the hash to apply to the properties identified by the pii array. Supported values for the hashFunction are:

  • MD2, the 128-bit algorithm MD2 (not-recommended due to performance see RFC6149)
  • MD5, the 128-bit algorithm MD5
  • SHA-1, the 160-bit algorithm SHA-1
  • SHA-256, 256-bit variant of the SHA-2 algorithm
  • SHA-384, 384-bit variant of the SHA-2 algorithm
  • SHA-512, 512-bit variant of the SHA-2 algorithm

With psuedonymization, note that the specified property in the enriched event POJO or self-describing event or context will be hashed using the hashFunction and then the newly hashed value will replace (i.e. overwrite) the prior unhashed value in the POJO or JSON.

The limitations of this approach are discussed below.

Example

Imagine an event where:

user_id is set to John Smith

The contexts array includes:

{
  "schema": "iglu:com.acme/email_sent/jsonschema/1-1-1",
  "data": {
    "subject": "Sensitive information",
    "emailAddress": "john@acme.com"
   }
}

Following processing by the PII Enrichment with the configuration provided above:

user_id would be mutated to:

ED014A19BB67A85F9C8B1D81E04A0E7101725BE8627D79D02CA4F3BD803F33CF3B8FED53E80D2A12C0D0E426824D99D110F0919298A5055EFFF040A3FC091518

The relevant context would become:

{
  "schema": "iglu:com.acme/email_sent/jsonschema/1-1-1",
  "data": {
    "subject": "Sensitive information",
    "emailAddress": "D63227AB419893C2483E7B8F5584AC49305191CAC19531E2F8F87C3F303B5F325B470AC51E307680E4B767E9DC685CBE025B1ADC4EA8A986EFD20BFD7E4B55E9"
   }
}

Limitations

In support of compliance with GDPR and ePrivacy, we strongly recommend that you familiarize yourself with the following limitations of the enrichment. This is a non-exhaustive list of limitations.

Only supports strings

Because the enrichment mutates each property's value in place, replacing it with a hash string, it only works if the property's value is already typed as a string. If the value is not a string, it will be ignored by the enrichment.

Can cause downstream JSON Schema validation to fail

Remember that this enrichment:

  1. Only supports hashing, not format-preserving encryption, and
  2. Mutates each property's value in place

Therefore, it is possible for the updated value to cause downstream validation, such as that performed by the RDB Loader, to fail. This will typically be because the length or format of the hashed value conflicts with the that of the original value.

Is lossy

The properties processed by this enrichment are hashed, not encrypted, and are mutated in place. The original value is therefore not recoverable without re-processing the raw collector logs.

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions