-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Upgrade Guide
HOME » UPGRADE GUIDE
🚧 We are no longer doing umbrella Snowplow releases and are instead releasing each component on its own. You can find upgrade instructions in that respective component's repo.
On this page, you will find the steps to upgrade sequentially after a Snowplow release with the latest umbrella release version at the top. Here sequentially means from the previous to the following.
You can also use Snowplow Version Matrix as a guidance to the internal component dependencies for a particular release.
For easier navigation, please, follow the links below.
- Snowplow 119 Tycho Magnetic Anomaly Two (r119) 2020-04-30
- Snowplow 118 Morgantina (r118) 2019-12-24
- Snowplow 117 Biskupin (r117) 2019-12-03
- Snowplow 116 Madara Rider (r116) 2019-09-12
- Snowplow 115 Sigiriya (r115) 2019-07-17
- Snowplow 114 Polonnaruwa (r114) 2019-05-16
- Snowplow 113 Filitosa (r113) 2019-02-27
- Snowplow 112 Baalbek (r112) 2019-02-20
- Snowplow 111 Selinunte (r111) 2018-10-01
- Snowplow 110 Valle dei Templi (r110) 2018-09-07
- Snowplow 109 Lambaesis (r109) 2018-08-21
- Snowplow 108 Val Camonica (r108) 2018-07-24
- Snowplow 107 Trypillia (r107) 2018-07-18
- Snowplow 106 Acropolis (r106) 2018-06-01
- Snowplow 105 Pompeii (r105) 2018-05-07
- Snowplow 104 Stoplesteinan (r104) 2018-04-30
- Snowplow 103 Paestum (r103) 2018-04-17
- Snowplow 102 Afontova Gora (r102) 2018-04-03
- Snowplow 101 Neapolis (r101) 2018-03-21
- Snowplow 100 Epidaurus (r100) 2018-02-26
- Snowplow 99 Carnac (r99) 2018-01-25
- Snowplow 98 Argentomagus (r98) 2018-01-05
- Snowplow 97 Knossos (r97) 2017-12-18
- Snowplow 96 Zeugma (r96) 2017-11-21
- Snowplow 95 Ellora (r95) 2017-11-13
- Snowplow 94 Hill of Tara (r94) 2017-10-10
- Snowplow 93 Virunum (r93) 2017-10-03
- Snowplow 92 Maiden Castle (r92) 2017-09-11
- Snowplow 91 Stonehenge (r91) 2017-08-17
- Snowplow 90 Lascaux (r90) 2017-07-26
- Snowplow 89 Plain of Jars (r89) 2017-06-12
- Snowplow 88 Angkor Wat (r88) 2017-04-27
- Snowplow 87 Chichen Itza (r87) 2017-02-21
- Snowplow 86 Petra (r86) 2016-12-20
- Snowplow 85 Metamorphosis (r85) 2016-11-15
- Snowplow 84 Steller's Sea Eagle (r84) 2016-10-07
- Snowplow 83 Bald Eagle (r83) 2016-09-06
- Snowplow 82 Tawny Eagle (r82) 2016-08-08
- Snowplow 81 Kangaroo Island Emu (r81) 2016-06-16
- Snowplow 80 Southern Cassowary (r80) 2016-05-30
- Snowplow 79 Black Swan (r79) 2016-05-12
- Snowplow 78 Great Hornbill (r78) 2016-03-15
- Snowplow 77 Great Auk (r77) 2016-02-29
- Snowplow 76 Changeable Hawk-Eagle (r76) 2016-01-26
- Snowplow 75 Long-Legged Buzzard (r75) 2016-01-02
- Snowplow 74 European Honey Buzzard (r74) 2015-12-22
- Snowplow 73 Cuban Macaw (r73) 2015-12-04
- Snowplow 72 Great Spotted Kiwi (r72) 2015-10-15
- Snowplow 71 Stork-Billed Kingfisher (r71) 2015-10-02
- Snowplow 70 Bornean Green Magpie (r70) 2015-08-19
- Snowplow 69 Blue-Bellied Roller (r69) 2015-07-24
- Snowplow 68 Turquoise Jay (r68) 2015-07-23
- Snowplow 67 Bohemian Waxwing (r67) 2015-07-13
- Snowplow 66 Oriental Skylark (r66) 2015-06-16
- Snowplow 65 Scarlet Rosefinch (r65) 2015-05-08
- Snowplow 64 Palila (r64) 2015-04-16
- Snowplow 63 Red-Cheeked Cordon-Bleu (r63) 2015-04-02
- Snowplow 62 Tropical Parula (r62) 2015-03-17
- Snowplow 61 Pygmy Parrot (r61) 2015-03-02
- Snowplow 60 Bee Hummingbird (r60) 2015-02-03
Docker images available on Docker Hub:
An example configuration can be found here.
The configuration of the referer parser enrichment needs to be updated to become:
{
"schema": "iglu:com.snowplowanalytics.snowplow/referer_parser/jsonschema/2-0-0",
"data": {
"vendor": "com.snowplowanalytics.snowplow",
"name": "referer_parser",
"enabled": true,
"parameters": {
"database": "referers-latest.json",
"internalDomains": [
"www.subdomain1.snowplowanalytics.com"
],
"uri": "https://s3-eu-west-1.amazonaws.com/snowplow-hosted-assets/third-party/referer-parser/"
}
}
}
For GCP the URI to use is gs://sp-referer-parser/third-party/referer-parser
.
Docker images available on Docker Hub:
- snowplow/scala-stream-collector-kinesis:1.0.1
- snowplow/scala-stream-collector-pubsub:1.0.1
- snowplow/scala-stream-collector-kafka:1.0.1
- snowplow/scala-stream-collector-nsq:1.0.1
- snowplow/scala-stream-collector-stdout:1.0.1
An example configuration can be found here.
When using it with PubSub
, the configuration lines
sink {
enabled = googlepubsub
need to become
sink {
enabled = google-pub-sub
Available on Docker Hub.
Available on our Bintray.
The config doesn't change.
Available on Docker Hub.
An example configuration can be found here.
This version brings the possibility to partition the events on S3:
- by date with dateFormat
- by type with partitionedBucket in case of Self-Describing JSONs
For instance with partitioning a SchemaViolations
bad row could be written to this path on s3: s3://bad-row-bucket/partitioned/com.snowplowanalytics.snowplow.badrows.schema_violations/
.
Available on our Bintray as a Docker image or zipped.
Like for s3 loader it's possible to partition data by type with in case of Self-Describing JSONs, with --partitionedOuptutDirectory=
.
Full list of parameters here.
Available on our Bintray as a Docker image or zipped.
An example configuration can be found here.
The only change is that parameter clusterType
got renamed documentType
.
The release candidate of Beam Enrich can be found on Docker Hub with tag 1.0.0
.
The release candidate of Stream Enrich can be found on Docker Hub with tag 1.0.0
.
The release candidate of Scala Stream Collector can be found on Docker Hub with tag 1.0.0
.
Although there has been a lot of refactoring in this release, the configuration almost didn't change.
For the enrichments, only the config of the referer parser enrichment needs to be updated to become:
{
"schema": "iglu:com.snowplowanalytics.snowplow/referer_parser/jsonschema/2-0-0",
"data": {
"vendor": "com.snowplowanalytics.snowplow",
"name": "referer_parser",
"enabled": true,
"parameters": {
"database": "referers-latest.json",
"internalDomains": [
"www.subdomain1.snowplowanalytics.com"
],
"uri": "https://s3-eu-west-1.amazonaws.com/snowplow-hosted-assets/third-party/referer-parser/"
}
}
}
For GCP the uri to use is gs://sp-referer-parser/third-party/referer-parser
.
The environment variable AWS_CBOR_DISABLE
or the JAVA option -Dcom.amazonaws.sdk.disableCbor
now needs to be set when running the Docker image of the collector.
Following an upgrade of the library used to parse the configuration of Scala Stream Collector, when using it with PubSub
, the configuration lines
sink {
enabled = googlepubsub
need to become
sink {
enabled = google-pub-sub
In order to insert the new format of bad rows into ElasticSearch, this line needs to be set to plain-json
(waiting for ES loader update).
A new version of the Scala Stream Collector can be found on our Docker Hub repository under 0.17.0
tag.
For example to start up an ssl-enabled, auto-upgrade server, following config can be used, collector configuration should contain:
ssl {
enable = true
redirect = true
port = 443
}
However, this configuration will use environment-defined JVM-attached certificates. In order to override the default behaviour and use a custom certificate, the low-level section can be defined as (akka config section):
ssl-config {
keyManager = {
stores = [
{type = "PKCS12", classpath = false, path = ${CERT_FILE}, password = "pass" }
]
}
}
A new version of the Snowplow Common Enrich can be found on Maven repository
The schema for the configuration of the enrichment has been updated to version 1-0-1
:
{
"schema": "iglu:com.snowplowanalytics.snowplow/anon_ip/jsonschema/1-0-1",
"data": {
"name": "anon_ip",
"vendor": "com.snowplowanalytics.snowplow",
"enabled": true,
"parameters": {
"anonOctets": 1,
"anonSegments": 1
}
}
}
A new version of the Snowplow Common Enrich can be found on Maven repository
The schema for the configuration of the enrichment has been updated to version 1-0-1
:
{
"schema": "iglu:com.snowplowanalytics.snowplow/event_fingerprint_config/jsonschema/1-0-1",
"data": {
"name": "event_fingerprint_config",
"vendor": "com.snowplowanalytics.snowplow",
"enabled": true,
"parameters": {
"excludeParameters": ["cv", "eid", "nuid", "stm"],
"hashAlgorithm": "SHA1"
}
}
}
A new version of the EmrEtl Runner can be found on our Bintray repository under r117-biskupin
version.
In order to enable spot instances, add a core_instance_bid
setting to your config.yml
file. This setting specifies a bid for an hour of EC2 spot instance in USD.
aws:
emr:
jobflow:
core_instance_bid: 0.3
A new version of the Beam Enrich can be found on our Docker Hub repository under 0.4.0
tag.
It contains the newest Snowplow Common Enrich.
A new version of the Stream Enrich can be found on our Docker Hub repository under 0.22.0
tag.
It contains the newest Snowplow Common Enrich.
A new version of the Spark Enrich can be used by setting it in your EmrEtlRunner configuration:
enrich:
version:
spark_enrich: 1.19.0
or directly make use of the new Spark Enrich available at:
s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.19.0.jar.
It contains the newest Snowplow Common Enrich.
This release focuses on adding new features to the Scala Stream Collector, including the ability to set first-party cookies server-side on multiple domains and a to use custom path mappings.
It also includes an update to EmrEtlRunner, to add support for shredded data in tsv
format.
A new version of the Scala Stream Collector can be found on our Bintray.
You can also find the images on Docker Hub:
To make use of the new features, you'll need to update your configuration as follows:
- Add a
collector.paths
section if you want to provide custom path mappings:
paths {
"/com.acme/track" = "/com.snowplowanalytics.snowplow/tp2" # for tracker protocol 2 requests
"/com.acme/redirect" = "/r/tp2" # for redirect requests
"/com.acme/iglu" = "/com.snowplowanalytics.iglu/v1" # for Iglu webhook requests
}
- In
collector.cookie
there is no longer adomain
setting. Instead, you can provide a list ofdomains
to be used and / or afallbackDomain
in case none of the origin domains matches the ones you specified:
domains = [
"acme.com"
"acme.net"
]
fallbackDomain = "roadrunner.com" # no leading dot
If you don't wish to use multiple domains and want to preserve the previous behaviour, leave domains
empty and specify a fallbackDomain
with the same value as collector.cookie.domain
from your previous configuration (but leave out any leading dots).
Both domains
and fallbackDomain
are optional settings, just like domain
is an optional setting in earlier versions.
- Another addition to
collector.cookie
are controls for extra directives to be passed in theSet-Cookie
response header.
secure = false # set to true if you want to enforce secure connections
httpOnly = false # set to true if you want to make the cookie inaccessible to non-HTTP requests
sameSite = "None" # or `Lax`, or `Strict`. This is an optional parameter.
This release includes 2 updates for EmrEtlRunner, one bug fix and one to improve its reliability.
It also includes an update to Event Manifest Populator, so that it can read the files containing the events produced by stream-enrich.
The latest version of EmrEtlRunner is available on our Bintray here.
This release includes a number of new features and updates, most of which live in Scala Common Enrich. Mainly, a new user agent enrichment has been added, as well as the possibility to use a remote adapter.
If you are a GCP pipeline user, a new Beam Enrich can be found on Bintray:
- as a ZIP archive
- as a Docker image
If you are a Kinesis or Kafka pipeline user, a new Stream Enrich can be found on Bintray.
Finally, if you are a batch pipeline user, a new Spark Enrich can be used by setting the new version in your EmrEtlRunner configuration:
enrich:
version:
spark_enrich: 1.18.0 # WAS 1.17.0
or directly make use of the new Spark Enrich available at:
s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.18.0.jar
A new version of EmrEtlRunner is also available in our Bintray.
This enrichment is based on in-memory HashMaps and require roughly 400 MB of RAM (see here). |
---|
To use new YAUAA enrichment, add yauaa_enrichment_config.json
to the folder with configuration files for enrichments, with the following content:
{
"schema": "iglu:com.snowplowanalytics.snowplow.enrichments/yauaa_enrichment_config/jsonschema/1-0-0",
"data": {
"enabled": true,
"vendor": "com.snowplowanalytics.snowplow.enrichments",
"name": "yauaa_enrichment_config"
}
}
More information about this enrichment can be found on the dedicated wiki page.
This release focuses on improvements to the Scala Stream Collector as well as new features for Scala Common Enrich such as HubSpot webhook integration and POST
support in the API request enrichment.
A new version of the Scala Stream Collector incorporating the changes discussed above can be found on our Bintray.
To make use of this new version, you’ll need to amend your configuration in the following ways:
- Add a
collector.cors
section to specify theAccess-Control-Max-Age
duration:
cors {
accessControlMaxAge = 5 seconds # -1 seconds disables the cache
}
- Add a
collector.prometheusMetrics
section:
prometheusMetrics {
enabled = false
durationBuckets = [0.1, 3, 10] # optional buckets by which to group by the `http_request_duration_seconds` metric
}
- Modify the
collector.doNotTrackCookie
section if you want to make use of a regex:
doNotTrackCookie {
enabled = true
name = cookie-name
value = ".+cookie-value.+"
}
- Add the optional
collector.streams.sink.producerConf
if you want to specify additional Kafka producer configuration:
producerConf {
acks = all
}
This also holds true for Stream Enrich enrich.streams.sourceSink.{producerConf, consumerConf}
.
A full example configuration can be found in [the repository][config-ssc].
If you are a GCP pipeline user, a new Beam Enrich can be found on Bintray:
- as a ZIP archive
- as a Docker image
If you are a Kinesis or Kafka pipeline user, a new Stream Enrich can be found on Bintray.
Finally, if you are a batch pipeline user, a new Spark Enrich can be used by setting the new version in your EmrEtlRunner configuration:
enrich:
version:
spark_enrich: 1.17.0 # WAS 1.16.0
or directly make use of the new Spark Enrich available at:
s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.17.0.jar
A new version of EmrEtlRunner is also available in our Bintray.
This release focuses on reliability improvements for the batch pipeline. It also itroduces support for persistent EMR cluster.
The latest version of the EmrEtlRunner is available from our Bintray here.
A settings is needed to enable or disable compaction of the output of the shred job.
aws:
s3:
consolidate_shredded_output: false
If you're not making use of any enrichment and contexts, you'll need to disable this setting.
For a complete example, see our sample config.yml
template.
The new Clojure Collector is stored in S3 at:
s3://snowplow-hosted-assets/2-collectors/clojure-collector/clojure-collector-2.1.3-standalone.war
.
This small release adds CORS-related headers to POST requests as a follow-up of R110 which added them to OPTIONS requests.
The new Clojure Collector is stored in S3 at:
s3://snowplow-hosted-assets/2-collectors/clojure-collector/clojure-collector-2.1.2-standalone.war
.
This release brings a new enrichment platform for Google Cloud Platform: Beam Enrich as well as a couple of bugfixes.
Beam Enrich is the latest enrichment platform released by Snowplow, it runs on Google Cloud Dataflow.
To know more, check out the following resources:
- https://github.com/snowplow/snowplow/wiki/Beam-Enrich
- https://github.com/snowplow/snowplow/wiki/setting-up-beam-enrich
- https://github.com/snowplow/snowplow/tree/master/3-enrich/beam-enrich
The new version of Stream Enrich can be found in our Bintray here.
It incorporates a fix for users of the PII enrichment.
The new Clojure Collector is stored in S3 at:
s3://snowplow-hosted-assets/2-collectors/clojure-collector/clojure-collector-2.1.1-standalone.war
.
It incorporates a fix for CORS requests.
This release bring the possibility to enable end-to-end encryption for the batch pipeline as well as a way to specify the cookie path for the Clojure Collector.
If you want to leverage the monthly-updated database of useragent regexes we host on S3, you'll need to update your enrichment configuration to the following:
{
“schema": "iglu:com.snowplowanalytics.snowplow/ua_parser_config/jsonschema/1-0-1", # Was 1-0-0
"data": {
"vendor": "com.snowplowanalytics.snowplow",
"name": "ua_parser_config",
"enabled": true,
"parameters": {
"database": "regexes.yaml", # New
"uri": "s3://snowplow-hosted-assets/third-party/ua-parser/" # New
}
}
}
Note that this change is not mandatory.
If you are a real-time pipeline user, a version of Stream Enrich can be found on our Bintray here.
If you are a batch pipeline user, you'll need to either update your EmrEtlRunner configuration to the following:
enrich:
version:
spark_enrich: 1.16.0 # WAS 1.15.0
or directly make use of the new Spark Enrich available at:
s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.16.0.jar
.
The latest version of the Scala Stream Collector is available from our Bintray here.
collector {
crossDomain {
enabled = true
domains = [ "*"] # WAS domain and not an array
secure = true
}
doNotTrackCookie { # New section
enabled = false
name = cookie-name
value = cookie-value
}
rootResponse { # New section
enabled = false
statusCode = 200
body = “ok”
}
}
For a complete example, see our sample config.hocon
template.
The latest version of the EmrEtlRunner is available from our Bintray here.
We encourage people to change their S3 buckets to use the s3a
scheme because usage
of the s3a
protocol doesn't generate empty files:
aws:
s3:
raw:
in:
- "s3a://in-bucket"
processing: "s3a://processing-bucket"
archive: "s3a://archive-bucket/raw"
enriched:
good: "s3a://enriched-bucket/good"
bad: "s3a://enriched-bucket/bad"
errors: "s3a://enriched-bucket/errors"
archive: "s3a://archive-bucket/enriched"
shredded:
good: "s3a://shredded-bucket/good"
bad: "s3a://shredded-bucket/bad"
errors: "s3a://shredded-bucket/errors"
archive: "s3a://archive-bucket/shredded"
For a complete example, see our sample config.yml
template.
This release brings the possibility to enable end-to-end encryption for the batch pipeline as well as a way to specify the cookie path for the Clojure Collector.
The latest version of the EmrEtlRunner is available from our Bintray here.
This release brigs the possibility to interact with SSE-S3 (AES 256 managed by S3) encrypted buckets
through aws:s3:buckets:encrypted
.
Additionally, you can now specify an EMR security configuration, which lets you configure local disk
encryption as well as in-transit encryption, through aws:emr:security_configuration
aws:
s3:
buckets:
encrypted: false # Can be true or false depending on whether you interact with SSE-S3 encrypted buckets
emr:
security_configuration: name-of-the-security-configuration # Leave blank if you don't use a security configuration
monitoring:
snowplow:
port: 8080 # New and optional
protocol: http # New and optional
For a complete example, see our sample config.yml
template.
For more background on end-to-end encryption for the batch pipeline, you can refer to our dedicated wiki page.
The new Clojure Collector is stored in S3 at:
s3://snowplow-hosted-assets/2-collectors/clojure-collector/clojure-collector-2.1.0-standalone.war
.
By default, the cookie path will now be /
. However, it can be customized by adding a SP_PATH
environment property to your Elastic Beanstalk application.
This release introduces the IAB Spiders & Robots enrichment for detecting bots and spiders, as well as new Marketo and Vero webhook adapters and fixes to the Google Analytics enrichment.
If you are a streaming pipeline user, a version of Stream Enrich incorporating the new IAB enrichment can be found on our Bintray here.
If you are a batch pipeline user, you'll need to either update your EmrEtlRunner configuration to the following:
enrich:
version:
spark_enrich: 1.15.0 # WAS 1.14.0
or directly make use of the new Spark Enrich available at:
s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.15.0.jar
.
This release adds further capabilities to the PII Pseudonymization Enrichment to both stream and batch enrich. Specifically, it adds the capability to emit a stream of events which contain the original along with the modified value. The PII transformation event also contains information about the field and the parent event (the event whence this PII event originated).
To upgrade, update your EmrEtlRunner configuration to the following:
enrich:
version:
spark_enrich: 1.14.0 # WAS 1.13.0
The latest version of Stream Enrich is available from our Bintray here.
The following configuration is needed to enable the pii stream:
enrich {
streams {
in {...} # NO CHANGE
out {
enriched = my-enriched-output-event-without-pii # NO CHANGE
bad = my-events-that-failed-validation-during-enrichment # NO CHANGE
pii = my-output-event-that-contains-only-pii # NEW FIELD
partitionKey = "" # NO CHANGE
}
sourceSink {...} # NO CHANGE
buffer {...} # NO CHANGE
appName = "some-name" # NO CHANGE
}
}
In addition you need to configure the enrichment to emit events and also use a salt in hashing:
{
"schema": "iglu:com.snowplowanalytics.snowplow.enrichments/pii_enrichment_config/jsonschema/2-0-0", # NEW VERSION
"data": {
"vendor": "com.snowplowanalytics.snowplow.enrichments", # NO CHANGE
"name": "pii_enrichment_config", # NO CHANGE
"emitEvent": true, # NEW FIELD
"enabled": true, # NO CHANGE
"parameters": {
"pii": [...], # NO CHANGE
"strategy": {
"pseudonymize": {
"hashFunction": "SHA-1", # NO CHANGE
"salt": "pepper123" # NEW FIELD
}
}
}
}
}
This release focuses on solving an issue with the real-time pipeline which may result in duplicate events if you're using Kinesis.
More information is available in issue #3745 and the dedicated Discourse post.
A version of Stream Enrich incorporating a fix can be found on our Bintray here.
This release most notably solves an EmrEtlRunner Stream Enrich mode bugs introduced in R102. Information is available in issue #3717 and #3722.
The latest version of the EmrEtlRunner is available from our Bintray here.
This release upgrades the IP lookups enrichment.
Whether you are using the batch or streaming pipeline, it is important to perform this upgrade if you make use of the IP lookups enrichment.
To make use of the new enrichment, you will need to update your ip_lookups.json
so that it
conforms to the new 2-0-0
schema.
An example is provided in the GitHub repository.
If you are a streaming pipeline user, a version of Stream Enrich incorporating the upgraded ip lookups enrichment can be found on our Bintray here.
If you are a batch pipeline user, you'll need to either update your EmrEtlRunner configuration to the following:
enrich:
version:
spark_enrich: 1.13.0 # WAS 1.12.0
or directly make use of the new Spark Enrich available at:
s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.13.0.jar
.
The new Clojure Collector is stored in S3 at:
s3://snowplow-hosted-assets/2-collectors/clojure-collector/clojure-collector-2.0.0-standalone.war
.
By default, he /crossdomain.xml
route is disabled - it will have to be manually re-enabled by
adding the two following environment properties to your Elastic Beanstalk application:
-
SP_CDP_DOMAIN
: the domain that is granted access,*.acme.com
will match bothhttp://acme.com
andhttp://sub.acme.com
. -
SP_CDP_SECURE
: a boolean indicating whether to only grant access to HTTPS or both HTTPS and HTTP sources
This release brings stability imporovements and new "Stream Enrich" mode to EmrEtlRunner.
The latest version of the EmrEtlRunner is available from our Bintray here.
To turn this mode on, you need to add a new aws.s3.buckets.enriched.stream
property to your config.yml
file.
aws:
s3:
buckets:
enriched:
stream: s3://path-to-kinesis/output/
For a complete example, we now have a dedicated sample stream_config.yml template - this shows what you need to set, and what you can remove.
This release brings initial support for Google Cloud Platform to the realtime pipeline.
The latest version of the Scala Stream Collector is available from our Bintray here.
collector {
# Became non-optional
crossDomain {
enabled = true # NEW
domain = "*"
secure = true
}
}
For a complete example, see our sample config.hocon
template.
This release splits the JARs according to their targeted platform. As a result, you'll need to run one of the following depending on your needs:
java -jar snowplow-stream-collector-google-pubsub-0.13.0.jar --config config.hocon
java -jar snowplow-stream-collector-kinesis-0.13.0.jar --config config.hocon
java -jar snowplow-stream-collector-kafka-0.13.0.jar --config config.hocon
java -jar snowplow-stream-collector-nsq-0.13.0.jar --config config.hocon
java -jar snowplow-stream-collector-stdout-0.13.0.jar --config config.hocon
The latest version of Stream Enrich is available from our Bintray here.
enrich {
streams {
in { ... } # UNCHANGED
out { ... } # UNCHANGED
sourceSink { # NEW SECTION
enabled = kinesis
region = eu-west-1
aws {
accessKey = iam
secretKey = iam
}
maxRecords = 10000
initialPosition = TRIM_HORIZON
backoffPolicy {
minBackoff = 50
maxBackoff = 1000
}
}
buffer { ... } # UNCHANGED
appName = "" # UNCHANGED
}
monitoring { ... } # UNCHANGED
}
For a complete example, see our sample config.hocon
template.
- R101 Blog Post
- R101 Release Notes
- Getting started on GCP guide
- Setting up the Scala Stream Collector on GCP guide
- Setting up Stream Enrich on GCP guide
This release lets you pseudonymize PII fields in your streaming pipeline.
The latest version of Stream Enrich is available from our Bintray here.
If you are using Redshift as a storage target, it is important to update the atomic.events
table, so that the new fields will fit using:
a migration script.
This release lets you seamlessly integrates Google Analytics events in your Snowplow batch pipeline.
The Snowplow Google Analytics plugin lets you tee your Google Analytics payloads directly to a Snowplow collector to be further processed.
Check out the setup guide to know more.
To benefit from the Google Analytics integration you'll need Spark Enrich 1.12.0 or higher:
enrich:
version:
spark_enrich: 1.12.0 # WAS 1.11.0
For a complete example, see our sample config.yml
template.
This release brings support for the webhooks introduced in Release 97 to the realtime pipeline as well as some nifty features to the Scala Stream Collector.
The latest version of the Scala Stream Collector is available from our Bintray here.
collector {
# Optional cross domain policy configuration.
# To disable, remove the "crossDomain" configuration and the collector will respond with a 404 to
# the /crossdomain.xml route.
crossDomain { # NEW
domain = "*"
secure = true
}
cookie {
# ...
# Optionally, specify the name of the header containing the originating protocol for use in the
# bounce redirect location. Use this if behind a load balancer that performs SSL termination.
# The value of this header must be http or https. Example, if behind an AWS Classic ELB.
forwardedProtocolHeader = "X-Forwarded-Proto" # NEW
}
# When enabled, the redirect url passed via the `u` query parameter is scanned for a placeholder
# token. All instances of that token are replaced withe the network ID. If the placeholder isn't
# specified, the default value is `${SP_NUID}`.
redirectMacro { # NEW
enabled = false
placeholder = "[TOKEN]"
}
}
For a complete example, see our sample config.hocon
template.
The latest version of Stream Enrich is available from our Bintray here.
This release brings 4 new webhook adapters (Mailgun, StatusGator, Unbounce, Olark) to Snowplow. Follow the corresponding webhook set-up guide in Setting up a webhook
The latest version of the EmrEtlRunner is available from our Bintray here.
enrich:
version:
spark_enrich: 1.11.0 # WAS 1.10.0
For a complete example, see our sample config.yml
template.
This release brings NSQ support to the Scala Stream Collector and Stream Enrich.
The latest version of the Scala Stream Collector is available from our Bintray here.
collector {
#sink = kinesis # REMOVED
streams {
sink { # ADDED
enabled = kinesis # or kafka or nsq
# only the corresponding config is needed (e.g. kinesis or kafka config)
}
}
}
For a complete example, see our sample config.hocon
template.
The latest version of Stream Enrich is available from our Bintray here.
This release introduces ZSTD encoding to the Redshift model as well as update the Spark components to 2.2.0 which is included in AMI 5.9.0.
The latest version of the EmrEtlRunner is available from our Bintray here.
This release updates the Spark Enrich and RDB Shredder jobs to Spark 2.2.0. As a result, an AMI bump is warranted. RDB Loader has been updated too:
aws:
# ...
emr:
ami_version: 5.9.0 # WAS 5.5.0
# ...
enrich:
version:
spark_enrich: 1.10.0 # WAS 1.9.0
storage:
versions:
rdb_loader: 0.14.0 # WAS 0.13.0
rdb_shredder: 0.13.0 # WAS 0.12.0
For a complete example, see our sample config.yml
template.
Unlocking ZSTD compression relies on updating the atomic.events
table through
a migration script.
This script assumes that you're currently on version 0.8.0 of the atomic.events
table, if you're
upgrading from an earlier version, please refer to
the appropriate migration script to get to version 0.8.0.
If you rely on an SSH tunnel to connect the RDB Loader to your Redshift cluster, you'll need to
update your Redshift storage target to 2-1-0. Refer to the schema
to incorporate a properly formatted sshTunnel
field.
We've set up a mirror of Iglu central on Google Cloud Platform to maintain high availability in case of S3 outages. To benefit from this mirror, you'll need to add the following repository to your Iglu resolver JSON file:
{
"name": "Iglu Central - Mirror 01",
"priority": 1,
"vendorPrefixes": [ "com.snowplowanalytics" ],
"connection": {
"http": {
"uri": "http://mirror01.iglucentral.com"
}
}
This release fixes an issue in Stream Enrich introduced in R93.
The latest version of Stream Enrich is available from our Bintray here.
This release refreshes the streaming Snowplow pipeline: the Scala Stream Collector and Stream Enrich.
The latest version of the Scala Stream Collector is available from our Bintray here.
collector {
cookieBounce { # NEW
enabled = false
name = "n3pc"
fallbackNetworkUserId = "00000000-0000-4000-A000-000000000000"
}
sink = kinesis # WAS sink.enabled
streams { # REORGANIZED
good = good-stream
bad = bad-stream
kinesis {
// ...
}
kafka {
// ...
retries = 0 # NEW
}
}
}
akka {
http.server { # WAS spray.can.server
// ...
}
}
For a complete example, see our sample config.hocon
template.
The Scala Stream Collector is no longer an executable jar. As a result, it will have to be launched as:
java -jar snowplow-stream-collector-0.10.0.jar --config config.hocon
The latest version of Stream Enrich is available from our Bintray here.
enrich {
// ...
streams {
// ...
out {
// ...
partitionKey = user_ipaddress # NEW
}
kinesis { # REORGANIZED
// ...
initialTimestamp = "2017-05-17T10:00:00Z" # NEW but optional
backoffPolicy { # MOVED
// ...
}
}
kafka {
// ...
retries = 0 # NEW
}
}
}
For a complete example, see our sample config.hocon
template.
Stream Enrich is no longer an executable jar. As a result, it will have to be launched as:
java -jar snowplow-stream-enrich-0.11.0.jar --config config.hocon --resolver file:resolver.json
Additionally, a new --force-ip-lookups-download
flag has been introduced in order to force the
download of the ip lookup database when the application starts.
This release most notably solves a bug which occurred if one were to skip the shred step, more information is available in issue #3403 and the dedicated Discourse post.
The latest version of the EmrEtlRunner is available from our Bintray here.
In order to update RDB Loader you need to make following change to your configuration YAML:
storage:
versions:
rdb_loader: 0.13.0 # WAS 0.12.0
For a complete example, see our sample config.yml
template.
This release revolves around making EmrEtlRunner, the component launching the EMR steps for the batch pipeline, significantly more robust. Most notably, this release fixes a long-standing bug in the way the staging step was performed, which affected all users of the Clojure Collector (issue #3085).
The latest version of the EmrEtlRunner is available from our Bintray here.
Make sure to use the run
command when launching EmrEtlRunner, for example:
./snowplow-emr-etl-runner run \
-c config.yml \
-r resolver.json
Additionally, it is advised to set up a local (through a file) or distributed (through Consul) lock:
./snowplow-emr-etl-runner run \
-c config.yml \
-r resolver.json \
--lock path/to/lock \
--consul http://127.0.0.1:8500 # Optional address to your Consul server
This release introduces RDB Loader, a new EMR-run application replacing our StorageLoader, as proposed in our Splitting EmrEtlRunner RFC. This release also brings various enhancements and alterations in EmrEtlRunner.
The latest version of the EmrEtlRunner is available from our Bintray here.
In order to use RDB Loader you need to make following addition in your configuration YAML:
storage:
versions:
rdb_loader: 0.12.0 # NEW
The following settings no longer make sense, as Postgres loading also happens on EMR node, therefore can be deleted:
storage:
download: # REMOVE
folder: # REMOVE
To gradually configure your EMR application you can add optional emr.configuration property:
emr:
configuration: # NEW
yarn-site:
yarn.resourcemanager.am.max-attempts: "1"
spark:
maximizeResourceAllocation: "true"
For a complete example, see our sample config.yml
template.
EmrEtlRunner now accepts a new --include
option with a single possible vacuum
argument, which will be passed to RDB Loader.
Also, --skip
now accepts new rdb_load
, archive_enriched
and analyze
arguments. Skipping rdb_load
and archive_enriched
steps is identical to running R89 EmrEtlRunner without StorageLoader.
Finally, note that the StorageLoader is no more part of batch pipeline apps archive.
As RDB Loader is an EMR step now, we wanted to make sure that user's AWS credentials are not exposed anywhere. To load Redshift we're using IAM Roles, which allow Redshift to load data from S3.
To create an IAM Role you need to go to AWS Console » IAM » Roles » Create new role. Then you need chose Amazon Redshift » AmazonS3ReadOnlyAccess, choose a role name, for example "RedshiftLoadRole". Once created, copy the Role ARN as you will need it in the next section.
Now you need to attach new role to running Redshift cluster. Go to AWS Console » Redshift » Clusters » Manage IAM Roles » Attach just created role.
Your EMR cluster’s master node will need to be whitelisted in Redshift in order to perform the load.
If you are using an "EC2 Classic" environment, from the Redshift UI you will need to create a Cluster Security Group and add the relevant EC2 Security Group, most likely called ElasticMapReduce-master. Make sure to enable this Cluster Security Group against your Redshift cluster.
If you are using modern VPC-based environment, you will need to modify the Redshift cluster, and add a VPC security group, most likely called ElasticMapReduce-Master-Private.
In both cases, you only need to whitelist access from the EMR master node, because RDB Loader runs exclusively from the master node.
We have updated the Redshift storage target config - the new version requires the Role ARN that you noted down above:
{
"schema": "iglu:com.snowplowanalytics.snowplow.storage/redshift_config/jsonschema/2-0-0", // WAS 1-0-0
"data": {
"name": "AWS Redshift enriched events storage",
...
"roleArn": "arn:aws:iam::719197435995:role/RedshiftLoadRole", // NEW
...
}
}
This release ports the batch pipeline from Twitter Scalding to Apache Spark.
The latest version of the EmrEtlRunner and StorageLoader are available from our Bintray here.
- Update
ami_version
to5.5.0
- Move
job_name
to aws -> emr -> jobflow - Remove
hadoop_shred
from enrich -> versions - Add
rdb_shredder
to a newly created storage -> versions - Move
hadoop_elasticsearch
to storage -> version - Replace
hadoop_enrich
byspark_enrich
aws:
emr:
ami_version: 5.5.0 # WAS 4.5.0
. . .
jobflow:
job_name: Snowplow ETL # MOVED FROM enrich:
enrich:
versions:
spark_enrich: 1.9.0 # WAS 1.8.0
storage:
versions:
rdb_shredder: 0.12.0 # WAS 0.11.0
hadoop_elasticsearch: 0.1.0 # UNCHANGED BUT MOVED
For a complete example, see our sample config.yml
template.
Note that using the Spark artifacts is incompatible with instances types having only one virtual CPU such as m1.medium.
This release introduces event de-duplication across different pipeline runs, powered by DynamoDB, along with an important refactoring of the batch pipeline configuration.
The latest version of the EmrEtlRunner and StorageLoader are available from our Bintray here.
Storage targets configuration JSONs can be generated from your existing config.yml
, using the 3-enrich/emr-etl-runner/config/convert_targets.rb
script. These files should be stored in a folder, for example called targets
, alongside your existing enrichments
folder.
When complete, your folder layout will look something like this:
snowplow_config
├── config.yml
├── enrichments
│ ├── campaign_attribution.json
│ ├── ...
│ ├── user_agent_utils_config.json
├── iglu_resolver.json
├── targets
│ ├── duplicate_dynamodb.json
│ ├── enriched_redshift.json
For complete examples, see our storage target configuration JSONs. The explanation of the properties are on the wiki page.
- Remove whole
storage.targets
section (leavingstorage.download.folder
) from yourconfig.yml
file - Update the
hadoop_shred
job version in your configuration YAML like so:
versions:
hadoop_enrich: 1.8.0 # UNCHANGED
hadoop_shred: 0.11.0 # WAS 0.10.0
hadoop_elasticsearch: 0.1.0 # UNCHANGED
For a complete example, see our sample config.yml
template.
- Append the option
--targets $TARGETS_DIR
to bothsnowplow-emr-etl-runner
andsnowplow-storage-loader
applications - Append the option
--resolver $IGLU_RESOLVER
tosnowplow-storage-loader
application. This is required to validate the storage target configurations
Please be aware that enabling this will have a potentially high cost and performance impact on your Snowplow batch pipeline.
If you want to start to deduplicate events across batches you need to add a new DynamoDB config target to your newly created targets
directory.
Optionally, before first run of Shred job with cross-batch deduplication, you may want to run Event Manifest Populator to back-fill the DynamoDB table.
When Relational Database Shredder runs, if the table doesn’t exist then it will be automatically created with provisioned throughput by default set to 100 write capacity units and 100 read capacity units and the required schema to store and deduplicate events.
For relatively low (1M events per run) cases, the default settings will likely just work. However, we do strongly recommend monitoring the EMR job, and its AWS billing impact, closely and tweaking DynamoDB provisioned throughput and your EMR cluster specification accordingly.
This release contains a wide array of new features, stability enhancements and performance improvements for EmrEtlRunner and StorageLoader. As of this release EmrEtlRunner lets you specify EBS volumes for your Hadoop worker nodes; meanwhile StorageLoader now writes to a dedicated manifest table to record each load.
The latest version of the EmrEtlRunner and StorageLoader are available from our Bintray here.
To make use of the new ability to specify EBS volumes for your EMR cluster’s core nodes, update your configuration YAML like so:
jobflow:
master_instance_type: m1.medium
core_instance_count: 1
core_instance_type: c4.2xlarge
core_instance_ebs: # Optional. Attach an EBS volume to each core instance.
volume_size: 200 # Gigabytes
volume_type: "io1"
volume_iops: 400 # Optional. Will only be used if volume_type is "io1"
ebs_optimized: false # Optional. Will default to true
The above configuration will attach an EBS volume of 200 GiB to each core instance in your EMR cluster; the volumes will be Provisioned IOPS (SSD), with the performance of 400 IOPS/GiB. The volumes will not be EBS optimized. Note that this configuration has finally allowed us to use the EBS-only c4
instance types for our core nodes.
For a complete example, see our sample config.yml
template.
You will also need to deploy the following manifest table for Redshift:
This table should be deployed into the same schema as your events
and other tables.
This release introduces additional event de-duplication functionality for our Redshift load process, plus a brand new data model that makes it easier to get started with web data. It also adds support for AWS’s newest regions: Ohio, Montreal and London.
Upgrading is simple - update the hadoop_shred
job version in your configuration YAML like so:
versions:
hadoop_enrich: 1.8.0 # UNCHANGED
hadoop_shred: 0.10.0 # WAS 0.9.0
hadoop_elasticsearch: 0.1.0 # UNCHANGED
For a complete example, see our sample config.yml
template.
You will also need to deploy the following table for Redshift:
This release brings initial beta support for using Apache Kafka with the Snowplow real-time pipeline, as an alternative to Amazon Kinesis.
Please note that this Kafka support is extremely beta - we want you to use it and test it; do not use it in production.
The real-time apps for R85 Metamorphosis are available in the following zipfiles:
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_scala_stream_collector_0.9.0.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_stream_enrich_0.10.0.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_elasticsearch_sink_0.8.0_1x.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_elasticsearch_sink_0.8.0_2x.zip
Or you can download all of the apps together in this zipfile:
https://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r85_metamorphosis.zip
To upgrade the Stream Collector application:
- Install the new Collector on each server in your auto-scaling group
- Upgrade your config by:
- Moving the
collector.sink.kinesis.buffer
section down tocollector.sink.buffer
; as this section will be used to configure limits for both Kinesis and Kafka. - Adding a new section within the
collector.sink
block:
- Moving the
collector {
...
sink {
...
buffer {
byte-limit:
record-limit: # Not supported by Kafka; will be ignored
time-limit:
}
...
kafka {
brokers: ""
# Data will be stored in the following topics
topic {
good: ""
bad: ""
}
}
...
}
To upgrade the Stream Enrich application:
- Install the new Stream Enrich on each server in your auto-scaling group
- Upgrade your config by:
- Adding a new section within the enrich block:
enrich {
...
# Kafka configuration
kafka {
brokers: "localhost:9092"
}
...
}
Note: The app-name defined in your config will be used as your Kafka consumer group ID.
The Kinesis apps for R84 Stellers Sea Eagle are available in the following zipfiles:
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_stream_collector_0.8.0.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_stream_enrich_0.9.0.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_elasticsearch_sink_0.8.0.zip
Or you can download all of the apps together in this zipfile:
https://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r84_stellers_sea_eagle.zip
Only the Elasticsearch Sink app config has changed. The change does not include breaking config changes. To upgrade the Elasticsearch Sink:
- Install the new Elasticsearch Sink app on each server in your Elasticsearch Sink auto-scaling group
- Update your Elasticsearch Sink config with the new
elasticsearch.client.http
section: elasticsearch.client.http.conn-timeout
elasticsearch.client.http.read-timeout
NOTE: These timeouts are optional and will default to 300000 if they cannot be found in your Config.
See our sample config.hocon
template.
This release introduces our powerful new SQL Query Enrichment, long-awaited support for the EU Frankfurt AWS region (eu-central-1), plus POST
support for our Iglu webhook adapter.
Update the hadoop_enrich
job version in your configuration YAML like so:
versions:
hadoop_enrich: 1.8.0 # WAS 1.7.0
hadoop_shred: 0.9.0 # UNCHANGED
hadoop_elasticsearch: 0.1.0 # UNCHANGED
For a complete example, see our sample config.yml
template.
This is a real-time pipeline release. This release updates the Kinesis Elasticsearch Sink with support for sending events via HTTP, allowing us to support Amazon Elasticsearch Service.
The Kinesis apps for 82 Tawny Eagle are all available in a single zip file here:
https://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r82_tawny_eagle.zip
The individual Kinesis apps for R82 Tawny Eagle are also available in the following zipfiles:
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_stream_collector_0.7.0.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_stream_enrich_0.8.1.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_elasticsearch_sink_0.7.0.zip
Only the Elasticsearch Sink app has actually changed. The change does, however, include breaking config changes, so you will need to make some minor changes to your configuration file. To upgrade the Elasticsearch Sink:
- Install the new Elasticsearch Sink app on each server in your Elasticsearch Sink auto-scaling group
- Update your Elasticsearch Sink config with the new
elasticsearch
section:
- The only new fields are
elasticsearch.client.type
andelasticsearch.client.port
- The following fields have been renamed:
elasticsearch.cluster-name
is nowelasticsearch.cluster.name
elasticsearch.endpoint
is nowelasticsearch.client.endpoint
elasticsearch.max-timeout
is nowelasticsearch.client.max-timeout
elasticsearch.index
is nowelasticsearch.cluster.index
elasticsearch.type
is nowelasticsearch.cluster.type
- Update your supervisor process to point to the new Kinesis Elasticsearch Sink app
- Restart the supervisor process on each server running the sink
This is a real-time pipeline release. At the heart of it is the Hadoop Event Recovery project, which allows you to fix up Snowplow bad rows and make them ready for reprocessing.
The Kinesis apps for R81 Kangaroo Island Emu are all available in a single zip file here:
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r81_kangaroo_island_emu.zip
Only the Stream Enrich app has actually changed. The change is not breaking, so you don’t have to make any changes to your configuration file. To upgrade Stream Enrich:
- Install the new Stream Enrich app on each server in your Stream Enrich auto-scaling group
- Update your supervisor process to point to the new Stream Enrich app
- Restart the supervisor process on each server running Stream Enrich
This is a real-time pipeline release which improves stability and brings the real-time pipeline up-to-date with our Hadoop pipeline.
As a result, you can now use R79 Black Swan’s API Request Enrichment and the HTTP Header Extractor Enrichment in your real-time pipeline. Also, you can now configure the number of records that the Kinesis Client Library should retrieve with each call to GetRecords
.
The Kinesis apps for R80 Southern Cassowary are all available in a single zip file here:
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r80_southern_cassowary.zip
There are no breaking changes in this release - you can upgrade the individual Kinesis apps without worrying about having to update the configuration files or indeed the Kinesis streams.
If you want to configure how many records Stream Enrich should read from Kinesis at a time, update its configuration file to add a maxRecords
property like so:
enrich {
...
streams {
in: {
...
maxRecords: 5000 # Default is 10000
...
If you want to configure how many records Kinesis Elasticsearch Sink should read from Kinesis at a time, again update its configuration file to add a maxRecords
property:
sink {
...
kinesis {
in: {
...
maxRecords: 5000 # Default is 10000
...
This release introduces our powerful new API Request Enrichment, plus a new HTTP Header Extractor Enrichment and several other improvements on the enrichments side.
It also updates the Iglu client used by our Spark Enrich and Relational Database Shredder components. The version 1.4.0 lets you fetch your schemas from Iglu registries with authentication support, allowing you to keep your proprietary schemas private.
The recommended AMI version to run Snowplow is now 4.5.0 - update your configuration YAML as follows:
emr:
ami_version: 4.5.0 # WAS 4.3.0
Next, update your hadoop_enrich
and hadoop_shred
job versions like so:
versions:
hadoop_enrich: 1.7.0 # WAS 1.6.0
hadoop_shred: 0.9.0 # WAS 0.8.0
hadoop_elasticsearch: 0.1.0 # UNCHANGED
For a complete example, see our sample config.yml
template.
If you want to use an Iglu registry with authentication, add a private apikey
to the registry’s configuration entry and set the schema version to 1-0-1 as in the example below.
{
"schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1",
"data": {
"cacheSize": 500,
"repositories": [
{
"name": "Iglu Central",
"priority": 0,
"vendorPrefixes": [ "com.snowplowanalytics" ],
"connection": {
"http": {
"uri": "http://iglucentral.com"
}
}
},
{
"name": "Private Acme repository for com.acme",
"priority": 1,
"vendorPrefixes": [ "com.acme" ],
"connection": {
"http": {
"uri": "http://iglu.acme.com/api",
"apikey": "APIKEY-FOR-ACME"
}
}
}
]
}
}
This release brings our Kinesis pipeline functionally up-to-date with our Hadoop pipeline, and makes various further improvements to the Kinesis pipeline.
The Kinesis apps for R78 Great Hornbill are now all available in a single zip file here:
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r78_great_hornbill.zip
Scala Kinesis Enrich has been renamed to Stream Enrich. The name of the artifact has changed to "snowplow-stream-enrich".
Upgrading will require the following configuration changes to the applications' individual HOCON configuration files.
Add a collector.cookie.name
field to the HOCON and set its value to "sp"
.
Also, note that the configuration file no longer supports loading AWS credentials from the classpath using ClasspathPropertiesFileCredentialsProvider. If your configuration looks like this:
{
"aws": {
"access-key": "cpf",
"secret-key": "cpf"
}
}
then you should change "cpf" to "default" to use the DefaultAWSCredentialsProviderChain. You will need to ensure that your credentials are available in one of the places the AWS Java SDK looks. For more information about this, see the Javadoc.
Replace the sink.kinesis.out
string with an object with two fields:
{
"sink": {
"good": "elasticsearch", # or "stdout"
"bad": "kinesis" # or "stderr" or "none"
}
}
Additionally, move the stream-type
setting from the sink.kinesis.in
section to the sink
section.
If you are loading Snowplow bad rows into for example Elasticsearch, please make sure to update all applications.
For a complete example, see our sample config.hocon
template.
This release focuses on the command-line applications used to orchestrate Snowplow, bringing Snowplow up-to-date with the new 4.x series of Elastic MapReduce releases.
Running EmrEtlRunner and StorageLoader as Ruby (rather than JRuby apps) is no longer actively supported.
The latest version of the EmrEtlRunner and StorageLoader are available from our Bintray here.
Note that the snowplow-runner-and-loader.sh
script has been also updated to use the JRuby binaries rather than the raw Ruby project.
The recommended AMI version to run Snowplow is now 4.3.0 - update your configuration YAML as follows:
emr:
ami_version: 4.3.0 # WAS 3.7.0
You will need to update the jar versions in the same section:
versions:
hadoop_enrich: 1.6.0 # WAS 1.5.1
hadoop_shred: 0.8.0 # WAS 0.7.0
hadoop_elasticsearch: 0.1.0 # UNCHANGED
For a complete example, see our sample config.yml
template.
This release introduces an event de-duplication process which runs on Hadoop, and also includes an important bug fix for our SendGrid webhook support.
Upgrading to this release is simple - the only changed components are the jar versions for Hadoop Enrich and Hadoop Shred.
In the config.yml
file for your EmrEtlRunner, update your hadoop_enrich
and hadoop_shred
job versions like so:
versions:
hadoop_enrich: 1.5.1 # WAS 1.5.0
hadoop_shred: 0.7.0 # WAS 0.6.0
hadoop_elasticsearch: 0.1.0 # Unchanged
For a complete example, see our sample config.yml
template.
This release lets you warehouse the event streams generated by Urban Airship and SendGrid, and also updates our web-recalculate data model.
The corresponding version of the EmrEtlRunner and StorageLoader are available from our Bintray here.
In your EmrEtlRunner’s config.yml
file, update your hadoop_enrich
job’s version to 1.5.0, like so:
versions:
hadoop_enrich: 1.5.0 # WAS 1.4.0
For a complete example, see our sample config.yml
template.
You'll need to deploy the Redshift tables for any webhooks you plan on ingesting into Snowplow. You can find the Redshift table deployment instructions on the corresponding webhook setup wiki pages:
This release adds a Weather Enrichment to the Hadoop pipeline - making Snowplow the first event analytics platform with built-in weather analytics!
Data provider: OpenWeatherMap
To take advantage of this new enrichment, update the hadoop_enrich
jar version in the emr
section of your configuration YAML:
versions:
hadoop_enrich: 1.4.0 # WAS 1.3.0
hadoop_shred: 0.6.0 # UNCHANGED
hadoop_elasticsearch: 0.1.0 # UNCHANGED
For a complete example, see our sample config.yml
template.
Make sure to add a weather_enrichment_config.json
configured as explained here into your enrichments
folder too. The file should conform to this JSON Schema.
The corresponding JSONPaths file could be found here.
If you are using Snowplow with Amazon Redshift, you will need to deploy the org_openweathermap_weather_1 table into your database.
This release adds the ability to automatically load bad rows from the Snowplow Elastic MapReduce jobflow into Elasticsearch for analysis and formally separates the Snowplow enriched event format from the TSV format used to load Redshift.
The corresponding version of the EmrEtlRunner and StorageLoader are available from our Bintray here.
You will need to update the jar versions in the emr
section of your configuration YAML:
versions:
hadoop_enrich: 1.3.0 # Version of the Hadoop Enrichment process
hadoop_shred: 0.6.0 # Version of the Hadoop Shredding process
hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
In order to start loading bad rows from the EMR jobflow into Elasticsearch, you will need to add an Elasticsearch target to the targets
section of your configuration YAML.
targets:
- name: "Our Elasticsearch cluster" # Name for the target - used to label the corresponding jobflow step
type: elasticsearch # Marks the database type as Elasticsearch
host: "ec2-43-1-854-22.compute-1.amazonaws.com" # Elasticsearch host
database: snowplow # The Elasticsearch index
port: 9200 # Port used to connect to Elasticsearch
table: bad_rows # The Elasticsearch type
es_nodes_wan_only: false # Set to true if using Amazon Elasticsearch Service
username: # Not required for Elasticsearch
password: # Not required for Elasticsearch
sources: # Leave blank or specify: ["s3://out/enriched/bad/run=xxx", "s3://out/shred/bad/run=yyy"]
maxerror: # Not required for Elasticsearch
comprows: # Not required for Elasticsearch
Note that the database
and table
fields actually contain the index and type respectively where bad rows will be stored.
The sources
field is an array of buckets from which to load bad rows. If you leave this field blank, then the bad rows buckets created by the current run of the EmrEtlRunner will be loaded. Alternatively, you can explicitly specify an array of bad row buckets to load.
For a complete example, see our sample config.yml
template.
Note these updates to EmrEtlRunner's command-line arguments:
- You can skip loading data into Elasticsearch by running EmrEtlRunner with the
--skip elasticsearch
option - To run just the Elasticsearch load without any other EmrEtlRunner steps, explicitly skip all other steps using
--skip staging,s3distcp,enrich,shred,archive_raw
- Note that running EmrEtlRunner with
--skip enrich,shred
will no longer skip the EMR job, since there is still the Elasticsearch step to run - If you are using Postgres rather than Redshift, you should no longer pass the
--skip shred
option to EmrEtlRunner. This is because the shred step now removes JSON fields from the enriched event TSV.
Use the appropriate migration script to update your version of the atomic.events
table to the relevant schema:
If you are upgrading to this release from an older version of Snowplow, we also provide Redshift migration scripts to atomic.events
version 0.8.0 from 0.4.0, 0.5.0 and 0.6.0 versions.
Warning: these migration scripts will alter your atomic.events
table in-place, deleting the unstruct_event
, contexts
, and derived_contexts
columns. We recommend that you make a full backup before running these scripts.
This release adds the ability to track clicks through the Snowplow Clojure Collector, adds a cookie extractor enrichment and introduces new de-duplication queries leveraging R71's event fingerprint
This release bumps the Clojure Collector to version 1.1.0.
To upgrade to this release:
- Download the new warfile by right-clicking on this link and selecting “Save As…”
- Log in to your Amazon Elastic Beanstalk console
- Browse to your Clojure Collector’s application
- Click the “Upload New Version” and upload your warfile
You need to update the version of the Enrich jar in your configuration file:
hadoop_enrich: 1.2.0 # Version of the Hadoop Enrichment process
If you wish to use the new cookie extractor enrichment, write a configuration JSON and add it to your enrichments
folder. The example JSON can be found here.
This default configuration is capturing the Scala Stream Collector's own sp
cookie - in practice, you would probably extract other more valuable cookies available on your domain. Each extracted cookie will end up a single derived context following the JSON Schema org.ietf/http_cookie/jsonschema/1-0-0
.
Note: This enrichment only works with events recorded by the Scala Stream Collector - the CloudFront and Clojure Collectors do not capture HTTP headers.
If you are using Snowplow with Amazon Redshift and wish to use the new cookie extractor enrichment, you will need to deploy the org_ietf_http_cookie_1
table into your database.
For the new URI redirect functionality, install the com_snowplowanalytics_snowplow_uri_redirect_1
table.
This release significantly overhauls Snowplow's handling of time and introduces event fingerprinting to support de-duplication efforts. It also brings our validation of unstructured events and custom context JSONs "upstream" from our Hadoop Shred process into our Hadoop Enrich process.
The latest version of the EmrEtlRunner and StorageLoadeder are available from our Bintray here.
Unzip this file to a sensible location (e.g. /opt/snowplow-r71
).
You should update the versions of the Enrich and Shred jars in your [configuration file][https://github.com/snowplow/snowplow/blob/r71-stork-billed-kingfisher/3-enrich/emr-etl-runner/config/config.yml.sample]:
hadoop_enrich: 1.1.0 # Version of the Hadoop Enrichment process
hadoop_shred: 0.5.0 # Version of the Hadoop Shredding process
You should also update the AMI version field:
ami_version: 3.7.0
For each of your database targets, you must add the new ssl_mode
field:
targets:
- name: "My Redshift database"
...
ssl_mode: disable # One of disable (default), require, verify-ca or verify-full
If you wish to use the new event fingerprint enrichment, write a configuration JSON and add it to your enrichments
folder. The example JSON can be found here.
Use the appropriate migration script to update your version of the atomic.events
table to the corresponding schema:
If you are ingesting Cloudfront access logs with Snowplow, use the Cloudfront access log migration script to update your com_amazon_aws_cloudfront_wd_access_log_1
table.
This release focuses on improving our StorageLoader and EmrEtlRunner components and is the first step towards combining the two into a single CLI application.
Download the EmrEtlRunner and StorageLoader from Bintray.
Unzip this file to a sensible location (e.g. /opt/snowplow-r70
).
Check that you have a compatible JRE (1.7+) installed:
$ ./snowplow-emr-etl-runner --version
snowplow-emr-etl-runner 0.17.0
Your two old configuration files will no longer work. Use the aforementioned combine_configurations.rb
script to turn them into a unified configuration file and a resolver JSON.
For reference:
-
config/iglu_resolver.json
- example resolver JSON -
emr-etl-runner/config/config.yml.sample
- example unified configuration YAML
Note that field names in the unified configuration file no longer start with a colon - so region: us-east-1
not :region: us-east-1
.
The EmrEtlRunner now requires a --resolver
argument which should be the path to your new resolver JSON.
Also note that when specifying steps to skip using the --skip
option, the "archive" step has been renamed to "archive_raw" in the EmrEtlRunner and "archive_enriched" in the StorageLoader. This is in preparation for merging the two applications into one.
This release contains new and updated SQL data models.
The SQL data models are a standalone and optional part of the Snowplow pipeline. Users who don't use the SQL data models are therefore not affected by this release.
To implement the SQL data models, first execute the relevant setup queries in Redshift. Then use SQL Runner to execute the queries on a regular basis. SQL Runner is an open source app that makes it easy to execute SQL statements programmatically as part of the Snowplow data pipeline.
The web and mobile data models come in two variants: recalculate
and incremental
.
The recalculate
models drop and recalculate the derived tables using all events, and can therefore be replaced without having to upgrade the tables.
The incremental
models update the derived tables using only the events from the most recent batch. The updated incremental
model comes with a migration script.
This is a small release which adapts the EmrEtlRunner to use the new Elastic MapReduce API.
You need to update EmrEtlRunner to the version 0.16.0 on GitHub:
$ git clone git://github.com/snowplow/snowplow.git
$ git checkout r68-turquoise-jay
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment
This release brings a host of upgrades to our real-time Amazon Kinesis pipeline as well as the embedding of Snowplow tracking into this pipeline.
The Kinesis apps for r67 Bohemian Waxwing are now all available in a single zip file here. Upgrading will require various configuration changes to each of the three applications’ HOCON configuration files.
- Change
collector.sink.kinesis.stream.name
tocollector.sink.kinesis.stream.good
in the HOCON - Add
collector.sink.kinesis.stream.bad
to the HOCON
If you want to include Snowplow tracking for this application please append the following:
enrich {
...
monitoring {
snowplow {
collector-uri: ""
collector-port: 80
app-id: ""
method: "GET"
}
}
}
Note that this is a wholly optional section; if you do not want to send application events to a second Snowplow instance, simply do not add this to your configuration file.
For a complete example, see our config.hocon.sample
file.
- Add
max-timeout
into theelasticsearch
fields - Merge location fields into the
elasticsearch
section - If you want to include Snowplow Tracking for this application please append the following:
sink {
...
monitoring {
snowplow {
collector-uri: ""
collector-port: 80
app-id: ""
method: "GET"
}
}
}
Again, note that Snowplow tracking is a wholly optional section.
For a complete example, see our config.hocon.sample
file.
This release upgrades our Hadoop Enrichment process to run on Hadoop 2.4, re-enables our Kinesis-Hadoop lambda architecture and also introduces a new scriptable enrichment powered by JavaScript.
You need to update EmrEtlRunner to the version 0.15.0 on GitHub:
$ git clone git://github.com/snowplow/snowplow.git
$ git checkout r66-oriental-skylark
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment
You need to update your EmrEtlRunner's config.yml
file to reflect the new Hadoop 2.4.0 and AMI 3.6.0 support:
:emr:
:ami_version: 3.6.0 # WAS 2.4.2
And:
:versions:
:hadoop_enrich: 1.0.0 # WAS 0.14.1
You can enable this enrichment by creating a self-describing JSON and adding into your enrichments
folder. The configuration JSON should validate against the javascript_script_config
schema.
This release greatly improves the speed, efficiency, and reliability of Snowplow’s real-time Kinesis pipeline.
The Kinesis apps for r65 Scarlet Rosefinch are all available in a single zip file here.
Upgrading will require various configuration changes to each of the four applications.
Add backoffPolic
y and buffer fields to the configuration HOCON.
- Add
backoffPolicy
andbuffer
fields to the configuration HOCON - Extract the resolver from the configuration HOCON into its own JSON file, which can be stored locally or in DynamoDB
- Update the command line arguments as detailed here
- Rename the outermost key in the configuration HOCON from "connector" to "sink"
- Replace the "s3/endpoint" field with an "s3/region" field (such as
us-east-1
)
Rename the outermost key in the configuration HOCON from "connector" to "sink"
This is a major release which adds a new data modeling stage to the Snowplow pipeline, as well as fixes a small number of important bugs across the rest of Snowplow.
You need to update EmrEtlRunner to the code 0.14.0 on GitHub:
$ git clone git://github.com/snowplow/snowplow.git
$ git checkout r64-palila
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment
From this release onwards, you must specify IAM roles for Elastic MapReduce to use. If you have not already done so, you can create these default EMR roles using the AWS Command Line Interface, like so:
$ aws emr create-default-roles
Now update your EmrEtlRunner's config.yml
file to add the default roles you just created:
:emr:
:ami_version: 2.4.2 # Choose as per http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-ami.html
:region: eu-west-1 # Always set this
:jobflow_role: EMR_EC2_DefaultRole # NEW LINE
:service_role: EMR_DefaultRole # NEW LINE
This release also bumps the Hadoop Enrichment process to version 0.14.1. Update config.yml
like so:
:versions:
:hadoop_enrich: 0.14.1 # WAS 0.14.0
For a complete example, see our sample config.yml
template.
This release widens the mkt_clickid
field in atomic.events
. You need to use the appropriate migration script to update to the new table definition:
This is a major release which adds two new enrichments, upgrades existing enrichments and significantly extends and improves our Canonical Event Model for loading into Redshift, Elasticsearch and Postgres.
The new and upgraded enrichments are as follows:
- New enrichment: parsing useragent strings using the
ua_parser
library - New enrichment: converting the money amounts in e-commerce transactions into a base currency using Open Exchange Rates
- Upgraded: extracting click IDs in our campaign attribution enrichment, so that Snowplow event data can be more precisely joined with campaign data
- Upgraded: our existing MaxMind-powered IP lookups
- Upgraded: useragent parsing using the
user_agent_utils
library can now be disabled
To continue parsing useragent strings using the user_agent_utils
library, you must add a new JSON configuration file into your folder of enrichment JSONs:
{
"schema": "iglu:com.snowplowanalytics.snowplow/user_agent_utils_config/jsonschema/1-0-0",
"data": {
"vendor": "com.snowplowanalytics.snowplow",
"name": "user_agent_utils_config",
"enabled": true,
"parameters": {}
}
}
The name of the file is not important but must end in .json
.
Configuring other enrichments is at your discretion. Useful resources here are:
There are two steps to upgrading the EMR pipeline:
- Upgrade your EmrEtlRunner to use the latest Hadoop job versions
- Upgrade your Redshift and/or Postgres
atomic.events
table to the relevant definitions
This release bumps:
- The Hadoop Enrichment process to version 0.14.0
- The Hadoop Shredding process to version 0.4.0
In your EmrEtlRunner's config.yml
file, update your Hadoop jobs versions like so:
:versions:
:hadoop_enrich: 0.14.0 # WAS 0.13.0
:hadoop_shred: 0.4.0 # WAS 0.3.0
For a complete example, see our sample config.yml
template.
You need to use the appropriate migration script to update to the new table definition:
If you want to make use of the new ua_parser based useragent parsing enrichment in Redshift, you must also deploy the new table into your atomic
schema:
This release updates:
- Scala Kinesis Enrich, to version 0.4.0
- Kinesis Elasticsearch Sink, to version 0.2.0
The new version of the Kinesis pipeline is available on Bintray. The download contains the latest versions of all of the Kinesis apps (Scala Stream Collector, Scala Kinesis Enrich, Kinesis Elasticsearch Sink, and Kinesis S3 Sink).
Our recommended approach for upgrading is as follows:
- Kill your Scala Kinesis Enrich cluster
- Leave your Kinesis Elasticsearch Sink cluster running until all remaining enriched events are loaded, then kill this cluster too
- Upgrade your Scala Kinesis Enrich cluster to the new version
- Upgrade your Kinesis Elasticsearch Sink cluster to the new version
- Restart your Scala Kinesis Enrich cluster
- Restart your Kinesis Elasticsearch Sink cluster
This release is designed to fix an incompatibility issue between r61's EmrEtlRunner and some older Elastic Beanstalk configurations. It also includes some other EmrEtlRunner improvements.
You need to update EmrEtlRunner to the code 0.13.0 on GitHub:
$ git clone git://github.com/snowplow/snowplow.git
$ git checkout r62-tropical-parula
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment
You must also update your EmrEtlRunner's configuration file, or else you will get a Contract failure on start. See the next section for details.
Whether or not you use the new bootstrap option, you must update your EmrEtlRunner's config.yml
file to include an entry for it:
In the :emr:
section of your EmrEtlRunner's config.yml
file, add in a :bootstrap:
property like so:
:emr:
...
:ec2_key_name: ADD HERE
:bootstrap: [] # No custom boostrap actions
:software:
...
For a complete example, see our sample config.yml
template.
This release has a variety of new features, operational enhancements and bug fixes. The major additions are:
- You can now parse Amazon CloudFront access logs using Snowplow
- The latest Clojure Collector version supports Tomcat 8 and CORS, ready for cross-domain
POST
from JavaScript and ActionScript - EmrEtlRunner's failure handling and Clojure Collector log handling have been improved
You need to update EmrEtlRunner to the code 0.12.0 on GitHub:
$ git clone git://github.com/snowplow/snowplow.git
$ git checkout r61-pygmy-parrot
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment
If you currently use snowplow-runner-and-loader.sh
, upgrade to the relevant version too.
This release bumps the Hadoop Enrichment process to version 0.13.0.
In your EmrEtlRunner's config.yml
file, update your hadoop_enrich
and hadoop_shred
jobs' versions like so:
:versions:
:hadoop_enrich: 0.13.0 # WAS 0.12.0
For a complete example, see our sample config.yml
template.
This release bumps the Clojure Collector to version 1.0.0.
You will not be able to upgrade an existing Tomcat 7 cluster to use this version. Instead, to upgrade to this release:
- Download the new warfile by right-clicking on this link and selecting "Save As…"
- Log in to your Amazon Elastic Beanstalk console
- Browse to your Clojure Collector's application
- Click the "Launch New Environment" action
- Click the "Upload New Version" and upload your warfile
When you are confident that the new collector is performing as expected, you can choose the "Swap Environment URLs" action to put the new collector live.
This release focuses on the Snowplow Kinesis flow, and includes:
- A new Kinesis “sink app” that reads the Scala Stream Collector’s Kinesis stream of raw events and stores these raw events in Amazon S3 in an optimized format
- An updated version of our Hadoop Enrichment process that supports as an input format the events stored in S3 by the new Kinesis sink app
Together, these two features let you robustly archive your Kinesis event stream in S3, and process and re-process it at will using our tried-and-tested Hadoop Enrichment process.
Up until now, all Snowplow releases have used semantic versioning. We will continue to use semantic versioning for Snowplow's many constituent applications and libraries, but our releases of the Snowplow platform as a whole will be known by their release number plus a codename. The codenames for 2015 will be birds in ascending order of size, starting with the Bee Hummingbird.
We recommend upgrading EmrEtlRunner to the version 0.11.0, given the bugs fixed in this release. You also must upgrade if you want to use Hadoop to process the events stored by the Kinesis LZO S3 Sink.
Upgrade is as follows:
$ git clone git://github.com/snowplow/snowplow.git
$ git checkout r60-bee-hummingbird
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment
This release bumps the Hadoop Enrichment process to version 0.12.0.
In your EmrEtlRunner's config.yml file, update your hadoop_enrich
job's version like so:
:versions:
:hadoop_enrich: 0.12.0 # WAS 0.11.0
If you want to run the Hadoop Enrichment process against the output of the Kinesis LZO S3 Sink, you will have to change the collector_format field in the configuration file to thrift
:
:collector_format: thrift
For a complete example, see our sample config.yml
template.
We are steadily moving over to Bintray for hosting binaries and artifacts which don't have to be hosted on S3. To make deployment easier, the Kinesis apps (Scala Stream Collector, Scala Kinesis Enrich, Kinesis Elasticsearch Sink, and Kinesis S3 Sink) are now all available in a single zip file.
Home | About | Project | Setup Guide | Technical Docs | Copyright © 2012-2021 Snowplow Analytics Ltd. Documentation terms of use.