Experiment with Node streams implementing a data processing pipeline CLI.
Data is an event log of pageviews, gzipped .tsv with header date, time, userid, url, ip, useragent
The pipeline has 4 stages:
- Read .gz files from disk (input directory passed in via the CLI)
- Extract / unzip
- Parse the TSV (using csv-parse, that returns an array for each record)
- Convert that array to an object for more readable access
- Geocode IP address to get city/country - using geoip2 databases for lookup
- Parse useragent string to get os/browser - using ua-parser-js
- Build a map of
{ country/city: numberOfEvents }
- Filter events so userId is unique
- Build a similar map for
{ browser/os: numberOfUsers }
- Iterate over those maps to find the top 5 in each category
- Write the final stream somewhere (file passed in via the cli)
- Install the latest version of node (14) e.g. using nvm
- Install yarn
yarn
- Drop your gzip files in the data dir
yarn generate ./data
The cli is built using commander so you can also run yarn cli
to see the options and get help.
At the moment the scalability bottleneck is the filtering of unique users - the map of userIds is held in memory. In production that could be replaced with a key/value store like redis or dynamodb.
So far I spent about a day, with more time I'd look to implement the following:
- More tests
- Add config params e.g. start/end date
- Replace in-memory user map with k/v store e.g. redis
- Deploy / host it somewhere