This project uses Terraform to schedule Athena Queries in AWS us-east-1 querying the Common Crawl, which contains petabytes of billions of webpages in 40+ languages.
This project was designed to run for free on us-east-1 where the data is located.
- Linux (also works from AWS/Google Cloud Shell)
- Terraform
- Go programming language
- ~/.aws/credentials
- run
deploy.sh dev
- wait for the second file to be created in s3://your_bucket/results
- upload a CSV file containing at least these 3 columns to s3://your_bucket/results
"warc_filename","warc_record_offset","warc_record_length"
"crawl-data/CC-MAIN-2021-10/segments/161.53/robotstxt/CC-MAIN-2021030106-2021030109-00602.warc.gz","1843978","19366"
this is the same format of output created by Athena queries if perform queries similar to
SELECT warc_filename, warc_record_offset, warc_record_length
FROM "ccindex"
WHERE url_host_registered_domain = 'wikipedia.org'
AND url_path like '/wiki/%s'
AND fetch_status = 200
LIMIT 1
Alternatively you can also upload such a query to s3://your_bucket/queries and it will be executed, triggering the creation of the results file, which will trigger the download of warc files
- Open the Athena query editor in the us-east-1 region where all the Common Crawl data is located
- Exploring Common Crawl with Athena
flowchart
user((User))
results([S3 /results])
warc([S3 /warcs])
user-- upload athena results csv --> results
queries([S3 /queries])
commoncrawl([S3 commoncrawl])
lambda1[Lambda]
lambda2[Lambda]
athena[Athena]
user-- upload athena query --> queries
queries-- notifies --> lambda1
queries<-- reads query --> lambda1
lambda1-- starts query --> athena
athena-- saves results --> results
results-- notifies --> lambda2
lambda2<-- reads results -->results
lambda2<-- request warcs --> commoncrawl
lambda2-- upload warcs --> warc