This project uses Terraform to schedule Athena Queries in AWS us-east-1 querying the Common Crawl, which contains petabytes of billions of webpages in 40+ languages.
This project was designed to run for free on us-east-1 where the data is located.
- Linux (also works from AWS/Google Cloud Shell)
- Terraform
- Go programming language
- ~/.aws/credentials
- run dev
- wait for the second file to be created in s3://your_bucket/results
- upload a CSV file containing at least these 3 columns to s3://your_bucket/results
this is the same format of output created by Athena queries if perform queries similar to
SELECT warc_filename, warc_record_offset, warc_record_length
FROM "ccindex"
WHERE url_host_registered_domain = ''
AND url_path like '/wiki/%s'
AND fetch_status = 200
Alternatively you can also upload such a query to s3://your_bucket/queries and it will be executed, triggering the creation of the results file, which will trigger the download of warc files
- Open the Athena query editor in the us-east-1 region where all the Common Crawl data is located
- Exploring Common Crawl with Athena
results([S3 /results])
warc([S3 /warcs])
user-- upload athena results csv --> results
queries([S3 /queries])
commoncrawl([S3 commoncrawl])
user-- upload athena query --> queries
queries-- notifies --> lambda1
queries<-- reads query --> lambda1
lambda1-- starts query --> athena
athena-- saves results --> results
results-- notifies --> lambda2
lambda2<-- reads results -->results
lambda2<-- request warcs --> commoncrawl
lambda2-- upload warcs --> warc