This space is reserved for custom scripts. Based on the experience, some scripts are already in place which may provide an aid to the project.
- crawl-logstats.sh - Extracts the crawl statistics from Nutch logs. Please see USAGE for more details.
- crawl-fetchstats.sh - Extracts the fetch statistics from Nutch segments. Please see USAGE for more details.
- memex_cca_esindex.py - Converts the Nutch Common Crawl Dump to CDRv2 format. Please see USAGE for more details.
- splitter.py - Splits the CDRv2 JSON into multiple JSONs based on target websites. Please see USAGE for more details.