A crawl evaluation of Apache Nutch v1.12. We are running our crawls on TACC Wrangler, a supercomputer funded by NSF, in both Hadoop and Local mode thereby pushing the crawler to its limits for a best throughput.
We are evaluating Nutch all kind of crazy stuff - Broad crawling, Focused crawling, Inteligient Crawling, Domain Discovery and many more...
The project has a sample crawling workspace for Wrangler which is both automated and portable. More details can be found from the respective README files.