Crawl - Evaluation

A crawl evaluation of Apache Nutch v1.12. We are running our crawls on TACC Wrangler, a supercomputer funded by NSF, in both Hadoop and Local mode thereby pushing the crawler to its limits for a best throughput.

We are evaluating Nutch all kind of crazy stuff - Broad crawling, Focused crawling, Inteligient Crawling, Domain Discovery and many more...

The project has a sample crawling workspace for Wrangler which is both automated and portable. More details can be found from the respective README files.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
nutch		nutch
scripts		scripts
workspace		workspace
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawl - Evaluation

Quick Links

About

Releases

Packages

Contributors 3

Languages

License

karanjeets/PCF-Nutch-on-Wrangler

Folders and files

Latest commit

History

Repository files navigation

Crawl - Evaluation

Quick Links

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages