Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
-
Updated
Dec 24, 2024 - Java
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation
Single Docker container running Heritrix 3, picking up jobs from a directory.
Dockerized Web Curator Tool with Heritrix 3 and pywb
Parse a Heritrix crawl.log into an XML sitemap
Add a description, image, and links to the heritrix topic page so that developers can more easily learn about it.
To associate your repository with the heritrix topic, visit your repo's landing page and select "manage topics."