HEPcrawl is a harvesting library based on Scrapy (http://scrapy.org) for INSPIRE-HEP (http://inspirehep.net) that focuses on automatic and semi-automatic retrieval of new content from all the sources the site aggregates. In particular content from major and minor publishers in the field of High-Energy Physics.
The project is currently in early stage of development.
We start by creating a virtual environment for our Python packages:
mkvirtualenv hepcrawl
cdvirtualenv
mkdir src && cd src
Now we grab the code and install it in development mode:
git clone https://github.com/inspirehep/hepcrawl.git
cd hepcrawl
pip install -e .
Development mode ensures that any changes you do to your sources are automatically taken into account = no need to install again after changing something.
Finally run the tests to make sure all is setup correctly:
python setup.py test
Thanks to the command line tools provided by Scrapy, we can easily test the spiders as we are developing them. Here is an example using the simple sample spider:
cdvirtualenv src/hepcrawl
scrapy crawl Sample -a source_file=file://`pwd`/tests/responses/world_scientific/sample_ws_record.xml
Thanks for contributing!