This crawler scrapes news articles from DailyHunt.
To recreate the dataset:
- Download files from Huggingface Hub by doing
wget https://huggingface.co/datasets/rahular/varta/raw/main/varta/<split>/<split>.json
- Set the
INFILE
insettings.py
to the path of the file you want to recreate; changeOUTFILE
to the path where you want the data to be saved - The output file will contain the following keys:
id
: unique identifier of the article in the format of "nxxxxxxxxx"headline
: headline of the articletext
: the main content of the articleurl
: the DailyHunt url of the articlesource_media
: name of the publisher that DailyHunt aggregates this article fromsource_url
: the url of this article from the original publisherpublication_date
: timestamptags
: a list of categories that the article belongs toreactions
: a dictionary of the reactions from the readersword_count
: word count based on white space delimiterlangCode
: language code based on two-digit ISO 639-1 convention
The code is based on Scrapy and BeautifulSoup. Install them both
conda activate <env_name>
conda install -c conda-forge scrapy
conda install -c anaconda beautifulsoup4
and run
scrapy crawl dh