A very simple news crawler in Python. Developed at Humboldt University of Berlin.
Fundus is:
-
A static news crawler. Fundus lets you crawl online news articles with only a few lines of Python code! Be it from live websites or the CC-NEWS dataset.
-
An open-source Python package. Fundus is built on the idea of building something together. We welcome your contribution to help Fundus grow!
To install from pip, simply do:
pip install fundus
Fundus requires Python 3.8+.
Let's use Fundus to crawl 2 articles from publishers based in the US.
from fundus import PublisherCollection, Crawler
# initialize the crawler for news publishers based in the US
crawler = Crawler(PublisherCollection.us)
# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
print(article)
That's already it!
If you run this code, it should print out something like this:
Fundus-Article:
- Title: "Feinstein's Return Not Enough for Confirmation of Controversial New [...]"
- Text: "Democrats jammed three of President Joe Biden's controversial court nominees
through committee votes on Thursday thanks to a last-minute [...]"
- URL: https://freebeacon.com/politics/feinsteins-return-not-enough-for-confirmation-of-controversial-new-hampshire-judicial-nominee/
- From: FreeBeacon (2023-05-11 18:41)
Fundus-Article:
- Title: "Northwestern student government freezes College Republicans funding over [...]"
- Text: "Student government at Northwestern University in Illinois "indefinitely" froze
the funds of the university's chapter of College Republicans [...]"
- URL: https://www.foxnews.com/us/northwestern-student-government-freezes-college-republicans-funding-poster-critical-lgbtq-community
- From: FoxNews (2023-05-09 14:37)
This printout tells you that you successfully crawled two articles!
For each article, the printout details:
- the "Title" of the article, i.e. its headline
- the "Text", i.e. the main article body text
- the "URL" from which it was crawled
- the news source it is "From"
Maybe you want to crawl a specific news source instead. Let's crawl news articles from Washington Times only:
from fundus import PublisherCollection, Crawler
# initialize the crawler for The New Yorker
crawler = Crawler(PublisherCollection.us.TheNewYorker)
# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
print(article)
If you're not familiar with CC-NEWS, check out their paper.
from fundus import PublisherCollection, CCNewsCrawler
# initialize the crawler for news publishers based in the US
crawler = CCNewsCrawler(*PublisherCollection.us)
# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
print(article)
We provide quick tutorials to get you started with the library:
- Tutorial 1: How to crawl news with Fundus
- Tutorial 2: How to crawl articles from CC-NEWS
- Tutorial 3: The Article Class
- Tutorial 4: How to filter articles
- Tutorial 5: Advanced topics
- Tutorial 6: Logging
If you wish to contribute check out these tutorials:
You can find the publishers currently supported here.
Also: Adding a new publisher is easy - consider contributing to the project!
Check out our evaluation benchmark.
Scraper | Precision | Recall | F1-Score |
---|---|---|---|
Fundus | 99.89±0.57 | 96.75±12.75 | 97.69±9.75 |
Trafilatura | 90.54±18.86 | 93.23±23.81 | 89.81±23.69 |
BTE | 81.09±19.41 | 98.23±8.61 | 87.14±15.48 |
jusText | 86.51±18.92 | 90.23±20.61 | 86.96±19.76 |
news-please | 92.26±12.40 | 86.38±27.59 | 85.81±23.29 |
BoilerNet | 84.73±20.82 | 90.66±21.05 | 85.77±20.28 |
Boilerpipe | 82.89±20.65 | 82.11±29.99 | 79.90±25.86 |
Please cite the following paper when using Fundus or building upon our work:
@misc{dallabetta2024fundus,
title={Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions},
author={Max Dallabetta and Conrad Dobberstein and Adrian Breiding and Alan Akbik},
year={2024},
eprint={2403.15279},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Please email your questions or comments to Max Dallabetta
Thanks for your interest in contributing! There are many ways to get involved; start with our contributor guidelines and then check these open issues for specific tasks.