Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CC-NEWS dataset #331

Merged
merged 34 commits into from
Jan 30, 2024
Merged

Add support for CC-NEWS dataset #331

merged 34 commits into from
Jan 30, 2024

Conversation

MaxDall
Copy link
Collaborator

@MaxDall MaxDall commented Jan 10, 2024

This PR adds functionality to crawl the CC-NEWS dataset.

The file structure shadows the structure used in the main scraping directory. This code is currently mostly separated from the main files because those still use asyncio and thus are incompatible. When the main crawler gets rewritten, the code should be merged (not speaking about the PR).

Here is a quick tutorial on how to use the crawler.

Basic usage:

from fundus import CCNewsCrawler, PublisherCollection

crawler = CCNewsCrawler(*PublisherCollection)
for article in crawler.crawl(max_articles=100):
    print(article)

With date range:

from fundus import CCNewsCrawler, PublisherCollection
from datetime import datetime

crawler = CCNewsCrawler(*PublisherCollection)
for article in crawler.crawl(start=datetime(2020, 1, 1), end=datetime(2020, 3, 1), max_articles=20000):
    print(article)

Using the above block of code, I was able to crawl 20000 articles in 1:35 minutes on a machine with 10 GB bandwidth and 72 cores.

closes #44
closes #45

@MaxDall MaxDall added the feature Have an idea on how to improve the code base? Come forward and let us know. label Jan 10, 2024
Copy link
Collaborator

@dobbersc dobbersc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this great feature addition. Especially the multiprocessing part is very interesting and offered some new insights for me :).

src/fundus/scraping/common_crawl/pipeline.py Outdated Show resolved Hide resolved
src/fundus/scraping/common_crawl/pipeline.py Outdated Show resolved Hide resolved
src/fundus/scraping/common_crawl/pipeline.py Outdated Show resolved Hide resolved
src/fundus/scraping/common_crawl/pipeline.py Outdated Show resolved Hide resolved
src/fundus/scraping/common_crawl/pipeline.py Outdated Show resolved Hide resolved
src/fundus/scraping/common_crawl/pipeline.py Outdated Show resolved Hide resolved
src/fundus/scraping/common_crawl/pipeline.py Outdated Show resolved Hide resolved
src/fundus/scraping/common_crawl/pipeline.py Show resolved Hide resolved
src/fundus/scraping/common_crawl/pipeline.py Outdated Show resolved Hide resolved
src/fundus/scraping/common_crawl/pipeline.py Outdated Show resolved Hide resolved
MaxDall and others added 7 commits January 23, 2024 18:08
src/fundus/scraping/common_crawl/pipeline.py Outdated Show resolved Hide resolved
src/fundus/scraping/common_crawl/pipeline.py Outdated Show resolved Hide resolved
src/fundus/scraping/common_crawl/pipeline.py Outdated Show resolved Hide resolved
src/fundus/scraping/common_crawl/pipeline.py Outdated Show resolved Hide resolved
src/fundus/scraping/common_crawl/pipeline.py Outdated Show resolved Hide resolved
src/fundus/scraping/common_crawl/pipeline.py Outdated Show resolved Hide resolved
MaxDall and others added 2 commits January 29, 2024 18:22
Co-authored-by: Conrad Dobberstein <29147025+dobbersc@users.noreply.github.com>
src/fundus/scraping/common_crawl/pipeline.py Show resolved Hide resolved
src/fundus/scraping/common_crawl/html.py Outdated Show resolved Hide resolved
src/fundus/scraping/common_crawl/html.py Outdated Show resolved Hide resolved
src/fundus/scraping/common_crawl/html.py Outdated Show resolved Hide resolved
Co-authored-by: Conrad Dobberstein <29147025+dobbersc@users.noreply.github.com>
Copy link
Collaborator

@dobbersc dobbersc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The long-awaited and needed feature :).

@MaxDall MaxDall merged commit 358d229 into master Jan 30, 2024
5 checks passed
@MaxDall MaxDall deleted the add_cc_news branch January 30, 2024 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Have an idea on how to improve the code base? Come forward and let us know.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Redesign stream support for CC-NEWS crawler Adjust cc-news crawler to current project state
2 participants