Add support for CC-NEWS dataset #331

MaxDall · 2024-01-10T21:21:57Z

This PR adds functionality to crawl the CC-NEWS dataset.

The file structure shadows the structure used in the main scraping directory. This code is currently mostly separated from the main files because those still use asyncio and thus are incompatible. When the main crawler gets rewritten, the code should be merged (not speaking about the PR).

Here is a quick tutorial on how to use the crawler.

Basic usage:

from fundus import CCNewsCrawler, PublisherCollection

crawler = CCNewsCrawler(*PublisherCollection)
for article in crawler.crawl(max_articles=100):
    print(article)

With date range:

from fundus import CCNewsCrawler, PublisherCollection
from datetime import datetime

crawler = CCNewsCrawler(*PublisherCollection)
for article in crawler.crawl(start=datetime(2020, 1, 1), end=datetime(2020, 3, 1), max_articles=20000):
    print(article)

Using the above block of code, I was able to crawl 20000 articles in 1:35 minutes on a machine with 10 GB bandwidth and 72 cores.

closes #44
closes #45

…ents.

dobbersc

Thank you very much for this great feature addition. Especially the multiprocessing part is very interesting and offered some new insights for me :).

src/fundus/scraping/common_crawl/pipeline.py

Co-authored-by: Conrad Dobberstein <29147025+dobbersc@users.noreply.github.com>

src/fundus/scraping/common_crawl/pipeline.py

Co-authored-by: Conrad Dobberstein <29147025+dobbersc@users.noreply.github.com>

src/fundus/scraping/common_crawl/pipeline.py

src/fundus/scraping/common_crawl/html.py

Co-authored-by: Conrad Dobberstein <29147025+dobbersc@users.noreply.github.com>

dobbersc

The long-awaited and needed feature :).

MaxDall added 11 commits December 13, 2023 15:01

add CCNews warc iterator

55ea1c5

rename HTMLSource -> FundusSource

7482d4d

add CCNewsCrawler

64def94

spellchecking

6ea30e4

html file went missing

bb83cf2

remove leftover assertion

a6f1bd1

remove leftover funtion

8919371

replace cavant with pool

d8c7a1a

fix a bug where all filters where applied at once

f6a09e8

make multiprocessing optional but default

d06ae43

update README.md

d694741

MaxDall added the feature Have an idea on how to improve the code base? Come forward and let us know. label Jan 10, 2024

MaxDall added 10 commits January 10, 2024 22:25

add requests tp project dependencies

0899b5d

add tqdm, fastwarc, ftfy to project dependencies

c766a1c

adjust to new ftfy version

29bc92a

remove leftover attribute

c2d070a

some logic tweaks

05ffe2d

fixed a bug regarding date ranges

1bfc8e9

remove leftover

675b7b0

add dill serialization to allow unpickable functions as crawl argum…

fcfb42b

…ents.

refactor _queue_wrapper

bbe6993

fix exception msg

9da4f9f

dobbersc requested changes Jan 23, 2024

View reviewed changes

MaxDall and others added 7 commits January 23, 2024 18:08

Apply suggestions from code review

c5e6d60

Co-authored-by: Conrad Dobberstein <29147025+dobbersc@users.noreply.github.com>

resolve review comments

e716fa3

add CCNewsCrawler to tutorials

39ec846

make ParamSpec P private

594a2fa

Apply suggestions from code review

dc2e84b

Co-authored-by: Conrad Dobberstein <29147025+dobbersc@users.noreply.github.com>

add comment about parallel crawl

6afd5e5

rename extraction -> extracted

e07947d

MaxDall added 3 commits January 24, 2024 15:37

Merge remote-tracking branch 'origin/add_cc_news' into add_cc_news

a7a5343

fix mypy

0a3a4b9

fix docstrings

20dfa9f

dobbersc reviewed Jan 28, 2024

View reviewed changes

src/fundus/scraping/common_crawl/pipeline.py Outdated Show resolved Hide resolved

src/fundus/scraping/common_crawl/pipeline.py Outdated Show resolved Hide resolved

dobbersc reviewed Jan 28, 2024

View reviewed changes

src/fundus/scraping/common_crawl/pipeline.py Outdated Show resolved Hide resolved

MaxDall mentioned this pull request Jan 29, 2024

Refactor doc strings according to google specifications #339

Open

MaxDall and others added 2 commits January 29, 2024 18:22

Spellchecking

2ab29c2

Co-authored-by: Conrad Dobberstein <29147025+dobbersc@users.noreply.github.com>

Finish review comments

b6b8eae

dobbersc reviewed Jan 30, 2024

View reviewed changes

Clarify log messages

899c4c9

Co-authored-by: Conrad Dobberstein <29147025+dobbersc@users.noreply.github.com>

dobbersc approved these changes Jan 30, 2024

View reviewed changes

MaxDall merged commit 358d229 into master Jan 30, 2024
5 checks passed

MaxDall deleted the add_cc_news branch January 30, 2024 15:23

MaxDall mentioned this pull request Feb 1, 2024

Fixes broken tutorial links #348

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for CC-NEWS dataset #331

Add support for CC-NEWS dataset #331

MaxDall commented Jan 10, 2024 •

edited

Loading

dobbersc left a comment

dobbersc left a comment

Add support for CC-NEWS dataset #331

Add support for CC-NEWS dataset #331

Conversation

MaxDall commented Jan 10, 2024 • edited Loading

dobbersc left a comment

Choose a reason for hiding this comment

dobbersc left a comment

Choose a reason for hiding this comment

MaxDall commented Jan 10, 2024 •

edited

Loading