-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for CC-NEWS dataset #331
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for this great feature addition. Especially the multiprocessing part is very interesting and offered some new insights for me :).
Co-authored-by: Conrad Dobberstein <29147025+dobbersc@users.noreply.github.com>
Co-authored-by: Conrad Dobberstein <29147025+dobbersc@users.noreply.github.com>
Co-authored-by: Conrad Dobberstein <29147025+dobbersc@users.noreply.github.com>
Co-authored-by: Conrad Dobberstein <29147025+dobbersc@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The long-awaited and needed feature :).
This PR adds functionality to crawl the CC-NEWS dataset.
The file structure shadows the structure used in the main scraping directory. This code is currently mostly separated from the main files because those still use
asyncio
and thus are incompatible. When the main crawler gets rewritten, the code should be merged (not speaking about the PR).Here is a quick tutorial on how to use the crawler.
Basic usage:
With date range:
Using the above block of code, I was able to crawl 20000 articles in 1:35 minutes on a machine with 10 GB bandwidth and 72 cores.
closes #44
closes #45