Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial support for Italian publishers, starting with La Repubblica #670

Merged
merged 12 commits into from
Jan 2, 2025

Conversation

ruggsea
Copy link
Contributor

@ruggsea ruggsea commented Dec 30, 2024

Add La Repubblica (Italian News)

Hey! I've added support for La Repubblica, one of Italy's leading newspapers. I would like this project to cover Italian newspapers too and I figured I should start from this one

Additions

  • Created a new parser for La Repubblica (www.repubblica.it)
  • Added Italy (IT) as a new publisher group
  • Set up RSS feed and sitemap crawling

What can it parse?

The parser can extract:

  • Article titles
  • Full article body
  • Author information
  • Publishing dates
  • Paywall status

Missing

  • Topics

Testing

Created and passed unit tests; run script to add La Repubblica to the Supported Sources docs. Run linting and mypy checking as required

Copy link
Collaborator

@MaxDall MaxDall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruggsea Thanks a lot for pushing Italian publishers forward. That's a great addition and your code looks good! Just some minor adjustments.

src/fundus/publishers/it/__init__.py Outdated Show resolved Hide resolved
src/fundus/publishers/it/la_repubblica.py Outdated Show resolved Hide resolved
src/fundus/publishers/it/la_repubblica.py Outdated Show resolved Hide resolved
src/fundus/publishers/it/la_repubblica.py Outdated Show resolved Hide resolved
…h to get topics, modified the dynamic sitemap handling to conform to tagesspiel implementation, removed redundant free access check
…h to get topics, modified the dynamic sitemap handling to conform to tagesspiel implementation, removed redundant free access check
@ruggsea
Copy link
Contributor Author

ruggsea commented Dec 30, 2024

Thank you for your feedback. I have implemented your advice in my latest commit

Copy link
Collaborator

@MaxDall MaxDall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruggsea Huge thanks for adding this :)

@MaxDall MaxDall merged commit bc58f98 into flairNLP:master Jan 2, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants