Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update FAZ parser #419

Merged
merged 5 commits into from
Apr 19, 2024
Merged

Update FAZ parser #419

merged 5 commits into from
Apr 19, 2024

Conversation

MaxDall
Copy link
Collaborator

@MaxDall MaxDall commented Apr 17, 2024

This updates the FAZ parser to the latest layout changes.

Copy link
Collaborator

@addie9800 addie9800 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for keeping the parser updated


class V2(BaseParser):
_summary_selector = CSSSelector("div.header-teaser")
_paragraph_selector = CSSSelector("div.body-elements > p")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all paragraphs are direct childs of the div.body-elements. In some cases there are paragraphs, which are enclosed in another div: div.body-elements__paragraph (e.g. the first paragraph in this article: https://www.faz.net/aktuell/sport/fussball/champions-league/bvb-im-halbfinale-der-champions-league-koennen-wir-dortmund-jemals-ernst-nehmen-19659537.html)
In the same article there is a paragraph that is just a strong block enclosed with the above mentioned extra div. This also does not get caught.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

@MaxDall MaxDall Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch with the paragraphs. How about simply using .body-elements__paragraph as a paragraph selector?

About the live ticker: Currently, Fundus has no intention to officially support live tickers, so not being able to parse it is fine for me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good solution.

return []

@attribute
def title(self) -> Optional[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although maybe an edge case this seems to not be a 100% reliable option, the title cannot be parsedin this article: https://www.faz.net/podcasts/f-a-z-finanzen-immobilien/wie-gefaehrlich-ist-die-krise-im-nahen-osten-fuer-die-boerse-19657512.html It also doesn't have a publishing date.
Alternatively we could also consider blocking the url if it's a podcast

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a generic function to get titles from the root element. I don't mind leaving the URLs. Users have always the option to filter articles based on attributes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo, that option should just be used as a back-up since for most (maybe all) actual news articles the og:title element does exist and corresponds to the actual headline of the article, while the <title> element and headline usually do not seem to align.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, i thought <title> would align better with the actual title. Good to know 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your reply came faster, than expexted. I had a little type in the commit suggestion, but you had already comitted it, before I edited it.

MaxDall and others added 2 commits April 19, 2024 11:15
Co-authored-by: Adrian Breiding <ad123br@gmail.com>
@MaxDall
Copy link
Collaborator Author

MaxDall commented Apr 19, 2024

Yeah, I'm starting to think that code suggestions are a bad practice 😅 That's what's happening in roughly 90% of the times I'm committing one ... except for those related to doc strings.

@MaxDall MaxDall requested a review from addie9800 April 19, 2024 09:33
@MaxDall MaxDall merged commit 0ee27c3 into master Apr 19, 2024
5 checks passed
@MaxDall MaxDall deleted the fix-faz branch April 19, 2024 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants