-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update FAZ
parser
#419
Update FAZ
parser
#419
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for keeping the parser updated
src/fundus/publishers/de/faz.py
Outdated
|
||
class V2(BaseParser): | ||
_summary_selector = CSSSelector("div.header-teaser") | ||
_paragraph_selector = CSSSelector("div.body-elements > p") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all paragraphs are direct childs of the div.body-elements
. In some cases there are paragraphs, which are enclosed in another div: div.body-elements__paragraph
(e.g. the first paragraph in this article: https://www.faz.net/aktuell/sport/fussball/champions-league/bvb-im-halbfinale-der-champions-league-koennen-wir-dortmund-jemals-ernst-nehmen-19659537.html)
In the same article there is a paragraph that is just a strong
block enclosed with the above mentioned extra div. This also does not get caught.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, live tickers aren't being parsed yet: https://www.faz.net/aktuell/politik/ukraine-liveticker-us-repraesentantenhaus-abstimmung-ueber-ukraine-hilfen-am-samstag-faz-19030454.html or https://www.faz.net/aktuell/politik/ausland/israel-krieg-im-liveticker-netanjahu-wir-treffen-unsere-eigenen-entscheidungen-faz-19589481.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch with the paragraphs. How about simply using .body-elements__paragraph
as a paragraph selector?
About the live ticker: Currently, Fundus has no intention to officially support live tickers, so not being able to parse it is fine for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good solution.
return [] | ||
|
||
@attribute | ||
def title(self) -> Optional[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although maybe an edge case this seems to not be a 100% reliable option, the title cannot be parsedin this article: https://www.faz.net/podcasts/f-a-z-finanzen-immobilien/wie-gefaehrlich-ist-die-krise-im-nahen-osten-fuer-die-boerse-19657512.html It also doesn't have a publishing date.
Alternatively we could also consider blocking the url if it's a podcast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a generic function to get titles from the root element. I don't mind leaving the URLs. Users have always the option to filter articles based on attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imo, that option should just be used as a back-up since for most (maybe all) actual news articles the og:title
element does exist and corresponds to the actual headline of the article, while the <title>
element and headline usually do not seem to align.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, i thought <title>
would align better with the actual title. Good to know 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your reply came faster, than expexted. I had a little type in the commit suggestion, but you had already comitted it, before I edited it.
Co-authored-by: Adrian Breiding <ad123br@gmail.com>
Yeah, I'm starting to think that code suggestions are a bad practice 😅 That's what's happening in roughly 90% of the times I'm committing one ... except for those related to doc strings. |
This updates the
FAZ
parser to the latest layout changes.