-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added UK Publisher The Sun #445
Changes from 8 commits
655c05a
cff75bc
a48c8da
a4de929
f206460
2168d6e
49bce7a
79a60ad
7c70d0f
c057b99
e2287fe
c7794bd
bb3c19e
b4cfb6f
34400fc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,44 @@ | ||||||
import datetime | ||||||
from typing import List, Optional | ||||||
|
||||||
from lxml.cssselect import CSSSelector | ||||||
|
||||||
from fundus.parser import ArticleBody, BaseParser, ParserProxy, attribute | ||||||
from fundus.parser.utility import ( | ||||||
extract_article_body_with_selector, | ||||||
generic_author_parsing, | ||||||
generic_date_parsing, | ||||||
generic_topic_parsing, | ||||||
) | ||||||
|
||||||
|
||||||
class TheSunParser(ParserProxy): | ||||||
class V1(BaseParser): | ||||||
_summary_selector = CSSSelector("div[data-gu-name='standfirst'] p") | ||||||
_paragraph_selector = CSSSelector("div.article__content > p") | ||||||
_sub_headline_selector = CSSSelector("div.toplist_container__jpTyX thesun_container__fty3s > h2") | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah yes, but this is the expected behavior since you updated the selectors. If you run pytest now, the extracted content will differ from your test files you generated earlier. If you run There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are right, I am sorry for the inconvenience. I committed and pushed the feature change && newly generated test :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No worries :) |
||||||
|
||||||
@attribute | ||||||
def body(self) -> ArticleBody: | ||||||
return extract_article_body_with_selector( | ||||||
self.precomputed.doc, | ||||||
summary_selector=self._summary_selector, | ||||||
paragraph_selector=self._paragraph_selector, | ||||||
subheadline_selector=self._sub_headline_selector, | ||||||
) | ||||||
|
||||||
@attribute | ||||||
def publishing_date(self) -> Optional[datetime.datetime]: | ||||||
return generic_date_parsing(self.precomputed.ld.bf_search("datePublished")) | ||||||
|
||||||
@attribute | ||||||
def authors(self) -> List[str]: | ||||||
return generic_author_parsing(self.precomputed.ld.bf_search("author")) | ||||||
|
||||||
@attribute | ||||||
def title(self) -> Optional[str]: | ||||||
return self.precomputed.ld.bf_search("headline") | ||||||
|
||||||
@attribute | ||||||
def topics(self) -> List[str]: | ||||||
return generic_topic_parsing(self.precomputed.meta.get("article:tag")) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
{ | ||
"V1": { | ||
"authors": [ | ||
"Tanyel Mustafa" | ||
], | ||
"body": { | ||
"summary": [], | ||
"sections": [ | ||
{ | ||
"headline": [], | ||
"paragraphs": [ | ||
"REBECCA Cooke is the childhood sweetheart of England footballer Phil Foden.", | ||
"And the couple have announced some great news with the pair expecting a new addition to the family.", | ||
"Rebecca Cooke is the long-term partner of Manchester City midfielder Phil Foden.", | ||
"Rebecca is thought to be 22 years old and the mother of two children with Phil.", | ||
"She tends to keep out the spotlight and has her Instagram account currently set private, though it does seem to suggest that she goes by the nickname Becca.", | ||
"The exact time at which they started dating is unknown, but they have been together since being teenagers.", | ||
"At the age of 18 she became a mother to their son, Ronnie.", | ||
"A fan account of the couple (@beccafodenx) on Instagram shows the two together, along with a closer look at the blonde bombshell.", | ||
"Phil and Rebecca have a son called Ronnie, 4, and a daughter named True, 1.", | ||
"In April 2024, the couple announced they are expecting a third child.", | ||
"Speaking to Manchester City at the time of the birth of his son, Phil said: \"I was there for the birth. I walked out of the room, gave it a little tear and then went back in like nothing happened.", | ||
"\"I’m not one for crying in front of people. I like to be on my own, but I was there in the room, watched it happen and it was a special moment.", | ||
"\"Your life changes.\"", | ||
"He continued, speaking of the things he misses Ronnie doing due to football training: \"There are things you miss when you’re not there because you’ve got an away game.", | ||
"\"I was there when he started crawling, but I think I was in London when he started to walk.", | ||
"\"Now he’s getting about and walking everywhere, so you have to have eyes in the back of your head or he starts running off.", | ||
"\"It’s unfortunate to miss things like that but it’s a sacrifice that he’ll appreciate when he’s older.\"" | ||
] | ||
} | ||
] | ||
}, | ||
"publishing_date": "2024-04-22 22:04:00+01:00", | ||
"title": "Who is Man City star Phil Foden’s girlfriend Rebecca Cooke and how many children do couple have?...", | ||
"topics": [ | ||
"Pep Guardiola", | ||
"Phil Foden", | ||
"Celebrity relationships and break ups", | ||
"EVG", | ||
"Instagram", | ||
"Manchester City transfer news", | ||
"Pregnancy and childbirth", | ||
"Manchester", | ||
"England", | ||
"Manchester City", | ||
"Manchester United" | ||
] | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some articles that also contain subheadlines, such as this one: https://www.thesun.co.uk/betting/21748039/best-monopoly-live-casinos/. It would be great, if you could also add a subheadline selector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As requested, I added a subheadline selector and successfully executed
python -m scripts.generate_parser_test_files -p TheSun -o
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also tested black, isort, mypy and pytest. All of them passed on my local machine without any erros :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thanks a lot :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did I change everything that was requested or did I miss something :)?'
I think I might've misunderstood what the subheadline of this article is. Could you maybe point it out for me please? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I picked the wrong subheadline. I changed the subheadline selector and re-generated test files, executed pytest, black, isort and mypy. Pycharm tells me there are no file changes thus I can't push or commit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just reviewed it and judging by the subheadline selector you chose, I think you got the correct idea. A subheadline in Fundus a line of text separating paragraphs into logical entities. For example in https://www.thesun.co.uk/news/27470413/ukraine-torpedo-submarine-black-sea-battle/
CAN IT BE REAL?
would be considered a subheadline. In this case I would suggest something like this as the subheadline selector:div.article__content > h2.wp-block-heading