-
-
Notifications
You must be signed in to change notification settings - Fork 7.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TheGuardian] Add Extractor for podcasts #8535
Conversation
Converted this PR to draft in order to add the playlist extract functionality. |
yt_dlp/extractor/theguardian.py
Outdated
title = self._generic_title(url, webpage, default='') | ||
description = self._og_search_description(webpage) or get_element_by_class( | ||
'header__description', webpage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to extract a title without the " | The Guardian" junk at the end. Also use clean_html
just in case
title = self._generic_title(url, webpage, default='') | |
description = self._og_search_description(webpage) or get_element_by_class( | |
'header__description', webpage) | |
title = clean_html(get_element_by_class( | |
'index-page-header__title', webpage)) or self._generic_title(url, webpage) | |
description = self._og_search_description(webpage) or clean_html(get_element_by_class( | |
'header__description', webpage)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the URL given by the user has a page, the webpage title will be something like this: Today in Focus | Page 2 of 66 | News | The Guardian
.
Is there a helper function that can be used here to clean the text? Or is it fine to do something like title, _ = title.split('|')
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove_end
is the helper typically used for this, but it needs a fixed string to remove; so it wouldn't be useful here
IMO let's try to grab the clean title from one of these elements instead of doing string surgery with the title element
title = clean_html(get_element_by_class(
'index-page-header__title', webpage) or get_element_by_class('flagship-audio__title', webpage))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented your suggestion bashonly. So far, it seems to pick up all the titles correctly.
…ue to number of episodes per page
Thank you @bashonly for all the suggestions provided. The code has been adjusted with them in mind. |
Closes yt-dlp#8520 Authored by: SirElderling
IMPORTANT: PRs without the template will be CLOSED
Description of your pull request and other information
The purpose of this extractor is to download
The Guardian
podcast playlists and single episodes.Fixes #8520
Template
Before submitting a pull request make sure you have:
In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:
What is the purpose of your pull request?
Copilot Summary
🤖 Generated by Copilot at c8f48ae
Summary
🎧📰🐍
Add a new extractor
TheGuardianPodcastIE
for The Guardian podcast pages intheguardian.py
and import it in_extractors.py
. This allows yt-dlp to download audio files from The Guardian podcast URLs.Walkthrough
TheGuardianPodcastIE
fromtheguardian.py
in_extractors.py
(link)TheGuardianPodcastIE
class intheguardian.py
(link)