Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TheGuardian] Add Extractor for podcasts #8535

Merged
merged 8 commits into from
Nov 18, 2023

Conversation

SirElderling
Copy link
Contributor

@SirElderling SirElderling commented Nov 6, 2023

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

The purpose of this extractor is to download The Guardian podcast playlists and single episodes.

Fixes #8520

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Copilot Summary

🤖 Generated by Copilot at c8f48ae

Summary

🎧📰🐍

Add a new extractor TheGuardianPodcastIE for The Guardian podcast pages in theguardian.py and import it in _extractors.py. This allows yt-dlp to download audio files from The Guardian podcast URLs.

TheGuardianPodcastIE
New extractor for autumn podcasts
Inherits InfoExtractor

Walkthrough

  • Add support for The Guardian podcasts (link, link)
    • Import TheGuardianPodcastIE from theguardian.py in _extractors.py (link)
    • Define TheGuardianPodcastIE class in theguardian.py (link)

@bashonly bashonly added the site-request Request to support a new website label Nov 6, 2023
@bashonly bashonly self-requested a review November 6, 2023 22:44
@SirElderling SirElderling marked this pull request as draft November 10, 2023 07:05
@SirElderling
Copy link
Contributor Author

Converted this PR to draft in order to add the playlist extract functionality.

@SirElderling SirElderling marked this pull request as ready for review November 10, 2023 09:23
@SirElderling SirElderling changed the title [TheGuardian] Add Extractor [TheGuardian] Add Extractor for podcasts Nov 11, 2023
yt_dlp/extractor/theguardian.py Outdated Show resolved Hide resolved
yt_dlp/extractor/theguardian.py Outdated Show resolved Hide resolved
yt_dlp/extractor/theguardian.py Outdated Show resolved Hide resolved
Comment on lines 128 to 130
title = self._generic_title(url, webpage, default='')
description = self._og_search_description(webpage) or get_element_by_class(
'header__description', webpage)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to extract a title without the " | The Guardian" junk at the end. Also use clean_html just in case

Suggested change
title = self._generic_title(url, webpage, default='')
description = self._og_search_description(webpage) or get_element_by_class(
'header__description', webpage)
title = clean_html(get_element_by_class(
'index-page-header__title', webpage)) or self._generic_title(url, webpage)
description = self._og_search_description(webpage) or clean_html(get_element_by_class(
'header__description', webpage))

Copy link
Contributor Author

@SirElderling SirElderling Nov 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the URL given by the user has a page, the webpage title will be something like this: Today in Focus | Page 2 of 66 | News | The Guardian.
Is there a helper function that can be used here to clean the text? Or is it fine to do something like title, _ = title.split('|') ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove_end is the helper typically used for this, but it needs a fixed string to remove; so it wouldn't be useful here

IMO let's try to grab the clean title from one of these elements instead of doing string surgery with the title element

        title = clean_html(get_element_by_class(
            'index-page-header__title', webpage) or get_element_by_class('flagship-audio__title', webpage))

Copy link
Contributor Author

@SirElderling SirElderling Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented your suggestion bashonly. So far, it seems to pick up all the titles correctly.

@bashonly bashonly added the pending-fixes PR has had changes requested label Nov 11, 2023
@bashonly bashonly mentioned this pull request Nov 11, 2023
9 tasks
@bashonly bashonly removed the pending-fixes PR has had changes requested label Nov 11, 2023
@bashonly bashonly self-requested a review November 11, 2023 19:48
yt_dlp/extractor/theguardian.py Outdated Show resolved Hide resolved
yt_dlp/extractor/theguardian.py Outdated Show resolved Hide resolved
@bashonly bashonly added the pending-fixes PR has had changes requested label Nov 15, 2023
@bashonly bashonly removed the pending-fixes PR has had changes requested label Nov 17, 2023
yt_dlp/extractor/theguardian.py Outdated Show resolved Hide resolved
yt_dlp/extractor/theguardian.py Outdated Show resolved Hide resolved
yt_dlp/extractor/theguardian.py Outdated Show resolved Hide resolved
yt_dlp/extractor/theguardian.py Outdated Show resolved Hide resolved
@SirElderling
Copy link
Contributor Author

SirElderling commented Nov 18, 2023

Thank you @bashonly for all the suggestions provided. The code has been adjusted with them in mind.

@bashonly bashonly self-assigned this Nov 18, 2023
@bashonly bashonly merged commit 1fa3f24 into yt-dlp:master Nov 18, 2023
15 checks passed
@SirElderling SirElderling deleted the theguardian branch November 19, 2023 08:22
aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
site-request Request to support a new website
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support request for theguardian.com/news/audio/ podcasts
2 participants