Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make subtitle language selection more powerful #6264

Closed
wants to merge 2 commits into from

Conversation

sdht0
Copy link
Contributor

@sdht0 sdht0 commented Feb 17, 2023

Description of your pull request and other information

--sub-langs should allow selecting one or more subtitle languages of the following types: normal, auto, both, normal if available else auto. Currently --sub-langs does not support all such combinations (see #2262). For example, current behavior is to merge: a normal sub is selected if present otherwise the auto sub. Let's call this a simple merge. There is currently no way to download both normal and auto sub for the same lang.

When --sub-langs is not specified, the current default behavior (see my PR #6240) is to select the first matching sub in the following order. Let's call it a special merge.

  • Normal sub named exactly en
  • Normal sub that starts with en. This is to handle cases where subs are named as en-US or en-qlPKC2UN_YU.
  • Auto sub named exactly en
  • Auto sub that starts with en
  • Any normal sub and then any auto sub. This is handle the case when no en subs were found.

Note that this does not allow the ability to download only en subs. Thus, if en subs are missing, it will download an arbitrary (first) non-en lang.

This PR makes language selection much more flexible be making the following major changes.

  • Rename auto subs by prepending auto- (e.g., en -> auto-en). The main reason is to enable selecting auto subs separately from the normal subs (e.g., en.* vs auto-en.*). A secondary benefit is that auto subs are named differently on disk, which make is easy to distinguish them from normal subs just from the file name (also see handle subs and auto-subs separately #2262 (comment)).
  • Disables merging by default in --sub-langs and introduces a new special merging operator #<lang> to selectively enable merging as per the default behavior described above, with the change that if <lang> is not found, no other lang will be arbitrarily selected.
  • User can specify lang precedence: en/fr will download en if available, else fr if available, else nothing

I'm giving a comprehensive set of examples to show the new behavior of --sub-langs. The table assumes that subs are written to disk only. See #630 (comment) for corresponding embedding options.

Option Behavior Notes
--write-subs Select one normal sub
--write-auto-subs Select one auto sub
--write-subs --write-auto-subs Special merge subs and select one as described above
--write-subs --write-auto-subs --sub-langs 'all' Select all normal and auto subs without merge Changed behavior, selects both en and auto-en
--write-subs --write-auto-subs --sub-langs '#all' Select all normal and auto subs after simple merge New, selects en if present else auto-en, same as existing all
--write-subs --write-auto-subs --sub-langs 'allnorm' Select all normal subs New
--write-subs --write-auto-subs --sub-langs 'allauto' Select all auto subs New
--write-subs --write-auto-subs --sub-langs 'en' Select one en normal subs Changed, will select normal subs only
--write-subs --write-auto-subs --sub-langs 'auto-en' Select one en auto subs New
--write-subs --write-auto-subs --sub-langs 'en,auto-en' Select both en normal and auto subs New
--write-subs --write-auto-subs --sub-langs 'en/auto-en' Select one en normal sub if present else auto subs New, same as existing en
--write-subs --write-auto-subs --sub-langs '#en' Select one en sub after special merge New, shortcut for above but uses special merge
--write-subs --write-auto-subs --sub-langs 'en,ja,auto-ja,es' Select listed subs Changed: no merging, auto subs need to be explicitly selected
--write-subs --write-auto-subs --sub-langs 'en.*,auto-en.*' Select subs using regex Changed: no merging, auto subs need to be explicitly selected
--write-subs --write-auto-subs --sub-langs 'allnorm,auto-en-orig/auto-en' Select all normal subs and one en-orig auto sub if present else en auto sub New
--write-subs --write-auto-subs --sub-langs 'allnorm,auto-.*-orig' Select all normal subs and one *-orig auto sub New
--write-subs --write-auto-subs --sub-langs 'en/auto-en/fr/auto-fr' Set precedence order for subs, fr will only be selected if en is not present New
--write-subs --write-auto-subs --sub-langs '#en/fr' Same as above but with special merging New

The above design was chosen to make the code and options simpler, but it breaks existing behavior. Backward compatibility can be maintained at the cost of more complex code and likely needs 2 operators instead of 1. To summarize the break in behavior:

Option Current New
--write-subs --write-auto-subs Default behavior as described in the beginning No change
--write-subs --write-auto-subs --sub-langs 'all' Normal and auto subs are merged and only one is selected for each lang Both normal and auto subs for each lang will be selected without merging. Use --sub-langs '#all' for old behavior.
--write-subs --write-auto-subs --sub-langs 'en' Either normal or auto sub for en is selected Only normal sub for en is selected. Use --sub-langs 'en/orig-en' for old behavior.

TODOs

  • Perform and add tests. I have done some preliminary testing.
  • Update documentation. Need to decide how detailed the merging logic should be explained.

Let me know if this is an acceptable way to proceed. Once we come to a decision on the main design: renaming auto subs, # and / operators and the general code, I can work on the TODOs.

Fixes #2262 #4178

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

@pukkandan
Copy link
Member

pukkandan commented Feb 17, 2023

I wish you had discussed your idea before trying to implement it. The behavior of existing options should not be changed.

I have another idea for how to address these without breaking current compat.

@pukkandan pukkandan added enhancement New feature or request do-not-merge labels Feb 17, 2023
@pukkandan
Copy link
Member

Backward compatibility can be maintained at the cost of more complex code and likely needs 2 operators instead of 1.

Could you elaborate?

@sdht0
Copy link
Contributor Author

sdht0 commented Feb 17, 2023

I wish you had discussed your idea before trying to implement it. The behavior of existing options should not be changed.

Yeah I was thinking about the design and implementation was easy enough to do quickly, so no worries if this does not get accepted.

I have another idea for how to address these without breaking current compat. I'll type it out asap after I get the new release out

Sounds good. My only feedback for the design is to allow selecting the default behavior (startswith(lang)) but without selecting a non-en lang is en is absent, as this is what I'll personally be using most of the time.

Backward compatibility can be maintained at the cost of more complex code and likely needs 2 operators instead of 1.

Could you elaborate?

I tried preserving backward compatibility. But then there needs to be a way to have "simple merge" as is the current default and then be able to choose "no merging" to download both normal+auto or "special merging" (see PR description for the terminology). Thus two operators. maybe # and @.

Looking forward to your design.

@sdht0
Copy link
Contributor Author

sdht0 commented Mar 8, 2024

Closing this for now.

@sdht0 sdht0 closed this Mar 8, 2024
@sdht0 sdht0 deleted the subs-selection branch May 22, 2024 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

handle subs and auto-subs separately
2 participants