Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File names use "full-width" special characters: '?' vs '?' #5014

Closed
10 tasks done
AlbatorLaho opened this issue Sep 24, 2022 · 8 comments
Closed
10 tasks done

File names use "full-width" special characters: '?' vs '?' #5014

AlbatorLaho opened this issue Sep 24, 2022 · 8 comments
Labels
duplicate This issue or pull request already exists question Question

Comments

@AlbatorLaho
Copy link

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

  • I understand that I will be blocked if I remove or skip any mandatory* field

Checklist

Region

USA

Provide a description that is worded well enough to be understood

Recently (last month or so) all videos I download from YouTube use "full-width" special characters. (such as question mark, explication mark, colon, semicolon, &c.)
Other characters such as [A-Za-z] are "normal".
The video name appears to be "normal" in the json file from --write-info-json so I guess there must be some problem when naming the actual video file?
Example of "full-width" special character (U+FF1F) vs "normal" value ?(U+003F)

Provide verbose output that clearly demonstrates the problem

  • Run your yt-dlp command with -vU flag added (yt-dlp -vU <your command line>)
  • Copy the WHOLE output (starting with [debug] Command-line config) and insert it below

Complete Verbose Output

[debug] Command-line config: ['-vU', '--write-info-json', 'https://www.youtube.com/watch?v=aMu07rtD6cI']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version 2022.09.01 [5d7c7d6] (pip)
[debug] Python 3.10.6 (CPython 64bit) - macOS-10.14.6-x86_64-i386-64bit
[debug] Checking exe version: ffmpeg -bsfs
[debug] Checking exe version: ffprobe -bsfs
[debug] exe versions: ffmpeg 5.1.1 (setts), ffprobe 5.1.1, phantomjs 2.1.1, rtmpdump 2.4
[debug] Optional libraries: Cryptodome-3.15.0, brotli-1.0.9, certifi-2022.06.15, mutagen-1.45.1, sqlite3-2.6.0, websockets-10.3
[debug] Proxy map: {}
[debug] Loaded 1670 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Latest version: 2022.09.01, Current version: 2022.09.01
yt-dlp is up to date (2022.09.01)
[debug] [youtube] Extracting URL: https://www.youtube.com/watch?v=aMu07rtD6cI
[youtube] aMu07rtD6cI: Downloading webpage
[youtube] aMu07rtD6cI: Downloading android player API JSON
[debug] Sort order given by extractor: quality, res, fps, hdr:12, source, vcodec:vp9.2, channels, acodec, lang, proto
[debug] Formats sorted by: hasvid, ie_pref, quality, res, fps, hdr:12(7), source, vcodec:vp9.2(10), channels, acodec, lang, proto, filesize, fs_approx, tbr, vbr, abr, asr, vext, aext, hasaud, id
[debug] Default format spec: bestvideo*+bestaudio/best
[info] aMu07rtD6cI: Downloading 1 format(s): 313+251
[info] Writing video metadata as JSON to: Has Sony made the PERFECT party speaker?? - Sony SRS-XG300 [aMu07rtD6cI].info.json
[debug] Invoking http downloader on "https://rr4---sn-5hne6n6l.googlevideo.com/videoplayback?expire=1664000334&ei=7kwuY7_CEYWmgQeWzrjgCg&ip=185.107.57.81&id=o-AN1bI8VdB1eprY2PFFOVynifiL7VqM1SEcXxWlXrnHfS&itag=313&source=youtube&requiressl=yes&mh=MG&mm=31%2C29&mn=sn-5hne6n6l%2Csn-5hneknes&ms=au%2Crdu&mv=m&mvi=4&pl=24&initcwndbps=742500&spc=yR2vp3sGNQp7SHQxVzwXReu9qasZioY&vprv=1&svpuc=1&mime=video%2Fwebm&gir=yes&clen=960895517&dur=670.636&lmt=1663652915447588&mt=1663978408&fvip=2&keepalive=yes&fexp=24001373%2C24007246&c=ANDROID&txp=4432434&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cspc%2Cvprv%2Csvpuc%2Cmime%2Cgir%2Cclen%2Cdur%2Clmt&sig=AOq0QJ8wRQIgOIxx2ERu1BYvGrE2ip0j70kg9oHqM-IUXtZXx0DrWI0CIQDS7KQdsR92BDUC5yeJwLJUVspkChI_SchEKbVgFbf5IA%3D%3D&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Cinitcwndbps&lsig=AG3C_xAwRQIhALQvcN8pZ12W3V9qYKxlfsw7BgZn89ETQdEIb6HUwKWkAiBJEFWOGFw_83C1Hb769yOZzH-EZeaGphU8dj5dXNMsEQ%3D%3D"
[download] Destination: Has Sony made the PERFECT party speaker?? - Sony SRS-XG300 [aMu07rtD6cI].f313.webm
@AlbatorLaho AlbatorLaho added site-bug Issue with a specific website triage Untriaged issue labels Sep 24, 2022
@AlbatorLaho AlbatorLaho changed the title (YouTube) files names use "full-width" special characters: '?' vs '?' (YouTube) file names use "full-width" special characters: '?' vs '?' Sep 24, 2022
@layercak3
Copy link

layercak3 commented Sep 24, 2022

It's on purpose.

It's done here:

def replace_insane(char):

Take a look in the sanitize_filename function.

It is performed when --no-restrict-filenames is set (the default), just adding fee0 to the basic latin codepoints to get the codepoint of their fullwidth forms. In the case of --restrict-filenames it would just delete it, which is worse.

There isn't a way to disable it, unless you edit the code (change what chars are affected by the fullwidth replacement in the loop, or remove all calls to the function). Maybe we should get a new option that disables sanitization to the bare minimum (only disallowing forward slash and allowing everything else). I don't mind it personally because the title is fully preserved in the accompanying metadata (embedded and in info.json).

Alternatively you can use --compat-options filename-sanitization to mimic youtube-dl's behavior which doesn't do this

git-blame says this added it 989a01c

@pukkandan
Copy link
Member

pukkandan commented Sep 24, 2022

Maybe we should get a new option that disables sanitization to the bare minimum (only disallowing forward slash and allowing everything else)

We are doing the bare minimum sanitization, except that the code doesn't know what characters the filesystem allows. So we sanitize for the most restrictive. If you know of a way to correctly identify what characters need to be sanitized, let us know in #4547

@pukkandan
Copy link
Member

Duplicate of #4767 and related

@pukkandan pukkandan closed this as not planned Won't fix, can't repro, duplicate, stale Sep 24, 2022
@pukkandan pukkandan added duplicate This issue or pull request already exists question Question and removed site-bug Issue with a specific website triage Untriaged issue labels Sep 24, 2022
@pukkandan pukkandan changed the title (YouTube) file names use "full-width" special characters: '?' vs '?' File names use "full-width" special characters: '?' vs '?' Sep 24, 2022
@pukkandan
Copy link
Member

Summary of all solutions:

  1. Use --restrict-filenames if you don't want any special characters
  2. Use --compat-option filename-sanitization to use youtube-dl's behavior
  3. Use --replace-in-metadata to control how each character should be replaced

@AlbatorLaho
Copy link
Author

Ok, that makes sense why it's on purpose. (also I apologize for not seeing #4767 I looked for probably 15 minutes accumulatively and could not find anything)
I thought that this was a YouTube specific thing though, because I downloaded a video from BitChute, and (I thought) it used non-full-width characters... but I tried again, and it does... oops.
Thanks for the help!

@pukkandan
Copy link
Member

I looked for probably 15 minutes accumulatively and could not find anything)

No worries. Github issue search sucks unless you know the right search terms. But now that this issue is pinned, hopefully more people dont make duplicates

@wambiditu
Copy link

wambiditu commented Nov 11, 2022

Although this is closed, I wanted to add information.

None of the above options (delete, make fullwidth, or make underscore) seem reasonable to me.

I thought I could tell yt-dlp to ignore certain characters by simply replacing them with themselves in metadata, but that produces an error.

I am currently working around this problem. I allow the default (fullwidth) option, which at least preserves the information about which character WAS there, so I can put it back at the command line, with things like:
for f in *?*; do mv -v "${f}" "${f//?/?}"; done
and
for f in ***; do mv -v "${f}" "${f//*/*}"; done
(I didn't really test this, and it might have problems with quotes.)

My ideal would be to allow the user to specify a string containing all the characters they DON'T want yt-dlp to help with.
(Edit:simplified a bit)

@siroccal
Copy link

siroccal commented Feb 13, 2023

The replacement of '"*:<>?|/\\' with their full-width variants is awful on Linux, especially when using the command line and scripts to archive videos.

As there doesn't seem to be an option to disable this, the following can be used instead:

sed "s/return.*0xfee0.*/return '_' if char == '\/' else char/" -i .local/lib/python*/site-packages/yt_dlp/utils.py

or if you want to replace / with its full-width version instead of a _ this can be used:

sed "s/return.*0xfee0.*/return '\\\uFF0F' if char == '\/' else char/" -i .local/lib/python*/site-packages/yt_dlp/utils.py

or the "big" /:

sed "s/return.*0xfee0.*/return '\\\u29F8' if char == '\/' else char/" -i .local/lib/python*/site-packages/yt_dlp/utils.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists question Question
Projects
None yet
Development

No branches or pull requests

5 participants