-
-
Notifications
You must be signed in to change notification settings - Fork 7.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I want to stop hand-editing my filenames, but there's no way to suppress only unicode characters #7079
Comments
|
You are confusing the shell in use (cmd, powershell, bash, ...) with the console emulator. In your case you are using conhost, which is ancient and has now been obsoleted with Windows Terminal. Give Windows Terminal a try or even use a completely third party Console emulator, both options support the full range of unicode characters and the like. Also, if it doesnt exist already, install a monospaced font with full unicode character support. You are not restricted to conhost or raster font. |
That's all very interesting, but in the end, I don't think that's the solution for me. Enumerating the many use cases where these characters are problematic is beyond the scope of this bug report. I'm tired of having to find unicode fixes for the myriad of programs I have in my workflows. It's eaten upwards of ~20 hours troubleshooting unicode incompatibility issues in my various workflows in 2023 alone. These characters are problematic. (Nor am I looking to to perturb my TCC command-line environment that i've been continuously developing my own layer over for 35 years. ) I don't want the characters in my filename, and that is what this bug report is about. But perhaps it's really a feature request and not a bug report. My bad on that. |
Hmm, well, if something was missing, it must not have been produced as part of the output.
But alas is something I've tried in the past that created other problems, didn't solve other problems, and ended up not being the use case i need for my situations I just don't want those characters to be in the filenames, and I don't want to manually have to edit them out, which is what I've been doing. Even if they display properly, they are still problems in situations, so I'm still going to be editing them out. (Not all software on the planet can deal with them, not all software can be replaced, and I do a lot of niche things where sometimes I'm stuck running old software that can't easily be replaced)
I have but it lookd way too complicated and I don't want to regenerate the whole filename template, i like the way it comes out now, I already have things that ingest the filename format as it stands, so I really just want the character substitution, that's all. There's already character substitution support, it's just too much. Unicode and spaces are not in the same category of problematic filenames. I just want a more gentle mapping that only substitutes the unicode chracters. This is seeming more like a feature request than a bug report so I'm wishing I'd filed this in the correct place.
It's weird indeed. The youtube-dl.exe i used was a standalone exe, no clue about what kind of python bundling might exist. So many potential explanations including me mis-remembering, me suddenly getting interested in music that has more unicode filenames, etc etc. Hard to look back and ever know why that was when it started to become a problem, it just was. Maybe youtube-dl did save them, but not as often? Or maybe they were omitted by weird incompatibility? Who knows at this point, one could only speculate.
It does seem like a problem that's been solved umpteen times, and I'm honestly surprised there's not some "oneliner" library out there that would do exactly this automatically. Conversion to and from utf-8 is easy enough, but yea, this is something a bit different than that. Imagine my horror the first time i saw a colon in a filename (name specifically, not the path) on a windows machine 🤣 |
See pinned issues #4836, #5014, especially #5014 (comment) |
There is a detox utility that you could run in WSL as a batch filename corrector: needs configuration. The Unicode site offers a file confusables.txt that lists "confusable" Unicode characters and their equivalent as a sequence of non-confusable characters. This file can be processed to generate various mappings. I generated a function $ python3.9
Python 3.9.16 (main, Jan 12 2023, 04:51:49)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import confusables
>>> x = '𝟖'
>>> ord(x)
120790
>>> confusables.asc_from_confusable(x)
'8'
>>>
$ |
That confusables file is interesting, but it isn't quite what's needed. It's just remapping unicodes to other unicodes. So they're just as unwanted after your function as before, for my use case. and yea, I made my own filename fixer, but the problem with my own fixer is the same problem as using --replace-in-metadata -- the mappings still have to be added one character a time. So far I have 2. And 2 places to put each one - my filename fixer, and my wrapper for ytl-dp (though i could skip that, i'd rather have redundancy) Ugh. I don't have the language knowledge to do this well but I suppose with enough time and the right wikipedia page... I'm pretty much wanting a unicode => ascii mapping. I thought it wasn't a big ask but probably it's bigger than I think (most development ends up being 🤣) |
try with --compat-options filename-sanitization https://www.reddit.com/r/youtubedl/comments/yz9ozo/how_do_i_get_ytdlp_downloads_without_forbidden/ |
Didn't change the behavior. At this point I've done an hour or so work of getting the polyglot library installed on python to detect language and romanize where possible, and then have my own mapping table of characters not caught by that. it's ugly but it's finally starting to automatically rename these bad filenames It runs the whole filename through polyglot as one string, which may or may not detect language and translate Then it goes character by character checks specifically for 3 languages I care about and uses 3 language-specific libraries for those chracters finally it goes through my own mapping table It's a doozy and i was probably asking too much for this before.... I guess I just solved my own problem.
|
I use -compat filename-sanitization |
#11046 |
I'm not a bro, but check out my fix_unicode_filename projects on my github. It's a more current [and bulky] version of this. Alas, i still have to edit it when a new character i haven't encountered creeps up, so it's far from perfect, but i have it as part of my yt-dlp workflow to do what yt-dlp won't do. Emoji and unicode screw up a lot of software. My image viewer won't even view images if a folder has an emoji in it, for example. |
DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE
Checklist
Provide a description that is worded well enough to be understood
SHORT DESCRIPTION: I hope this description isn't too convoluted, but, I'm having what I would call an "unwanted filename" problem that doesn't seem to be solved with an easy level of complexity.
LONGER DESCRIPTION: There is no way to prevent unicode characters that break certain windows command line functionality from being suppressed in the output filename without also suppressing a lot of other characters (i.e turning all spaces into underscores) which just makes the "unwanted filename" problem that much worse.
COMPARED TO YOUTUBE-DL: This is a change from youtube-dl that has driven me crazy for awhile. I don't remember what youtube-dl used to do in this situation -- but I know this problem didn't become a thorn in my side until migrating to ytp-dl (which I like 1000X better).
Are you using the latest version?
Yes, I tried with that.
**Is the issue already documented? **
I couldn't find any when I looked in both places.
Why are existing options not enough?
Well first off, the "--no-restrict-filenames" option is supposed to be required to allow unicode characters, ampersands, and spaces. Yet here I am using windows, and the default behavior is to allow unicode characters, ampersands, and spaces. This might be a bug in and of itself, but maybe not..
But using "--windows-filenames" .... Still doesn't stop the unicode characters. So the unwanted filename problem isn't fixed.
And using "--restrict-filenames" ... Changes every space to an underscore! So now instead of having 1 or 2 unicode characters to fix, I will have 10 or so underscores to fix,
So "--restrict-filenames" actually increases the unwanted filename problem, instead of decreasing it! 😅
I just want the unicode characters fixed. If there is a question mark or a slash or a pipe in the youtube video title, just make it an underscore. I don't want some weird unicode character that almost no command line utilities can deal with to be substituted in, and which doesn't properly render in any command-line.
That actually makes the problem worse in many ways, unfortunately 😢
Windows may "allow" those types of filenames, but in practice, they are un-workable for users under windows who actually do things with the files other than download them and click them in the windows GUI. 😲
Contrary to popular belief, there are a lot of windows users still using command line utilities 😉
Have you read and understood the changes, between youtube-dl and yt-dlp
Yup. That's what created this. I was using youtube-dl for years and this problem immediately began when I switched. (Though switching solved EVERY OTHER PROBLEM😎)
Is there enough context in your bug report?
I hope so!
Does the issue involve one problem, and one problem only?
I hope so!
Is anyone going to need the feature?
Pretty much any one who wants to type a simple "dir" command at the command-line and actually see the actual character is affected by this behavior .😲
Many windows command-lines simply do not deal with unicode characters properly -- even though they should 😒 It's been the bane of my existence as I try to update my complete audiovisual workflow toolset to work properly with unicode. It's been one of my 2023 goals. I'm pretty sure I'm not alone in dealing with unicode issues.
But really, we need an option that is less than "turn every weird character into an underscore" - I don't want to be cursed to a lifetime of changing every underscore back to a space. But which is more than "allow every unicode character to just sit there being ugly" - I don't want to be cursed to a lifetime of changing every unicode character to an underscore.
Maybe there's a way to already do this without disturbing the filename in any other way, but I couldn't figure it out from the documentation.
But in the end, the problem boils down to:
I would like to get out of the personal hell of having to hand-edit every single download this great utility makes. and there doesn't seem to be a way out of it. Smushing ALL special characters into an underscore prevents me from knowing what the character originally was. I only want unicode characters smushed, not all special charcters.
I've included a screenshot just so you can see the "question mark inside a rectangle" "can't render unicode characters" appearance. This is universal across 4 command lines under windows (cmd, powershell, tcc, bash). Ironically when I pasted it in the plaintext below, the paste was valid! But that is the essence of the problem: Windows command line utilities just DON'T work right with unicode filenames. They don't. It's the world we live in.
Provide verbose output that clearly demonstrates the problem
yt-dlp -vU <your command line>
)'verbose': True
toYoutubeDL
params instead[debug] Command-line config
) and insert it belowComplete Verbose Output
^^^ ADDITIONAL NOTE! See that colon after "Metalocalypse" in the "Dir" command above? That actually isn't renderable on the screen in any command line. But because I copy and pasted that section with the mouse (rather than redirecting to the clipboard device at the command line), the character successfully made it to the clipboard, and then to this bug report. However, this simply doesn't work so smoothly in most other cases.
These characters can be copied out of the command-line world into the gui-world, but they don't typically work in the command line world. There's not even a way to properly render it on the console without doing strange things like codepage changes that don't actually work (changing codepage from 437 to 65001 does NOT fix these problems, not any of them).
In the end, the windows CLI world just isn't ready for these characters. Windows CLI has worked fine with spaces and ampersands for over 20 yrs, but it's still not there for command-line utilities.
The text was updated successfully, but these errors were encountered: