Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I want to stop hand-editing my filenames, but there's no way to suppress only unicode characters #7079

Closed
10 tasks done
ClaireCJS opened this issue May 19, 2023 · 12 comments
Closed
10 tasks done
Labels
question Question

Comments

@ClaireCJS
Copy link

ClaireCJS commented May 19, 2023

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

  • I understand that I will be blocked if I intentionally remove or skip any mandatory* field

Checklist

  • I'm reporting a bug unrelated to a specific site
  • I've verified that I'm running yt-dlp version 2023.03.04 (update instructions) or later (specify commit)
  • I've checked that all provided URLs are playable in a browser with the same IP and same login details
  • I've checked that all URLs and arguments with special characters are properly quoted or escaped
  • I've searched known issues and the bugtracker for similar issues including closed ones. DO NOT post duplicates
  • I've read the guidelines for opening an issue

Provide a description that is worded well enough to be understood

SHORT DESCRIPTION: I hope this description isn't too convoluted, but, I'm having what I would call an "unwanted filename" problem that doesn't seem to be solved with an easy level of complexity.

LONGER DESCRIPTION: There is no way to prevent unicode characters that break certain windows command line functionality from being suppressed in the output filename without also suppressing a lot of other characters (i.e turning all spaces into underscores) which just makes the "unwanted filename" problem that much worse.

COMPARED TO YOUTUBE-DL: This is a change from youtube-dl that has driven me crazy for awhile. I don't remember what youtube-dl used to do in this situation -- but I know this problem didn't become a thorn in my side until migrating to ytp-dl (which I like 1000X better).

Are you using the latest version?
Yes, I tried with that.

**Is the issue already documented? **
I couldn't find any when I looked in both places.

Why are existing options not enough?
Well first off, the "--no-restrict-filenames" option is supposed to be required to allow unicode characters, ampersands, and spaces. Yet here I am using windows, and the default behavior is to allow unicode characters, ampersands, and spaces. This might be a bug in and of itself, but maybe not..

But using "--windows-filenames" .... Still doesn't stop the unicode characters. So the unwanted filename problem isn't fixed.

And using "--restrict-filenames" ... Changes every space to an underscore! So now instead of having 1 or 2 unicode characters to fix, I will have 10 or so underscores to fix,
So "--restrict-filenames" actually increases the unwanted filename problem, instead of decreasing it! 😅

I just want the unicode characters fixed. If there is a question mark or a slash or a pipe in the youtube video title, just make it an underscore. I don't want some weird unicode character that almost no command line utilities can deal with to be substituted in, and which doesn't properly render in any command-line.
That actually makes the problem worse in many ways, unfortunately 😢

Windows may "allow" those types of filenames, but in practice, they are un-workable for users under windows who actually do things with the files other than download them and click them in the windows GUI. 😲
Contrary to popular belief, there are a lot of windows users still using command line utilities 😉

Have you read and understood the changes, between youtube-dl and yt-dlp
Yup. That's what created this. I was using youtube-dl for years and this problem immediately began when I switched. (Though switching solved EVERY OTHER PROBLEM😎)

Is there enough context in your bug report?
I hope so!

Does the issue involve one problem, and one problem only?
I hope so!

Is anyone going to need the feature?
Pretty much any one who wants to type a simple "dir" command at the command-line and actually see the actual character is affected by this behavior .😲
Many windows command-lines simply do not deal with unicode characters properly -- even though they should 😒 It's been the bane of my existence as I try to update my complete audiovisual workflow toolset to work properly with unicode. It's been one of my 2023 goals. I'm pretty sure I'm not alone in dealing with unicode issues.

But really, we need an option that is less than "turn every weird character into an underscore" - I don't want to be cursed to a lifetime of changing every underscore back to a space. But which is more than "allow every unicode character to just sit there being ugly" - I don't want to be cursed to a lifetime of changing every unicode character to an underscore.

Maybe there's a way to already do this without disturbing the filename in any other way, but I couldn't figure it out from the documentation.

But in the end, the problem boils down to:

I would like to get out of the personal hell of having to hand-edit every single download this great utility makes. and there doesn't seem to be a way out of it. Smushing ALL special characters into an underscore prevents me from knowing what the character originally was. I only want unicode characters smushed, not all special charcters.

I've included a screenshot just so you can see the "question mark inside a rectangle" "can't render unicode characters" appearance. This is universal across 4 command lines under windows (cmd, powershell, tcc, bash). Ironically when I pasted it in the plaintext below, the paste was valid! But that is the essence of the problem: Windows command line utilities just DON'T work right with unicode filenames. They don't. It's the world we live in.

ytldpbug

Provide verbose output that clearly demonstrates the problem

  • Run your yt-dlp command with -vU flag added (yt-dlp -vU <your command line>)
  • If using API, add 'verbose': True to YoutubeDL params instead
  • Copy the WHOLE output (starting with [debug] Command-line config) and insert it below

Complete Verbose Output

IMPORTANT NOTE ON THIS DEBUG OUTPUT!  

The unicode characters do not properly display in any 
windows command line (not CMD.exe, not PowerShell, 
not bash under WSL, not TakeCommand/TCC which is my primary).  

They also did not properly reach my clipboard,
so they do not appear in this output.

They do appear on my screen, though.

This is the essence of the problem -- unicode in filenames are problematic under windows.
If i redirected my output to the clipboard, no unicode characters were captured.
If i copied the output to my clipboard using mouse select, line breaks were destroyed.



The first character is, the ":" after "Metalocalypse".  
It's a unicode colon that isn't the same character 
as the ASCII colon, but it breaks most everything 
command-line driven in windows besides double-clicking 
the file in GUI.

The "|" (pipe) character ends up creating yet another one. 
These ones display in all local command lines as a 
questionmark inside a rectangular box. 
Not just that, but if you edit the character in yoiur command-line, 
it is usually treated as 2 characters (due to unicode byte length), 
so you get some "spooky behavior" with things like hitting the delete key 
only to turn the unicode character into a single byte character, 
other weird behaviors like file coloring not coloring to the end of the filename 
due to byte count being miscounted. Multiply those types of bugs across most 
every command line utility in existence that hasn't been specifically 
unicode-tested, an the file becomes de-facto untoucheable to a swath of 
useful command line utilities.  I keep having to create double-checks in every
aspect of my workflows only to run into these problems over and over again.

In the end, I am hand editing the filename of EVERY DOWNLOAD and it is a horrible existence.

Anyway, here is the output, but again, keep in mind that this output
is missing unicode chracters, thus i provided a screenshot so they could be "seen".

This is consistent across four different windows command lines 
(I'm counting bash under WSL as a windows command line sorry!)


yt-dlp -vU --windows-filenames https://www.youtube.com/watch?v=xBnn3qG44hI
Available version: stable@2023.03.04, Current version: stable@2023.03.04
Current Build Hash: 5590c57bd0433ed239a2deaaf92e2ad6f37fe50f53664c821575cafe106a9421
yt-dlp is up to date (stable@2023.03.04)
[youtube] Extracting URL: https://www.youtube.com/watch?v=xBnn3qG44hI
[youtube] xBnn3qG44hI: Downloading webpage
[youtube] xBnn3qG44hI: Downloading android player API JSON
[info] xBnn3qG44hI: Downloading 1 format(s): 248+251
[dashsegments] Total fragments: 1
[download] Destination: Metalocalypse Dethklok  Go Into the Water (Gulf of Danzig Remix)  Adult Swim [xBnn3qG44hI].f248.webm

[download]   0.0% of ~   8.91MiB at   11.11KiB/s ETA 13:40 (frag 0/1)
[download]   0.0% of ~   8.91MiB at   33.33KiB/s ETA 04:33 (frag 0/1)
[download]   0.1% of ~   8.91MiB at   76.92KiB/s ETA 01:58 (frag 0/1)
[download]   0.2% of ~   8.91MiB at  164.83KiB/s ETA 00:55 (frag 0/1)
[download]   0.3% of ~   8.91MiB at  333.33KiB/s ETA 00:27 (frag 0/1)
[download]   0.7% of ~   8.91MiB at  656.24KiB/s ETA 00:13 (frag 0/1)
[download]   1.4% of ~   8.91MiB at    1.19MiB/s ETA 00:07 (frag 0/1)
[download]   2.8% of ~   8.91MiB at    2.17MiB/s ETA 00:03 (frag 0/1)
[download]   5.6% of ~   8.91MiB at    3.99MiB/s ETA 00:02 (frag 0/1)
[download]  11.2% of ~   8.91MiB at    6.80MiB/s ETA 00:01 (frag 0/1)
[download]  22.4% of ~   8.91MiB at   11.04MiB/s ETA 00:00 (frag 0/1)
[download]  44.9% of ~   8.91MiB at   15.81MiB/s ETA 00:00 (frag 0/1)
[download]  89.8% of ~   8.91MiB at   20.59MiB/s ETA 00:00 (frag 0/1)
[download] 100.0% of ~   8.91MiB at   22.19MiB/s ETA 00:00 (frag 0/1)
[download] 100.0% of ~   8.91MiB at   20.97MiB/s ETA 00:00 (frag 1/1)
[download] 100% of    8.91MiB in 00:00:00 at 13.97MiB/s              
[dashsegments] Total fragments: 1
[download] Destination: Metalocalypse Dethklok  Go Into the Water (Gulf of Danzig Remix)  Adult Swim [xBnn3qG44hI].f251.webm

[download]   0.0% of ~   4.20MiB at   12.82KiB/s ETA 05:35 (frag 0/1)
[download]   0.1% of ~   4.20MiB at   38.46KiB/s ETA 01:51 (frag 0/1)
[download]   0.2% of ~   4.20MiB at   88.61KiB/s ETA 00:48 (frag 0/1)
[download]   0.3% of ~   4.20MiB at  189.87KiB/s ETA 00:22 (frag 0/1)
[download]   0.7% of ~   4.20MiB at  382.71KiB/s ETA 00:11 (frag 0/1)
[download]   1.5% of ~   4.20MiB at  659.45KiB/s ETA 00:06 (frag 0/1)
[download]   2.9% of ~   4.20MiB at    1.18MiB/s ETA 00:03 (frag 0/1)
[download]   5.9% of ~   4.20MiB at    2.10MiB/s ETA 00:01 (frag 0/1)
[download]  11.9% of ~   4.20MiB at    3.77MiB/s ETA 00:00 (frag 0/1)
[download]  23.8% of ~   4.20MiB at    6.51MiB/s ETA 00:00 (frag 0/1)
[download]  47.5% of ~   4.20MiB at   10.64MiB/s ETA 00:00 (frag 0/1)
[download]  95.1% of ~   4.20MiB at   15.27MiB/s ETA 00:00 (frag 0/1)
[download] 100.0% of ~   4.20MiB at   15.81MiB/s ETA 00:00 (frag 0/1)
[download] 100.0% of ~   4.20MiB at   14.55MiB/s ETA 00:00 (frag 1/1)
[download] 100% of    4.20MiB in 00:00:00 at 10.31MiB/s              
[Merger] Merging formats into "Metalocalypse Dethklok  Go Into the Water (Gulf of Danzig Remix)  Adult Swim [xBnn3qG44hI].webm"
Deleting original file Metalocalypse Dethklok  Go Into the Water (Gulf of Danzig Remix)  Adult Swim [xBnn3qG44hI].f248.webm (pass -k to keep)
Deleting original file Metalocalypse Dethklok  Go Into the Water (Gulf of Danzig Remix)  Adult Swim [xBnn3qG44hI].f251.webm (pass -k to keep)


< 6:25a> <9%> O:\MEDIA\FOR-REVIEW\oh\Dethklok\test>dir

 Volume in drive O is HD18T - THAILOG18T - O Serial number is 782c:4fe5
 Directory of  O:\MEDIA\FOR-REVIEW\oh\Dethklok\test\*

 5/19/2023   6:25         <DIR>    .
 5/19/2023   6:25         <DIR>    ..
 5/18/2023  15:25      13,750,847  Metalocalypse: Dethklok | Go Into the Water (Gulf of Danzig Remix) | Adult Swim [xBnn3qG44hI].webm
 5/19/2023   6:24             130  test.bat
          13,750,977 bytes in 2 files and 2 dirs    13,762,560 bytes allocated

^^^ ADDITIONAL NOTE! See that colon after "Metalocalypse" in the "Dir" command above? That actually isn't renderable on the screen in any command line. But because I copy and pasted that section with the mouse (rather than redirecting to the clipboard device at the command line), the character successfully made it to the clipboard, and then to this bug report. However, this simply doesn't work so smoothly in most other cases.

These characters can be copied out of the command-line world into the gui-world, but they don't typically work in the command line world. There's not even a way to properly render it on the console without doing strange things like codepage changes that don't actually work (changing codepage from 437 to 65001 does NOT fix these problems, not any of them).

In the end, the windows CLI world just isn't ready for these characters. Windows CLI has worked fine with spaces and ampersands for over 20 yrs, but it's still not there for command-line utilities.

@ClaireCJS ClaireCJS added bug Bug that is not site-specific triage Untriaged issue labels May 19, 2023
@dirkf
Copy link
Contributor

dirkf commented May 19, 2023

  1. This SO commenter doesn't agree with you, though.
  2. Your -v log omits the line that shows the encodings in use; also the Python + libs version lines.
  3. --no-restrict-filenames is clearly labelled as (default) in the help text
  4. A console font with sufficiently wide Unicode support should display the "? in a box" characters.
  5. Have you investigated the +U format code for your filename output template?
  6. "this problem immediately began when I switched": that seems weird, since yt-dl doesn't filter weird "Unicode characters" unless --restrict-filenames is specified (yt-dl master first applies the +U (NFKC) format, but the result probably gets _ed anyway); maybe the bundled Python version is responsible?
  7. While a conversion that identifies any "confusable" Unicode character with its ASCII equivalent would be possible, the linked table is more strict than desirable, eg 1 (one) == l (lower-case el).

@Grub4K
Copy link
Member

Grub4K commented May 19, 2023

In the end, the windows CLI world just isn't ready for these characters. Windows CLI has worked fine with spaces and ampersands for over 20 yrs, but it's still not there for command-line utilities.

You are confusing the shell in use (cmd, powershell, bash, ...) with the console emulator. In your case you are using conhost, which is ancient and has now been obsoleted with Windows Terminal. Give Windows Terminal a try or even use a completely third party Console emulator, both options support the full range of unicode characters and the like. Also, if it doesnt exist already, install a monospaced font with full unicode character support. You are not restricted to conhost or raster font.

@ClaireCJS
Copy link
Author

ClaireCJS commented May 19, 2023

In the end, the windows CLI world just isn't ready for these characters. Windows CLI has worked fine with spaces and ampersands for over 20 yrs, but it's still not there for command-line utilities.

You are confusing the shell in use (cmd, powershell, bash, ...) with the console emulator. In your case you are using conhost, which is ancient and has now been obsoleted with Windows Terminal. Give Windows Terminal a try or even use a completely third party Console emulator, both options support the full range of unicode characters and the like. Also, if it doesnt exist already, install a monospaced font with full unicode character support. You are not restricted to conhost or raster font.

That's all very interesting, but in the end, I don't think that's the solution for me.

Enumerating the many use cases where these characters are problematic is beyond the scope of this bug report. I'm tired of having to find unicode fixes for the myriad of programs I have in my workflows. It's eaten upwards of ~20 hours troubleshooting unicode incompatibility issues in my various workflows in 2023 alone.

These characters are problematic.
In many situations.

(Nor am I looking to to perturb my TCC command-line environment that i've been continuously developing my own layer over for 35 years. )

I don't want the characters in my filename, and that is what this bug report is about.

But perhaps it's really a feature request and not a bug report. My bad on that.

@ClaireCJS
Copy link
Author

ClaireCJS commented May 19, 2023

  1. Your -v log omits the line that shows the encodings in use; also the Python + libs version lines.

Hmm, well, if something was missing, it must not have been produced as part of the output.

  1. A console font with sufficiently wide Unicode support should display the "? in a box" characters.

But alas is something I've tried in the past that created other problems, didn't solve other problems, and ended up not being the use case i need for my situations

I just don't want those characters to be in the filenames, and I don't want to manually have to edit them out, which is what I've been doing. Even if they display properly, they are still problems in situations, so I'm still going to be editing them out.

(Not all software on the planet can deal with them, not all software can be replaced, and I do a lot of niche things where sometimes I'm stuck running old software that can't easily be replaced)

  1. Have you investigated the +U format code for your filename output template?

I have but it lookd way too complicated and I don't want to regenerate the whole filename template, i like the way it comes out now, I already have things that ingest the filename format as it stands, so I really just want the character substitution, that's all.

There's already character substitution support, it's just too much. Unicode and spaces are not in the same category of problematic filenames. I just want a more gentle mapping that only substitutes the unicode chracters.

This is seeming more like a feature request than a bug report so I'm wishing I'd filed this in the correct place.

  1. "this problem immediately began when I switched": that seems weird, since yt-dl doesn't filter weird "Unicode characters" unless --restrict-filenames is specified (yt-dl master first applies the +U (NFKC) format, but the result probably gets _ed anyway); maybe the bundled Python version is responsible?

It's weird indeed. The youtube-dl.exe i used was a standalone exe, no clue about what kind of python bundling might exist. So many potential explanations including me mis-remembering, me suddenly getting interested in music that has more unicode filenames, etc etc. Hard to look back and ever know why that was when it started to become a problem, it just was. Maybe youtube-dl did save them, but not as often? Or maybe they were omitted by weird incompatibility? Who knows at this point, one could only speculate.

  1. While a conversion that identifies any "confusable" Unicode character with its ASCII equivalent would be possible, the linked table is more strict than desirable, eg 1 (one) == l (lower-case el).

It does seem like a problem that's been solved umpteen times, and I'm honestly surprised there's not some "oneliner" library out there that would do exactly this automatically. Conversion to and from utf-8 is easy enough, but yea, this is something a bit different than that.

Imagine my horror the first time i saw a colon in a filename (name specifically, not the path) on a windows machine 🤣

@pukkandan
Copy link
Member

pukkandan commented May 19, 2023

See pinned issues #4836, #5014, especially #5014 (comment)

@pukkandan pukkandan added question Question and removed bug Bug that is not site-specific triage Untriaged issue labels May 19, 2023
@dirkf
Copy link
Contributor

dirkf commented May 19, 2023

There is a detox utility that you could run in WSL as a batch filename corrector: needs configuration.

The Unicode site offers a file confusables.txt that lists "confusable" Unicode characters and their equivalent as a sequence of non-confusable characters. This file can be processed to generate various mappings.

I generated a function asc_from_confusable() that takes any Unicode character, finds its entry, if any, and, if all the non-confusable characters are ASCII, returns the string of those characters, or otherwise the original character. Such a function might improve filename sanitization, but the mapping table is 180kB.

$ python3.9
Python 3.9.16 (main, Jan 12 2023, 04:51:49) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import confusables
>>> x = '𝟖'
>>> ord(x)
120790
>>> confusables.asc_from_confusable(x)
'8'
>>> 
$

@ClaireCJS
Copy link
Author

ClaireCJS commented May 19, 2023

That confusables file is interesting, but it isn't quite what's needed. It's just remapping unicodes to other unicodes. So they're just as unwanted after your function as before, for my use case.

and yea, I made my own filename fixer, but the problem with my own fixer is the same problem as using --replace-in-metadata -- the mappings still have to be added one character a time.

So far I have 2.
I have full alphabets to go.

And 2 places to put each one - my filename fixer, and my wrapper for ytl-dp (though i could skip that, i'd rather have redundancy)

Ugh.
No clue what i'd want japanese characters to go to, other than their english phonetical equivalent, so that at least i get something that i am pronouncing somewhat correctly.

I don't have the language knowledge to do this well but I suppose with enough time and the right wikipedia page...

I'm pretty much wanting a unicode => ascii mapping. I thought it wasn't a big ask but probably it's bigger than I think (most development ends up being 🤣)

@october262
Copy link

@ClaireCJS
Copy link
Author

ClaireCJS commented May 19, 2023

try with --compat-options filename-sanitization

Didn't change the behavior.

At this point I've done an hour or so work of getting the polyglot library installed on python to detect language and romanize where possible, and then have my own mapping table of characters not caught by that. it's ugly but it's finally starting to automatically rename these bad filenames

It runs the whole filename through polyglot as one string, which may or may not detect language and translate

Then it goes character by character checks specifically for 3 languages I care about and uses 3 language-specific libraries for those chracters

finally it goes through my own mapping table

It's a doozy and i was probably asking too much for this before.... I guess I just solved my own problem.

import os
import sys
import unidecode
from colorama import Fore, Style, init
import re

DRY_RUN = False


# Mapping of unicode symbols to ASCII equivalents that are valid for filenames
unicode_to_valid_filename_ASCII_map = {
    '|':   '-' ,  # unicode pipe
    '!':   '!' ,  # unicode exclamation mark
    '?':   '_' ,  # unicode question mark
    ':' :   '- ',  # unicode colon
    ';' :   ';' ,  # unicode semicolon
    ',' :   ',' ,  # unicode comma
    '。' :   '.' ,  # unicode full stop
    '⧸' :   '--',  # unicode slash           
}



def translate_character(char):
    """Translates a single character to its ASCII equivalent."""
    if char in unicode_to_valid_filename_ASCII_map: return unicode_to_valid_filename_ASCII_map[char]
    if   '\u4e00' <= char <= '\u9fff': ascii_equiv = translate_chinese_to_ascii (char)            # if Chinese
    elif '\u3040' <= char <= '\u30ff': ascii_equiv = translate_japanese_to_ascii(char)            # if Japanese
    elif '\uac00' <= char <= '\ud7af': ascii_equiv = translate_korean_to_ascii  (char)            # if Korean
    else:                              ascii_equiv = unidecode.unidecode        (char)            # if Unicode
    #eturn ascii_equiv if ascii_equiv.isalnum() else '_'  #BAD: turned " " to "_"
    return ascii_equiv


# Japanese
import romkan
def translate_japanese_to_ascii(char):
    return romkan.to_roma(char)

# Chinese
from pypinyin import lazy_pinyin, Style as PypinyinStyle
def translate_chinese_to_ascii(char):
    return ''.join(lazy_pinyin(char, style=PypinyinStyle.TONE3))

# Korean
from korean_romanizer.romanizer import Romanizer
def translate_korean_to_ascii(korean_text):
    r = Romanizer(korean_text)
    return r.romanize()



def romanize(text):
    """Return translated text, but fail very gracefully and transparently if there are any exceptions"""
    try:
        import logging
        logging.getLogger('polyglot').setLevel(logging.ERROR)   # Disable logging messages from Polyglot
        from polyglot.detect import Detector
        from polyglot.transliteration import Transliterator
        detector = Detector(text)
        source_lang = detector.language.code
        transliterator = Transliterator(source_lang=source_lang, target_lang="en")
        return transliterator.transliterate(text)
    except Exception:
        return(text)



def translate_filename(filename):
    """Translates a filename to its ASCII equivalent."""
    filename_romanized_with_polyglot = romanize(filename)
    return ''.join(translate_character(char) for char in filename_romanized_with_polyglot)

import msvcrt
def ask_permission(old_name, new_name):
    """Asks the user for permission to rename a file."""
    print(f"\n{Fore.YELLOW}{Style.BRIGHT}***** Rename:"                                                                   +
          f"\n{Fore.RED   }{Style.BRIGHT}From: {Style.NORMAL}{old_name}{Fore.CYAN}{Style.NORMAL}"                         +
          f"\n{Fore.GREEN }{Style.BRIGHT}  To: {Style.NORMAL}{new_name}{Fore.CYAN}{Style.NORMAL} "                        +
          f"\n{Fore.YELLOW}{Style.BRIGHT}***** Rename?"                                                                   +
          f" { Fore.BLUE  }{Style.BRIGHT}[{Fore.CYAN}Y{Fore.BLUE}/{Style.NORMAL}{Fore.CYAN}n{Style.BRIGHT}]{Style.NORMAL} ", end="")
    response = msvcrt.getch().decode().lower().strip()
    print(Style.BRIGHT, end="")
    if response.lower() in ['y', 'yes', '']:
        print(f"{Fore.GREEN}Yes!", end="")
        return True
    print(f"{Fore.RED}No!", end="")
    return False

def rename_files_in_directory(directory):
    """Renames all files in a directory, replacing unicode characters."""
    global DRY_RUN
    do_it_for_real = True
    automatic      = False
    DRY_RUN        = False
    permission     = False
    for filename in os.listdir(directory):
        new_name = translate_filename(filename)

        if filename != new_name:
            if       os.getenv('AUTOMATIC_UNICODE_CLEANING'):
                del os.environ['AUTOMATIC_UNICODE_CLEANING']                    #only let this directive work once!
                automatic      = True
                do_it_for_real = True
                action_string  = "  Auto-Renamed"
            else:
                permission = ask_permission(filename, new_name)
                do_it_for_real = permission
                action_string  = "       Renamed" if permission is True else f"{Fore.RED}Did not rename"
            if DRY_RUN:
                do_it_for_real = False

            old_file = os.path.join(directory, filename)
            new_file = os.path.join(directory, new_name)

            if do_it_for_real: os.rename(old_file, new_file)

            print("\n")
            if automatic: print(f"\t{Fore.YELLOW}Automatic Run: ")
            if DRY_RUN:   print(f"\t{Fore.YELLOW}" +  "Dry Run: ")
            print(f"{Fore.GREEN}{Style.NORMAL}\t{action_string}:\t{Fore.LIGHTBLACK_EX}{old_file} " +
                  f"{Fore.CYAN}\n\t\t    to:\t{Fore.GREEN}{new_file}{Style.NORMAL}\n\n\n")

def main():
    init()
    directory = sys.argv[1] if len(sys.argv) > 1 else '.'
    rename_files_in_directory(directory)


if __name__ == "__main__":
    main()

@rgiuffre
Copy link

rgiuffre commented Jun 1, 2023

I use -compat filename-sanitization

@good1uck
Copy link

try with --compat-options filename-sanitization

Didn't change the behavior.

At this point I've done an hour or so work of getting the polyglot library installed on python to detect language and romanize where possible, and then have my own mapping table of characters not caught by that. it's ugly but it's finally starting to automatically rename these bad filenames

It runs the whole filename through polyglot as one string, which may or may not detect language and translate

Then it goes character by character checks specifically for 3 languages I care about and uses 3 language-specific libraries for those chracters

finally it goes through my own mapping table

It's a doozy and i was probably asking too much for this before.... I guess I just solved my own problem.

import os
import sys
import unidecode
from colorama import Fore, Style, init
import re

DRY_RUN = False


# Mapping of unicode symbols to ASCII equivalents that are valid for filenames
unicode_to_valid_filename_ASCII_map = {
    '|':   '-' ,  # unicode pipe
    '!':   '!' ,  # unicode exclamation mark
    '?':   '_' ,  # unicode question mark
    ':' :   '- ',  # unicode colon
    ';' :   ';' ,  # unicode semicolon
    ',' :   ',' ,  # unicode comma
    '。' :   '.' ,  # unicode full stop
    '⧸' :   '--',  # unicode slash           
}



def translate_character(char):
    """Translates a single character to its ASCII equivalent."""
    if char in unicode_to_valid_filename_ASCII_map: return unicode_to_valid_filename_ASCII_map[char]
    if   '\u4e00' <= char <= '\u9fff': ascii_equiv = translate_chinese_to_ascii (char)            # if Chinese
    elif '\u3040' <= char <= '\u30ff': ascii_equiv = translate_japanese_to_ascii(char)            # if Japanese
    elif '\uac00' <= char <= '\ud7af': ascii_equiv = translate_korean_to_ascii  (char)            # if Korean
    else:                              ascii_equiv = unidecode.unidecode        (char)            # if Unicode
    #eturn ascii_equiv if ascii_equiv.isalnum() else '_'  #BAD: turned " " to "_"
    return ascii_equiv


# Japanese
import romkan
def translate_japanese_to_ascii(char):
    return romkan.to_roma(char)

# Chinese
from pypinyin import lazy_pinyin, Style as PypinyinStyle
def translate_chinese_to_ascii(char):
    return ''.join(lazy_pinyin(char, style=PypinyinStyle.TONE3))

# Korean
from korean_romanizer.romanizer import Romanizer
def translate_korean_to_ascii(korean_text):
    r = Romanizer(korean_text)
    return r.romanize()



def romanize(text):
    """Return translated text, but fail very gracefully and transparently if there are any exceptions"""
    try:
        import logging
        logging.getLogger('polyglot').setLevel(logging.ERROR)   # Disable logging messages from Polyglot
        from polyglot.detect import Detector
        from polyglot.transliteration import Transliterator
        detector = Detector(text)
        source_lang = detector.language.code
        transliterator = Transliterator(source_lang=source_lang, target_lang="en")
        return transliterator.transliterate(text)
    except Exception:
        return(text)



def translate_filename(filename):
    """Translates a filename to its ASCII equivalent."""
    filename_romanized_with_polyglot = romanize(filename)
    return ''.join(translate_character(char) for char in filename_romanized_with_polyglot)

import msvcrt
def ask_permission(old_name, new_name):
    """Asks the user for permission to rename a file."""
    print(f"\n{Fore.YELLOW}{Style.BRIGHT}***** Rename:"                                                                   +
          f"\n{Fore.RED   }{Style.BRIGHT}From: {Style.NORMAL}{old_name}{Fore.CYAN}{Style.NORMAL}"                         +
          f"\n{Fore.GREEN }{Style.BRIGHT}  To: {Style.NORMAL}{new_name}{Fore.CYAN}{Style.NORMAL} "                        +
          f"\n{Fore.YELLOW}{Style.BRIGHT}***** Rename?"                                                                   +
          f" { Fore.BLUE  }{Style.BRIGHT}[{Fore.CYAN}Y{Fore.BLUE}/{Style.NORMAL}{Fore.CYAN}n{Style.BRIGHT}]{Style.NORMAL} ", end="")
    response = msvcrt.getch().decode().lower().strip()
    print(Style.BRIGHT, end="")
    if response.lower() in ['y', 'yes', '']:
        print(f"{Fore.GREEN}Yes!", end="")
        return True
    print(f"{Fore.RED}No!", end="")
    return False

def rename_files_in_directory(directory):
    """Renames all files in a directory, replacing unicode characters."""
    global DRY_RUN
    do_it_for_real = True
    automatic      = False
    DRY_RUN        = False
    permission     = False
    for filename in os.listdir(directory):
        new_name = translate_filename(filename)

        if filename != new_name:
            if       os.getenv('AUTOMATIC_UNICODE_CLEANING'):
                del os.environ['AUTOMATIC_UNICODE_CLEANING']                    #only let this directive work once!
                automatic      = True
                do_it_for_real = True
                action_string  = "  Auto-Renamed"
            else:
                permission = ask_permission(filename, new_name)
                do_it_for_real = permission
                action_string  = "       Renamed" if permission is True else f"{Fore.RED}Did not rename"
            if DRY_RUN:
                do_it_for_real = False

            old_file = os.path.join(directory, filename)
            new_file = os.path.join(directory, new_name)

            if do_it_for_real: os.rename(old_file, new_file)

            print("\n")
            if automatic: print(f"\t{Fore.YELLOW}Automatic Run: ")
            if DRY_RUN:   print(f"\t{Fore.YELLOW}" +  "Dry Run: ")
            print(f"{Fore.GREEN}{Style.NORMAL}\t{action_string}:\t{Fore.LIGHTBLACK_EX}{old_file} " +
                  f"{Fore.CYAN}\n\t\t    to:\t{Fore.GREEN}{new_file}{Style.NORMAL}\n\n\n")

def main():
    init()
    directory = sys.argv[1] if len(sys.argv) > 1 else '.'
    rename_files_in_directory(directory)


if __name__ == "__main__":
    main()

#11046
Can you save me bro

@ClaireCJS
Copy link
Author

try with --compat-options filename-sanitization

Didn't change the behavior.
At this point I've done an hour or so work of getting the polyglot library installed on python to detect language and romanize where possible, and then have my own mapping table of characters not caught by that. it's ugly but it's finally starting to automatically rename these bad filenames
It runs the whole filename through polyglot as one string, which may or may not detect language and translate
Then it goes character by character checks specifically for 3 languages I care about and uses 3 language-specific libraries for those chracters
finally it goes through my own mapping table
It's a doozy and i was probably asking too much for this before.... I guess I just solved my own problem.

import os
import sys
import unidecode
from colorama import Fore, Style, init
import re

DRY_RUN = False


# Mapping of unicode symbols to ASCII equivalents that are valid for filenames
unicode_to_valid_filename_ASCII_map = {
    '|':   '-' ,  # unicode pipe
    '!':   '!' ,  # unicode exclamation mark
    '?':   '_' ,  # unicode question mark
    ':' :   '- ',  # unicode colon
    ';' :   ';' ,  # unicode semicolon
    ',' :   ',' ,  # unicode comma
    '。' :   '.' ,  # unicode full stop
    '⧸' :   '--',  # unicode slash           
}



def translate_character(char):
    """Translates a single character to its ASCII equivalent."""
    if char in unicode_to_valid_filename_ASCII_map: return unicode_to_valid_filename_ASCII_map[char]
    if   '\u4e00' <= char <= '\u9fff': ascii_equiv = translate_chinese_to_ascii (char)            # if Chinese
    elif '\u3040' <= char <= '\u30ff': ascii_equiv = translate_japanese_to_ascii(char)            # if Japanese
    elif '\uac00' <= char <= '\ud7af': ascii_equiv = translate_korean_to_ascii  (char)            # if Korean
    else:                              ascii_equiv = unidecode.unidecode        (char)            # if Unicode
    #eturn ascii_equiv if ascii_equiv.isalnum() else '_'  #BAD: turned " " to "_"
    return ascii_equiv


# Japanese
import romkan
def translate_japanese_to_ascii(char):
    return romkan.to_roma(char)

# Chinese
from pypinyin import lazy_pinyin, Style as PypinyinStyle
def translate_chinese_to_ascii(char):
    return ''.join(lazy_pinyin(char, style=PypinyinStyle.TONE3))

# Korean
from korean_romanizer.romanizer import Romanizer
def translate_korean_to_ascii(korean_text):
    r = Romanizer(korean_text)
    return r.romanize()



def romanize(text):
    """Return translated text, but fail very gracefully and transparently if there are any exceptions"""
    try:
        import logging
        logging.getLogger('polyglot').setLevel(logging.ERROR)   # Disable logging messages from Polyglot
        from polyglot.detect import Detector
        from polyglot.transliteration import Transliterator
        detector = Detector(text)
        source_lang = detector.language.code
        transliterator = Transliterator(source_lang=source_lang, target_lang="en")
        return transliterator.transliterate(text)
    except Exception:
        return(text)



def translate_filename(filename):
    """Translates a filename to its ASCII equivalent."""
    filename_romanized_with_polyglot = romanize(filename)
    return ''.join(translate_character(char) for char in filename_romanized_with_polyglot)

import msvcrt
def ask_permission(old_name, new_name):
    """Asks the user for permission to rename a file."""
    print(f"\n{Fore.YELLOW}{Style.BRIGHT}***** Rename:"                                                                   +
          f"\n{Fore.RED   }{Style.BRIGHT}From: {Style.NORMAL}{old_name}{Fore.CYAN}{Style.NORMAL}"                         +
          f"\n{Fore.GREEN }{Style.BRIGHT}  To: {Style.NORMAL}{new_name}{Fore.CYAN}{Style.NORMAL} "                        +
          f"\n{Fore.YELLOW}{Style.BRIGHT}***** Rename?"                                                                   +
          f" { Fore.BLUE  }{Style.BRIGHT}[{Fore.CYAN}Y{Fore.BLUE}/{Style.NORMAL}{Fore.CYAN}n{Style.BRIGHT}]{Style.NORMAL} ", end="")
    response = msvcrt.getch().decode().lower().strip()
    print(Style.BRIGHT, end="")
    if response.lower() in ['y', 'yes', '']:
        print(f"{Fore.GREEN}Yes!", end="")
        return True
    print(f"{Fore.RED}No!", end="")
    return False

def rename_files_in_directory(directory):
    """Renames all files in a directory, replacing unicode characters."""
    global DRY_RUN
    do_it_for_real = True
    automatic      = False
    DRY_RUN        = False
    permission     = False
    for filename in os.listdir(directory):
        new_name = translate_filename(filename)

        if filename != new_name:
            if       os.getenv('AUTOMATIC_UNICODE_CLEANING'):
                del os.environ['AUTOMATIC_UNICODE_CLEANING']                    #only let this directive work once!
                automatic      = True
                do_it_for_real = True
                action_string  = "  Auto-Renamed"
            else:
                permission = ask_permission(filename, new_name)
                do_it_for_real = permission
                action_string  = "       Renamed" if permission is True else f"{Fore.RED}Did not rename"
            if DRY_RUN:
                do_it_for_real = False

            old_file = os.path.join(directory, filename)
            new_file = os.path.join(directory, new_name)

            if do_it_for_real: os.rename(old_file, new_file)

            print("\n")
            if automatic: print(f"\t{Fore.YELLOW}Automatic Run: ")
            if DRY_RUN:   print(f"\t{Fore.YELLOW}" +  "Dry Run: ")
            print(f"{Fore.GREEN}{Style.NORMAL}\t{action_string}:\t{Fore.LIGHTBLACK_EX}{old_file} " +
                  f"{Fore.CYAN}\n\t\t    to:\t{Fore.GREEN}{new_file}{Style.NORMAL}\n\n\n")

def main():
    init()
    directory = sys.argv[1] if len(sys.argv) > 1 else '.'
    rename_files_in_directory(directory)


if __name__ == "__main__":
    main()

#11046 Can you save me bro

I'm not a bro, but check out my fix_unicode_filename projects on my github. It's a more current [and bulky] version of this. Alas, i still have to edit it when a new character i haven't encountered creeps up, so it's far from perfect, but i have it as part of my yt-dlp workflow to do what yt-dlp won't do.

Emoji and unicode screw up a lot of software. My image viewer won't even view images if a folder has an emoji in it, for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question
Projects
None yet
Development

No branches or pull requests

7 participants