Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[scsynth:windows] support unicode in argv #4479

Merged
merged 7 commits into from
Apr 6, 2020

Conversation

dyfer
Copy link
Member

@dyfer dyfer commented Jul 7, 2019

Purpose and Motivation

In the process of working on #4009, it turned out that device names with unicode characters are not handled properly on Windows. AFAICT they need to be encoded in UTF-16 (AKA "wide characters") on the stdio level on Windows.

EDIT: This PR is now updated according to #4479 (review). Scsynth on Windows reads command line arguments in UTF-16 and converts them to UTF-8; it also sets the terminal to UTF-8, similarly to what sclang does. This should allow selecting devices with non-ASCII characters in their name.

Please note, while the requested (non-ASCII) device name is properly displayed in the the terminal when scsynth boots, I'm not able to test if the device selection works properly for devices with non-ASCII names, as no devices on my system have non-ASCII characters in their name.

I'm hiding outdated information below

I started working on this issue, but it seems to be larger than just reading unicode characters in the arguments of scsynth. See relevant discussion: https://github.com//pull/4009#issuecomment-509001234, https://github.com//pull/4009#issuecomment-509016396

It seems that this PR solves the issue of passing unicode characters to scsynth on command line, having the scsynth properly handle UTF-16 encoding on the input. However, they are still not printed correctly outside of SC IDE, as they are still printed in UTF-8, which makes them display incorrectly in the console. However, in SCIDE they are displayed correctly, as the IDE's post window seems to interpret characters in UTF-8.

With this in mind, it seems that we still need to fix:

  • printing scsynth's unicode characters on Windows in the console (in UTF-16)
  • reading these characters in scide (converting UTF-16 to UTF-8)
  • eventually bringing scsynth's changes to supernova

Please note, the current implementation has a memory leak, as pointed out by @Spacechild1: #4009 (comment)

My intention for this PR is to have it for reference for the work on fixing the unicode pipeline on Windows. As this seems to be a larger project, I don't think I'll be able to continue work on this.

Types of changes

  • Bug fix

To-do list

  • Code is tested
  • This PR is ready for review

@jamshark70
Copy link
Contributor

I can confirm that:

s.options.device = "Windows WASAPI : 扬声器/听筒 (Realtek High Definition Audio)";
s.boot;

... produces the following in scsynth's startup output:

Booting with:  In: MME : Line 1/2 (M-Audio Fast Track Pr  Out: Windows WASAPI : 扬声器/听筒 (Realtek High Definition Audio)

(... which fails, but a/ this version doesn't support different I/O devices and b/ the point is that the Chinese characters came through fine.)

By contrast, the AppVeyor build for #4475 (independent I/O devices) prints (same input code):

Requested devices:
  In (matching device NOT found):  - Windows WASAPI : ������/��Ͳ (Realtek High Definition Audio)
  Out (matching device NOT found):
  - Windows WASAPI : ������/��Ͳ (Realtek High Definition Audio)
Selected default ASIO input/output devices

So the fix appears to be functionally correct.

@dyfer
Copy link
Member Author

dyfer commented Jul 8, 2019

Thank you for testing @jamshark70!
BTW I think this is the last AppVeyor build from the old PR - it should have this fix, as well as independent I/O devices, if you'd like to test that.

@sonoro1234
Copy link
Contributor

Cant use command line to set
Windows WASAPI : Micr├│fono de los auriculares (IDT High Definition Audio CODEC)
IDE instead shows
Windows WASAPI : Micrófono de los auriculares (IDT High Definition Audio CODEC)
and setting

s.options.inDevice_("Windows WASAPI : Micrófono de los auriculares (IDT High Definition Audio CODEC)"); //replace device
s.options.outDevice_("Windows WASAPI : Altavoces y auriculares (IDT High Definition Audio CODEC)"); //replace device

does

Booting with:
  In: Windows WASAPI : Micrófono de los auriculares (IDT High Definition Audio CODEC)
  Out: MME : Altavoces y auriculares (IDT Hi
SC_PortAudioDriver: PortAudio failed at Pa_OpenStream with error: 'Illegal combination of I/O devices'
could not initialize audio.

@dyfer
Copy link
Member Author

dyfer commented Jul 8, 2019

Uhm.... I see it's hard to test this, as this PR does not include the independent input/output fix from #4475. Temporarily you can test on last AppVeyor build from #4009.
After #4475 is merged I'll rebase this so the work on this can be resumed.

@sonoro1234 if you want to give it another go (with the AppVeyor build), could you try in command line with proper characters (not the way scsynth currently prints to cmdline), so something like
scynth.exe -H "Windows WASAPI : Micrófono de los (...)" "Windows WASAPI : Altavoces (...)" ?

Also, since this is doing substring matching, does this work?
scynth.exe -H "Windows WASAPI : Micr" "Windows WASAPI : Altavoces"

@sonoro1234
Copy link
Contributor

sonoro1234 commented Jul 9, 2019

if you want to give it another go (with the AppVeyor build), could you try in command line with proper characters (not the way scsynth currently prints to cmdline), so something like
scynth.exe -H "Windows WASAPI : Micrófono de los (...)" "Windows WASAPI : Altavoces (...)" ?

From which PR?

Meanwhile I have done
-H "Windows DirectSound : Micrófono de los auriculares (IDT High Definition Audio CODEC)"
with scsynt 3.10.2 and:
This prints in Lua2SC IDE(utf-8 encoding)
Windows DirectSound : Micrófono de los auriculares (IDT High Definition Audio CODEC)
While in the windows console (echoing the command)
Windows DirectSound : Micr├│fono de los auriculares (IDT High Definition Audio CODEC)

the device is recognized because it tries to boot with

Booting with:
  In: Windows DirectSound : Micrófono de los auriculares (IDT High Definition Audio CODEC)
  Out: MME : Line 1/2 (4- M-Audio Fast Track
SC_PortAudioDriver: PortAudio failed at Pa_OpenStream with error: 'Illegal combination of I/O devices'

So I think that the solution could be doing nothing to the code, but only use the strings as printed in SCIDE (or Lua2SC IDE)

Chinesse names should be the final test but I dont have this names in my devices.
As reported by @jamshark70 the printing with this change is different than without it.
Althougt I would like to see how the whole device listing is in standard SCIDE for this devices
(Not only Requested devices)

@dyfer
Copy link
Member Author

dyfer commented Jul 9, 2019

From which PR?

From #4009. The AppVeyor links were provided in the comment above.

Anyway, I think work on this should wait at least for #4475 to be merged.

I wanted to reiterate, that this is (possibly) fixing input args only, not printing. As stated in the opening comment, printing characters to the console, and then to scide, still needs to be addressed (although it now seems to work in scide since both scsynth and scide seem to use UTF-8 for output).

@jamshark70
Copy link
Contributor

jamshark70 commented Jul 11, 2019

Unexpected trip to the office = unexpected testing opportunity.

Using the old 4009 build, it looks good! If the input command line is unicode-encoded, the devices are found.

s.options.outDevice = "Windows WDM-KS : Speakers";
s.options.inDevice = "Windows WDM-KS : 立体声混音";

s.options.asOptionsString
->  -u 57110 -a 1024 -i 2 -o 2 -H "Windows WDM-KS : 立体声混音" "Windows WDM-KS : Speakers" -R 0 -l 1

s.boot;

->

Requested devices:
  In (matching device found):  - Windows WDM-KS : 立体声混音
  Out (matching device found):
  - Windows WDM-KS : Speakers

Requested devices 立体声混音 (Realtek HD Audio Stereo input) and Speakers (Realtek HD Audio output) use different sample rates. Please set matching sample rates in the Windows Sound Control Panel and try again.

(Sample rate error isn't relevant.)

@sonoro1234
Copy link
Contributor

@jamshark70
So, what you say is:
With this PR "Windows WDM-KS : 立体声混音" is recognized and is not recognized without it?

@jamshark70
Copy link
Contributor

jamshark70 commented Jul 11, 2019

With this PR "Windows WDM-KS : 立体声混音" is recognized and is not recognized without it?

That's correct. Earlier in this PR conversation, I had reported that a different test build (without the Unicode fixes) garbles non-ASCII characters, and this result shows that the fixes not only allow non-ASCII characters to print correctly in scsynth --> IDE output, but they also allow device names with non-ASCII characters to match. So, as far as Unicode is concerned, it's a complete and correct fix.

I still suspect that Windows cmd.exe is using some encoding other than Unicode. This would explain your finding that device names don't print correctly in the console (when they do in scide) and that they aren't received correctly from the cmd.exe commandline (but they are received correctly from an sclang unixCmd). So your findings don't cast any doubt on the fixes for Unicode but they do raise the prospect that Unicode might not be sufficient for every case. (If that's the case, then... bloody Microsoft, it's 2019, get with the rest of the digital world and start using Unicode already, jeez.)

It would be good to get it to work with cmd.exe but we will need to know a/ what is the encoding and b/ how to detect, within scsynth, which encoding is being used for the argv values.

@mossheim
Copy link
Contributor

https://docs.microsoft.com/en-us/windows/win32/intl/code-pages

Also, Unicode is not an encoding. Let's please get that straight. UTF8 and UTF16 are encodings.

@sonoro1234
Copy link
Contributor

sonoro1234 commented Jul 11, 2019

Only left to do: avoid memory leaking and replicate on supernova

@mossheim
Copy link
Contributor

I would like to review the scsynth portion of this before any more code is written.

@mossheim mossheim self-requested a review July 11, 2019 13:25
@dyfer
Copy link
Member Author

dyfer commented Jul 11, 2019

Thanks @brianlheim. I also tried to update the initial comment to highlight UTF-8 vs UTF-16 encoding.

Copy link
Contributor

@mossheim mossheim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code can be made much simpler:

Convert the entirety of wchar_t** argv to char** in wmain and just malloc the memory you need. std::wstring constructor followed by std::string constructor followed by strdup is 3 allocations and 2 deallocations when you could have one allocation per argument.

you also don't need to special-case empty string.

after that, you have no need for any of the changes in the actual option parsing code; you can essentially do this:

#ifdef _WIN32
int wmain(int argc, wchar_t** argv) {
    // other setup
    char** utf8_argv = convert_args(argc, argv);
    return utf8_main(argc, utf8_argv);
}
#else
int main(int argc, char** argv) { return utf8_main(argc, utf8_argv); }
#endif
int utf8_main() { /* ... */ }

you don't need to free this memory. this is a trivial amount of memory (likely less than 256 bytes) and many of these arguments need to stay around for the life of the application anyway (for instance anything assigned to mUGensPluginPath). you would be calling free right before the closing brace of main; just let the OS reclaim it. :)

@dyfer
Copy link
Member Author

dyfer commented Jul 15, 2019

Thanks @brianlheim this is very helpful! I had a feeling I wouldn't have to free the memory especially that I don't know which arguments need to stay around.

I'll rework this later, but my question is for this PR - should printing UTF-16 in the IDE be also part of this PR, or a separate one?

@mossheim
Copy link
Contributor

should printing UTF-16 in the IDE be also part of this PR, or a separate one?

I don't understand this question; did you mean "printing UTF-16 in the terminal"? Btw, the IDE (and everything in this project really) should treat all strings as UTF-8 except when interfacing with APIs that require something else (see http://utf8everywhere.org/#windows). scsynth (and I guess supernova when it compiles on Windows) AFAIK do not do this, but the code for sclang and the IDE gets it right. actually, Qt uses UTF-16 encoding IIRC (for QString), but their API for converting is very clean.

I was going to add that you might want to do this (from cmd-line sclang) so that the terminal is utf-8 once we have control of it - https://github.com/supercollider/supercollider/blob/develop/lang/LangSource/SC_TerminalClient.cpp#L702-L730. I wrote that code quite awhile ago and I'm honestly not sure if it's correct, but I think that's the first thing to try unless you can find a better solution.

Btw, after this PR we would be using <codecvt> in common/SC_Codecvt.hpp and WideCharToMultiByte in this PR; we should choose one and stick with it. Do you have any idea which is more reliable?

@dyfer
Copy link
Member Author

dyfer commented Jul 16, 2019

should printing UTF-16 in the IDE be also part of this PR, or a separate one?

I don't understand this question; did you mean "printing UTF-16 in the terminal"?

Yes. BUT when we do print UTF-16 strings in terminal in scsynth (I tried while debugging), AFAICT these are not printed properly in the IDE. ... because... (?):

I was going to add that you might want to do this (from cmd-line sclang) so that the terminal is utf-8 once we have control of it - https://github.com/supercollider/supercollider/blob/develop/lang/LangSource/SC_TerminalClient.cpp#L702-L730. I wrote that code quite awhile ago and I'm honestly not sure if it's correct, but I think that's the first thing to try unless you can find a better solution.

Ah I see. So that's why printing from scsynth now works properly in the ide.

For the servers to play nice with other environments, shouldn't they use UTF-16 in terminal (on Windows)? What happens if we do that but the terminal is set to UTF-8? Honest question, I'm learning as I go.

If we should print UTF-16 in the console, my understanding is that there are 2 more things to properly support this:

  • printing UTF-16 in the console (scsynth)
  • displaying these in scide

Is this correct?

Btw, after this PR we would be using <codecvt> in common/SC_Codecvt.hpp and WideCharToMultiByte in this PR; we should choose one and stick with it. Do you have any idea which is more reliable?

I have no previous expertise on this. I chose to use WideCharToMultiByte because PortAudio was using that. I'm happy to defer this choice to more experienced contributors :)

@mossheim
Copy link
Contributor

@dyfer I am confused. In the PR comment above you wrote:

However, they are still not printed correctly outside of SC IDE, as they are still printed in UTF-8, which makes them display incorrectly in the console. However, in SCIDE they are displayed correctly, as the IDE's post window seems to interpret characters in UTF-8.

But then you write:

With this in mind, it seems that we still need to fix:

  • reading these characters in scide (converting UTF-16 to UTF-8)

and

BUT when we do print UTF-16 strings in terminal in scsynth (I tried while debugging), AFAICT these are not printed properly in the IDE. ... because... (?)

Here is what I think should happen on Windows (for macOS and Linux, everything is char & UTF-8`:

  1. scsynth expects wchar_t & UTF-16 input, but immediately converts to char & UTF-8
  2. scsynth's output is all UTF-8. It changes the codepage to UTF-8 at program start (and changes it back at exit). IIUC, this should result in any const char*/std::strings holding UTF-8-encoded strings to be printed correctly. Not 100% sure but that would be a lot better than having to convert every const char* to const wchar_t* + printing with wprintf.

I don't know how (2) interacts with the IDE (or IPC in general). I would assume however that there is some functional way of combining everything. :)

I have no previous expertise on this. I chose to use WideCharToMultiByte because PortAudio was using that. I'm happy to defer this choice to more experienced contributors :)

After a bit of googling it seems like since we're only doing this on windows, WideCharToMultiByte would be the better choice (for performance and for API ease). Plus, apparently is deprecated in C++17. Since the code you'll be writing directly mallocs char*, it is probably better to keep it local to this file. I can open a separate PR for doing that in the SC_Codecvt.hpp header.

@dyfer
Copy link
Member Author

dyfer commented Jul 16, 2019

@dyfer I am confused. In the PR comment above you wrote:

Sorry! Please keep in mind my understanding of this is hazy - but I'm also probably failing to communicate what I think I understand.

However, they are still not printed correctly outside of SC IDE, as they are still printed in UTF-8, which makes them display incorrectly in the console. However, in SCIDE they are displayed correctly, as the IDE's post window seems to interpret characters in UTF-8.

But then you write:

With this in mind, it seems that we still need to fix:

  • reading these characters in scide (converting UTF-16 to UTF-8)

So this would need to happen after/if we changed scsynth printing to UTF-16. But I might be completely wrong about this approach.

Here is what I think should happen on Windows (for macOS and Linux, everything is char & UTF-8`:

  1. scsynth expects wchar_t & UTF-16 input, but immediately converts to char & UTF-8
  2. scsynth's output is all UTF-8. It changes the codepage to UTF-8 at program start (and changes it back at exit). IIUC, this should result in any const char*/std::strings holding UTF-8-encoded strings to be printed correctly. Not 100% sure but that would be a lot better than having to convert every const char* to const wchar_t* + printing with wprintf.

I agree with 1.

I should stop commenting on UTF-8/16 in windows - I just tried just pasting what I thought was both UTF 8 and UTF 16 characters in cmd.exe as well as PowerShell. I was not able to even display them properly (neither UTF-8 nor UTF-16), let alone use as an argument. So I just don't understand where which encoding should be used, and how, and if things need to be set up for it. Sorry for creating confusion with my previous comments.

After a bit of googling it seems like since we're only doing this on windows, WideCharToMultiByte would be the better choice (for performance and for API ease). Plus, apparently is deprecated in C++17. Since the code you'll be writing directly mallocs char*, it is probably better to keep it local to this file. I can open a separate PR for doing that in the SC_Codecvt.hpp header.

I think you meant that WideCharToMultiByte is not deprecated (but some conversions in codecvt are) right?

So, I'll clean up this PR to allow reading UTF-16 in scsynth args. I won't make any more attempts at encoding of the output strings, `cause I'm clearly not understanding the problem fully (and it's not a priority for me). We'll still get usable setting of audio devices.

@mossheim
Copy link
Contributor

sorry, my comment did not display properly. i meant that the codecvt header is deprecated, but since i used angle brackets it was hidden by gh's markdown renderer.

i think the solution is to simply set and unset the codepage once scsynth starts and finishes respectively, like is done in SC_TerminalClient. i would really appreciate if you could add that to this PR. otherwise, no worries, just clean up the changes as already described, i can explain character encoding to you another time (it is truly such nonsense) :)

@mossheim
Copy link
Contributor

@dyfer I notice this PR now has conflicts. what's the status? is it ready for review after rebasing?

@dyfer
Copy link
Member Author

dyfer commented Dec 24, 2019

@brianlheim nope, this is not ready for review. I'm planning to get to this PR at some point. (sorry the WIP tag was not set)

@mossheim
Copy link
Contributor

nope, this is not ready for review. I'm planning to get to this PR at some point. (sorry the WIP tag was not set)

no worries, just checking in!

@dyfer dyfer force-pushed the topic/scsynth-unicode-windows branch from fff4301 to 11c35cd Compare March 15, 2020 07:13
@dyfer
Copy link
Member Author

dyfer commented Mar 15, 2020

I think this is ready for a review now!

@dyfer dyfer changed the title [WIP] [scsynth:windows] support unicode in argv [scsynth:windows] support unicode in argv Mar 15, 2020
Copy link
Contributor

@mossheim mossheim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cleanest set of transformations IMO would be:

  • create a new function above main and wmain, int scsynth_main(int argc, char** argv), containing most of what is main currently
  • main calls it directly, wmain translates its arguments from wchar_t to char, does other setup, calls scsynth_main, then does WSACleanup() and SetConsoleOutputCP
  • you can use a local variable for the old code page with this design
  • double check that any actions we need to clean up manually, such as SetConsoleOutputCP or WSAStartup, are cleaned up on ALL paths out of the main function

i realized we actually should be freeing all manually allocated memory to avoid any accidental complaints from memory analyzers. it's up to you how to handle that, but personally i think a std::vector<char*> where each element has been new'd would be appropriate. creating a std::vector<std::string> which you then use to populate to a std::vector<char*> is also an option but a little unwieldy. IMO when you're essentially using C-like interfaces it's appropriate to use C-style memory management when C++ style would create clutter. if you do manually allocate memory, always do it with new and delete or their array versions rather than malloc and free.

#ifdef _WIN32

int wmain(int argc, wchar_t* wargv[]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove these declarations for main and wmain, we don't need them.

@dyfer
Copy link
Member Author

dyfer commented Mar 20, 2020

Thanks for the feedback!

I think I addressed all the points, I've created a new function int scsynth_main(int argc, char** argv), called from main/wmain and I used a vector for converted args.

  • Do I still need that argv.clear() right before the exit? (argv is a vector there)
  • Is the memory allocation done properly now?
  • double check that any actions we need to clean up manually, such as SetConsoleOutputCP or WSAStartup, are cleaned up on ALL paths out of the main function

I think they are, as we call them after the main returns, regardless of the exit value. Is this correct?

EDIT:
I realized that now we set stdout line buffering after possibly displaying warning messages (from setting WSAStartup and changing the codepage), is that an issue?


// set codepage to UTF-8
static UINT gOldCodePage; // for remembering the old codepage when we switch to UTF-8
gOldCodePage = GetConsoleOutputCP();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can combine to auto oldCodePage = GetConsoleOutputCP();. we only use the g prefix for globals

// reset codepage from UTF-8
SetConsoleOutputCP(gOldCodePage);
// clear vector with converted args
argv.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here you just need to loop over argv and delete [] each element. no need to clear, that will happen when argv is destructed.

@dyfer
Copy link
Member Author

dyfer commented Apr 5, 2020

PR updated :)


// set codepage to UTF-8
static UINT oldCodePage; // for remembering the old codepage when we switch to UTF-8
oldCodePage = GetConsoleOutputCP();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is better as auto oldCodePage = GetConsoleOutputCP() -- no static since it isnt needed here, and combining the declaration and assignment into a single initialization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry I missed it the first time through

SetConsoleOutputCP(oldCodePage);
// clear vector with converted args
for (int i = 0; i < argv.size(); i++) {
delete[] argv[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can write this as a range-for, which is preferable in modern c++ when you don't need an index or iterator:

for (auto* arg : argv)
    delete[] arg;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, hopefully!

Copy link
Contributor

@mossheim mossheim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants