-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[scsynth:windows] support unicode in argv #4479
[scsynth:windows] support unicode in argv #4479
Conversation
a5815fa
to
cf79408
Compare
I can confirm that:
... produces the following in scsynth's startup output:
(... which fails, but a/ this version doesn't support different I/O devices and b/ the point is that the Chinese characters came through fine.) By contrast, the AppVeyor build for #4475 (independent I/O devices) prints (same input code):
So the fix appears to be functionally correct. |
Thank you for testing @jamshark70! |
Cant use command line to set
does
|
Uhm.... I see it's hard to test this, as this PR does not include the independent input/output fix from #4475. Temporarily you can test on last AppVeyor build from #4009. @sonoro1234 if you want to give it another go (with the AppVeyor build), could you try in command line with proper characters (not the way scsynth currently prints to cmdline), so something like Also, since this is doing substring matching, does this work? |
From which PR? Meanwhile I have done the device is recognized because it tries to boot with
So I think that the solution could be doing nothing to the code, but only use the strings as printed in SCIDE (or Lua2SC IDE) Chinesse names should be the final test but I dont have this names in my devices. |
From #4009. The AppVeyor links were provided in the comment above. Anyway, I think work on this should wait at least for #4475 to be merged. I wanted to reiterate, that this is (possibly) fixing input args only, not printing. As stated in the opening comment, printing characters to the console, and then to scide, still needs to be addressed (although it now seems to work in scide since both scsynth and scide seem to use UTF-8 for output). |
Unexpected trip to the office = unexpected testing opportunity. Using the old 4009 build, it looks good! If the input command line is unicode-encoded, the devices are found.
(Sample rate error isn't relevant.) |
@jamshark70 |
That's correct. Earlier in this PR conversation, I had reported that a different test build (without the Unicode fixes) garbles non-ASCII characters, and this result shows that the fixes not only allow non-ASCII characters to print correctly in scsynth --> IDE output, but they also allow device names with non-ASCII characters to match. So, as far as Unicode is concerned, it's a complete and correct fix. I still suspect that Windows cmd.exe is using some encoding other than Unicode. This would explain your finding that device names don't print correctly in the console (when they do in scide) and that they aren't received correctly from the cmd.exe commandline (but they are received correctly from an sclang unixCmd). So your findings don't cast any doubt on the fixes for Unicode but they do raise the prospect that Unicode might not be sufficient for every case. (If that's the case, then... bloody Microsoft, it's 2019, get with the rest of the digital world and start using Unicode already, jeez.) It would be good to get it to work with cmd.exe but we will need to know a/ what is the encoding and b/ how to detect, within scsynth, which encoding is being used for the argv values. |
https://docs.microsoft.com/en-us/windows/win32/intl/code-pages Also, Unicode is not an encoding. Let's please get that straight. UTF8 and UTF16 are encodings. |
Only left to do: avoid memory leaking and replicate on supernova |
I would like to review the scsynth portion of this before any more code is written. |
Thanks @brianlheim. I also tried to update the initial comment to highlight UTF-8 vs UTF-16 encoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code can be made much simpler:
Convert the entirety of wchar_t** argv
to char**
in wmain
and just malloc
the memory you need. std::wstring
constructor followed by std::string
constructor followed by strdup
is 3 allocations and 2 deallocations when you could have one allocation per argument.
you also don't need to special-case empty string.
after that, you have no need for any of the changes in the actual option parsing code; you can essentially do this:
#ifdef _WIN32
int wmain(int argc, wchar_t** argv) {
// other setup
char** utf8_argv = convert_args(argc, argv);
return utf8_main(argc, utf8_argv);
}
#else
int main(int argc, char** argv) { return utf8_main(argc, utf8_argv); }
#endif
int utf8_main() { /* ... */ }
you don't need to free this memory. this is a trivial amount of memory (likely less than 256 bytes) and many of these arguments need to stay around for the life of the application anyway (for instance anything assigned to mUGensPluginPath
). you would be calling free
right before the closing brace of main
; just let the OS reclaim it. :)
Thanks @brianlheim this is very helpful! I had a feeling I wouldn't have to free the memory especially that I don't know which arguments need to stay around. I'll rework this later, but my question is for this PR - should printing UTF-16 in the IDE be also part of this PR, or a separate one? |
I don't understand this question; did you mean "printing UTF-16 in the terminal"? Btw, the IDE (and everything in this project really) should treat all strings as UTF-8 except when interfacing with APIs that require something else (see http://utf8everywhere.org/#windows). scsynth (and I guess supernova when it compiles on Windows) AFAIK do not do this, but the code for sclang and the IDE gets it right. actually, Qt uses UTF-16 encoding IIRC (for QString), but their API for converting is very clean. I was going to add that you might want to do this (from cmd-line sclang) so that the terminal is utf-8 once we have control of it - https://github.com/supercollider/supercollider/blob/develop/lang/LangSource/SC_TerminalClient.cpp#L702-L730. I wrote that code quite awhile ago and I'm honestly not sure if it's correct, but I think that's the first thing to try unless you can find a better solution. Btw, after this PR we would be using |
Yes. BUT when we do print UTF-16 strings in terminal in scsynth (I tried while debugging), AFAICT these are not printed properly in the IDE. ... because... (?):
Ah I see. So that's why printing from scsynth now works properly in the ide. For the servers to play nice with other environments, shouldn't they use UTF-16 in terminal (on Windows)? What happens if we do that but the terminal is set to UTF-8? Honest question, I'm learning as I go. If we should print UTF-16 in the console, my understanding is that there are 2 more things to properly support this:
Is this correct?
I have no previous expertise on this. I chose to use |
@dyfer I am confused. In the PR comment above you wrote:
But then you write:
and
Here is what I think should happen on Windows (for macOS and Linux, everything is char & UTF-8`:
I don't know how (2) interacts with the IDE (or IPC in general). I would assume however that there is some functional way of combining everything. :)
After a bit of googling it seems like since we're only doing this on windows, WideCharToMultiByte would be the better choice (for performance and for API ease). Plus, apparently is deprecated in C++17. Since the code you'll be writing directly mallocs |
Sorry! Please keep in mind my understanding of this is hazy - but I'm also probably failing to communicate what I think I understand.
So this would need to happen after/if we changed scsynth printing to UTF-16. But I might be completely wrong about this approach.
I agree with 1. I should stop commenting on UTF-8/16 in windows - I just tried just pasting what I thought was both UTF 8 and UTF 16 characters in
I think you meant that So, I'll clean up this PR to allow reading UTF-16 in scsynth args. I won't make any more attempts at encoding of the output strings, `cause I'm clearly not understanding the problem fully (and it's not a priority for me). We'll still get usable setting of audio devices. |
sorry, my comment did not display properly. i meant that the codecvt header is deprecated, but since i used angle brackets it was hidden by gh's markdown renderer. i think the solution is to simply set and unset the codepage once scsynth starts and finishes respectively, like is done in SC_TerminalClient. i would really appreciate if you could add that to this PR. otherwise, no worries, just clean up the changes as already described, i can explain character encoding to you another time (it is truly such nonsense) :) |
@dyfer I notice this PR now has conflicts. what's the status? is it ready for review after rebasing? |
@brianlheim nope, this is not ready for review. I'm planning to get to this PR at some point. (sorry the WIP tag was not set) |
no worries, just checking in! |
fff4301
to
11c35cd
Compare
11c35cd
to
5eb66e3
Compare
I think this is ready for a review now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the cleanest set of transformations IMO would be:
- create a new function above
main
andwmain
,int scsynth_main(int argc, char** argv)
, containing most of what ismain
currently main
calls it directly,wmain
translates its arguments from wchar_t to char, does other setup, callsscsynth_main
, then doesWSACleanup()
andSetConsoleOutputCP
- you can use a local variable for the old code page with this design
- double check that any actions we need to clean up manually, such as SetConsoleOutputCP or WSAStartup, are cleaned up on ALL paths out of the main function
i realized we actually should be freeing all manually allocated memory to avoid any accidental complaints from memory analyzers. it's up to you how to handle that, but personally i think a std::vector<char*>
where each element has been new
'd would be appropriate. creating a std::vector<std::string>
which you then use to populate to a std::vector<char*>
is also an option but a little unwieldy. IMO when you're essentially using C-like interfaces it's appropriate to use C-style memory management when C++ style would create clutter. if you do manually allocate memory, always do it with new
and delete
or their array versions rather than malloc
and free
.
server/scsynth/scsynth_main.cpp
Outdated
#ifdef _WIN32 | ||
|
||
int wmain(int argc, wchar_t* wargv[]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove these declarations for main and wmain, we don't need them.
Thanks for the feedback! I think I addressed all the points, I've created a new function
I think they are, as we call them after the main returns, regardless of the exit value. Is this correct? EDIT: |
server/scsynth/scsynth_main.cpp
Outdated
|
||
// set codepage to UTF-8 | ||
static UINT gOldCodePage; // for remembering the old codepage when we switch to UTF-8 | ||
gOldCodePage = GetConsoleOutputCP(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can combine to auto oldCodePage = GetConsoleOutputCP();
. we only use the g
prefix for globals
server/scsynth/scsynth_main.cpp
Outdated
// reset codepage from UTF-8 | ||
SetConsoleOutputCP(gOldCodePage); | ||
// clear vector with converted args | ||
argv.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here you just need to loop over argv
and delete []
each element. no need to clear, that will happen when argv is destructed.
PR updated :) |
server/scsynth/scsynth_main.cpp
Outdated
|
||
// set codepage to UTF-8 | ||
static UINT oldCodePage; // for remembering the old codepage when we switch to UTF-8 | ||
oldCodePage = GetConsoleOutputCP(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is better as auto oldCodePage = GetConsoleOutputCP()
-- no static since it isnt needed here, and combining the declaration and assignment into a single initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah sorry I missed it the first time through
server/scsynth/scsynth_main.cpp
Outdated
SetConsoleOutputCP(oldCodePage); | ||
// clear vector with converted args | ||
for (int i = 0; i < argv.size(); i++) { | ||
delete[] argv[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can write this as a range-for, which is preferable in modern c++ when you don't need an index or iterator:
for (auto* arg : argv)
delete[] arg;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, hopefully!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Purpose and Motivation
In the process of working on #4009, it turned out that device names with unicode characters are not handled properly on Windows. AFAICT they need to be encoded in UTF-16 (AKA "wide characters") on the stdio level on Windows.
EDIT: This PR is now updated according to #4479 (review). Scsynth on Windows reads command line arguments in UTF-16 and converts them to UTF-8; it also sets the terminal to UTF-8, similarly to what sclang does. This should allow selecting devices with non-ASCII characters in their name.
Please note, while the requested (non-ASCII) device name is properly displayed in the the terminal when scsynth boots, I'm not able to test if the device selection works properly for devices with non-ASCII names, as no devices on my system have non-ASCII characters in their name.
I'm hiding outdated information below
It seems that this PR solves the issue of passing unicode characters to scsynth on command line, having the scsynth properly handle UTF-16 encoding on the input. However, they are still not printed correctly outside of SC IDE, as they are still printed in UTF-8, which makes them display incorrectly in the console. However, in SCIDE they are displayed correctly, as the IDE's post window seems to interpret characters in UTF-8.
With this in mind, it seems that we still need to fix:
Please note, the current implementation has a memory leak, as pointed out by @Spacechild1: #4009 (comment)
My intention for this PR is to have it for reference for the work on fixing the unicode pipeline on Windows. As this seems to be a larger project, I don't think I'll be able to continue work on this.
Types of changes
To-do list