You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So whether or not to remove quotes and where seems to have been plaguing this repo with endless issues and PRs.
Sorry to wake up the dead here but see #145, #82, #93, #140, #138, #153, #180, #201, #222 and probably more I didn't find.
Overall the goal of this issue, discussion, and potential future PR is to making typing a command into your favorite shell or into your online bot interface more consistent, and also to make quote handling easier to test and more predictable.
I have two examples that well demonstrate what I think the main issues currently are:
I loaded this test.js into the node REPL before both of these examples to set up the command line and tokenizers.
//To grab arguments as passed by a shell if we recieve argsif(process.argv.length>2){console.log(JSON.stringify(process.argv.slice(2)))process.exit()}constfs=require('fs');const{execSync}=require('child_process')constutil=require('util')constparse=require('yargs-parser')//TokenizersfunctionextTokenizer(exe,ext,prefix=""){returncommand=>{tkfile=`tokenize.${ext}`fs.writeFileSync(tkfile,`${prefix}node test.js ${command}`)result=execSync(`${exe}${tkfile}`);returnJSON.parse(result)}}constbasicTokenizer=str=>str.split(" ")//Simple whitespace splitconstyargsTokenizer=require('./tokenize-arg-string')constcmdTokenizer=extTokenizer('cmd /C ','bat',"@echo off\n")constpowershellTokenizer=extTokenizer('powershell -ExecutionPolicy Bypass -File ','ps1')constbashTokenizer=extTokenizer('bash ','sh',"/home/mike/.nvm/versions/node/v14.8.0/bin/")//Tested with WSLconstargfile="command.txt"constcommandline=fs.readFileSync(argfile).toString()
Here I created a gist containing a quick and dirty tool for trying this out for yourself
1. Inconsistent removal of quotes depending on context of token
Post tokenizer, the argument parser seems to pick and choose which arguments aught to be stripped of quotes.
>commandline`"positional" -a 'flag'`>tokens=yargsTokenizer(commandline)['"positional"','-a',"'flag'"]>parse(tokens){_: ['"positional"'],a: 'flag'}
Notice How the flag argument gets its quotes removed, but the positional argument does not. This inconsistency is confusing.
Now you may ask "then how do I enter a command like domything "to this other thing" from a string statement. Well that brings me to problem 2.
2. Tokenizer inconsistency with shell interfaces.
So if tokenizeArgString is supposed to be the middleware that converts your text string to the argument (token) array in the case where the shell you are in hasn't done it for you, arguably the output of tokenizeArgString should emulate the same expected results as you would get by reading process.argv. Since process.argv is controlled by the various shells, lets see how tokenArgString stacks up against the most popular ones (that I could easily test).
yargs nicely handles both single quotes and double quotes
cmd does not support single quotes, however it supports " escapes and strips the double quotes
powershell is a bit weirder, it supports both single and double quotes and strips both, but its behavior is strange when using \ escapes... (it also supports using backticks as escapes, or double quote marks to escape)
bash where everything is nice and makes sense supports single and double quotes and \ escapes
Which one of these methods is correct? None of them, they are all different flavors on the same end goal. However people are used to them and generally expect similar behavior (whether or not they get it) and more importantly, they are testable.
My Proposal
Do not strip quotes from arguments passed as an array, ever. Assume they have already been tokenized by a shell or tokenizer. I would argue that parse with an argument of any[] should leave every single argument unmolested as it should assume that it is receiving the arguments directly from process.argv. Otherwise it is difficult to figure out when and where quotations need to be added or removed. Therefore it should assume that the inputted quotes are intended to remain. This is probably a small breaking change, you could probably fix this with a deprecated option as well, but in my opinion maintaining both modes seems like a bit of a pain.
In the case where the user has specified in options that they want quotes removed the question becomes how many times to you strip the quotes, this needs further discussion as this seems to be a post processing measure.
Have the tokenizer handle removing quotes in accordance with how users would expect it from their shell. That means supporting three different tokenizers that are flavored like cmd, powershell, and bash. This way not only can the programmer select which style they want, it also means that there is no longer a "but i want my quotes removed here but not here" issue because you can respond, "you selected the powershell style and thats how powershell does it", removing any opinion on how the quotes are removed and ending this plague of issues.
You also get the benefit of being able to unit test directly against the shells themselves. Im not totally confident that the exact method to test them I wrote above is very CI friendly, however a similar concept could be used. Just take a bunch of predefined argument strings and see if the relate shell has the same output as what the flavored tokenizer outputs.
You could layout the top level API like the following:
//If you know you are tokenizing, don't accept tokensexporttypeTokenizer=(arg: string)=>string[];exporttypeCommandLineParser=(arg: string,opts?: Partial<Options>)=>Arguments;//Same stuff as before for backwards compatibilityexportinterfaceParser{(args: ArgsInput,opts?: Partial<Options>): Arguments;//Will now by default use tokenize.default to tokenizedetailed(args: ArgsInput,opts?: Partial<Options>): DetailedArguments;camelCase(str: string): string;decamelize(str: string,joinString?: string): string;//new stuff//included tokenizerstokenize{(arg: string)=>string[];//Calling require('yargs-parser').tokenize("") should just use the default tokenizer
default: Tokenizer;//freeze the current tokenizer for backward compat
cmd: Tokenizer;
powershell: Tokenizer;
bash: Tokenizer;};//Include tokenizer flavors as full parsers
cmd: CommandLineParser;
powershell: CommandLineParser;
bash: CommandLineParser;//Allow programmers to feed it their own tokenizer? So less complaining about format
parser: (tokenizer: Tokenizer)=>CommandLineParser;}
I understand this goes somewhat against previous efforts (See #153), but ultimatley the tokenizer is meant to take the place of a shell that does it for you, so it doesn't make sense not to try to emulate the function of the shell in this regard.
This can also maybe open up an opportunity to seperate the tokenizers into their own npm package. It would be neat to have a package that is just designed to emulate how shells tokenize to argv. yargs-tokenizer anybody?
Let me know what yalls thoughts are on this, I'd be willing to do some of the ground work for this. I also believe that most (not all) of this can be introduced without any breaking changes...
The text was updated successfully, but these errors were encountered:
Thanks for this great research you've done up front 👍
Have the tokenizer handle removing quotes in accordance with how users would expect it from their shell. That means supporting three different tokenizers that are flavored like cmd, powershell, and bash.
Over the years with yargs we've tried really hard to find compromises that meet the widest variety of user expectations and, ideally, are identical in a wide variety of environments.
Sometimes we introduce configuration options, when folks have irreconcilable goals ... and it seems like there are enough people on both sides of the fence that it's worth doing. Ideally these configuration options are added rarely.
I would rather start by looking at your findings, and seeing if there are a few additional bugs we can fix for some shell environments, without breaking the existing behavior. Only if it's absolutely necessary, e.g., a situation where we need to do things one way on powershell vs., bash, and it's impossible to find a non-breaking middle-ground, would I suggest that we add additional configuration options.
I do not like the idea of decomposing the tokenizers into their own packages. Ideally we'd have one tokenizer that supports as wide a variety of process.argv formats as possible, and we'd maybe introduce some additional config settings for the tokenizers.
Note of clarification: along with moving towards supporting Deno/TypeScript/ESM, I've been putting work into consolidating yargs' footprint, with regards to dependencies. Several popular alternatives like commander have zero dependencies ... I'm a proponent of decomposing libraries, but have definitely gotten some flack for yargs' footprint, tldr; I've been trying to be really selective about what additional dependencies we add or break out.
So whether or not to remove quotes and where seems to have been plaguing this repo with endless issues and PRs.
Sorry to wake up the dead here but see #145, #82, #93, #140, #138, #153, #180, #201, #222 and probably more I didn't find.
Overall the goal of this issue, discussion, and potential future PR is to making typing a command into your favorite shell or into your online bot interface more consistent, and also to make quote handling easier to test and more predictable.
I have two examples that well demonstrate what I think the main issues currently are:
I loaded this test.js into the node REPL before both of these examples to set up the command line and tokenizers.
Here I created a gist containing a quick and dirty tool for trying this out for yourself
1. Inconsistent removal of quotes depending on context of token
Post tokenizer, the argument parser seems to pick and choose which arguments aught to be stripped of quotes.
Notice How the flag argument gets its quotes removed, but the positional argument does not. This inconsistency is confusing.
Now you may ask "then how do I enter a command like
domything "to this other thing"
from a string statement. Well that brings me to problem 2.2. Tokenizer inconsistency with shell interfaces.
So if tokenizeArgString is supposed to be the middleware that converts your text string to the argument (token) array in the case where the shell you are in hasn't done it for you, arguably the output of tokenizeArgString should emulate the same expected results as you would get by reading process.argv. Since process.argv is controlled by the various shells, lets see how tokenArgString stacks up against the most popular ones (that I could easily test).
There are a few observations to be made here:
Which one of these methods is correct? None of them, they are all different flavors on the same end goal. However people are used to them and generally expect similar behavior (whether or not they get it) and more importantly, they are testable.
My Proposal
Do not strip quotes from arguments passed as an array, ever. Assume they have already been tokenized by a shell or tokenizer. I would argue that parse with an argument of any[] should leave every single argument unmolested as it should assume that it is receiving the arguments directly from process.argv. Otherwise it is difficult to figure out when and where quotations need to be added or removed. Therefore it should assume that the inputted quotes are intended to remain. This is probably a small breaking change, you could probably fix this with a deprecated option as well, but in my opinion maintaining both modes seems like a bit of a pain.
In the case where the user has specified in options that they want quotes removed the question becomes how many times to you strip the quotes, this needs further discussion as this seems to be a post processing measure.
Have the tokenizer handle removing quotes in accordance with how users would expect it from their shell. That means supporting three different tokenizers that are flavored like cmd, powershell, and bash. This way not only can the programmer select which style they want, it also means that there is no longer a "but i want my quotes removed here but not here" issue because you can respond, "you selected the powershell style and thats how powershell does it", removing any opinion on how the quotes are removed and ending this plague of issues.
You also get the benefit of being able to unit test directly against the shells themselves. Im not totally confident that the exact method to test them I wrote above is very CI friendly, however a similar concept could be used. Just take a bunch of predefined argument strings and see if the relate shell has the same output as what the flavored tokenizer outputs.
You could layout the top level API like the following:
I understand this goes somewhat against previous efforts (See #153), but ultimatley the tokenizer is meant to take the place of a shell that does it for you, so it doesn't make sense not to try to emulate the function of the shell in this regard.
This can also maybe open up an opportunity to seperate the tokenizers into their own npm package. It would be neat to have a package that is just designed to emulate how shells tokenize to argv.
yargs-tokenizer
anybody?Let me know what yalls thoughts are on this, I'd be willing to do some of the ground work for this. I also believe that most (not all) of this can be introduced without any breaking changes...
The text was updated successfully, but these errors were encountered: