Unclear whether control characters are allowed in comments #567

ghost · 2018-10-06T12:00:48Z

It is not clear from the specification whether control characters (U+0000 .. U+001F and U+007F) are allowed in comments.

The section Comment only states that "A hash symbol marks the rest of the line as a comment.".
This implies that any non-newline characters are allowed after the hash symbol.

Discussion about control characters only appears in the section String, which should have no implication on comments.

The ABNF definition forbids control characters (except tab) in comments, but the ABNF is not authoritative.

I think the section Comment should explicitly state which characters are allowed in comments.
Note that it would be unlogical to allow e.g. form-feed inside comments, because form-feed is traditionally a stronger separator than newline.

pradyunsg · 2018-10-09T18:31:11Z

ABNF is not authoritative.

It is.

I think the section Comment should explicitly state which characters are allowed in comments.

If there's more people who think this would add value, we might do this. I don't want to complicate an otherwise straightforward explanation.

ghost · 2018-10-09T19:46:54Z

ABNF is not authoritative.

It is.

The general theme in issues #566, #567, #568, #569 is that I believe the specification to be ambiguous. I assumed the text document is the complete, stand-alone, authoritative specification, while the ABNF has only experimental status.

If indeed the ABNF is authoritative, perhaps these issues are resolved.
However I think in that case the specification text should explicitly declare the ABNF grammer to be authoritative and defer to it for details. Otherwise how will anybody know where to look for a precise definition of TOML ?

rmunn · 2018-10-25T13:54:42Z

From the point of view of a parser, it's probably easier to allow control characters (except for newline) in comments, because then the rule is "Once you see a comment marker, ignore everything until the next newline". This is simpler (at least for some parser libraries) to implement than "Once you see the comment marker, ignore all valid characters but throw an error for 0x00-0x1f and 0x7f, and switch modes at the newline".

So I'd personally prefer for the rule to be "control characters are legal (and ignored) in comments; after a comment marker is seen, only the next newline matters."

ChristianSi · 2018-10-26T18:22:47Z

@rmunn On the other hand, if control chars are prohibited in comments too, they would be prohibited anywhere in a TOML document since they are also prohibited in strings of all kinds (and they are certainly not allowed outside of strings or comments). So prohibiting them everywhere might actually be quite simple for many parsers.

pradyunsg · 2018-11-09T10:43:07Z

Currently, I am ambivalent on this issue tbh. I don't see any major gains either way so I'm inclined to say status quo would win here.

I can be convinced either way though.

ChristianSi · 2018-11-10T19:46:58Z

@pradyunsg I assume if you say "status quo" you mean the status quo as defined by the ABNF (which prohibits control chars in comments)?

The original poster's point was exactly that, in the human-readable spec, the status quo is not defined. I too would advice clarifying this in the written spec in addition to the ABNF, for people who can read English better than Backus–Naur.

pradyunsg · 2019-12-12T15:53:15Z

Thanks everyone for the patience here -- I took a long time to come back around to this.

Yes, I think forbidding control characters in comments makes sense. If there's a significant use case for control characters in comments, do holler.

Meanwhile, I think what we need here is a PR clarifying in text what's already clear in the ABNF -- that control characters are not allowed in comments.

- implemented hexfloats - handled trailing underscores in ints and floats - improved parse performance for single-digit integers - handled control characters in comments (per toml-lang/toml#567)

abelbraaksma · 2019-12-22T14:42:05Z

I understand this is closed now, and I also understand the logic behind the decision, however I think the following has not been considered and might make you reconsider the decision here:

If a comment with certain Unicode characters (except newline) is to be considered invalid, then this has the following drawbacks:

a comment can involuntarily invalidate the whole TOML file. I think such accidents should be prevented.
parsers must also parse each character in a comment. This may complicates parsing. Not because it's almost the same as 'string', but because of different handling of error situations, informing users why a comment is invalid.
on the same token, it slows down parsing,whatever way you look at it
I think the status quo requires strings to be valid Unicode strings. That's a little more parsing than just checking for control chars, and a parser really shouldn't be bothered with this added complexity when it encounters comments.
conceptually, people consider comments not part of the syntax, and feel free to dump whatever they like in them. Disallowing certain chars contradicts this very human 'feeling' about what comments are.

Bottom line of this argument, I'd like to propose a simplification: define a comment as any range of code points, until a newline is encountered.

This is simpler, and clearer for end users.

cleishm · 2022-10-09T10:01:10Z

As an implementation author, I agree with @abelbraaksma above. Firstly, it's substantially more complicated to parse out control characters in comments and raise an error, and this kind of strictness provides no real advantage to users - it results in documents being rejected for content that is commented out, which is itself an indication by the user that the content should be ignored by the parser.

eksortso · 2022-10-10T06:09:12Z

I don't seriously object to allowing control characters in comments. Are there any security concerns about them here? I could object to a delete char, 0x7F, immediately following the hash sign. But that's it.

If you think it's merited, let's reopen this issue, and I can compose a PR that would allow most everything but newlines in comments.

ChristianSi · 2022-10-10T16:35:45Z

I understand these sentiments, but currently the spec says "Control characters other than tab ... are not permitted in comments." So if we change this again, all 1.0-compatible parsers would supposedly have to be changed again.

Also some implementations might conceivably check for control chars before doing any further parsing step? If so, like I wrote earlier: "prohibiting them everywhere might actually be quite simple for many parsers."

Maybe this is a good case to let implementations decide on their behavior? The spec could say something like:

TOML implementations may throw an error or issue a warning if encountering any control characters other than tab (U+0000 to U+0008, U+000A to U+001F, U+007F) in a comment, but they don't have to do so.

In that way, both rejecting control chars and ignoring them together with the rest of the comment would both be fine.

This language would have to be adjusted regarding Unicode validity too, but here I think the same logic applies: some parsers might get the rejection of invalid Unicode sequences for free, because they use an OS/library function that does it for them. For them, if we require that "any range of code points" is accepted in the comment, whether valid Unicode or not, might actually make things harder. So "it's implementation-dependent" may be the best course of action too.

eksortso · 2022-10-11T03:29:53Z

@ChristianSi We don't need to say anything if implementations are going to choose their own behavior for handling comments. We just need to be simple and obvious. So we can simplify things to their essentials, and leave the rest to interpretation.

It's a given that any TOML document must be a valid Unicode document encoded in UTF-8. So we need not worry about invalid code points (like surrogates) or invalid byte strings, which would yield well-defined errors of their own.

One concern that wasn't addressed is whether or not the 0x0D in a Windows-type newline ought to be ignored. Since the line feed character is central to how comments work, we need only to mention that either type of newline can mark the end of a comment. I propose this text to replace the restrictions at the end of the Comments section in toml.md:

All characters except line feeds (0x0A) are permitted in comments and may be
ignored. At their discretion, parsers that read comments may exclude a final
carriage return (0x0D) appearing before a terminating line feed.

That's about all that needs to be said, I think. Simple parsers can just start from the #, scan ahead for 0x0A, and throw out all that stuff, excluding the 0x0A. More sophisticated parsers can take Windows-style newlines into account when preserving comments or interpreting them. Although this may require existing TOML v1.0 parsers to reimplement their comment handling code, these changes do ostensibly simplify parsing and address each of @abelbraaksma's concerns.

cleishm · 2022-10-19T08:47:31Z

For now, I've implemented the stricter mechanism in TomlJ - but in a bit of a sketchy way: I lex out any comment as # through to any control character (COMMENT : '#' (~[\u0000-\u0008\u000A-\u001F\u007F])*).

That works only because I know control characters aren't allowed anywhere else, so will immediately generate an error (except for \r?\n). But the error isn't particularly descriptive - it'll only indicate that the control character isn't expected, e.g. Unexpected '\\r', expected a newline or end-of-input, when it should really indicate that it isn't allowed within a comment.

abelbraaksma · 2022-10-20T17:33:58Z

@cleishm, not sure if this is feasible, but if there are inconsistent line endings, I'd report that as a specific error. Something like "Inconsistent line endings detected in file". Personally, I think any combination of CR|LF|CRLF should be allowed for simplicity's sake, but that ship has sailed, as mentioned in #924.

cleishm · 2022-10-20T21:20:24Z

@abelbraaksma For most parsers, it's probably not feasible. In my case, I use the ANTLR lexer (tokenizer) to detect newlines so that the parser only has to deal with the newline token. To give that kind of error would mean the lexer needs to be stateful - recording that it saw a newline of one type and then of a different type later. Not impossible, of course, but not trivial either.

abelbraaksma · 2022-10-21T22:17:24Z

@cleishm, I’m not sure how the lexer in your case defines the errors, but there ought to be a location where that exception is thrown. Since it’s only ever going to be a new line character that could possibly mean “wrong/corrupt line endings” and since it could mean nothing else, you could just make the method that throws the error itself smarter, ie by simply switching over whether the incorrect token is a \r, and adjust the message accordingly.

It’s been many years ago that I worked with ANTLR, so forgive me if I’m missing the obvious, and oversimplify things…

eksortso · 2022-10-22T01:03:07Z

As it stands, a TOML document with mixed LFs and CRLFs for line endings should not produce an error. Either line ending would be handled properly as a newline. And within multiline strings, the parser will normalize the line endings in the resulting strings.

abelbraaksma · 2022-10-22T12:49:24Z

Yes. But I think we were talking about sole CRs here, which are explicitly disallowed, when not followed by an LF.

eksortso · 2022-10-22T16:24:30Z

Which I agree with. My point is that the error message should not complain about "inconsistent" line endings, because they actually can be inconsistent, as long as only the two permitted line endings are used.

abelbraaksma · 2022-10-23T13:45:56Z

Totally, I didn’t mean to muddy the waters. Sorry for the confusion!

cleishm · 2022-10-24T08:34:36Z

In TOMLJ, sole CRs are already raised as errors, and a newline is tokenized from \r?\n, meaning it will consume any CR immediately before the linefeed (https://github.com/tomlj/tomlj/blob/main/src/main/antlr/org/tomlj/internal/TomlLexer.g4#L32). So this fits the current specification.

It won't handle documents where only CRs are used as line endings, but that is not currently permitted by the spec (and such documents are very uncommon now anyway).

This reverts commit ab74958. I'm a simple guy. Someone reports a problem, I fix it. No one reports a problem? There is nothing to fix so I go drink beer. No one really reported this as a problem, so there isn't anything to fix. But it *does* introduce entirely needless churn for all TOML implementations. Do we need to forbid *anything* in comments? Probably not. In strings we probably only need to forbid \x00. But at least before it was consistent with strings, and more importantly, what everyone wrote code for, which is tested, and already works. And [none of the hypotheticals](toml-lang#567 (comment)) on why this is "needed" are practical issues people reported, and most aren't even fixed: a comment can still invalidate the file, you must still parse each character in a comment as some are still forbidden, the performance benefits are very close to zero they might as well be zero, and you still can't "dump whatever you like" in comments. So it doesn't *actually* change anything, it just changes "disallow this set of control characters" to ... another (smaller) set. That's not really a substantial change. The only (minor) real-world issue that was reported (from the person doing the Java implementation) was that "it's substantially more complicated to parse out control characters in comments and raise an error, and this kind of strictness provides no real advantage to users". And that's not addressed at all with this. --- And while I'm at it, let me have a complaint about how this was merged: 1. Two people, both of whom actually maintain implementations, say they don't like this change. 2. This is basically ignored. 3. Three people continue written a fairly large number of extensive comments, so anyone who wasn't already interested in this change unsubscribes and/or goes 🤷 4. "Consensus". Sometimes I feel TOML attracts people who like to argue things from a mile-high ivory tower with abstract arguments that have only superficial bearing to actual pragmatic reality. Fixes toml-lang#995

This reverts commit ab74958. I'm a simple guy. Someone reports a problem, I drink coffee and fix it. No one reports a problem? There is nothing to fix and I go drink beer. No one really reported this as a problem, but it *does* introduce needless churn for all TOML implementations and the test suite. Do we need to forbid *anything* in comments? Probably not, and in strings we probably only need to forbid \x00. But at least before it was consistent with strings, and more importantly, what everyone wrote code for, which is tested, and already works. [None of the hypotheticals](toml-lang#567 (comment)) on why this is "needed" are practical issues people reported, and most aren't even fixed: a comment can still invalidate the file, you must still parse each character in a comment as some are still forbidden, the performance benefits are very close to zero they might as well be zero, and you still can't "dump whatever you like" in comments. So it doesn't *actually* change anything, it just changes "disallow this set of control characters" to ... "disallow this set of control characters" (but for a different set). That's not really a substantial or meaningful change. The only (minor) real-world issue that was reported (from the person doing the Java implementation) was that "it's substantially more complicated to parse out control characters in comments and raise an error, and this kind of strictness provides no real advantage to users". And that's not addressed at all with this, so... --- And while I'm at it, let me have a complaint about how this was merged: 1. Two people, both of whom actually maintain implementations, say they don't like this change. 2. This is basically ignored. 3. Three people continue written a fairly large number of large comments, so anyone who wasn't already interested in this change unsubscribes and/or goes 🤷 4. "Consensus". Sometimes I feel TOML attracts people who like to argue things from a mile-high ivory tower with abstract arguments that have only passing familiarity with any actual pragmatic reality. Fixes toml-lang#995

pradyunsg added the discussion label Oct 9, 2018

pradyunsg added the abnf label May 13, 2019

pradyunsg added clarification and removed discussion labels Jun 12, 2019

pradyunsg self-assigned this Dec 12, 2019

pradyunsg mentioned this issue Dec 18, 2019

Clarify that control characters are not permitted in comments #691

Merged

pradyunsg closed this as completed in #691 Dec 22, 2019

larpon mentioned this issue Oct 22, 2021

toml: check for illegal characters, fix all related skipped tests vlang/v#12270

Merged

eksortso added a commit to eksortso/toml that referenced this issue Oct 11, 2022

Relax comment parsing, per discussion on toml-lang#567

3ddb05f

eksortso mentioned this issue Oct 11, 2022

Permit more control characters in comments #924

Merged

arp242 mentioned this issue Sep 30, 2023

Excluding LF in 'allowed-comment-char' is confusing #995

Open

arp242 mentioned this issue Oct 1, 2023

Revert "Permit more control characters in comments (#924)" #996

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unclear whether control characters are allowed in comments #567

Unclear whether control characters are allowed in comments #567

ghost commented Oct 6, 2018

pradyunsg commented Oct 9, 2018

ghost commented Oct 9, 2018 •

edited by pradyunsg

Loading

rmunn commented Oct 25, 2018

ChristianSi commented Oct 26, 2018

pradyunsg commented Nov 9, 2018

ChristianSi commented Nov 10, 2018 •

edited

Loading

pradyunsg commented Dec 12, 2019

abelbraaksma commented Dec 22, 2019 •

edited

Loading

cleishm commented Oct 9, 2022

eksortso commented Oct 10, 2022

ChristianSi commented Oct 10, 2022 •

edited

Loading

eksortso commented Oct 11, 2022

cleishm commented Oct 19, 2022 •

edited

Loading

abelbraaksma commented Oct 20, 2022

cleishm commented Oct 20, 2022

abelbraaksma commented Oct 21, 2022

eksortso commented Oct 22, 2022

abelbraaksma commented Oct 22, 2022

eksortso commented Oct 22, 2022

abelbraaksma commented Oct 23, 2022

cleishm commented Oct 24, 2022

Unclear whether control characters are allowed in comments #567

Unclear whether control characters are allowed in comments #567

Comments

ghost commented Oct 6, 2018

pradyunsg commented Oct 9, 2018

ghost commented Oct 9, 2018 • edited by pradyunsg Loading

rmunn commented Oct 25, 2018

ChristianSi commented Oct 26, 2018

pradyunsg commented Nov 9, 2018

ChristianSi commented Nov 10, 2018 • edited Loading

pradyunsg commented Dec 12, 2019

abelbraaksma commented Dec 22, 2019 • edited Loading

cleishm commented Oct 9, 2022

eksortso commented Oct 10, 2022

ChristianSi commented Oct 10, 2022 • edited Loading

eksortso commented Oct 11, 2022

cleishm commented Oct 19, 2022 • edited Loading

abelbraaksma commented Oct 20, 2022

cleishm commented Oct 20, 2022

abelbraaksma commented Oct 21, 2022

eksortso commented Oct 22, 2022

abelbraaksma commented Oct 22, 2022

eksortso commented Oct 22, 2022

abelbraaksma commented Oct 23, 2022

cleishm commented Oct 24, 2022

ghost commented Oct 9, 2018 •

edited by pradyunsg

Loading

ChristianSi commented Nov 10, 2018 •

edited

Loading

abelbraaksma commented Dec 22, 2019 •

edited

Loading

ChristianSi commented Oct 10, 2022 •

edited

Loading

cleishm commented Oct 19, 2022 •

edited

Loading