Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear whether control characters are allowed in comments #567

Closed
ghost opened this issue Oct 6, 2018 · 21 comments · Fixed by #691
Closed

Unclear whether control characters are allowed in comments #567

ghost opened this issue Oct 6, 2018 · 21 comments · Fixed by #691
Assignees

Comments

@ghost
Copy link

ghost commented Oct 6, 2018

It is not clear from the specification whether control characters (U+0000 .. U+001F and U+007F) are allowed in comments.

The section Comment only states that "A hash symbol marks the rest of the line as a comment.".
This implies that any non-newline characters are allowed after the hash symbol.

Discussion about control characters only appears in the section String, which should have no implication on comments.

The ABNF definition forbids control characters (except tab) in comments, but the ABNF is not authoritative.

I think the section Comment should explicitly state which characters are allowed in comments.
Note that it would be unlogical to allow e.g. form-feed inside comments, because form-feed is traditionally a stronger separator than newline.

@pradyunsg
Copy link
Member

ABNF is not authoritative.

It is.

I think the section Comment should explicitly state which characters are allowed in comments.

If there's more people who think this would add value, we might do this. I don't want to complicate an otherwise straightforward explanation.

@ghost
Copy link
Author

ghost commented Oct 9, 2018

ABNF is not authoritative.

It is.

The general theme in issues #566, #567, #568, #569 is that I believe the specification to be ambiguous. I assumed the text document is the complete, stand-alone, authoritative specification, while the ABNF has only experimental status.

If indeed the ABNF is authoritative, perhaps these issues are resolved.
However I think in that case the specification text should explicitly declare the ABNF grammer to be authoritative and defer to it for details. Otherwise how will anybody know where to look for a precise definition of TOML ?

@rmunn
Copy link

rmunn commented Oct 25, 2018

From the point of view of a parser, it's probably easier to allow control characters (except for newline) in comments, because then the rule is "Once you see a comment marker, ignore everything until the next newline". This is simpler (at least for some parser libraries) to implement than "Once you see the comment marker, ignore all valid characters but throw an error for 0x00-0x1f and 0x7f, and switch modes at the newline".

So I'd personally prefer for the rule to be "control characters are legal (and ignored) in comments; after a comment marker is seen, only the next newline matters."

@ChristianSi
Copy link
Contributor

@rmunn On the other hand, if control chars are prohibited in comments too, they would be prohibited anywhere in a TOML document since they are also prohibited in strings of all kinds (and they are certainly not allowed outside of strings or comments). So prohibiting them everywhere might actually be quite simple for many parsers.

@pradyunsg
Copy link
Member

Currently, I am ambivalent on this issue tbh. I don't see any major gains either way so I'm inclined to say status quo would win here.

I can be convinced either way though.

@ChristianSi
Copy link
Contributor

ChristianSi commented Nov 10, 2018

@pradyunsg I assume if you say "status quo" you mean the status quo as defined by the ABNF (which prohibits control chars in comments)?

The original poster's point was exactly that, in the human-readable spec, the status quo is not defined. I too would advice clarifying this in the written spec in addition to the ABNF, for people who can read English better than Backus–Naur.

@pradyunsg
Copy link
Member

Thanks everyone for the patience here -- I took a long time to come back around to this.

Yes, I think forbidding control characters in comments makes sense. If there's a significant use case for control characters in comments, do holler.

Meanwhile, I think what we need here is a PR clarifying in text what's already clear in the ABNF -- that control characters are not allowed in comments.

@pradyunsg pradyunsg self-assigned this Dec 12, 2019
marzer added a commit to marzer/tomlplusplus that referenced this issue Dec 22, 2019
- implemented hexfloats
- handled trailing underscores in ints and floats
- improved parse performance for single-digit integers
- handled control characters in comments (per toml-lang/toml#567)
@abelbraaksma
Copy link
Contributor

abelbraaksma commented Dec 22, 2019

I understand this is closed now, and I also understand the logic behind the decision, however I think the following has not been considered and might make you reconsider the decision here:

If a comment with certain Unicode characters (except newline) is to be considered invalid, then this has the following drawbacks:

  • a comment can involuntarily invalidate the whole TOML file. I think such accidents should be prevented.
  • parsers must also parse each character in a comment. This may complicates parsing. Not because it's almost the same as 'string', but because of different handling of error situations, informing users why a comment is invalid.
  • on the same token, it slows down parsing,whatever way you look at it
  • I think the status quo requires strings to be valid Unicode strings. That's a little more parsing than just checking for control chars, and a parser really shouldn't be bothered with this added complexity when it encounters comments.
  • conceptually, people consider comments not part of the syntax, and feel free to dump whatever they like in them. Disallowing certain chars contradicts this very human 'feeling' about what comments are.

Bottom line of this argument, I'd like to propose a simplification: define a comment as any range of code points, until a newline is encountered.

This is simpler, and clearer for end users.

@cleishm
Copy link

cleishm commented Oct 9, 2022

As an implementation author, I agree with @abelbraaksma above. Firstly, it's substantially more complicated to parse out control characters in comments and raise an error, and this kind of strictness provides no real advantage to users - it results in documents being rejected for content that is commented out, which is itself an indication by the user that the content should be ignored by the parser.

@eksortso
Copy link
Contributor

I don't seriously object to allowing control characters in comments. Are there any security concerns about them here? I could object to a delete char, 0x7F, immediately following the hash sign. But that's it.

If you think it's merited, let's reopen this issue, and I can compose a PR that would allow most everything but newlines in comments.

@ChristianSi
Copy link
Contributor

ChristianSi commented Oct 10, 2022

I understand these sentiments, but currently the spec says "Control characters other than tab ... are not permitted in comments." So if we change this again, all 1.0-compatible parsers would supposedly have to be changed again.

Also some implementations might conceivably check for control chars before doing any further parsing step? If so, like I wrote earlier: "prohibiting them everywhere might actually be quite simple for many parsers."

Maybe this is a good case to let implementations decide on their behavior? The spec could say something like:

TOML implementations may throw an error or issue a warning if encountering any control characters other than tab (U+0000 to U+0008, U+000A to U+001F, U+007F) in a comment, but they don't have to do so.

In that way, both rejecting control chars and ignoring them together with the rest of the comment would both be fine.

This language would have to be adjusted regarding Unicode validity too, but here I think the same logic applies: some parsers might get the rejection of invalid Unicode sequences for free, because they use an OS/library function that does it for them. For them, if we require that "any range of code points" is accepted in the comment, whether valid Unicode or not, might actually make things harder. So "it's implementation-dependent" may be the best course of action too.

eksortso added a commit to eksortso/toml that referenced this issue Oct 11, 2022
@eksortso
Copy link
Contributor

@ChristianSi We don't need to say anything if implementations are going to choose their own behavior for handling comments. We just need to be simple and obvious. So we can simplify things to their essentials, and leave the rest to interpretation.

It's a given that any TOML document must be a valid Unicode document encoded in UTF-8. So we need not worry about invalid code points (like surrogates) or invalid byte strings, which would yield well-defined errors of their own.

One concern that wasn't addressed is whether or not the 0x0D in a Windows-type newline ought to be ignored. Since the line feed character is central to how comments work, we need only to mention that either type of newline can mark the end of a comment. I propose this text to replace the restrictions at the end of the Comments section in toml.md:

All characters except line feeds (0x0A) are permitted in comments and may be
ignored. At their discretion, parsers that read comments may exclude a final
carriage return (0x0D) appearing before a terminating line feed.

That's about all that needs to be said, I think. Simple parsers can just start from the #, scan ahead for 0x0A, and throw out all that stuff, excluding the 0x0A. More sophisticated parsers can take Windows-style newlines into account when preserving comments or interpreting them. Although this may require existing TOML v1.0 parsers to reimplement their comment handling code, these changes do ostensibly simplify parsing and address each of @abelbraaksma's concerns.

@cleishm
Copy link

cleishm commented Oct 19, 2022

For now, I've implemented the stricter mechanism in TomlJ - but in a bit of a sketchy way: I lex out any comment as # through to any control character (COMMENT : '#' (~[\u0000-\u0008\u000A-\u001F\u007F])*).

That works only because I know control characters aren't allowed anywhere else, so will immediately generate an error (except for \r?\n). But the error isn't particularly descriptive - it'll only indicate that the control character isn't expected, e.g. Unexpected '\\r', expected a newline or end-of-input, when it should really indicate that it isn't allowed within a comment.

@abelbraaksma
Copy link
Contributor

@cleishm, not sure if this is feasible, but if there are inconsistent line endings, I'd report that as a specific error. Something like "Inconsistent line endings detected in file". Personally, I think any combination of CR|LF|CRLF should be allowed for simplicity's sake, but that ship has sailed, as mentioned in #924.

@cleishm
Copy link

cleishm commented Oct 20, 2022

@abelbraaksma For most parsers, it's probably not feasible. In my case, I use the ANTLR lexer (tokenizer) to detect newlines so that the parser only has to deal with the newline token. To give that kind of error would mean the lexer needs to be stateful - recording that it saw a newline of one type and then of a different type later. Not impossible, of course, but not trivial either.

@abelbraaksma
Copy link
Contributor

@cleishm, I’m not sure how the lexer in your case defines the errors, but there ought to be a location where that exception is thrown. Since it’s only ever going to be a new line character that could possibly mean “wrong/corrupt line endings” and since it could mean nothing else, you could just make the method that throws the error itself smarter, ie by simply switching over whether the incorrect token is a \r, and adjust the message accordingly.

It’s been many years ago that I worked with ANTLR, so forgive me if I’m missing the obvious, and oversimplify things…

@eksortso
Copy link
Contributor

As it stands, a TOML document with mixed LFs and CRLFs for line endings should not produce an error. Either line ending would be handled properly as a newline. And within multiline strings, the parser will normalize the line endings in the resulting strings.

@abelbraaksma
Copy link
Contributor

Yes. But I think we were talking about sole CRs here, which are explicitly disallowed, when not followed by an LF.

@eksortso
Copy link
Contributor

Which I agree with. My point is that the error message should not complain about "inconsistent" line endings, because they actually can be inconsistent, as long as only the two permitted line endings are used.

@abelbraaksma
Copy link
Contributor

Totally, I didn’t mean to muddy the waters. Sorry for the confusion!

@cleishm
Copy link

cleishm commented Oct 24, 2022

In TOMLJ, sole CRs are already raised as errors, and a newline is tokenized from \r?\n, meaning it will consume any CR immediately before the linefeed (https://github.com/tomlj/tomlj/blob/main/src/main/antlr/org/tomlj/internal/TomlLexer.g4#L32). So this fits the current specification.

It won't handle documents where only CRs are used as line endings, but that is not currently permitted by the spec (and such documents are very uncommon now anyway).

arp242 added a commit to arp242/toml that referenced this issue Oct 1, 2023
This reverts commit ab74958.

I'm a simple guy. Someone reports a problem, I fix it. No one reports a problem? There is nothing to fix so I go drink beer.

No one really reported this as a problem, so there isn't anything to fix. But it *does* introduce entirely needless churn for all TOML implementations. Do we need to forbid *anything* in comments? Probably not. In strings we probably only need to forbid \x00. But at least before it was consistent with strings, and more importantly, what everyone wrote code for, which is tested, and already works.

And [none of the hypotheticals](toml-lang#567 (comment)) on why this is "needed" are practical issues people reported, and most aren't even fixed: a comment can still invalidate the file, you must still parse each character in a comment as some are still forbidden, the performance benefits are very close to zero they might as well be zero, and you still can't "dump whatever you like" in comments.

So it doesn't *actually* change anything, it just changes "disallow this set of control characters" to ... another (smaller) set. That's not really a substantial change. The only (minor) real-world issue that was reported (from the person doing the Java implementation) was that "it's substantially more complicated to parse out control characters in comments and raise an error, and this kind of strictness provides no real advantage to users". And that's not addressed at all with this.

---

And while I'm at it, let me have a complaint about how this was merged:

1. Two people, both of whom actually maintain implementations, say they don't like this change.
2. This is basically ignored.
3. Three people continue written a fairly large number of extensive comments, so anyone who wasn't already interested in this change unsubscribes and/or goes 🤷
4. "Consensus".

Sometimes I feel TOML attracts people who like to argue things from a mile-high ivory tower with abstract arguments that have only superficial bearing to actual pragmatic reality.

Fixes toml-lang#995
arp242 added a commit to arp242/toml that referenced this issue Oct 1, 2023
This reverts commit ab74958.

I'm a simple guy. Someone reports a problem, I drink coffee and fix it. No one reports a problem? There is nothing to fix and I go drink beer.

No one really reported this as a problem, but it *does* introduce needless churn for all TOML implementations and the test suite. Do we need to forbid *anything* in comments? Probably not, and in strings we probably only need to forbid \x00. But at least before it was consistent with strings, and more importantly, what everyone wrote code for, which is tested, and already works.

[None of the hypotheticals](toml-lang#567 (comment)) on why this is "needed" are practical issues people reported, and most aren't even fixed: a comment can still invalidate the file, you must still parse each character in a comment as some are still forbidden, the performance benefits are very close to zero they might as well be zero, and you still can't "dump whatever you like" in comments.

So it doesn't *actually* change anything, it just changes "disallow this set of control characters" to ... "disallow this set of control characters" (but for a different set). That's not really a substantial or meaningful change. The only (minor) real-world issue that was reported (from the person doing the Java implementation) was that "it's substantially more complicated to parse out control characters in comments and raise an error, and this kind of strictness provides no real advantage to users". And that's not addressed at all with this, so...

---

And while I'm at it, let me have a complaint about how this was merged:

1. Two people, both of whom actually maintain implementations, say they don't like this change.
2. This is basically ignored.
3. Three people continue written a fairly large number of large comments, so anyone who wasn't already interested in this change unsubscribes and/or goes 🤷
4. "Consensus".

Sometimes I feel TOML attracts people who like to argue things from a mile-high ivory tower with abstract arguments that have only passing familiarity with any actual pragmatic reality.

Fixes toml-lang#995
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants