-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[apex] Use case-insensitive input stream to avoid choking on Unicode escape sequences #5284
Conversation
…code escape sequences
Generated by 🚫 Danger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @wahajenius, thanks for the PR!
This looks almost good, but we should add a proper test. Please have a look at my comments.
assertEquals(2, getLexingErrors(CharStreams.fromString(s))); | ||
assertEquals(0, getLexingErrors(new CaseInsensitiveInputStream(CharStreams.fromString(s)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, that this test doesn't test PMD - it is testing the underlying library. But it doesn't make sure, that PMD is using the underlying library correctly...
This means, if your fix is removed, this test stays green, although PMD will be broken...
Please add a test case, e.g. in https://github.com/pmd/pmd/blob/main/pmd-apex/src/test/java/net/sourceforge/pmd/lang/apex/ast/ApexCommentTest.java
The test could look something like this:
@Test
void fileWithUnicodeEscapes() {
ASTApexFile file = apex.parse(FORMAL_COMMENT_CONTENT + "\n"
+ "class MyClass { String s = 'Fran\\u00E7ois'; }");
ASTFormalComment comment = file.descendants(ASTUserClass.class).children(ASTFormalComment.class).first();
assertEquals(FORMAL_COMMENT_CONTENT, comment.getImage());
}
Thanks to the custom error listener in ApexCommentBuilder, this should fail, if we don't apply your fix (using CaseInsensitiveInputStream) and should be working with your fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points!
I agree, my original test doesn't cover the specific situation on how the library is being used by PMD. I intended it more as a showcase of the peculiar behavior of the apex-lexer library rather than as a proper unit test for ApexCommentBuilder
. Because it's odd that the library apparently breaks on Unicode escape sequences in strings - those shouldn't be affected at all by the fact that the Apex language is case-insensitive. I will notify the maintainers of apex-parser because it may be a bug. My intention for the PR here was to raise awareness and provide a quick fix for PMD's use case while the underlying lexing issue is under investigation.
I noticed that ApexCommentBuilder
doesn't have unit tests yet and as it's fairly complex and I'm not familiar with the codebase, writing a proper unit test seemed like a daunting task. It's indeed a good idea to add it ApexCommentTest
.
Should I remove my original test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I remove my original test?
I think, we can keep these tests. Maybe add a link to apex-dev-tools/apex-parser#55 (thanks btw. to report this upstream).
return errorListener.getErrorCount(); | ||
} | ||
|
||
static class ErrorListener extends BaseErrorListener { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static class ErrorListener extends BaseErrorListener { | |
private static class ErrorListener extends BaseErrorListener { |
@@ -103,7 +104,8 @@ public void buildFormalComment(AbstractApexNode node) { | |||
} | |||
|
|||
private static CommentInformation extractInformationFromComments(TextDocument sourceCode, String suppressMarker) { | |||
ApexLexer lexer = new ApexLexer(CharStreams.fromString(sourceCode.getText().toString())); | |||
String source = sourceCode.getText().toString(); | |||
ApexLexer lexer = new ApexLexer(new CaseInsensitiveInputStream(CharStreams.fromString(source))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should here also remove the default error listener of antlr and add our own, that really fails instead of just logging to stderr:
lexer.removeErrorListeners();
lexer.addErrorListener(new BaseErrorListener() {
@Override
public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine, String msg, RecognitionException e) {
throw new ParseException("line " + line + ":" + charPositionInLine + " " + msg);
}
});
(Note: ParseException = net.sourceforge.pmd.lang.ast.ParseException)
Otherwise, we won't be able to create an test easily, as we won't react on any lexer error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something similar is already used elsewhere:
- AntlrTokenManager, throws
LexException
- GroovyTokenManager, throws
LexException
- PmdSwiftParser, only logs, TODO for throwing
ParseException
Perhaps that custom error listener can be refactored into a dedicated class and reused in all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good observation. However, I'd prefer to keep this PR limited and try to avoid growing it further.
These refactorings you mention makes sense, but should be done in a separate step - separate PR. Note, that we deliberately don't throw an exception in PmdSwiftParser, as the grammar is far from complete - throwing an exception here basically means, that only few Swift code can be parsed.
Note, that we should keep throwing a ParseException instead of a LexException in ApexCommentBuilder, as this is called from the PMD "check" goal rather than from CPD (copy paste detection).
Looking at the other examples, the exception should be probably build with:
throw new ParseException(msg).withLocation(FileLocation.caret(sourceCode.getFileId(), line, charPositionInLine));
Describe the PR
The Apex lexer fails to correctly parse Unicode escape sequences when it does not operate on a
CaseInsensitiveInputStream
.For example:
This causes a runtime error during
tokens.fill()
:The character stream must be wrapped in a
CaseInsensitiveInputStream
:Performing a
pmd check
on Apex classes eventually leads toApexCommentBuilder.extractInformationFromComments()
, which does not include the wrappedCaseInsensitiveInputStream
yet, causing analysis failures. This PR adds the wrapper and a unit test demonstrating the issue.Related issues
Ready?
./mvnw clean verify
passes (checked automatically by github actions)