Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[apex] Use case-insensitive input stream to avoid choking on Unicode escape sequences #5284

Merged
merged 1 commit into from
Nov 14, 2024

Conversation

wahajenius
Copy link

@wahajenius wahajenius commented Oct 22, 2024

Describe the PR

The Apex lexer fails to correctly parse Unicode escape sequences when it does not operate on a CaseInsensitiveInputStream.
For example:

ApexLexer lexer = new ApexLexer(CharStreams.fromString("'Fran\\u00E7ois'"));
CommonTokenStream tokens  = new CommonTokenStream(lexer);
tokens.fill();

This causes a runtime error during tokens.fill():

line 1:0 token recognition error at: ''Fran\u00E'
line 1:14 token recognition error at: '''

The character stream must be wrapped in a CaseInsensitiveInputStream:

ApexLexer lexer = new ApexLexer(new CaseInsensitiveInputStream(CharStreams.fromString("'Fran\\u00E7ois'")));
...

Performing a pmd check on Apex classes eventually leads to ApexCommentBuilder.extractInformationFromComments(), which does not include the wrapped CaseInsensitiveInputStream yet, causing analysis failures. This PR adds the wrapper and a unit test demonstrating the issue.

Related issues

Ready?

  • Added unit tests for fixed bug/feature
  • Passing all unit tests
  • Complete build ./mvnw clean verify passes (checked automatically by github actions)
  • Added (in-code) documentation (if needed)

@pmd-test
Copy link

1 Message
📖 Compared to main:
This changeset changes 0 violations,
introduces 0 new violations, 0 new errors and 0 new configuration errors,
removes 0 violations, 0 errors and 0 configuration errors.
Download full report as build artifact

Generated by 🚫 Danger

@adangel adangel added this to the 7.7.0 milestone Oct 22, 2024
@adangel adangel changed the title [apex] Must use case-insensitive input stream to avoid choking on Unicode escape sequences [apex] Use case-insensitive input stream to avoid choking on Unicode escape sequences Oct 22, 2024
Copy link
Member

@adangel adangel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @wahajenius, thanks for the PR!

This looks almost good, but we should add a proper test. Please have a look at my comments.

Comment on lines +60 to +61
assertEquals(2, getLexingErrors(CharStreams.fromString(s)));
assertEquals(0, getLexingErrors(new CaseInsensitiveInputStream(CharStreams.fromString(s))));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, that this test doesn't test PMD - it is testing the underlying library. But it doesn't make sure, that PMD is using the underlying library correctly...
This means, if your fix is removed, this test stays green, although PMD will be broken...

Please add a test case, e.g. in https://github.com/pmd/pmd/blob/main/pmd-apex/src/test/java/net/sourceforge/pmd/lang/apex/ast/ApexCommentTest.java

The test could look something like this:

    @Test
    void fileWithUnicodeEscapes() {
        ASTApexFile file = apex.parse(FORMAL_COMMENT_CONTENT + "\n"
                + "class MyClass { String s = 'Fran\\u00E7ois'; }");
        ASTFormalComment comment = file.descendants(ASTUserClass.class).children(ASTFormalComment.class).first();
        assertEquals(FORMAL_COMMENT_CONTENT, comment.getImage());
    }

Thanks to the custom error listener in ApexCommentBuilder, this should fail, if we don't apply your fix (using CaseInsensitiveInputStream) and should be working with your fix.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points!

I agree, my original test doesn't cover the specific situation on how the library is being used by PMD. I intended it more as a showcase of the peculiar behavior of the apex-lexer library rather than as a proper unit test for ApexCommentBuilder. Because it's odd that the library apparently breaks on Unicode escape sequences in strings - those shouldn't be affected at all by the fact that the Apex language is case-insensitive. I will notify the maintainers of apex-parser because it may be a bug. My intention for the PR here was to raise awareness and provide a quick fix for PMD's use case while the underlying lexing issue is under investigation.

I noticed that ApexCommentBuilder doesn't have unit tests yet and as it's fairly complex and I'm not familiar with the codebase, writing a proper unit test seemed like a daunting task. It's indeed a good idea to add it ApexCommentTest.

Should I remove my original test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I remove my original test?

I think, we can keep these tests. Maybe add a link to apex-dev-tools/apex-parser#55 (thanks btw. to report this upstream).

return errorListener.getErrorCount();
}

static class ErrorListener extends BaseErrorListener {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static class ErrorListener extends BaseErrorListener {
private static class ErrorListener extends BaseErrorListener {

@@ -103,7 +104,8 @@ public void buildFormalComment(AbstractApexNode node) {
}

private static CommentInformation extractInformationFromComments(TextDocument sourceCode, String suppressMarker) {
ApexLexer lexer = new ApexLexer(CharStreams.fromString(sourceCode.getText().toString()));
String source = sourceCode.getText().toString();
ApexLexer lexer = new ApexLexer(new CaseInsensitiveInputStream(CharStreams.fromString(source)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should here also remove the default error listener of antlr and add our own, that really fails instead of just logging to stderr:

        lexer.removeErrorListeners();
        lexer.addErrorListener(new BaseErrorListener() {
            @Override
            public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine, String msg, RecognitionException e) {
                throw new ParseException("line " + line + ":" + charPositionInLine + " " + msg);
            }
        });

(Note: ParseException = net.sourceforge.pmd.lang.ast.ParseException)

Otherwise, we won't be able to create an test easily, as we won't react on any lexer error.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something similar is already used elsewhere:

Perhaps that custom error listener can be refactored into a dedicated class and reused in all cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good observation. However, I'd prefer to keep this PR limited and try to avoid growing it further.

These refactorings you mention makes sense, but should be done in a separate step - separate PR. Note, that we deliberately don't throw an exception in PmdSwiftParser, as the grammar is far from complete - throwing an exception here basically means, that only few Swift code can be parsed.
Note, that we should keep throwing a ParseException instead of a LexException in ApexCommentBuilder, as this is called from the PMD "check" goal rather than from CPD (copy paste detection).
Looking at the other examples, the exception should be probably build with:

throw new ParseException(msg).withLocation(FileLocation.caret(sourceCode.getFileId(), line, charPositionInLine));

@adangel adangel modified the milestones: 7.7.0, 7.8.0 Oct 24, 2024
adangel added a commit that referenced this pull request Nov 14, 2024
adangel added a commit that referenced this pull request Nov 14, 2024
…escape sequences (#5284)

Merge pull request #5284 from wahajenius:main
@adangel adangel merged commit 8c58a0b into pmd:main Nov 14, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[apex] Token recognition errors for string containing unicode escape sequence
3 participants