[apex] Use case-insensitive input stream to avoid choking on Unicode escape sequences #5284

wahajenius · 2024-10-22T12:39:57Z

Describe the PR

The Apex lexer fails to correctly parse Unicode escape sequences when it does not operate on a CaseInsensitiveInputStream.
For example:

ApexLexer lexer = new ApexLexer(CharStreams.fromString("'Fran\\u00E7ois'"));
CommonTokenStream tokens  = new CommonTokenStream(lexer);
tokens.fill();

This causes a runtime error during tokens.fill():

line 1:0 token recognition error at: ''Fran\u00E'
line 1:14 token recognition error at: '''

The character stream must be wrapped in a CaseInsensitiveInputStream:

ApexLexer lexer = new ApexLexer(new CaseInsensitiveInputStream(CharStreams.fromString("'Fran\\u00E7ois'")));
...

Performing a pmd check on Apex classes eventually leads to ApexCommentBuilder.extractInformationFromComments(), which does not include the wrapped CaseInsensitiveInputStream yet, causing analysis failures. This PR adds the wrapper and a unit test demonstrating the issue.

Related issues

Fixes [apex] Token recognition errors for string containing unicode escape sequence #5333

Ready?

Added unit tests for fixed bug/feature
Passing all unit tests
Complete build ./mvnw clean verify passes (checked automatically by github actions)
Added (in-code) documentation (if needed)

…code escape sequences

pmd-test · 2024-10-22T15:49:54Z

	1 Message
📖	Compared to main: This changeset changes 0 violations, introduces 0 new violations, 0 new errors and 0 new configuration errors, removes 0 violations, 0 errors and 0 configuration errors. Download full report as build artifact

Generated by 🚫 Danger

adangel

Hi @wahajenius, thanks for the PR!

This looks almost good, but we should add a proper test. Please have a look at my comments.

adangel · 2024-10-22T19:12:00Z

pmd-apex/src/test/java/net/sourceforge/pmd/lang/apex/ast/ApexLexerTest.java

+        assertEquals(2, getLexingErrors(CharStreams.fromString(s)));
+        assertEquals(0, getLexingErrors(new CaseInsensitiveInputStream(CharStreams.fromString(s))));


Note, that this test doesn't test PMD - it is testing the underlying library. But it doesn't make sure, that PMD is using the underlying library correctly...
This means, if your fix is removed, this test stays green, although PMD will be broken...

Please add a test case, e.g. in https://github.com/pmd/pmd/blob/main/pmd-apex/src/test/java/net/sourceforge/pmd/lang/apex/ast/ApexCommentTest.java

The test could look something like this:

@Test void fileWithUnicodeEscapes() { ASTApexFile file = apex.parse(FORMAL_COMMENT_CONTENT + "\n" + "class MyClass { String s = 'Fran\\u00E7ois'; }"); ASTFormalComment comment = file.descendants(ASTUserClass.class).children(ASTFormalComment.class).first(); assertEquals(FORMAL_COMMENT_CONTENT, comment.getImage()); }

Thanks to the custom error listener in ApexCommentBuilder, this should fail, if we don't apply your fix (using CaseInsensitiveInputStream) and should be working with your fix.

Good points!

I agree, my original test doesn't cover the specific situation on how the library is being used by PMD. I intended it more as a showcase of the peculiar behavior of the apex-lexer library rather than as a proper unit test for ApexCommentBuilder. Because it's odd that the library apparently breaks on Unicode escape sequences in strings - those shouldn't be affected at all by the fact that the Apex language is case-insensitive. I will notify the maintainers of apex-parser because it may be a bug. My intention for the PR here was to raise awareness and provide a quick fix for PMD's use case while the underlying lexing issue is under investigation.

I noticed that ApexCommentBuilder doesn't have unit tests yet and as it's fairly complex and I'm not familiar with the codebase, writing a proper unit test seemed like a daunting task. It's indeed a good idea to add it ApexCommentTest.

Should I remove my original test?

Should I remove my original test?

I think, we can keep these tests. Maybe add a link to apex-dev-tools/apex-parser#55 (thanks btw. to report this upstream).

adangel · 2024-10-22T19:12:14Z

pmd-apex/src/test/java/net/sourceforge/pmd/lang/apex/ast/ApexLexerTest.java

+        return errorListener.getErrorCount();
+    }
+
+    static class ErrorListener extends BaseErrorListener {


Suggested change

static class ErrorListener extends BaseErrorListener {

private static class ErrorListener extends BaseErrorListener {

adangel · 2024-10-22T19:22:31Z

pmd-apex/src/main/java/net/sourceforge/pmd/lang/apex/ast/ApexCommentBuilder.java

@@ -103,7 +104,8 @@ public void buildFormalComment(AbstractApexNode node) {
    }

    private static CommentInformation extractInformationFromComments(TextDocument sourceCode, String suppressMarker) {
-        ApexLexer lexer = new ApexLexer(CharStreams.fromString(sourceCode.getText().toString()));
+        String source = sourceCode.getText().toString();
+        ApexLexer lexer = new ApexLexer(new CaseInsensitiveInputStream(CharStreams.fromString(source)));


We should here also remove the default error listener of antlr and add our own, that really fails instead of just logging to stderr:

lexer.removeErrorListeners(); lexer.addErrorListener(new BaseErrorListener() { @Override public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine, String msg, RecognitionException e) { throw new ParseException("line " + line + ":" + charPositionInLine + " " + msg); } });

(Note: ParseException = net.sourceforge.pmd.lang.ast.ParseException)

Otherwise, we won't be able to create an test easily, as we won't react on any lexer error.

Something similar is already used elsewhere:

AntlrTokenManager, throws LexException

GroovyTokenManager, throws LexException

PmdSwiftParser, only logs, TODO for throwing ParseException

Perhaps that custom error listener can be refactored into a dedicated class and reused in all cases.

That's a good observation. However, I'd prefer to keep this PR limited and try to avoid growing it further.

These refactorings you mention makes sense, but should be done in a separate step - separate PR. Note, that we deliberately don't throw an exception in PmdSwiftParser, as the grammar is far from complete - throwing an exception here basically means, that only few Swift code can be parsed.
Note, that we should keep throwing a ParseException instead of a LexException in ApexCommentBuilder, as this is called from the PMD "check" goal rather than from CPD (copy paste detection).
Looking at the other examples, the exception should be probably build with:

throw new ParseException(msg).withLocation(FileLocation.caret(sourceCode.getFileId(), line, charPositionInLine));

…escape sequences (#5284) Merge pull request #5284 from wahajenius:main

[apex] Must use case-insensitive input stream to avoid choking on Uni…

61b1c37

…code escape sequences

adangel added this to the 7.7.0 milestone Oct 22, 2024

adangel changed the title ~~[apex] Must use case-insensitive input stream to avoid choking on Unicode escape sequences~~ [apex] Use case-insensitive input stream to avoid choking on Unicode escape sequences Oct 22, 2024

adangel requested changes Oct 22, 2024

View reviewed changes

adangel modified the milestones: 7.7.0, 7.8.0 Oct 24, 2024

adangel added a commit that referenced this pull request Nov 14, 2024

[doc] Update release notes (#5284, #5333)

093683b

adangel added a commit that referenced this pull request Nov 14, 2024

[apex] Use case-insensitive input stream to avoid choking on Unicode …

206ed8b

…escape sequences (#5284) Merge pull request #5284 from wahajenius:main

adangel merged commit 8c58a0b into pmd:main Nov 14, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[apex] Use case-insensitive input stream to avoid choking on Unicode escape sequences #5284

[apex] Use case-insensitive input stream to avoid choking on Unicode escape sequences #5284

wahajenius commented Oct 22, 2024 •

edited by adangel

Loading

pmd-test commented Oct 22, 2024

adangel left a comment

adangel Oct 22, 2024

wahajenius Oct 23, 2024

adangel Oct 24, 2024

adangel Oct 22, 2024

adangel Oct 22, 2024

wahajenius Oct 23, 2024

adangel Oct 24, 2024

		assertEquals(2, getLexingErrors(CharStreams.fromString(s)));
		assertEquals(0, getLexingErrors(new CaseInsensitiveInputStream(CharStreams.fromString(s))));

	static class ErrorListener extends BaseErrorListener {
	private static class ErrorListener extends BaseErrorListener {

[apex] Use case-insensitive input stream to avoid choking on Unicode escape sequences #5284

[apex] Use case-insensitive input stream to avoid choking on Unicode escape sequences #5284

Conversation

wahajenius commented Oct 22, 2024 • edited by adangel Loading

Describe the PR

Related issues

Ready?

pmd-test commented Oct 22, 2024

adangel left a comment

Choose a reason for hiding this comment

adangel Oct 22, 2024

Choose a reason for hiding this comment

wahajenius Oct 23, 2024

Choose a reason for hiding this comment

adangel Oct 24, 2024

Choose a reason for hiding this comment

adangel Oct 22, 2024

Choose a reason for hiding this comment

adangel Oct 22, 2024

Choose a reason for hiding this comment

wahajenius Oct 23, 2024

Choose a reason for hiding this comment

adangel Oct 24, 2024

Choose a reason for hiding this comment

wahajenius commented Oct 22, 2024 •

edited by adangel

Loading