Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve SQL parsing of character literals (quoted strings) #5108

Merged
merged 13 commits into from
Nov 28, 2023

Conversation

erasmussen-first
Copy link
Contributor

@erasmussen-first erasmussen-first commented Oct 23, 2023

Impact

  • Bug fix (non-breaking change which fixes expected existing functionality)
  • Enhancement/New feature (adds functionality without impacting existing logic)
  • Breaking change (fix or feature that would cause existing functionality to change)

Description

This fixes an SQL parsing issue where a certain character sequence inside a quoted character literal (aka quoted string) is incorrectly tokenized (premature end-of-string detection), leading to errors executing the SQL command(s) on the database host. The issue affects users of MySQL, Postgres, and possibly other DBs.

Also, a SQL character literal test pattern which fails without this update is added to the SimpleSqlGrammarTest.groovy to confirm the fix and prevent future reversion.

Things to be aware of

Per highlightjs/highlight.js#1748

  • Some SQL implementations support only quote-char-twice as an escaped-quote, others only support backslash-quote, some support both, and sometimes it is configurable.
  • This means it is not possible to always conclusively identify quoted text boundaries without context about SQL platform and configuration.
  • A real fix is possibly to invoke a block of code defined in SimpleSqlGrammar.jj to dynamically check for false-positive matches on S_CHAR_LITERAL.

A related complication is that '\' is sometimes a valid string literal, and sometimes it is not.

A similar issue in a different SQL parser using JavaCC is mentioned here: JSQLParser/JSqlParser#1172.

Things to worry about

There are inconsistencies between different DB platforms in what is considered valid character literal syntax. Strings that are valid on one type of server may not be valid on another, and vice versa. This makes platform-agnostic end-of-string detection imprecise. Please test this very thoroughly on multiple DB platforms.

On the other hand, the current parser logic exposes MySQL, Postgres, and possibly other DBs to errors and/or unintended SQL command execution.

It may be that the truly correct fix involves detection of DB platform and configuration options.

Additional Context

The character pattern that this fixes was distilled out of changeSets that work correctly with MySQL 5.7 and Liquibase 3.5.4. I have not tested all the intervening releases to see exactly where it stopped working.

@filipelautert
Copy link
Collaborator

Hello @erasmussen-first ! Thanks for the PR.
We have one failed integration test . It's a mysql test, and it fails when running the following SQL:

CREATE PROCEDURE insert_shop ()
INSERT INTO shop 
VALUES
(1,'\'',3.45),
(1,'B',3.99),
(4,'\"',19.23),
(4,'\'\"',10.00)
;

CREATE FUNCTION f_insert_shop ()
RETURNS VARCHAR(20)
DETERMINISTIC
BEGIN
INSERT INTO shop 
VALUES
(5,'\'',3.45),
(5,'B',3.99),
(7,'\"',19.23),
(7,'\'\"',10.00);
RETURN ('STRING');
END
;

It is failing because it is not able to find the delimiter ";" to split the String into 2 SQLs, then it bundles it all together and we have a failure.
I managed to reproduce the issue adding the following line to the unit test file:

        "'a\'b;c\nd'"                                          | ["'a\'b;c\nd'"]

Notice that this is almost the same test that you added, but instead of using double \ I'm using just one. I tried some options but wasn't able to "fix" it.. any ideas?

@erasmussen-first
Copy link
Contributor Author

Thanks. I suspect the \' and \" within the same string are not handled correctly. I'll try to adjust the patterns later today.

…e inside double quotes (escape isn't needed in these cases, but should be supported)
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@filipelautert
Copy link
Collaborator

Hi @erasmussen-first ! thanks for the change, but it still not passing this check. I added the verification to the unit tests files so we are able to validate it using github build.

@erasmussen-first
Copy link
Contributor Author

erasmussen-first commented Oct 30, 2023

That's very strange. It seems to be working here with the pattern that I though you added in your comment. It may be that GitHub comments are turning a double backslash into a single backslash and I misinterpreted what you said. I'll merge in the latest master to make sure I'm not missing something else and try again with the new pattern added to the test.

Copy link
Collaborator

@filipelautert filipelautert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional tests passed here -> https://github.com/liquibase/liquibase-pro-tests/actions/runs/6825903760 .
@rberezen - this is a change to the heart of the
SQL parser, but @erasmussen-first did a great job and it is passing all of our tests.

| < #ESC_NON_QUOTE: "\\" ["n","t","b","r","f","\\","0"] >

/* SQL-standard is that string literals are delimited only by single-quote, and double-quotes are only for identifiers... */
//| < #S_QUOTED_STRING_A: ( "'" ( <ESC_S_QUOTE_A> | <ESC_NON_QUOTE> | ~["'"] )* "'") >
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erasmussen-first Hi! I believe we should uncomment these few lines, right?

//| < S_CHAR_LITERAL: (["U","E","N","R","B"]|"RB"|"_utf8")? ( <S_QUOTED_STRING_A> | <S_QUOTED_STRING_B> | <D_QUOTED_STRING_A> | <D_QUOTED_STRING_B> ) >
| < S_CHAR_LITERAL: (["U","E","N","R","B"]|"RB"|"_utf8")? (<S_QUOTED_STRING_HYBRID> | <D_QUOTED_STRING_HYBRID>) >

// Previous logic...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erasmussen-first I do not think we should keep the previous logic. Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rberezen I think you mean remove those commented out lines?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@filipelautert yes, thank you ;)

Copy link
Contributor Author

@erasmussen-first erasmussen-first Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. I am happy to remove the comments. I had left them in to make it easy to compare the old and new patterns as well as to suggest ideas for further improvements. I will instead archive suggestions here in the comments and update the PR.

The patterns below may be of use in future work to more unambiguously determine end-of-string literal, but require the parser to know which syntax (A or B) applies to the database host. A possible solution for that is to implement optional parameters and default behavior in the <sql> and <sqlFile> tags. Trying all four patterns at once will produce mistakes, which is why this PR uses hybrid patterns.

    /* SQL-standard is that string literals are delimited only by single-quote, and double-quotes are only for identifiers... */
|   < #S_QUOTED_STRING_A: ( "'" ( <ESC_S_QUOTE_A> | <ESC_NON_QUOTE> | ~["'"] )* "'") >
|   < #S_QUOTED_STRING_B: ( "'" ( <ESC_S_QUOTE_A> | <ESC_S_QUOTE_B> | <ESC_D_QUOTE_B> | <ESC_NON_QUOTE>  | ~["\\","'"] )* "'") >
    /* ... but many DBs tolerate double-quotes around string literals, including MySQL (unless you enable ANSI SQL mode), and MSSQL (if you disable SET QUOTED_IDENTIFIER) */
|   < #D_QUOTED_STRING_A: ( "\"" ( <ESC_D_QUOTE_A> | <ESC_NON_QUOTE> | ~["\""] )* "\"") >
|   < #D_QUOTED_STRING_B: ( "\"" ( <ESC_S_QUOTE_B> | <ESC_D_QUOTE_A> | <ESC_D_QUOTE_B> | <ESC_NON_QUOTE> | ~["\\","\""] )* "\"") >
    /* Finally... (pick one based on DB host syntax) */
//|   < S_CHAR_LITERAL: (["U","E","N","R","B"]|"RB"|"_utf8")? ( <S_QUOTED_STRING_A> | <D_QUOTED_STRING_A> ) >
//|   < S_CHAR_LITERAL: (["U","E","N","R","B"]|"RB"|"_utf8")? ( <S_QUOTED_STRING_B> | <D_QUOTED_STRING_B> ) >

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erasmussen-first thank you very much for your help and contribution!

@filipelautert filipelautert added this to the 1NEXT milestone Nov 27, 2023
@filipelautert filipelautert merged commit dfb6c20 into liquibase:master Nov 28, 2023
38 of 42 checks passed
@filipelautert
Copy link
Collaborator

Thanks @erasmussen-first !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants