-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve SQL parsing of character literals (quoted strings) #5108
Improve SQL parsing of character literals (quoted strings) #5108
Conversation
Hello @erasmussen-first ! Thanks for the PR.
It is failing because it is not able to find the delimiter ";" to split the String into 2 SQLs, then it bundles it all together and we have a failure.
Notice that this is almost the same test that you added, but instead of using double |
Thanks. I suspect the \' and \" within the same string are not handled correctly. I'll try to adjust the patterns later today. |
…e inside double quotes (escape isn't needed in these cases, but should be supported)
Kudos, SonarCloud Quality Gate passed! |
Hi @erasmussen-first ! thanks for the change, but it still not passing this check. I added the verification to the unit tests files so we are able to validate it using github build. |
That's very strange. It seems to be working here with the pattern that I though you added in your comment. It may be that GitHub comments are turning a double backslash into a single backslash and I misinterpreted what you said. I'll merge in the latest master to make sure I'm not missing something else and try again with the new pattern added to the test. |
…first/liquibase into improve-endDelimiter
…arsed test pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functional tests passed here -> https://github.com/liquibase/liquibase-pro-tests/actions/runs/6825903760 .
@rberezen - this is a change to the heart of the
SQL parser, but @erasmussen-first did a great job and it is passing all of our tests.
| < #ESC_NON_QUOTE: "\\" ["n","t","b","r","f","\\","0"] > | ||
|
||
/* SQL-standard is that string literals are delimited only by single-quote, and double-quotes are only for identifiers... */ | ||
//| < #S_QUOTED_STRING_A: ( "'" ( <ESC_S_QUOTE_A> | <ESC_NON_QUOTE> | ~["'"] )* "'") > |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erasmussen-first Hi! I believe we should uncomment these few lines, right?
//| < S_CHAR_LITERAL: (["U","E","N","R","B"]|"RB"|"_utf8")? ( <S_QUOTED_STRING_A> | <S_QUOTED_STRING_B> | <D_QUOTED_STRING_A> | <D_QUOTED_STRING_B> ) > | ||
| < S_CHAR_LITERAL: (["U","E","N","R","B"]|"RB"|"_utf8")? (<S_QUOTED_STRING_HYBRID> | <D_QUOTED_STRING_HYBRID>) > | ||
|
||
// Previous logic... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erasmussen-first I do not think we should keep the previous logic. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rberezen I think you mean remove those commented out lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@filipelautert yes, thank you ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi. I am happy to remove the comments. I had left them in to make it easy to compare the old and new patterns as well as to suggest ideas for further improvements. I will instead archive suggestions here in the comments and update the PR.
The patterns below may be of use in future work to more unambiguously determine end-of-string literal, but require the parser to know which syntax (A or B) applies to the database host. A possible solution for that is to implement optional parameters and default behavior in the <sql>
and <sqlFile>
tags. Trying all four patterns at once will produce mistakes, which is why this PR uses hybrid patterns.
/* SQL-standard is that string literals are delimited only by single-quote, and double-quotes are only for identifiers... */
| < #S_QUOTED_STRING_A: ( "'" ( <ESC_S_QUOTE_A> | <ESC_NON_QUOTE> | ~["'"] )* "'") >
| < #S_QUOTED_STRING_B: ( "'" ( <ESC_S_QUOTE_A> | <ESC_S_QUOTE_B> | <ESC_D_QUOTE_B> | <ESC_NON_QUOTE> | ~["\\","'"] )* "'") >
/* ... but many DBs tolerate double-quotes around string literals, including MySQL (unless you enable ANSI SQL mode), and MSSQL (if you disable SET QUOTED_IDENTIFIER) */
| < #D_QUOTED_STRING_A: ( "\"" ( <ESC_D_QUOTE_A> | <ESC_NON_QUOTE> | ~["\""] )* "\"") >
| < #D_QUOTED_STRING_B: ( "\"" ( <ESC_S_QUOTE_B> | <ESC_D_QUOTE_A> | <ESC_D_QUOTE_B> | <ESC_NON_QUOTE> | ~["\\","\""] )* "\"") >
/* Finally... (pick one based on DB host syntax) */
//| < S_CHAR_LITERAL: (["U","E","N","R","B"]|"RB"|"_utf8")? ( <S_QUOTED_STRING_A> | <D_QUOTED_STRING_A> ) >
//| < S_CHAR_LITERAL: (["U","E","N","R","B"]|"RB"|"_utf8")? ( <S_QUOTED_STRING_B> | <D_QUOTED_STRING_B> ) >
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erasmussen-first thank you very much for your help and contribution!
Thanks @erasmussen-first ! |
Impact
Description
This fixes an SQL parsing issue where a certain character sequence inside a quoted character literal (aka quoted string) is incorrectly tokenized (premature end-of-string detection), leading to errors executing the SQL command(s) on the database host. The issue affects users of MySQL, Postgres, and possibly other DBs.
Also, a SQL character literal test pattern which fails without this update is added to the SimpleSqlGrammarTest.groovy to confirm the fix and prevent future reversion.
Things to be aware of
Per highlightjs/highlight.js#1748
A related complication is that '\' is sometimes a valid string literal, and sometimes it is not.
A similar issue in a different SQL parser using JavaCC is mentioned here: JSQLParser/JSqlParser#1172.
Things to worry about
There are inconsistencies between different DB platforms in what is considered valid character literal syntax. Strings that are valid on one type of server may not be valid on another, and vice versa. This makes platform-agnostic end-of-string detection imprecise. Please test this very thoroughly on multiple DB platforms.
On the other hand, the current parser logic exposes MySQL, Postgres, and possibly other DBs to errors and/or unintended SQL command execution.
It may be that the truly correct fix involves detection of DB platform and configuration options.
Additional Context
The character pattern that this fixes was distilled out of changeSets that work correctly with MySQL 5.7 and Liquibase 3.5.4. I have not tested all the intervening releases to see exactly where it stopped working.