ctags: Add quick path for looking up too long strings in the keyword table #3433

techee · 2023-03-21T14:51:02Z

Parser code like

vString *str = vStringNew();
while (someCondition)
{
    int c = getcFromInputFile();
    vStringPut(str, c);
    if (lookupCaseKeyword (vStringValue(str), some_lang))
    {
        do_stuff();
        vStringClear(str);
    }
}

is prone to quadratic complexity because when someCondition isn't satisfied in the parsed file, the string str grows by one in each iteration and in each iteration lookupCaseKeyword() has to go through strlen(str) characters to compute the hash.

Since we know the maximum length of the strings inside the keyword hash table, we can store this value and if the queried string is longer than this value, we can be sure it isn't present in the hash table and return quickly without having to compute the full hash of the string.

This code has already been merged into ctags master so we can cherry-pick it safely. I'll merge it in about a week if there are no objections.

Fixes the hang of SQL parser reported in geany/geany-osx#42.

…table Parser code like vString *str = vStringNew(); while (someCondition) { int c = getcFromInputFile(); vStringPut(str, c); if (lookupCaseKeyword (vStringValue(str), some_lang)) { do_stuff(); vStringClear(str); } } is prone to quadratic complexity because when someCondition isn't satisfied in the parsed file, the string str grows by one in each iteration and in each iteration lookupCaseKeyword() has to go through strlen(str) characters to compute the hash. Since we know the maximum length of the strings inside the keyword hash table, we can store this value and if the queried string is longer than this value, we can be sure it isn't present in the hash table and return quickly without having to compute the full hash of the string.

elextr · 2023-03-22T00:50:38Z

Couple of comments/questions

after the patch it still goes through strlen characters and computes the hash (at least up to maxlen)
so computing the hash is unlikely to be the issue since it still happens with the patch,
I am not sure about the characteristics of the hash implementation with unruly input, does it limit the chain lengths, or can they extend up to the maximum load factor?
if the chain lengths can grow long then the cost of strcasecmp will dominate since it is slow, having to apply tolower on both strings before comparing, and IIRC SQL is case insensitive.
why the **** is the problematic code searching the hash each character in the first place? Why not scan input until the next end of word and search for that word once?

[Edit: just to be clear in case its not immediately obvious, "unruly" input is where hashes cluster on one or a few values, so those chains grow]

techee · 2023-03-22T09:13:32Z

after the patch it still goes through strlen characters and computes the hash (at least up to maxlen)

The key is that it's up to maxLen and not potentially the whole-file-len. Assume, for instance, that you have a 1MB file and this problem happens at the beginning of the file. Also assume that maxLen (which corresponds to the longest keyword) is say 30. With this patch, the code will go through 30 * 1 000 000 characters (30 in each iteration). Without this patch, it will go through 1 character in the first iteration, 2 characters in the second, ..., and 1 000 000 in the 1 000 000th iteration, so totally 1 000 000 * 1 000 000 / 2.

so computing the hash is unlikely to be the issue since it still happens with the patch,

Computing the hash isn't the problem - the problem is for how many characters it is computed unnecessarily when we know in advance that the string cannot be inside the hash table because it's too long. But instead of doing strlen() > maxLen before computing the hash, I compute the length as part of the hashing function so it reuses the loop and doesn't have to go through the whole string length.

I am not sure about the characteristics of the hash implementation with unruly input, does it limit the chain lengths, or can they extend up to the maximum load factor?

The patch doesn't try to do anything about the hashing itself. It just implements this simple logic: is the length of the string we are checking longer than the longest string in the hash table? If so, we can be sure it's not in the hash table and say no quickly without doing the full hashing and table lookup. Otherwise hashing works as before.

if the chain lengths can grow long then the cost of strcasecmp will dominate since it is slow, having to apply tolower on both strings before comparing, and IIRC SQL is case insensitive.

This line won't be reached because of this check before:

geany/ctags/main/keyword.c

Lines 162 to 163 in dea43ba

    
           if (maxLenReached) 
        
           	return KEYWORD_NONE;

why the **** is the problematic code searching the hash each character in the first place? Why not scan input until the next end of word and search for that word once?

This happened in the SQL parser because of the various dialects of SQL:

In PostgreSQL you can have ``$$dollar quoted strings delimited with dollars$$`
In PL/SQL you can have $$SOME_KEYWORDS_STARGING_WITH_DOUBLE_DOLLAR

When reaching the $$, the parser was checking against the keyword table if was one of the PL/SQL keywords as the string grew. Of course, in this case one could check for the spaces in this case and if the space was found, it couldn't be a keyword. It's just that such problems aren't so obvious when writing parsers (I gave the example code above as the illustration only, the whole thing is much more hidden in the SQL parser) and there may be similar problems in other parsers too and better to prevent complete parser freezes at the hash table level for all parsers than relying on the parser logic.

techee · 2023-03-22T10:35:28Z

Maybe just to be clear, when storing values in the hash table, the 1000 here

geany/ctags/main/keyword.c

Line 121 in dea43ba

    
           const unsigned int index = hashValue (string, language, 1000, &dummy) % TableSize;

means "infinity", not a real limit (the table is used for keywords only, there will, hopefully, be no languages with 1000 character keywords). So storing works as before and is unaffected by this patch.

elextr · 2023-03-22T11:17:56Z

there may be similar problems in other parsers too and better to prevent complete parser freezes at the hash table level for all parsers than relying on the parser logic.

Firstly I'm definitely in favour of applying this to limit the impact of issues with badly written parsers. After all they are written for offline ctags, not interactive Geany.

But the point was that the real problem is the parser, and that should be fixed to be smarter, as you say:

see $$
scan until first non-keyword character
now look it up in the keyword table
if not keyword skip string to $$

Of course thats not infallable, $$PLSQL_LINE is a PL/SQL keyword in a Postgresql string$$, but then that will be misinterpreted now anyway.

be no languages with 1000 character keywords). So storing works as before and is unaffected by this patch.

Well, PL/SQL $$THINGS are really a type of compile time identifier that users can also define IIRC (its been [mumble] decades since I done PL/SQL), so they could be any length.

This line won't be reached because of this check before

Sorry for not being clear, I was talking about the existing code that would be casing a whole $$ string, and which I think is the cause of the slowness, not the hashing. Certainly the test will constrain it (unless an evil 😈 user made a long name, see above).

techee · 2023-03-22T13:24:34Z

But the point was that the real problem is the parser, and that should be fixed to be smarter, as you say:

see $$

scan until first non-keyword character

now look it up in the keyword table

if not keyword skip string to $$

This part of the parser is quite tricky. The strings starting with $$ are just a special case but the general one is

$SOME_STRING$real string here$SOME_STRING$

where SOME_STRING is an arbitrary prefix (and suffix), possibly empty, leading to $$string$$ delimiting the string both at the beginning and the end. So you have to detect these guys, the keyword guys and the code isn't completely trivial, especially when you consider that inside these strings you can have sequences of characters that normally mean comments like --, which is a start of a line comment:

$$keyword-- this is a comment
$$non_keyword-- this is a contents of a string until double dollar

Not undoable, but I just don't want to be the one writing the proper fix ;-).

elextr · 2023-03-22T13:35:43Z

$SOME_STRING$real string here$SOME_STRING$

Yes, and IIRC in some SQLs $thing is also meaningful, but only plain $$ strings can clash with keywords that are going to be looked up in the table.

Not undoable, but I just don't want to be the one writing the proper fix ;-).

Yeah, but since this patch limits the damage, we can assign implementing the proper solution to "somebody" else 😁

techee · 2023-03-22T13:43:15Z

Yes, and IIRC in some SQLs $thing is also meaningful, but only plain $$ strings can clash with keywords that are going to be looked up in the table.

Oh yeah, totally, and these can also be keywords and looked up in the table - I just added support to the parser for C-preprocessor-like ifdefs here

universal-ctags/ctags#3654

Just too many different things happening after $ in various SQL dialects.

Yeah, but since this patch limits the damage, we can assign implementing the proper solution to "somebody" else 😁

My plan :-).

techee merged commit e2ce7db into geany:master Mar 24, 2023

techee mentioned this pull request Mar 25, 2023

Find window loses focus geany/geany-osx#42

Closed

b4n added this to the 1.39/2.0 milestone Apr 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ctags: Add quick path for looking up too long strings in the keyword table #3433

ctags: Add quick path for looking up too long strings in the keyword table #3433

techee commented Mar 21, 2023

elextr commented Mar 22, 2023 •

edited

Loading

techee commented Mar 22, 2023

techee commented Mar 22, 2023

elextr commented Mar 22, 2023

techee commented Mar 22, 2023

elextr commented Mar 22, 2023

techee commented Mar 22, 2023

ctags: Add quick path for looking up too long strings in the keyword table #3433

ctags: Add quick path for looking up too long strings in the keyword table #3433

Conversation

techee commented Mar 21, 2023

elextr commented Mar 22, 2023 • edited Loading

techee commented Mar 22, 2023

techee commented Mar 22, 2023

elextr commented Mar 22, 2023

techee commented Mar 22, 2023

elextr commented Mar 22, 2023

techee commented Mar 22, 2023

elextr commented Mar 22, 2023 •

edited

Loading