-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ctags: Add quick path for looking up too long strings in the keyword table #3433
Conversation
…table Parser code like vString *str = vStringNew(); while (someCondition) { int c = getcFromInputFile(); vStringPut(str, c); if (lookupCaseKeyword (vStringValue(str), some_lang)) { do_stuff(); vStringClear(str); } } is prone to quadratic complexity because when someCondition isn't satisfied in the parsed file, the string str grows by one in each iteration and in each iteration lookupCaseKeyword() has to go through strlen(str) characters to compute the hash. Since we know the maximum length of the strings inside the keyword hash table, we can store this value and if the queried string is longer than this value, we can be sure it isn't present in the hash table and return quickly without having to compute the full hash of the string.
Couple of comments/questions
[Edit: just to be clear in case its not immediately obvious, "unruly" input is where hashes cluster on one or a few values, so those chains grow] |
The key is that it's up to maxLen and not potentially the whole-file-len. Assume, for instance, that you have a 1MB file and this problem happens at the beginning of the file. Also assume that maxLen (which corresponds to the longest keyword) is say 30. With this patch, the code will go through 30 * 1 000 000 characters (30 in each iteration). Without this patch, it will go through 1 character in the first iteration, 2 characters in the second, ..., and 1 000 000 in the 1 000 000th iteration, so totally 1 000 000 * 1 000 000 / 2.
Computing the hash isn't the problem - the problem is for how many characters it is computed unnecessarily when we know in advance that the string cannot be inside the hash table because it's too long. But instead of doing
The patch doesn't try to do anything about the hashing itself. It just implements this simple logic: is the length of the string we are checking longer than the longest string in the hash table? If so, we can be sure it's not in the hash table and say no quickly without doing the full hashing and table lookup. Otherwise hashing works as before.
This line won't be reached because of this check before: Lines 162 to 163 in dea43ba
This happened in the SQL parser because of the various dialects of SQL:
When reaching the |
Maybe just to be clear, when storing values in the hash table, the Line 121 in dea43ba
means "infinity", not a real limit (the table is used for keywords only, there will, hopefully, be no languages with 1000 character keywords). So storing works as before and is unaffected by this patch. |
Firstly I'm definitely in favour of applying this to limit the impact of issues with badly written parsers. After all they are written for offline ctags, not interactive Geany. But the point was that the real problem is the parser, and that should be fixed to be smarter, as you say:
Of course thats not infallable,
Well, PL/SQL $$THINGS are really a type of compile time identifier that users can also define IIRC (its been [mumble] decades since I done PL/SQL), so they could be any length.
Sorry for not being clear, I was talking about the existing code that would be casing a whole $$ string, and which I think is the cause of the slowness, not the hashing. Certainly the test will constrain it (unless an evil 😈 user made a long name, see above). |
This part of the parser is quite tricky. The strings starting with
where
Not undoable, but I just don't want to be the one writing the proper fix ;-). |
Yes, and IIRC in some SQLs
Yeah, but since this patch limits the damage, we can assign implementing the proper solution to "somebody" else 😁 |
Oh yeah, totally, and these can also be keywords and looked up in the table - I just added support to the parser for C-preprocessor-like ifdefs here Just too many different things happening after
My plan :-). |
Parser code like
is prone to quadratic complexity because when someCondition isn't satisfied in the parsed file, the string str grows by one in each iteration and in each iteration
lookupCaseKeyword()
has to go throughstrlen(str)
characters to compute the hash.Since we know the maximum length of the strings inside the keyword hash table, we can store this value and if the queried string is longer than this value, we can be sure it isn't present in the hash table and return quickly without having to compute the full hash of the string.
This code has already been merged into ctags master so we can cherry-pick it safely. I'll merge it in about a week if there are no objections.
Fixes the hang of SQL parser reported in geany/geany-osx#42.