-
Notifications
You must be signed in to change notification settings - Fork 629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python: optimize skipEverything() #383
Conversation
Actually the performance exactly doubles: n: old time before the change the split between the function and the rest is: after optimizing it: The rest takes the same amount of time before and after so 0.89 * m = 0.45 * n so n/m = 2 End of secondary school math :-). |
Great! Could you tell me how do you find the hot spot? |
I think the old code is better readable because of the literal strings, but the improvement is too big not to merge this. So could you add one or two line of comment (or extend the existing one above the block) to clarify a bit, including the strings it is looking for? And can you add a test case to the Units directory for this? |
match = 1; | ||
cp += 1; | ||
} | ||
else if (*cp != 'r' && *cp != 'R' && (c1 == 'r' || c1 == 'R')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather move this before and simply advance cp
(well, compute the offset rather), and then test for '
and "
, as this check is the same in both paths.
Something like this
if (!match && (*cp == 'u' || *cp == 'r' || *cp == 'b' ||
*cp == 'U' || *cp == 'R' || *cp == 'B'))
{
unsigned int i = 1;
if (*cp != 'r' && *cp != 'R' && (cp[i] == 'r' || cp[i] == 'R'))
i++;
if (cp[i] == '\'' || cp[i] == '"')
{
match = 1;
cp += i;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@b4n Thanks, it looks nicer, I'll use this.
LGTM |
@masatake I've just written a geany wiki page describing the profiling method I use here (thanks for motivating me to finally do that, I've been planning to write something for a long time): https://wiki.geany.org/howtos/profiling/gperftools The picture there is from this very issue. Actually I found this issue in Geany when I noticed the Python parser was about twice as slow as the C parser. For ctags the easiest way to test will be the first described method (without the signals). I think this issue is rather rare in the parsers - the parsers that use tokens shouldn't suffer from something like this because the token is built up by reading character by character and no multiple comparisons like that happen. In general the most performance-critical part of the parsers is the input-skipping code because this is what runs most of the time (function/variable/etc. declarations/definitions take a small part of sources and the rest like function bodies has to be skipped). @ffes I agree it's a bit harder to read - I'll add some comments as you suggest and will try to come up with some test (have to check first if there isn't some test already testing this). |
@techee, than you. I will put the link to the page in docs/tips.rst. |
Most of the time there's no start of a string which means all the 10 strcmp()s are done for every character of the input. This is very expensive: before this patch this function alone takes 55% of the parser time. When comparing by character (and avoiding further comparison if the first character doesn't match), this function takes only 11% of the parser time so the performance of the parser nearly doubles. In addition check for the "rb" prefix which is possible in Python 3.
OK, I have noticed that in the original code the "rb" Python 3 prefix is missing https://docs.python.org/3.5/reference/lexical_analysis.html#string-and-bytes-literals so I added it (which complicates things a little more). About the test - I actually don't know how to test it. After looking at the Python specification for some time and the way the parser works I think even if something like rb"whatever" was parsed as an identifier "rb" and string "whatever", the parser would work alright. But it's of course possible I missed some special case. |
Indeed. The only case I could imagine breaking would be if such a string appears somewhere it would be mistaken for an identifier, but I can't think of any valid code doing this. So yeah, we could add a test including all kind of strings and that would probably be good, but I doubt it would really test for much. So IMO the fact no test breaks is basically good enough. |
Ready to merge? |
@masatake From my side yes (someone should review the updated patch though). |
python: optimize skipEverything()
Thank you. |
Most of the time there's no start of a string which means all the 10
strcmp()s are done for every character of the input. This is very expensive:
before this patch this function alone takes 55% of the parser time.
When comparing by character (and avoiding further comparison if the first
character doesn't match), this function takes only 11% of the parser time
so the performance of the parser nearly doubles.