python: optimize skipEverything() #383

techee · 2015-06-18T17:23:35Z

Most of the time there's no start of a string which means all the 10
strcmp()s are done for every character of the input. This is very expensive:
before this patch this function alone takes 55% of the parser time.
When comparing by character (and avoiding further comparison if the first
character doesn't match), this function takes only 11% of the parser time
so the performance of the parser nearly doubles.

techee · 2015-06-18T17:35:51Z

Actually the performance exactly doubles:

n: old time
m: new time

before the change the split between the function and the rest is:
0.55 * n + 0.45 * n

after optimizing it:
0.11 * m + 0.89 * m

The rest takes the same amount of time before and after so

0.89 * m = 0.45 * n

so

n/m = 2

End of secondary school math :-).

masatake · 2015-06-19T00:55:12Z

Great!

Could you tell me how do you find the hot spot?
Can I ask you to write it to somewhere in docs/tips.rst?

ffes · 2015-06-19T07:24:14Z

I think the old code is better readable because of the literal strings, but the improvement is too big not to merge this. So could you add one or two line of comment (or extend the existing one above the block) to clarify a bit, including the strings it is looking for?

And can you add a test case to the Units directory for this?

b4n · 2015-06-19T11:54:14Z

parsers/python.c

+				match = 1;
+				cp += 1;
+			}
+			else if (*cp != 'r' && *cp != 'R' && (c1 == 'r' || c1 == 'R'))


I'd rather move this before and simply advance cp (well, compute the offset rather), and then test for ' and ", as this check is the same in both paths.

Something like this

if (!match && (*cp == 'u' || *cp == 'r' || *cp == 'b' || *cp == 'U' || *cp == 'R' || *cp == 'B')) { unsigned int i = 1; if (*cp != 'r' && *cp != 'R' && (cp[i] == 'r' || cp[i] == 'R')) i++; if (cp[i] == '\'' || cp[i] == '"') { match = 1; cp += i; } }

@b4n Thanks, it looks nicer, I'll use this.

b4n · 2015-06-19T11:55:17Z

LGTM

techee · 2015-06-19T13:19:17Z

@masatake I've just written a geany wiki page describing the profiling method I use here (thanks for motivating me to finally do that, I've been planning to write something for a long time):

https://wiki.geany.org/howtos/profiling/gperftools

The picture there is from this very issue. Actually I found this issue in Geany when I noticed the Python parser was about twice as slow as the C parser. For ctags the easiest way to test will be the first described method (without the signals).

I think this issue is rather rare in the parsers - the parsers that use tokens shouldn't suffer from something like this because the token is built up by reading character by character and no multiple comparisons like that happen. In general the most performance-critical part of the parsers is the input-skipping code because this is what runs most of the time (function/variable/etc. declarations/definitions take a small part of sources and the rest like function bodies has to be skipped).

@ffes I agree it's a bit harder to read - I'll add some comments as you suggest and will try to come up with some test (have to check first if there isn't some test already testing this).

masatake · 2015-06-20T05:49:31Z

@techee, than you. I will put the link to the page in docs/tips.rst.

Most of the time there's no start of a string which means all the 10 strcmp()s are done for every character of the input. This is very expensive: before this patch this function alone takes 55% of the parser time. When comparing by character (and avoiding further comparison if the first character doesn't match), this function takes only 11% of the parser time so the performance of the parser nearly doubles. In addition check for the "rb" prefix which is possible in Python 3.

techee · 2015-06-20T09:48:45Z

OK, I have noticed that in the original code the "rb" Python 3 prefix is missing

https://docs.python.org/3.5/reference/lexical_analysis.html#string-and-bytes-literals

so I added it (which complicates things a little more).

About the test - I actually don't know how to test it. After looking at the Python specification for some time and the way the parser works I think even if something like rb"whatever" was parsed as an identifier "rb" and string "whatever", the parser would work alright. But it's of course possible I missed some special case.

b4n · 2015-06-20T11:40:03Z

About the test - I actually don't know how to test it. After looking at the Python specification for some time and the way the parser works I think even if something like rb"whatever" was parsed as an identifier "rb" and string "whatever", the parser would work alright. But it's of course possible I missed some special case.

Indeed. The only case I could imagine breaking would be if such a string appears somewhere it would be mistaken for an identifier, but I can't think of any valid code doing this.

So yeah, we could add a test including all kind of strings and that would probably be good, but I doubt it would really test for much.

So IMO the fact no test breaks is basically good enough.

masatake · 2015-06-22T16:37:53Z

Ready to merge?

techee · 2015-06-22T20:37:14Z

@masatake From my side yes (someone should review the updated patch though).

python: optimize skipEverything()

masatake · 2015-06-24T07:00:52Z

Thank you.

masatake added the Parsers label Jun 19, 2015

b4n reviewed Jun 19, 2015
View reviewed changes

techee force-pushed the python_optimize branch from 40a832b to 98e2521 Compare June 20, 2015 09:41

b4n mentioned this pull request Jun 20, 2015

python: Fix handling of inline comments #387

Merged

masatake added a commit that referenced this pull request Jun 24, 2015

Merge pull request #383 from techee/python_optimize

61ac9e7

python: optimize skipEverything()

masatake merged commit 61ac9e7 into universal-ctags:master Jun 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python: optimize skipEverything() #383

python: optimize skipEverything() #383

techee commented Jun 18, 2015

techee commented Jun 18, 2015

masatake commented Jun 19, 2015

ffes commented Jun 19, 2015

b4n Jun 19, 2015

techee Jun 19, 2015

b4n commented Jun 19, 2015

techee commented Jun 19, 2015

masatake commented Jun 20, 2015

techee commented Jun 20, 2015

b4n commented Jun 20, 2015

masatake commented Jun 22, 2015

techee commented Jun 22, 2015

masatake commented Jun 24, 2015

python: optimize skipEverything() #383

python: optimize skipEverything() #383

Conversation

techee commented Jun 18, 2015

techee commented Jun 18, 2015

masatake commented Jun 19, 2015

ffes commented Jun 19, 2015

b4n Jun 19, 2015

Choose a reason for hiding this comment

techee Jun 19, 2015

Choose a reason for hiding this comment

b4n commented Jun 19, 2015

techee commented Jun 19, 2015

masatake commented Jun 20, 2015

techee commented Jun 20, 2015

b4n commented Jun 20, 2015

masatake commented Jun 22, 2015

techee commented Jun 22, 2015

masatake commented Jun 24, 2015