Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: optimise strings.ToLower() #1170

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

soujanyanmbri
Copy link
Contributor

@soujanyanmbri soujanyanmbri commented Oct 9, 2024

Thank you for contributing to Coraza WAF, your effort is greatly appreciated
Before submitting check if what you want to add to coraza list meets quality standards before sending pull request. Thanks!

Make sure that you've checked the boxes below before you submit PR:

Thanks for your contribution ❤️

@soujanyanmbri soujanyanmbri requested a review from a team as a code owner October 9, 2024 13:45
@soujanyanmbri
Copy link
Contributor Author

Benchmark:

BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Fully_Lowercase_Sentence-10         	40185970	        29.50 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Fully_Uppercase_Sentence-10         	17557453	        66.52 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Mixed_Case_Sentence-10              	18643068	        64.06 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Non-Alphabetic_Characters-10        	46125682	        26.00 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Greek_Alphabet-10                 	29133343	        41.43 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Cyrillic_Alphabet-10              	26644413	        44.82 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Mixed_Greek_and_ASCII-10          	18659374	        66.00 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Emoji-10                          	19809511	        59.83 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Empty_String-10                           	570688939	         2.225 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Only_Punctuation-10                       	196423014	         6.113 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Only_Whitespace-10                        	233590377	         5.154 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Long_Mixed_Case_String-10                 	 6503064	       183.7 ns/op	     224 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Special_Turkish_Case-10                   	16458063	        72.57 ns/op	      80 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/ASCII_Fully_Lowercase_Sentence-10             	15182882	        80.65 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/ASCII_Fully_Uppercase_Sentence-10             	14452549	        82.19 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/ASCII_Mixed_Case_Sentence-10                  	14902519	        81.55 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/ASCII_Non-Alphabetic_Characters-10            	16164508	        75.16 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Unicode_Greek_Alphabet-10                     	12417586	       103.0 ns/op	     192 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Unicode_Cyrillic_Alphabet-10                  	10973074	       103.6 ns/op	     192 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Unicode_Mixed_Greek_and_ASCII-10              	15417500	        78.68 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Unicode_Emoji-10                              	15793966	        77.65 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Empty_String-10                               	232960081	         5.279 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Only_Punctuation-10                           	99651907	        12.02 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Only_Whitespace-10                            	100000000	        11.05 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Long_Mixed_Case_String-10                     	 6063015	       195.0 ns/op	     448 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Special_Turkish_Case-10                       	13682554	        99.41 ns/op	     160 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Fully_Lowercase_Sentence-10            	20465416	        62.36 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Fully_Uppercase_Sentence-10            	 7061458	       172.2 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Mixed_Case_Sentence-10                 	 9851926	       129.9 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Non-Alphabetic_Characters-10           	24569835	        50.57 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Greek_Alphabet-10                    	 1287247	       929.4 ns/op	      96 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Cyrillic_Alphabet-10                 	 1000000	      1035 ns/op	      96 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Mixed_Greek_and_ASCII-10             	 2851934	       421.8 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Emoji-10                             	 4843082	       246.4 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Empty_String-10                              	568174978	         2.111 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Only_Punctuation-10                          	84454507	        14.13 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Only_Whitespace-10                           	100000000	        11.25 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Long_Mixed_Case_String-10                    	 3768148	       318.0 ns/op	     224 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Special_Turkish_Case-10                      	 3724746	       318.9 ns/op	      80 B/op	       1 allocs/op

Copy link

codecov bot commented Oct 9, 2024

Codecov Report

Attention: Patch coverage is 97.26027% with 2 lines in your changes missing coverage. Please review.

Project coverage is 83.50%. Comparing base (4ff1f76) to head (9b2743b).
Report is 154 commits behind head on main.

Files with missing lines Patch % Lines
internal/seclang/rule_parser.go 71.42% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1170      +/-   ##
==========================================
+ Coverage   82.72%   83.50%   +0.78%     
==========================================
  Files         162      166       +4     
  Lines        9080     9632     +552     
==========================================
+ Hits         7511     8043     +532     
- Misses       1319     1338      +19     
- Partials      250      251       +1     
Flag Coverage Δ
default 83.50% <97.26%> (+5.66%) ⬆️
examples 83.50% <97.26%> (+57.07%) ⬆️
ftw 83.50% <97.26%> (+36.13%) ⬆️
ftw-multiphase 83.50% <97.26%> (+33.95%) ⬆️
tinygo 83.50% <97.26%> (+8.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

// asciiToLowerInPlace converts ASCII characters in the string to lowercase
// starting from the specified index.
func asciiToLowerInPlace(s string, start int) string {
res := []byte(s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"returned byte slice MUST NOT be modified since it shares the same backing array with the given string."
Not sure, if this is the best use here, i tried to copy the byte slice before modifying it, and it is causing unexpected behaviour.

Copy link
Member

@M4tteoP M4tteoP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not per se against this change (performance is a key concept), but I would like to get a better understanding of the tradeoff we have in order to gain this extra performance.

  • Being just meant for ASCII chars, what are the transformations that we would not perform anymore? (First of all, I'm thinking at chars like ÀÈÉÖ).
  • Should we restrict the usage of the new stringsutil.AsciiToLower for comparisons where we expect to deal with ASCII and keep the standard one to deal with generic inputs? (but I get that it would not really be that beneficial in terms of performance gains)
  • Otherwise, we should consider it a breaking change, shouldn't we? Running the official ToLower tests with the new function might give more clarity about the behavior's change: https://github.com/golang/go/blob/master/src/strings/strings_test.go#L640-L650.

Thanks!

@jcchavezs
Copy link
Member

@M4tteoP I think the main assumption here is that variable keys and collection names will always be ascii.

@soujanyanmbri
Copy link
Contributor Author

Hey @M4tteoP ,
I've added a test case for unicode cases by the way:

{"Unicode Unchanged", "Привет Мир", "Привет Мир"}, // Unicode characters remain unchanged

As @jcchavezs mentioned, the main assumption here is that variable keys and collection names will always be ascii, and this should not cause coraza to miss any evasions as well.

@soujanyanmbri
Copy link
Contributor Author

Hey @jcchavezs @M4tteoP ,
Are there any additional checks i should be doing?

Thanks!

jcchavezs

This comment was marked as outdated.

@jcchavezs jcchavezs dismissed their stale review October 28, 2024 08:34

change of mind

@jcchavezs
Copy link
Member

@soujanyanmbri sorry for the back and forth.

after a chat with anuraaga and reading golang/go#17859 I was wondering if this change is truly needed. On one hand I believe we should only apply to those places where we truly believe this is going to improve performance, on the other hand I think benchmarks must be done with real data e.g. real strings we are optimizing for rather than other kind of input to see real performance difference as the assumption was:

I think the main assumption here is that variable keys and collection names will always be ascii.

and not everywhere. How does that sound?

@M4tteoP
Copy link
Member

M4tteoP commented Oct 28, 2024

My main concern here is if we can be sure about the assumption. Assuming only ASCII, we skip lowercase conversions for some UTF-8 characters that have lowercase representations. A trivial example that would lead to a change compared to current detection is the following:

SecRule REQUEST_BODY "@contains è" "id:101,phase:2,deny,t:lowercase,status:403,msg:'Invalid char',log,auditlog"
~/Repo/Coraza  18ae0d83(main) ✔
▶ curl -i 'localhost:8090' --data "È"
HTTP/1.1 403 Forbidden

[DEBUG] Evaluating operator: MATCH tx_id="HTxystXLQrrxvxSxjua" rule_id=101 variable="REQUEST_BODY" operator_function="@contains" operator_data="è" arg="è"
▶ git checkout soujanyanmbri/main
▶ curl -i 'localhost:8090' --data "È"
HTTP/1.1 200 OK

[DEBUG] Evaluating operator: NO MATCH tx_id="nmvdcNgWOFemGsrovLB" rule_id=101 variable="REQUEST_BODY" operator_function="@contains" operator_data="è" arg="È"

@soujanyanmbri
Copy link
Contributor Author

Hey @M4tteoP and @jcchavezs , made a minor fix, Added a check where we rollback to strings.toLower() in case we encounter any unicode character. What do you guys think ?

goos: darwin
goarch: arm64
pkg: github.com/corazawaf/coraza/v3/internal/strings
cpu: Apple M1 Pro
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Fully_Lowercase_Sentence-10         	25895461	        44.97 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Fully_Uppercase_Sentence-10         	18426256	        65.55 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Mixed_Case_Sentence-10              	20168433	        59.82 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Non-Alphabetic_Characters-10        	33194751	        36.07 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Greek_Alphabet-10                 	 1306746	       915.9 ns/op	      96 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Cyrillic_Alphabet-10              	 1000000	      1043 ns/op	      96 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Mixed_Greek_and_ASCII-10          	19852059	        61.32 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Emoji-10                          	 5101520	       236.6 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Empty_String-10                           	482705482	         2.668 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Only_Punctuation-10                       	154285611	         7.949 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Only_Whitespace-10                        	198415596	         6.275 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Long_Mixed_Case_String-10                 	 7410427	       168.4 ns/op	     224 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Special_Turkish_Case-10                   	18378188	        67.35 ns/op	      80 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Fully_Lowercase_Sentence-10            	19861150	        60.44 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Fully_Uppercase_Sentence-10            	 5104068	       216.2 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Mixed_Case_Sentence-10                 	 9451922	       129.9 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Non-Alphabetic_Characters-10           	19565535	        56.18 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Greek_Alphabet-10                    	 1294094	       933.0 ns/op	      96 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Cyrillic_Alphabet-10                 	 1000000	      1020 ns/op	      96 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Mixed_Greek_and_ASCII-10             	 3142808	       384.8 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Emoji-10                             	 5080602	       234.9 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Empty_String-10                              	460176516	         2.603 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Only_Punctuation-10                          	72635012	        16.56 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Only_Whitespace-10                           	90726379	        13.16 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Long_Mixed_Case_String-10                    	 3842942	       316.8 ns/op	     224 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Special_Turkish_Case-10                      	 4221205	       285.5 ns/op	      80 B/op	       1 allocs/op
PASS
ok  	github.com/corazawaf/coraza/v3/internal/strings	38.532s

@fzipi
Copy link
Member

fzipi commented Dec 29, 2024

ping @jcchavezs @M4tteoP

@fzipi fzipi changed the title Optimise strings.ToLower() refactor: optimise strings.ToLower() Dec 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants