-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unescape regex captures #1446
Unescape regex captures #1446
Conversation
lib/sinatra/base.rb
Outdated
@@ -1028,7 +1028,7 @@ def process_route(pattern, conditions, block = nil, values = []) | |||
|
|||
regexp_exists = pattern.is_a?(Mustermann::Regular) || (pattern.respond_to?(:patterns) && pattern.patterns.any? {|subpattern| subpattern.is_a?(Mustermann::Regular)} ) | |||
if regexp_exists | |||
captures = pattern.match(route).captures | |||
captures = pattern.match(route).captures.map { |c| force_encoding URI_INSTANCE.unescape(c) if c } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't seem to get the tests running here, but I think this might not be correct. It seems like force_encoding
will set the strings to UTF-8, but you can encode any arbitrary bytes with URI encoding. force_encoding
means that we'll end up with strings that are tagged as UTF-8 but aren't actually UTF-8 data.
This problem is actually demonstrated by the test in this PR. \x80
is not valid unicode:
irb(main):005:0> x = '/foo%80/bar%e2%80%8cbaz'
=> "/foo%80/bar%e2%80%8cbaz"
irb(main):006:0> y = URI.unescape x
=> "/foo\x80/barbaz"
irb(main):007:0> y.force_encoding "UTF-8"
=> "/foo\x80/barbaz"
irb(main):008:0> y.valid_encoding?
=> false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm good point. I was trying to take into account how things were done in v1.4.8:
https://github.com/sinatra/sinatra/blob/v1.4.8/lib/sinatra/base.rb#L1006
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow! Good job tracking that down!!
I think just not calling force_encoding
is the right thing to do (that's what we're doing in a monkey patch), but it looks like calling force_encoding
is what Sinatra used to do, so maybe everyone has written their apps based on that behavior.
I'm not sure what is the best path forward (pun intended but also a true statement).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out URI::Parser#unescape
already calls force_encoding
, so that was redundant anyway. You're right though, the results are tagged UTF-8 but valid_encoding?
returns false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will call force_encoding
, but it uses the encoding of the string that was passed in. When we read the path from the socket, the socket will return a string that is encoded in binary, not utf-8.
For example, a path encoded with UTF-8:
irb(main):002:0> path = "/foo%80/bar%e2%80%8cbaz"
=> "/foo%80/bar%e2%80%8cbaz"
irb(main):003:0> path.encoding
=> #<Encoding:UTF-8>
irb(main):004:0> x = URI.unescape path
=> "/foo\x80/barbaz"
irb(main):005:0> x.encoding
=> #<Encoding:UTF-8>
irb(main):006:0> x.valid_encoding?
=> false
Now if the path is binary:
irb(main):007:0> path = "/foo%80/bar%e2%80%8cbaz".b
=> "/foo%80/bar%e2%80%8cbaz"
irb(main):008:0> path.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):009:0> x = URI.unescape path
=> "/foo\x80/bar\xE2\x80\x8Cbaz"
irb(main):010:0> x.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):011:0> x.valid_encoding?
=> true
The test differs from production here because the default encoding of strings inside a file is UTF-8 (so the string literal will be tagged with UTF-8), but the strings we read from the socket will be tagged as binary.
Doing this might be more closely represent production:
get '/foo%80/bar%e2%80%8cbaz'.b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, gotcha!
test/routing_test.rb
Outdated
it 'unescapes named parameters and splats' do | ||
mock_app { | ||
get '/:foo/*' do |a, b| | ||
assert_equal "foo\x80", params['foo'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we should also do:
assert_predicate params['foo'], :valid_encoding?
I suspect it will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You were right on, this assertion fails.
I think in Rails, we default to force encoding things to UTF-8 since that is the most common case. At GitHub we have to deal with non UTF-8 data, so I added a thing to Rails such that a particular controller / action can override the encoding. Maybe the right thing to do here is to call Thanks for looking in to this! |
After a little more digging, I see Sinatra provides an option to set
This doesn't help though if you only need it for particular routes, I'll try to see what that takes 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth, this looks good to me. 😀
I just realized I was confusing code for handling "regular" routes with regex routes. I think we're probably good on the regex front with this PR, but I realized a relatively recent commit is conflicting a bit: I think we'll need to decide on a consistent approach for handling both. @namusyaka any thoughts? |
test/routing_test.rb
Outdated
mock_app { | ||
get '/:foo/*' do |a, b| | ||
assert_equal "foo\x80", params['foo'] | ||
assert_predicate params['foo'], :valid_encoding? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails because params are force encoded with the default encoding here:
https://github.com/sinatra/sinatra/blob/master/lib/sinatra/base.rb#L1026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just droping the permalink version of that link here as the line it points to might change in the future:
Line 1026 in 1b35eaa
force_encoding(params) |
@namusyaka I don't know all of the history of #1412, but the question has come up if we should try to force encode route params the default_encoding if it potentially results in an invalid encoding? |
@jkowens Sorry, I'm busy for a few days. Thanks for your patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenderlove @jkowens Thank you for your contributions.
I'm really pleasure to see your discussions and fixes.
The current change set looks good to me.
If you have potential invalid encoding case, please open a new issue.
## 2.0.4 / 2018-09-15 * Don't blow up when passing frozen string to `send_file` disposition [#1137](sinatra/sinatra#1137) by Andrew Selder * Fix ubygems LoadError [#1436](sinatra/sinatra#1436) by Pavel Rosick�«ò * Unescape regex captures [#1446](sinatra/sinatra#1446) by Jordan Owens * Slight performance improvements for IndifferentHash [#1427](sinatra/sinatra#1427) by Mike Pastore * Improve development support and documentation and source code by Will Yang, Jake Craige, Grey Baker and Guilherme Goettems Schneider ## 2.0.3 / 2018-06-09 * Fix the backports gem regression [#1442](sinatra/sinatra#1442) by Marc-Andr�«± Lafortune ## 2.0.2 / 2018-06-05 * Escape invalid query parameters [#1432](sinatra/sinatra#1432) by Kunpei Sakai * The patch fixes [CVE-2018-11627](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-11627). * Fix undefined method error for `Sinatra::RequiredParams` with hash key [#1431](sinatra/sinatra#1431) by Arpit Chauhan * Add xml content-types to valid html_types for Rack::Protection [#1413](sinatra/sinatra#1413) by Reenan Arbitrario * Encode route parameters using :default_encoding setting [#1412](sinatra/sinatra#1412) by Brian m. Carlson * Fix unpredictable behaviour from Sinatra::ConfigFile [#1244](sinatra/sinatra#1244) by John Hope * Add Sinatra::IndifferentHash#slice [#1405](sinatra/sinatra#1405) by Shota Iguchi * Remove status code 205 from drop body response [#1398](sinatra/sinatra#1398) by Shota Iguchi * Ignore empty captures from params [#1390](sinatra/sinatra#1390) by Shota Iguchi * Improve development support and documentation and source code by Zp Yuan, Andreas Finger, Olle Jonsson, Shota Iguchi, Nikita Bulai and Joshua O'Brien
This adds tests to make sure named parameters and regex captures are unescaped, and also provides a potential fix for #1443. I'm not familiar enough with Mustermann to know if it should handle unescaping captures.