Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unescape regex captures #1446

Merged
merged 2 commits into from
Aug 4, 2018
Merged

Unescape regex captures #1446

merged 2 commits into from
Aug 4, 2018

Conversation

jkowens
Copy link
Member

@jkowens jkowens commented Jun 11, 2018

This adds tests to make sure named parameters and regex captures are unescaped, and also provides a potential fix for #1443. I'm not familiar enough with Mustermann to know if it should handle unescaping captures.

@@ -1028,7 +1028,7 @@ def process_route(pattern, conditions, block = nil, values = [])

regexp_exists = pattern.is_a?(Mustermann::Regular) || (pattern.respond_to?(:patterns) && pattern.patterns.any? {|subpattern| subpattern.is_a?(Mustermann::Regular)} )
if regexp_exists
captures = pattern.match(route).captures
captures = pattern.match(route).captures.map { |c| force_encoding URI_INSTANCE.unescape(c) if c }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't seem to get the tests running here, but I think this might not be correct. It seems like force_encoding will set the strings to UTF-8, but you can encode any arbitrary bytes with URI encoding. force_encoding means that we'll end up with strings that are tagged as UTF-8 but aren't actually UTF-8 data.

This problem is actually demonstrated by the test in this PR. \x80 is not valid unicode:

irb(main):005:0> x = '/foo%80/bar%e2%80%8cbaz'
=> "/foo%80/bar%e2%80%8cbaz"
irb(main):006:0> y = URI.unescape x
=> "/foo\x80/bar‌baz"
irb(main):007:0> y.force_encoding "UTF-8"
=> "/foo\x80/bar‌baz"
irb(main):008:0> y.valid_encoding?
=> false

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm good point. I was trying to take into account how things were done in v1.4.8:

https://github.com/sinatra/sinatra/blob/v1.4.8/lib/sinatra/base.rb#L1006

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! Good job tracking that down!!

I think just not calling force_encoding is the right thing to do (that's what we're doing in a monkey patch), but it looks like calling force_encoding is what Sinatra used to do, so maybe everyone has written their apps based on that behavior.

I'm not sure what is the best path forward (pun intended but also a true statement).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out URI::Parser#unescape already calls force_encoding, so that was redundant anyway. You're right though, the results are tagged UTF-8 but valid_encoding? returns false.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will call force_encoding, but it uses the encoding of the string that was passed in. When we read the path from the socket, the socket will return a string that is encoded in binary, not utf-8.

For example, a path encoded with UTF-8:

irb(main):002:0> path = "/foo%80/bar%e2%80%8cbaz"
=> "/foo%80/bar%e2%80%8cbaz"
irb(main):003:0> path.encoding
=> #<Encoding:UTF-8>
irb(main):004:0> x = URI.unescape path
=> "/foo\x80/bar‌baz"
irb(main):005:0> x.encoding
=> #<Encoding:UTF-8>
irb(main):006:0> x.valid_encoding?
=> false

Now if the path is binary:

irb(main):007:0> path = "/foo%80/bar%e2%80%8cbaz".b
=> "/foo%80/bar%e2%80%8cbaz"
irb(main):008:0> path.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):009:0> x = URI.unescape path
=> "/foo\x80/bar\xE2\x80\x8Cbaz"
irb(main):010:0> x.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):011:0> x.valid_encoding?
=> true

The test differs from production here because the default encoding of strings inside a file is UTF-8 (so the string literal will be tagged with UTF-8), but the strings we read from the socket will be tagged as binary.

Doing this might be more closely represent production:

get '/foo%80/bar%e2%80%8cbaz'.b

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha!

it 'unescapes named parameters and splats' do
mock_app {
get '/:foo/*' do |a, b|
assert_equal "foo\x80", params['foo']

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should also do:

assert_predicate params['foo'], :valid_encoding?

I suspect it will fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were right on, this assertion fails.

@tenderlove
Copy link

I think in Rails, we default to force encoding things to UTF-8 since that is the most common case. At GitHub we have to deal with non UTF-8 data, so I added a thing to Rails such that a particular controller / action can override the encoding.

Maybe the right thing to do here is to call force_encoding, but also provide a way to override which encoding gets used on a particular route.

Thanks for looking in to this!

@jkowens
Copy link
Member Author

jkowens commented Jun 14, 2018

After a little more digging, I see Sinatra provides an option to set default_encoding for the app:

set :default_encoding,  'ascii-8bit'

This doesn't help though if you only need it for particular routes, I'll try to see what that takes 👍

Copy link

@tenderlove tenderlove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, this looks good to me. 😀

@jkowens
Copy link
Member Author

jkowens commented Jun 14, 2018

I just realized I was confusing code for handling "regular" routes with regex routes. I think we're probably good on the regex front with this PR, but I realized a relatively recent commit is conflicting a bit:

b058835

I think we'll need to decide on a consistent approach for handling both. @namusyaka any thoughts?

mock_app {
get '/:foo/*' do |a, b|
assert_equal "foo\x80", params['foo']
assert_predicate params['foo'], :valid_encoding?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails because params are force encoded with the default encoding here:

https://github.com/sinatra/sinatra/blob/master/lib/sinatra/base.rb#L1026

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just droping the permalink version of that link here as the line it points to might change in the future:

force_encoding(params)

@jkowens
Copy link
Member Author

jkowens commented Jun 15, 2018

@namusyaka I don't know all of the history of #1412, but the question has come up if we should try to force encode route params the default_encoding if it potentially results in an invalid encoding?

@namusyaka
Copy link
Member

@jkowens Sorry, I'm busy for a few days.
I'm going to review this carefuly next week.

Thanks for your patch.

@namusyaka namusyaka added this to the v2.0.4 milestone Jun 24, 2018
Copy link
Member

@namusyaka namusyaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenderlove @jkowens Thank you for your contributions.
I'm really pleasure to see your discussions and fixes.
The current change set looks good to me.

If you have potential invalid encoding case, please open a new issue.

@namusyaka namusyaka merged commit 9590706 into sinatra:master Aug 4, 2018
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Sep 23, 2018
## 2.0.4 / 2018-09-15

* Don't blow up when passing frozen string to `send_file` disposition [#1137](sinatra/sinatra#1137) by Andrew Selder

* Fix ubygems LoadError [#1436](sinatra/sinatra#1436) by Pavel Rosick�«ò

* Unescape regex captures [#1446](sinatra/sinatra#1446) by Jordan Owens

* Slight performance improvements for IndifferentHash [#1427](sinatra/sinatra#1427) by Mike Pastore

* Improve development support and documentation and source code by Will Yang, Jake Craige, Grey Baker and Guilherme Goettems Schneider

## 2.0.3 / 2018-06-09

* Fix the backports gem regression [#1442](sinatra/sinatra#1442) by Marc-Andr�«± Lafortune

## 2.0.2 / 2018-06-05

* Escape invalid query parameters [#1432](sinatra/sinatra#1432) by Kunpei Sakai
  * The patch fixes [CVE-2018-11627](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-11627).

* Fix undefined method error for `Sinatra::RequiredParams` with hash key [#1431](sinatra/sinatra#1431) by Arpit Chauhan

* Add xml content-types to valid html_types for Rack::Protection [#1413](sinatra/sinatra#1413) by Reenan Arbitrario

* Encode route parameters using :default_encoding setting [#1412](sinatra/sinatra#1412) by Brian m. Carlson

* Fix unpredictable behaviour from Sinatra::ConfigFile [#1244](sinatra/sinatra#1244) by John Hope

* Add Sinatra::IndifferentHash#slice [#1405](sinatra/sinatra#1405) by Shota Iguchi

* Remove status code 205 from drop body response [#1398](sinatra/sinatra#1398) by Shota Iguchi

* Ignore empty captures from params [#1390](sinatra/sinatra#1390) by Shota Iguchi

* Improve development support and documentation and source code by Zp Yuan, Andreas Finger, Olle Jonsson, Shota Iguchi, Nikita Bulai and Joshua O'Brien
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants