Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect encoding and decode text response #256

Merged
merged 5 commits into from
Feb 15, 2018
Merged

Detect encoding and decode text response #256

merged 5 commits into from
Feb 15, 2018

Conversation

messense
Copy link
Contributor

Fixes #246

Cargo.toml Outdated
@@ -27,6 +28,7 @@ tokio-tls = "0.1"
url = "1.2"
uuid = { version = "0.5", features = ["v4"] }
hyper-proxy = "0.4.0"
uchardet = "2.0"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't use chardet because of license issue, it's LGPL-3.0

src/response.rs Outdated
self.read_to_end(&mut content).map_err(::error::from)?;
let encoding_name = uchardet::detect_encoding_name(&content).unwrap_or_else(|_| "utf-8".to_string());
let encoding = Encoding::for_label(encoding_name.as_bytes()).unwrap_or(UTF_8);
let (text, _, _) = encoding.decode(&content);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this.

@messense
Copy link
Contributor Author

messense commented Feb 10, 2018

This patch mimics the behavior of the Python requests library. Removed encoding detection.

@messense
Copy link
Contributor Author

Unfortunately, the uchardet crate doesn't compile on x86_64-pc-windows-gnu and i686-pc-windows-gnu targets, maybe we can add a feature gate for it or just remove encoding detection?

@messense
Copy link
Contributor Author

So I have removed uchardet, I think get encoding from Content-Type and default to utf-8 should work for most cases.

@seanmonstar
Copy link
Owner

Thanks!

I just read through the docs of encoding_rs, and it seems that this will replace any characters it can't understand with the UTF-8 replacement character. Is this sort of behavior what someone would expect from calling res.text()? It might be surprising...

@messense
Copy link
Contributor Author

I thought about it but I am not sure what's the best way to deal with it.

I think Python requests use UTF-8 replacement character too.

https://github.com/requests/requests/blob/3c1d36b827417fdeaf5a1c106129de30dac371d7/requests/models.py#L855-L864

Copy link
Owner

@seanmonstar seanmonstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, awesome! I think it makes sense to do this as a convenience method. The behavior should likely be documented on the method.

src/response.rs Outdated
let encoding_name = self.headers().get::<::header::ContentType>()
.and_then(|content_type| {
content_type.get_param("charset")
.map(|charset| charset.as_str().to_string())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs to copy the string, and so could just be charset.as_str().

src/response.rs Outdated
content_type.get_param("charset")
.map(|charset| charset.as_str().to_string())
})
.unwrap_or_else(|| "utf-8".to_string());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With no copy of the string above, this can just be .unwrap_or("utf-8").

src/response.rs Outdated
})
.unwrap_or_else(|| "utf-8".to_string());
let encoding = Encoding::for_label(encoding_name.as_bytes()).unwrap_or(UTF_8);
let (text, _, _) = encoding.decode(&content);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like decode returns a Cow<str>, since it may have detected that the bytes were valid UTF-8 and didn't need to do any copying. So, we can handle the Cow if it is Cow::Borrowed, that means we don't need to make a new copy, since the bytes in content were valid! Eliminating this copy is a bigger deal depending on how big the body was.

So, seems this could be handled like so:

// a block because of borrow checker
{
    let (text, _, _) = encoding.decode(&content);
    match text {
        Cow::Owned(s) => return Ok(s),
        _ => (),
    }
}
unsafe {
    // decoding returned Cow::Borrowed, meaning these bytes
    // are already valid utf8
    Ok(String::from_utf8_unchecked(content))
}

@messense
Copy link
Contributor Author

Addressed review comments.

@seanmonstar seanmonstar merged commit 0203fad into seanmonstar:master Feb 15, 2018
@seanmonstar
Copy link
Owner

Woot! Thanks!

@messense messense deleted the feature/text-encoding branch February 16, 2018 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants