-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect encoding and decode text response #256
Detect encoding and decode text response #256
Conversation
Cargo.toml
Outdated
@@ -27,6 +28,7 @@ tokio-tls = "0.1" | |||
url = "1.2" | |||
uuid = { version = "0.5", features = ["v4"] } | |||
hyper-proxy = "0.4.0" | |||
uchardet = "2.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't use chardet because of license issue, it's LGPL-3.0
src/response.rs
Outdated
self.read_to_end(&mut content).map_err(::error::from)?; | ||
let encoding_name = uchardet::detect_encoding_name(&content).unwrap_or_else(|_| "utf-8".to_string()); | ||
let encoding = Encoding::for_label(encoding_name.as_bytes()).unwrap_or(UTF_8); | ||
let (text, _, _) = encoding.decode(&content); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this.
|
Unfortunately, the |
So I have removed |
Thanks! I just read through the docs of encoding_rs, and it seems that this will replace any characters it can't understand with the UTF-8 replacement character. Is this sort of behavior what someone would expect from calling |
I thought about it but I am not sure what's the best way to deal with it. I think Python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, awesome! I think it makes sense to do this as a convenience method. The behavior should likely be documented on the method.
src/response.rs
Outdated
let encoding_name = self.headers().get::<::header::ContentType>() | ||
.and_then(|content_type| { | ||
content_type.get_param("charset") | ||
.map(|charset| charset.as_str().to_string()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this needs to copy the string, and so could just be charset.as_str()
.
src/response.rs
Outdated
content_type.get_param("charset") | ||
.map(|charset| charset.as_str().to_string()) | ||
}) | ||
.unwrap_or_else(|| "utf-8".to_string()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With no copy of the string above, this can just be .unwrap_or("utf-8")
.
src/response.rs
Outdated
}) | ||
.unwrap_or_else(|| "utf-8".to_string()); | ||
let encoding = Encoding::for_label(encoding_name.as_bytes()).unwrap_or(UTF_8); | ||
let (text, _, _) = encoding.decode(&content); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like decode
returns a Cow<str>
, since it may have detected that the bytes were valid UTF-8 and didn't need to do any copying. So, we can handle the Cow
if it is Cow::Borrowed
, that means we don't need to make a new copy, since the bytes in content
were valid! Eliminating this copy is a bigger deal depending on how big the body was.
So, seems this could be handled like so:
// a block because of borrow checker
{
let (text, _, _) = encoding.decode(&content);
match text {
Cow::Owned(s) => return Ok(s),
_ => (),
}
}
unsafe {
// decoding returned Cow::Borrowed, meaning these bytes
// are already valid utf8
Ok(String::from_utf8_unchecked(content))
}
Addressed review comments. |
Woot! Thanks! |
Fixes #246