Detect encoding and decode text response #256

messense · 2018-02-10T11:43:23Z

Fixes #246

messense · 2018-02-10T11:45:32Z

Cargo.toml

@@ -27,6 +28,7 @@ tokio-tls = "0.1"
 url = "1.2"
 uuid = { version = "0.5", features = ["v4"] }
 hyper-proxy = "0.4.0"
+uchardet = "2.0"


Can't use chardet because of license issue, it's LGPL-3.0

messense · 2018-02-10T11:45:49Z

src/response.rs

+        self.read_to_end(&mut content).map_err(::error::from)?;
+        let encoding_name = uchardet::detect_encoding_name(&content).unwrap_or_else(|_| "utf-8".to_string());
+        let encoding = Encoding::for_label(encoding_name.as_bytes()).unwrap_or(UTF_8);
+        let (text, _, _) = encoding.decode(&content);


I am not sure about this.

messense · 2018-02-10T12:12:07Z

~~This patch mimics the behavior of the Python requests library.~~ Removed encoding detection.

messense · 2018-02-10T12:18:38Z

Unfortunately, the uchardet crate doesn't compile on x86_64-pc-windows-gnu and i686-pc-windows-gnu targets, maybe we can add a feature gate for it or just remove encoding detection?

messense · 2018-02-11T01:41:38Z

So I have removed uchardet, I think get encoding from Content-Type and default to utf-8 should work for most cases.

seanmonstar · 2018-02-12T17:54:36Z

Thanks!

I just read through the docs of encoding_rs, and it seems that this will replace any characters it can't understand with the UTF-8 replacement character. Is this sort of behavior what someone would expect from calling res.text()? It might be surprising...

messense · 2018-02-13T08:02:55Z

I thought about it but I am not sure what's the best way to deal with it.

I think Python requests use UTF-8 replacement character too.

https://github.com/requests/requests/blob/3c1d36b827417fdeaf5a1c106129de30dac371d7/requests/models.py#L855-L864

seanmonstar

Alright, awesome! I think it makes sense to do this as a convenience method. The behavior should likely be documented on the method.

seanmonstar · 2018-02-14T19:58:29Z

src/response.rs

+        let encoding_name = self.headers().get::<::header::ContentType>()
+            .and_then(|content_type| {
+                content_type.get_param("charset")
+                    .map(|charset| charset.as_str().to_string())


I don't think this needs to copy the string, and so could just be charset.as_str().

seanmonstar · 2018-02-14T19:58:58Z

src/response.rs

+                content_type.get_param("charset")
+                    .map(|charset| charset.as_str().to_string())
+            })
+            .unwrap_or_else(|| "utf-8".to_string());


With no copy of the string above, this can just be .unwrap_or("utf-8").

seanmonstar · 2018-02-14T20:09:16Z

src/response.rs

+            })
+            .unwrap_or_else(|| "utf-8".to_string());
+        let encoding = Encoding::for_label(encoding_name.as_bytes()).unwrap_or(UTF_8);
+        let (text, _, _) = encoding.decode(&content);


It looks like decode returns a Cow<str>, since it may have detected that the bytes were valid UTF-8 and didn't need to do any copying. So, we can handle the Cow if it is Cow::Borrowed, that means we don't need to make a new copy, since the bytes in content were valid! Eliminating this copy is a bigger deal depending on how big the body was.

So, seems this could be handled like so:

// a block because of borrow checker { let (text, _, _) = encoding.decode(&content); match text { Cow::Owned(s) => return Ok(s), _ => (), } } unsafe { // decoding returned Cow::Borrowed, meaning these bytes // are already valid utf8 Ok(String::from_utf8_unchecked(content)) }

messense · 2018-02-15T03:54:55Z

Addressed review comments.

seanmonstar · 2018-02-15T19:02:06Z

Woot! Thanks!

Detect encoding and decode text response

25378d6

Fixes #246

messense commented Feb 10, 2018

View reviewed changes

Try to get encoding from Content-Type header

f5d00b6

messense mentioned this pull request Feb 10, 2018

Relicensing thuleqaid/rust-chardet#3

Open

Remove uchardet encoding detection for now

2337017

Add non utf-8 test case for Response::text()

19c57f6

messense mentioned this pull request Feb 11, 2018

Response::text errors on non-utf8 bytes #246

Closed

seanmonstar reviewed Feb 14, 2018

View reviewed changes

Reduce copies

8c08daa

seanmonstar merged commit 0203fad into seanmonstar:master Feb 15, 2018

messense deleted the feature/text-encoding branch February 16, 2018 02:40

messense mentioned this pull request Jul 8, 2019

Remove unnecessary unsafe in Text::poll #559

Closed

seanmonstar mentioned this pull request Jul 5, 2023

BOM in Response::text_with_charset #1897

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect encoding and decode text response #256

Detect encoding and decode text response #256

messense commented Feb 10, 2018

messense Feb 10, 2018

messense Feb 10, 2018

messense commented Feb 10, 2018 •

edited

Loading

messense commented Feb 10, 2018

messense commented Feb 11, 2018

seanmonstar commented Feb 12, 2018

messense commented Feb 13, 2018

seanmonstar left a comment

seanmonstar Feb 14, 2018

seanmonstar Feb 14, 2018

seanmonstar Feb 14, 2018

messense commented Feb 15, 2018

seanmonstar commented Feb 15, 2018

Detect encoding and decode text response #256

Detect encoding and decode text response #256

Conversation

messense commented Feb 10, 2018

messense Feb 10, 2018

Choose a reason for hiding this comment

messense Feb 10, 2018

Choose a reason for hiding this comment

messense commented Feb 10, 2018 • edited Loading

messense commented Feb 10, 2018

messense commented Feb 11, 2018

seanmonstar commented Feb 12, 2018

messense commented Feb 13, 2018

seanmonstar left a comment

Choose a reason for hiding this comment

seanmonstar Feb 14, 2018

Choose a reason for hiding this comment

seanmonstar Feb 14, 2018

Choose a reason for hiding this comment

seanmonstar Feb 14, 2018

Choose a reason for hiding this comment

messense commented Feb 15, 2018

seanmonstar commented Feb 15, 2018

messense commented Feb 10, 2018 •

edited

Loading