Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Response::text errors on non-utf8 bytes #246

Closed
rusterize opened this issue Jan 16, 2018 · 2 comments
Closed

Response::text errors on non-utf8 bytes #246

rusterize opened this issue Jan 16, 2018 · 2 comments

Comments

@rusterize
Copy link

rusterize commented Jan 16, 2018

@seanmonstar Thank you very much for this great crate! It is badly needed, and I appreciate that you shared it.

Run the following simple test:

extern crate reqwest;
use reqwest::Error;

fn main() {
    match run() {
        Ok(_) => println!("success!"),
        Err(e) => eprintln!("Error: {}",e),
    }
}

fn run() -> Result<(), Error> {
    let client = reqwest::Client::new();
    let mut res = reqwest::get("http://google.com")?;
    let text = res.text()?;
    Ok(())
}

This is the output:

sh-4.4$ ./target/debug/rtest
Error: stream did not contain valid UTF-8

The error happens because Response::text() ignores the Content-Type: text/html; charset=ISO-8859-1 header from google. Response::text() is using read_to_string() from the std library which explicitly requires utf-8 encoding.

I think it is a rather big problem if reqwest can't handle google.com. You could use the ecoding crate and honor the encoding header. As a short term workaround you could provide a method to return a &[u8] rather than a String, and the user can work around the bug.

NOTE: Google may change their page tomorrow and everything will work fine. Nonetheless I am glad it broke becuse otherwise this would have been hard to discover!

Here is the data in case google changes their pages:
2018_01_16_www.google.com.data_non_utf8.txt.gz
2018_01_16_www.google.com.header.txt.gz

BTW, it is rather funny that the offending bytes are around the "Advertising Program" string :)

@seanmonstar
Copy link
Owner

This is actually expected, as text does just use read_to_string which requires the body to be utf-8. However, you bring up a good point, which is that perhaps reqwest can do a better job of just making this work. For instance, python-requests handles this automatically as well.

@seanmonstar seanmonstar changed the title reqwest can't handle http://google.com - Encoding problem! Response::text errors on non-utf8 bytes Jan 16, 2018
@rleungx rleungx mentioned this issue Feb 8, 2018
@messense
Copy link
Contributor

#256 will fix this.

image

seanmonstar pushed a commit that referenced this issue Feb 15, 2018
* Detect encoding and decode text response

Fixes #246

* Try to get encoding from Content-Type header

* Remove uchardet encoding detection for now

* Add non utf-8 test case for Response::text()

* Reduce copies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants