We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In this section:
http://www.nltk.org/book/ch03.html#accessing-text-from-the-web-and-from-disk
The following code is used to decode the bytes from the Gutenberg Project webserver:
raw = response.read().decode('utf8')
With Python 3.7.4, the value of raw will contain a byte-order mark (BOM).
raw
'\ufeffThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r'
and the return value of len() will be 1176967 rather than 1176893.
len()
1176967
1176893
The Python Unicode HOWTO recommends the use of utf-8-sig as an encoding value to exclude the BOM, which really isn't needed for UTF-8.
utf-8-sig
from urllib import request
url = 'http://www.gutenberg.org/files/2554/2554-0.txt'
response = request.urlopen(url)
raw = response.read().decode(encoding='utf-8-sig')
type(raw)
str
len(raw)
1176966
raw[:75]
'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'
The text was updated successfully, but these errors were encountered:
stevenbird
No branches or pull requests
In this section:
http://www.nltk.org/book/ch03.html#accessing-text-from-the-web-and-from-disk
The following code is used to decode the bytes from the Gutenberg Project webserver:
With Python 3.7.4, the value of
raw
will contain a byte-order mark (BOM).and the return value of
len()
will be1176967
rather than1176893
.The Python Unicode HOWTO recommends the use of
utf-8-sig
as an encoding value to exclude the BOM, which really isn't needed for UTF-8.The text was updated successfully, but these errors were encountered: