fix: fix decoding error(utf-8,gbk,iso-8859 and other charsets) in readability (issue #2435) #2449
+54
−12
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This pull request fixed #2435
For readability, it may encounters the decoding error.
before:
![image](https://private-user-images.githubusercontent.com/73331790/400145749-6063531c-b8ee-44de-b0bd-fd66cad201e9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk1MjQxMzIsIm5iZiI6MTczOTUyMzgzMiwicGF0aCI6Ii83MzMzMTc5MC80MDAxNDU3NDktNjA2MzUzMWMtYjhlZS00NGRlLWIwYmQtZmQ2NmNhZDIwMWU5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE0VDA5MDM1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTc2NjcwMTZkYTYzOWU1Mzg1NzhhZTI1NjMxMTJmODVmZDEwNDM3YTVhNjI3ZWU2YmM1YWM3MTcwMGM0OGVkZWMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.Z17d43d8LBi8cdHhDFZ7AF8o_P9SaJV7Yki86U8jqvI)
After:
![1736001323084](https://private-user-images.githubusercontent.com/73331790/400145762-35ae95f4-5cba-4082-a629-69ae1918290d.jpg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk1MjQxMzIsIm5iZiI6MTczOTUyMzgzMiwicGF0aCI6Ii83MzMzMTc5MC80MDAxNDU3NjItMzVhZTk1ZjQtNWNiYS00MDgyLWE2MjktNjlhZTE5MTgyOTBkLmpwZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE0VDA5MDM1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTFlYmM4NjJmZjMyMDEwOGM3ZmQxZThlZjQ3MTE0YzM3ZmM3N2UwYjJkMGY2MzgxMTZjODZiYjU3YmY0YWI2NTYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.tquDpmDcUejOoNo_4VW00Fj1ee-pHDOi5LOfscrPMDA)
The original code had some deficiencies in decoding the original webpage text. It only considered obtaining the charset attribute value from the
content type
. So I optimized the process of text encoding detection and decoding.原来的代码中,对原网页文本解码存在一些不足,它仅仅考虑了从content type 中获得charset属性值。所以我优化了对文本编码检测和解码的流程。
Optimized Code Flow for Charset Detection and Decoding
Here’s the optimized code flow for detecting the charset and decoding the document:
Content-Type
response header.chardet
to detect the encoding.Detailed Optimizations
Get Charset from
Content-Type
Headercharset
from the response header. This is the standard HTTP approach for specifying encoding.Fallback to
chardet
Content-Type
charset is provided or if it’s ambiguous, usechardet
for charset detection.chardet
analyzes the byte distribution of the content to determine its encoding accurately.Stream Handling
ArrayBuffer
only once, ensuring the stream is not consumed multiple times.Flowchart
优化字符集检测和解码的代码流程
以下是优化后的代码流程,按照优先级获取编码方式并解码内容:
Content-Type
响应头中获取编码方式。chardet
检测编码。详细优化点
从
Content-Type
获取字符集:charset
,这是 HTTP 协议指定的标准方式。结合
chardet
进行编码检测:Content-Type
未指定字符集,或者内容中未提供<meta charset>
,则使用chardet
检测编码。chardet
通过统计方法分析内容的字节分布,准确性较高。流的读取:
ArrayBuffer
一次,确保流未被重复消费。流程图
PR Type
Linked Issues
#2435
Additional context
The document for chardet
Changelog