fix: fix decoding error(utf-8,gbk,iso-8859 and other charsets) in readability (issue #2435) #2449

PrinOrange · 2025-01-04T14:49:52Z

Description

This pull request fixed #2435
For readability, it may encounters the decoding error.

before:

After:

The original code had some deficiencies in decoding the original webpage text. It only considered obtaining the charset attribute value from the content type. So I optimized the process of text encoding detection and decoding.

原来的代码中，对原网页文本解码存在一些不足，它仅仅考虑了从content type 中获得charset属性值。所以我优化了对文本编码检测和解码的流程。

Optimized Code Flow for Charset Detection and Decoding

Here’s the optimized code flow for detecting the charset and decoding the document:

First, get the charset from the Content-Type response header.
If the charset is not found in the headers, use chardet to detect the encoding.

Detailed Optimizations

Get Charset from Content-Type Header
- Directly extract charset from the response header. This is the standard HTTP approach for specifying encoding.
- Example code uses a regular expression to parse the charset:
```
const httpCharset = contentType?.match(/charset=([\w-]+)/i)?.[1];
```
Fallback to chardet
- If no Content-Type charset is provided or if it’s ambiguous, use chardet for charset detection.
- chardet analyzes the byte distribution of the content to determine its encoding accurately.
Stream Handling
- Read the content into an ArrayBuffer only once, ensuring the stream is not consumed multiple times.

Flowchart

[Start]
   ↓
Fetch content
   ↓
Check Content-Type header for charset
   ↓
No charset? → Use chardet to detect
   ↓
Decode using detected charset
   ↓
[Done]

优化字符集检测和解码的代码流程

以下是优化后的代码流程，按照优先级获取编码方式并解码内容：

优先从 Content-Type 响应头中获取编码方式。
如果未能从响应头获取编码方式，则使用 chardet 检测编码。

详细优化点

从 Content-Type 获取字符集：
- 直接从响应头解析 charset，这是 HTTP 协议指定的标准方式。
- 示例代码中使用正则表达式提取字符集：
```
const httpCharset = contentType?.match(/charset=([\w-]+)/i)?.[1];
```
结合 chardet 进行编码检测：
- 如果 Content-Type 未指定字符集，或者内容中未提供 <meta charset>，则使用 chardet 检测编码。
- chardet 通过统计方法分析内容的字节分布，准确性较高。
流的读取：
- 只读取 ArrayBuffer 一次，确保流未被重复消费。

流程图

[Start]
   ↓
Fetch content
   ↓
Check Content-Type header for charset
   ↓
No charset? → Use chardet to detect
   ↓
Decode using detected charset
   ↓
[Done]

PR Type

Feature
Bugfix
Hotfix
Other (please describe):

Linked Issues

#2435

Additional context

The document for chardet

Changelog

I have updated the changelog/next.md with my changes.

follow-reviewer-bot · 2025-01-04T14:49:55Z

Thank you for your contribution. We will review it promptly.

vercel · 2025-01-04T14:49:56Z

@PrinOrange is attempting to deploy a commit to the RSS3 Team on Vercel.

A member of the Team first needs to authorize it.

follow-reviewer-bot · 2025-01-04T14:50:00Z

Suggested PR Title:

feat(readability): enhance charset detection with chardet

Change Summary:
Added chardet library for enhanced charset detection in readability function and refactored code to handle charset detection more robustly, potentially improving text decoding accuracy.

Code Review:

apps/main/src/lib/readability.ts, lines 41-42: There is a potential issue with the reassignment of the text variable, which is declared as a constant (const). This will create an error during execution. Either declare text using let or introduce a new variable for the re-decoding process if the charsets differ.

        if (finalCharset.toLowerCase() !== detectedCharset.toLowerCase()) {
          // Reassigning text here will cause an error
          // eslint-disable-next-line no-param-reassign
          text = new TextDecoder(finalCharset, { fatal: false }).decode(buffer)
        }

Use let instead of const for text:

        let decodedText = text;
        if (finalCharset.toLowerCase() !== detectedCharset.toLowerCase()) {
          decodedText = new TextDecoder(finalCharset, { fatal: false }).decode(buffer);
        }

        return decodedText;

Or keep the text as it is and introduce a new variable:

        let finalText = text;
        if (finalCharset.toLowerCase() !== detectedCharset.toLowerCase()) {
          finalText = new TextDecoder(finalCharset, { fatal: false }).decode(buffer);
        }

        return finalText;

No further change requests necessary.

hyoban · 2025-01-06T00:58:37Z

thank you for your contribution

follow-reviewer-bot · 2025-01-06T00:59:59Z

Thank you @PrinOrange for your contribution! 🎉

Your pull request has been merged and we really appreciate your help in making this project better. We hope to see more contributions from you in the future! 💪

PrinOrange added 2 commits January 4, 2025 22:10

fix: fix decoding error by detecting charset.

bb8c6fd

fix: add auto detect origin encoding charset for readability.

b413f50

PrinOrange mentioned this pull request Jan 4, 2025

Chars decoding error (gb2312, UTF-8) in Readability. #2435

Closed

5 tasks

PrinOrange changed the title ~~fix: fix decoding error(utf-8,gbk,iso-8859 and other charsets) in readability (issue2435)~~ fix: fix decoding error(utf-8,gbk,iso-8859 and other charsets) in readability (issue #2435) Jan 4, 2025

PrinOrange added 3 commits January 4, 2025 23:01

lint: fix code style.

e363b94

fix: no longer use meta tags to detect encoding

0054ade

lint: adjust code style and error handle

f7e4046

hyoban merged commit 1de1cd9 into RSSNext:dev Jan 6, 2025
5 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fix decoding error(utf-8,gbk,iso-8859 and other charsets) in readability (issue #2435) #2449

fix: fix decoding error(utf-8,gbk,iso-8859 and other charsets) in readability (issue #2435) #2449

PrinOrange commented Jan 4, 2025 •

edited by hyoban

Loading

follow-reviewer-bot bot commented Jan 4, 2025

vercel bot commented Jan 4, 2025

follow-reviewer-bot bot commented Jan 4, 2025

hyoban commented Jan 6, 2025

follow-reviewer-bot bot commented Jan 6, 2025

fix: fix decoding error(utf-8,gbk,iso-8859 and other charsets) in readability (issue #2435) #2449

fix: fix decoding error(utf-8,gbk,iso-8859 and other charsets) in readability (issue #2435) #2449

Conversation

PrinOrange commented Jan 4, 2025 • edited by hyoban Loading

Description

Optimized Code Flow for Charset Detection and Decoding

Detailed Optimizations

Flowchart

优化字符集检测和解码的代码流程

详细优化点

流程图

PR Type

Linked Issues

Additional context

Changelog

follow-reviewer-bot bot commented Jan 4, 2025

vercel bot commented Jan 4, 2025

follow-reviewer-bot bot commented Jan 4, 2025

hyoban commented Jan 6, 2025

follow-reviewer-bot bot commented Jan 6, 2025

PrinOrange commented Jan 4, 2025 •

edited by hyoban

Loading