Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix decoding error(utf-8,gbk,iso-8859 and other charsets) in readability (issue #2435) #2449

Merged
merged 5 commits into from
Jan 6, 2025

Conversation

PrinOrange
Copy link
Contributor

@PrinOrange PrinOrange commented Jan 4, 2025

Description

This pull request fixed #2435
For readability, it may encounters the decoding error.

before:
image

After:
1736001323084

The original code had some deficiencies in decoding the original webpage text. It only considered obtaining the charset attribute value from the content type. So I optimized the process of text encoding detection and decoding.

原来的代码中,对原网页文本解码存在一些不足,它仅仅考虑了从content type 中获得charset属性值。所以我优化了对文本编码检测和解码的流程。


Optimized Code Flow for Charset Detection and Decoding

Here’s the optimized code flow for detecting the charset and decoding the document:

  1. First, get the charset from the Content-Type response header.
  2. If the charset is not found in the headers, use chardet to detect the encoding.

Detailed Optimizations

  1. Get Charset from Content-Type Header

    • Directly extract charset from the response header. This is the standard HTTP approach for specifying encoding.
    • Example code uses a regular expression to parse the charset:
      const httpCharset = contentType?.match(/charset=([\w-]+)/i)?.[1];
  2. Fallback to chardet

    • If no Content-Type charset is provided or if it’s ambiguous, use chardet for charset detection.
    • chardet analyzes the byte distribution of the content to determine its encoding accurately.
  3. Stream Handling

    • Read the content into an ArrayBuffer only once, ensuring the stream is not consumed multiple times.

Flowchart

[Start]
   ↓
Fetch content
   ↓
Check Content-Type header for charset
   ↓
No charset? → Use chardet to detect
   ↓
Decode using detected charset
   ↓
[Done]

优化字符集检测和解码的代码流程

以下是优化后的代码流程,按照优先级获取编码方式并解码内容:

  1. 优先从 Content-Type 响应头中获取编码方式
  2. 如果未能从响应头获取编码方式,则使用 chardet 检测编码

详细优化点

  1. Content-Type 获取字符集

    • 直接从响应头解析 charset,这是 HTTP 协议指定的标准方式。
    • 示例代码中使用正则表达式提取字符集:
      const httpCharset = contentType?.match(/charset=([\w-]+)/i)?.[1];
  2. 结合 chardet 进行编码检测

    • 如果 Content-Type 未指定字符集,或者内容中未提供 <meta charset>,则使用 chardet 检测编码。
    • chardet 通过统计方法分析内容的字节分布,准确性较高。
  3. 流的读取

    • 只读取 ArrayBuffer 一次,确保流未被重复消费。

流程图

[Start]
   ↓
Fetch content
   ↓
Check Content-Type header for charset
   ↓
No charset? → Use chardet to detect
   ↓
Decode using detected charset
   ↓
[Done]

PR Type

  • Feature
  • Bugfix
  • Hotfix
  • Other (please describe):

Linked Issues

#2435

Additional context

The document for chardet

Changelog

  • I have updated the changelog/next.md with my changes.

Sorry, something went wrong.

@follow-reviewer-bot
Copy link

Thank you for your contribution. We will review it promptly.

Copy link

vercel bot commented Jan 4, 2025

@PrinOrange is attempting to deploy a commit to the RSS3 Team on Vercel.

A member of the Team first needs to authorize it.

@follow-reviewer-bot
Copy link

Suggested PR Title:

feat(readability): enhance charset detection with chardet

Change Summary:
Added chardet library for enhanced charset detection in readability function and refactored code to handle charset detection more robustly, potentially improving text decoding accuracy.

Code Review:

  • apps/main/src/lib/readability.ts, lines 41-42: There is a potential issue with the reassignment of the text variable, which is declared as a constant (const). This will create an error during execution. Either declare text using let or introduce a new variable for the re-decoding process if the charsets differ.
        if (finalCharset.toLowerCase() !== detectedCharset.toLowerCase()) {
          // Reassigning text here will cause an error
          // eslint-disable-next-line no-param-reassign
          text = new TextDecoder(finalCharset, { fatal: false }).decode(buffer)
        }

Use let instead of const for text:

        let decodedText = text;
        if (finalCharset.toLowerCase() !== detectedCharset.toLowerCase()) {
          decodedText = new TextDecoder(finalCharset, { fatal: false }).decode(buffer);
        }

        return decodedText;

Or keep the text as it is and introduce a new variable:

        let finalText = text;
        if (finalCharset.toLowerCase() !== detectedCharset.toLowerCase()) {
          finalText = new TextDecoder(finalCharset, { fatal: false }).decode(buffer);
        }

        return finalText;

No further change requests necessary.

@PrinOrange PrinOrange changed the title fix: fix decoding error(utf-8,gbk,iso-8859 and other charsets) in readability (issue2435) fix: fix decoding error(utf-8,gbk,iso-8859 and other charsets) in readability (issue #2435) Jan 4, 2025
@hyoban
Copy link
Member

hyoban commented Jan 6, 2025

thank you for your contribution

@hyoban hyoban merged commit 1de1cd9 into RSSNext:dev Jan 6, 2025
5 of 8 checks passed
@follow-reviewer-bot
Copy link

Thank you @PrinOrange for your contribution! 🎉

Your pull request has been merged and we really appreciate your help in making this project better. We hope to see more contributions from you in the future! 💪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Chars decoding error (gb2312, UTF-8) in Readability.
2 participants