Skip to content

Commit

Permalink
Improve binary detection for cases starting by chance by a BOM
Browse files Browse the repository at this point in the history
The presence of a sequence of bytes resembling a BOM does not guarantee
that the data is text. We can in those cases use the detection provided
by Qt. If the codec matches the one selected, we can consider that text.

See issue #2197
  • Loading branch information
mgrojo committed Jun 21, 2020
1 parent f4d78ad commit efb5b1a
Showing 1 changed file with 6 additions and 3 deletions.
9 changes: 6 additions & 3 deletions src/Data.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,12 @@ static const QByteArray bom4b("\xFF\xFE\x00\x00", 4);

bool isTextOnly(QByteArray data, const QString& encoding, bool quickTest)
{
// If the data starts with a Unicode BOM, we always assume it is text
if(startsWithBom(data))
return true;
// If the data starts with a Unicode BOM, we can use detection provided by QTextCodec.
if(startsWithBom(data)) {
QTextCodec *codec = encoding.isEmpty()? QTextCodec::codecForName("UTF-8") : QTextCodec::codecForName(encoding.toUtf8());
QTextCodec *detectedCodec = QTextCodec::codecForUtfText(data, nullptr);
return detectedCodec == codec;
}

// Truncate to the first few bytes for quick testing
int testSize = quickTest? std::min(512, data.size()) : data.size();
Expand Down

2 comments on commit efb5b1a

@chrisjlocke
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, we're using the detection provided by QTextCodec on the full cell contents and not the first few bytes for quick testing? That's what the code is doing, so just making sure that was our intention...

@mgrojo
Copy link
Member Author

@mgrojo mgrojo commented on efb5b1a Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I supposed it would only read the BOM and some bytes. Looking now at the source code, it seems that it reads only the BOM (or I didn't get something right). So the improvement here seems to be, that it is not only enough to start by a BOM, it has to match the encoding that the user has selected. But I'm not expert in Unicode, and don't know how to test all the possible encodings. In Linux, everything seems to use UTF-8 and BOMs are rare.

Please sign in to comment.