Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread bug in v1.15.4 when reading a CSV with no headers and first variable is BZh #6304

Closed
grainnemcguire opened this issue Jul 21, 2024 · 10 comments · Fixed by #6308
Closed

Comments

@grainnemcguire
Copy link

fread produces an error or incorrect results when the first variable in a csv file with no headers is BZh.

  • In this case, fread() thinks that the file is a bgz file and attempts to process it accordingly via R.utils package.
  • It generates an error message to install r.utils if not installed
  • It generates an empty file error if R.utils is installed

fread runs as expected if the data have headers. data.table v1.14.8 also runs as expected.

This looks to be related to the issue in #5461 and the related changes in PR #5474 and is triggered by matching against bz2_signature in fread().

Reproducible example

library(data.table)  # v 1.15.4

# these all error ----;
dt_out <- data.table(c1 = "BZh")
fwrite(dt_out, "c:/gtemp/tem1.csv", col.names = FALSE)
(dt_in <- fread("c:/gtemp/tem1.csv"))
# error if R.utils not installed:
# Error: To read gz and bz2 files directly, fread() requires 'R.utils' package which cannot be found. Please install 'R.utils' using 'install.packages('R.utils')'.

# Error if R.utils is installed:
# Error in fread("c:/gtemp/tem1.csv") : 
#   File is empty: C:\Users\grain\AppData\Local\Temp\RtmpmKTA65\file33a84c417794

(dt_in <- fread(file = "c:/gtemp/tem1.csv"))

(dt_in <- fread(file = "c:/gtemp/tem1.csv", header = FALSE))


dt_out <- data.table(c1 = c("BZh", "BZh"))
fwrite(dt_out, "c:/gtemp/tem1.csv", col.names = FALSE)
(dt_in <- fread(file = "c:/gtemp/tem1.csv", header = FALSE))

dt_out <- data.table(c1 = c("BZh"), c2 = "Bzh")
fwrite(dt_out, "c:/gtemp/tem1.csv", col.names = FALSE)
(dt_in <- fread(file = "c:/gtemp/tem1.csv", header = FALSE))

dt_out <- data.table(c1 = c("BZh", "BZh", "BZ", "GZ"), c2 = c("Bzh", "Bzh", "Bzh", "Bzh"))
fwrite(dt_out, "c:/gtemp/tem1.csv", col.names = FALSE)
(dt_in <- fread(file = "c:/gtemp/tem1.csv", header = FALSE))

# this works ----
# these all work ----;
dt_out <- data.table(c1 = "BZh")
fwrite(dt_out, "c:/gtemp/tem1.csv", col.names = TRUE)
(dt_in <- fread("c:/gtemp/tem1.csv"))

Output of sessionInfo()

R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_Ireland.utf8  LC_CTYPE=English_Ireland.utf8    LC_MONETARY=English_Ireland.utf8 LC_NUMERIC=C                     LC_TIME=English_Ireland.utf8    

time zone: Europe/Dublin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.15.4

loaded via a namespace (and not attached):
[1] compiler_4.4.1    parallel_4.4.1    tools_4.4.1       R.methodsS3_1.8.2 R.utils_2.12.3    R.oo_1.26.0 
@ben-schwen
Copy link
Member

Thanks for the report. Not reproducable on Ubuntu, but can reproduce on current master and Windows.

library(data.table)
dt = data.table(c1="BZh")
f = tempfile()
fwrite(dt, f, col.names=FALSE)
fread(f)
#> Fehler in fread(f) :
#>   File is empty: C:\Users\~\AppData\Local\Temp\Rtmp4Cc6ek\file1ff86e2d449f

@MichaelChirico
Copy link
Member

MichaelChirico commented Jul 22, 2024

fread(file = f) is a workaround? Any insight from verbose=TRUE?

@grainnemcguire
Copy link
Author

fread(file = f) doesn't help. Here is verbose output:

image

From stepping through fread in debug mode:

  • there is a variable file_signature defined as readBin(file, raw(), 8L) = c(42, 5a, 68, 0d, 0a) where file = f in this case.
  • bz2_signature evaluates as c(42, 5a, 68) (defined as as.raw(c(0x42, 0x5A, 0x68)) in function code)
  • this condition: identical(head(file_signature, 3L), bz2_signature) then evaluates as TRUE [both are c(42, 5a, 68)] which triggers the handling as a bz2 file and thus the error

@ben-schwen
Copy link
Member

Out of cursiosity, is this a real file starting with "BZh"?

Interestingly, readLines also seems to have certain problems here

f = tempfile()
fwrite(data.table(c1=c("BZh")), f, col.names=FALSE)
readLines(f)
#> [1] "BZh"
fwrite(data.table(c1=c("BZh", "x")), f, col.names=FALSE)
readLines(f)
#> character(0)

@MichaelChirico
Copy link
Member

Per here, we could make our detection more safe from false positives like this by also checking the 4th byte for a digit 1-9:

https://en.wikipedia.org/wiki/Bzip2#File_format

@grainnemcguire
Copy link
Author

@ben-schwen yes we found the issue in a real csv with no headers. It contains many more variables and rows, but the BZh in first entry is enough to trigger the false positive.

@ben-schwen
Copy link
Member

Apparently this is also a problem for readLines so I will file a report at the r bug tracker.

At R source:
https://github.com/wch/r-source/blob/709158ab78d11f3dbe855b6f3b71bf3892a1b3be/src/main/connections.c#L2418-L2447

@MichaelChirico
Copy link
Member

Thanks @grainnemcguire, can you share readBin(f, raw(), 10) for your file?

@MichaelChirico
Copy link
Member

Went ahead and filed https://bugs.r-project.org/show_bug.cgi?id=18768 before I forget

@grainnemcguire
Copy link
Author

Thanks for the rapid response to this. Just FYI, the data field was a character field more than 3 characters long - the next 7 characters after the BZh were a mixture of letters and numbers.

Thanks in general for maintaining data.table. It's such a useful package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants