Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another seemingly bug in dealing with NAs #5273

Open
kwhkim opened this issue Nov 25, 2021 · 4 comments
Open

Another seemingly bug in dealing with NAs #5273

kwhkim opened this issue Nov 25, 2021 · 4 comments
Labels

Comments

@kwhkim
Copy link

kwhkim commented Nov 25, 2021

Too much requirement readings... I think I did my best :)
NA seems to cause all sorts of problem...

Here is my Minimal Example.

txt = r"(
NA,"NA","",
"NA",,"",NA
"",NA,,"NA"
)"
fread(txt) # whether header=TRUE or FALSE
#    V1 V2 V3 V4
# 1: NA NA NA NA
# 2: NA NA NA NA
# 3: NA NA NA NA

Here is what it should do(it is just that "NA" is changed to "XX")

txt = r"(
NA,"XX","",
"XX",,"",NA
"",NA,,"XX"
)"
fread(txt, header=FALSE)
#      V1   V2 V3   V4
# 1: <NA>   XX NA     
# 2:   XX      NA <NA>
# 3:      <NA> NA   XX

except for one thing that V3 are all NAs... where is ""?

It doesnot matter whether or not you set na.strings = 'NA'

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=Korean_Korea.949  LC_CTYPE=Korean_Korea.949    LC_MONETARY=Korean_Korea.949
[4] LC_NUMERIC=C                 LC_TIME=Korean_Korea.949    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.2 dplyr_1.0.7      

loaded via a namespace (and not attached):
 [1] fansi_0.5.0      assertthat_0.2.1 utf8_1.2.2       crayon_1.4.2     R6_2.5.1        
 [6] DBI_1.1.1        lifecycle_1.0.1  magrittr_2.0.1   pillar_1.6.4     cli_3.1.0       
[11] rlang_0.4.12     rstudioapi_0.13  vctrs_0.3.8      generics_0.1.1   ellipsis_0.3.2  
[16] tools_4.1.0      glue_1.4.2       purrr_0.3.4      compiler_4.1.0   pkgconfig_2.0.3 
[21] tidyselect_1.1.1 tibble_3.1.5    
@kwhkim
Copy link
Author

kwhkim commented Nov 25, 2021

I tried with read.csv() and voila... it is basically the same with data.table so I wonder data.table's fread is in a way based on read.csv()...

Here is an example for read.csv

> read.csv('test2_read_csv.csv', header=FALSE)
    V1   V2 V3   V4
1 <NA>   XX NA     
2   XX      NA <NA>
3      <NA> NA   XX
Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on 'test2_read_csv.csv'
> read.csv('test_read_csv.csv', header=FALSE)
  V1 V2 V3 V4
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on 'test_read_csv.csv'

test_read_csv.csv and test2_read_csv.csv are the file with same contents as variable txt above

And, readr::read_csv() behaves just like read.csv() or fread().... Strange.....

@avimallu
Copy link
Contributor

My interpretation of

txt = r"(
NA,"XX","",
"XX",,"",NA
"",NA,,"XX"
)"

is based on cat(txt), which comes out as:

NA,"XX","",
"XX",,"",NA
"",NA,,"XX"

Notice the empty line on the top that fread seems to skip automatically. The documentation states that:

,"", is unambiguous and read as an empty string.

Which explains why the third column is read as a set of logical NAs in your sample code. Now, the output:

> fread(txt, header = FALSE)
       V1     V2     V3     V4
   <char> <char> <lgcl> <char>
1:   <NA>     XX     NA       
2:     XX            NA   <NA>
3:          <NA>     NA     XX

is a little confusing, because the column after the comma , in the first line is read as blank, but is consistent with how ,, in (row, column) positions (2, 2) and (3,1) is read as a blank string. It still contradicts the documentation:

By default, ",," for columns of all types, including type character is read as NA for consistency

but that is also probably because fread detects that ,"", is used to explicitly denote NA values in the file, so ,"", overrides this default.

@kwhkim
Copy link
Author

kwhkim commented Dec 9, 2021

@avimallu
Well, the main issue here is that even if the difference is the replacement of "NA" with "XX",
the result comes out totally different. With "NA", it's all NAs!!!
How wierd it is, even though I did not look into the source code,
I think it is totally based on wrong assumption.
ps) the first blank line is just for convenience of reading,
I checked and the result is the same with the first blank line or not.

@kwhkim
Copy link
Author

kwhkim commented Dec 9, 2021

According to my analysis on text data file,
given quotation mark as " and column seperator ,,
I think both ,, and ,NA, should be NA and ,"", and ,"NA", should empty character and character "NA".
The philosophy is if some character or characters are given special meaning other than letter itself,
there should be some way to recover its meaning as a letter.
It is the reason why we need quotation mark in the first place, "," should be recognized as a letter other than column seperator.
In case of ,, and ,NA, it could be recognized as "" and "NA" because they dont contain ,(column seperator).
But we can give another meaning(missing value) to ,, and ,NA,. In that case,
it is obligatory to find another way to represent character "" and "NA" and naturally ,"", and ,"NA", can do the work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants