dplyr::arrange interfering with data.table's auto-indexing #5042

Waldi73 · 2021-06-11T21:52:56Z

`Minimal reproducible example`

This is a follow-up on this StackOverflow question/answer.
dplyr::arrange seems to interfere with auto-indexing, leading to unexpected wrong results.
A work-around is to disable auto-indexing with options(datatable.auto.index = FALSE).
As this issue is quite tricky, a warning, a mention to it in FAQ or a protection against this happening would be useful!

library(dplyr); 
library(data.table)
DT <-
  fread(
    "iso3c  country income
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
"
  )

codes <- c("ALB", "ZMB")
options(datatable.auto.index = TRUE) # Default
DT <- distinct(DT) %>%   as.data.table()

# Index creation because %in% is used for the first time
DT[iso3c %in% codes,verbose=T]

# Index mixed up by arrange
DT <- DT %>% arrange(iso3c) %>% as.data.table()

# this is wack because data.table uses the old index where row were rearranged:
DT[iso3c %in% codes,verbose=T]
#>    iso3c country income
#> 1:   ALB Albania   UMIC

# this works because (...) prevents the parser to use auto-index
DT[(iso3c %in% codes)]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

`Output of sessionInfo()`

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.0 dplyr_1.0.6

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2021-06-12T01:28:38Z

could you report this with dplyr as well? at first glance it seems like a bug on that end

avimallu · 2021-06-12T03:53:35Z

I think it might be related somewhat to #4889, and not fixed by #4893. dplyr, AFAIK works with either tibbles or data.frames, so a data.table's index attribute would be ignored during that conversion but retained, leading to an incorrect result. I took a quick look at as.data.table.data.frame, which I believe would be called here, and it doesn't look like it removed the index:

require(data.table)
mtcars = as.data.table(mtcars)
setindex(mtcars, mpg)
mtcars2 = as.data.frame(mtcars)
attributes(mtcars2)$index

The last row returns an output:

integer(0)
attr(,"__mpg")
 [1] 15 16 24  7 17 31 14 23 22 29 12 13 11  6  5 10 25 30  1  2  4 32 21  3  9  8 27 26 19 28 18 20

which would remain unchanged in an operation such as dplyr::arrange followed by a as.data.table. Source codes for easy reference:

as.data.frame.data.table

function (x, ...) 
{
    ans = copy(x)
    setattr(ans, "row.names", .set_row_names(nrow(x)))
    setattr(ans, "class", "data.frame")
    setattr(ans, "sorted", NULL)
    setattr(ans, ".internal.selfref", NULL)
    ans
}
<bytecode: 0x00000256511304d0>
<environment: namespace:data.table>

as.data.table.data.frame

function (x, keep.rownames = FALSE, key = NULL, ...) 
{
    if (!identical(keep.rownames, FALSE)) {
        ans = data.table(rn = rownames(x), x, keep.rownames = FALSE, 
            key = key)
        if (is.character(keep.rownames)) 
            setnames(ans, "rn", keep.rownames[1L])
        return(ans)
    }
    if (any(vapply_1i(x, function(xi) length(dim(xi))))) {
        return(as.data.table.list(as.list(x), keep.rownames = keep.rownames, 
            ...))
    }
    ans = copy(x)
    setattr(ans, "row.names", .set_row_names(nrow(x)))
    setattr(ans, "class", .resetclass(x, "data.frame"))
    setalloccol(ans)
    setkeyv(ans, key)
    ans
}
<bytecode: 0x0000025651853d60>
<environment: namespace:data.table>

jangorecki · 2021-06-12T06:22:01Z

I don't have dplyr on hand to try that out, but as said by @avimallu, as.data.frame should remove index. I pushed that change to #5043 Let me know if it does fix the problem so I will add NEWS entry.

tlapak · 2021-06-12T06:27:34Z

As far as I can tell, dplyr explicitly preserves all attributes of the input. There needs to be a dplyr_row_slice.data.table function to accommodate the attributes that depend on row order, see
https://github.com/tidyverse/dplyr/blob/55917b6194aa8ba5b0ad8b2109ac746193996c96/R/generics.R#L25-L47

Waldi73 · 2021-06-12T06:36:20Z

Thanks all for analyzing this!
@MichaelChirico , reported on dplyr/issues/5917 as well

tlapak · 2021-06-14T10:13:56Z

@jangorecki I've tested your patch in the meantime and as I expected it does not solve the problem. One possibility (with probably minimal future maintainence, but that tends to be wishful thinking) would be to clear the attributes in as.data.table and setDT but I'm not really a fan of that.

The following would fix the problem as well as obviating the need to call as.data.table at all here:

dplyr_row_slice.data.table <- function(data, i, ...) {
  ans <- NextMethod()
  setattr(ans,"sorted",NULL)
  setattr(ans,"index",NULL)
  setalloccol(ans)
  return(ans)
}

but that could start to get hairy as to who has to maintain it and beg the question of how much more should be done to make dplyr play nicely with data.table.

avimallu · 2021-06-14T10:59:30Z

Wouldn't getting as.data.table and setDT to recognize an index attribute and updating it at the time of function call work that doesn't depend on other packages and have much maintenance overhead? I don't know much about the source code for data.table, but I do remember reading discussions about a computationally efficient manner of checking if a vector is already ordered that is in use right now.

A potential problem is if a different package or the user specifically creates an index attribute for a data.frame or derived object, that conflicts with how data.table uses that attribute - a chance that is much less likely.

hadley · 2021-06-16T22:45:43Z

As @tlapak points out, dplyr defaults to preserving all attributes (since a reasonable number of people store arbitrary metadata in attributes and expect it to be maintained). As mentioned, dplyr_row_slice.data.table method would fix the problem, but it's unclear where it should live — since it affects the combination of dplyr and data.table, it feels odd to put it in either. I think it's slightly more correct for it to live in data.table (since y'all know best how to handle your own attributes), but would that require adding dplyr as a suggested package and a little work to conditionally register the method. Given that, I'd be happy to include the method in dplyr.

@tlapak, you'd be welcome to do a PR if you wanted, and it would be great to get a sense from other data.table developers what attributes you think we should preserve/drop.

mattdowle · 2021-07-27T21:07:58Z

Thanks all. #4893 now merged too and I confirmed it doesn't fix this one either, as @avimallu and @tlapak expected.

The root of this issue with dplyr::arrange appears to be vctrs::vec_slice which dplyr:::dplyr_row_slice.data.frame calls.

> DT
    iso3c    country income
   <char>     <char> <char>
1:    MOZ Mozambique    LIC
2:    ZMB     Zambia   LMIC
3:    ALB    Albania   UMIC
> attr(DT, "index")
integer(0)
attr(,"__iso3c")
[1] 3 1 2
> 
> DT2 = vctrs::vec_slice(DT, c(3,1,2))
> 
> DT2
    iso3c    country income
   <char>     <char> <char>
1:    ALB    Albania   UMIC
2:    MOZ Mozambique    LIC
3:    ZMB     Zambia   LMIC
> attr(DT2, "index")
integer(0)
attr(,"__iso3c")
[1] 3 1 2    # index left intact and now incorrect
> class(DT2)
[1] "data.table" "data.frame"  # still a data.table

The intention is that [.data.table detects that dplyr and vctrs are not data.table-aware, and any data.frame-code (in this case [ on a data.table from those not-data-table-aware packages) gets redirected through the cedta() catch which drops the key and indices. That happens here :

data.table/R/data.table.R

Line 153 in 6fa5cab

if (!missing(i) && is.data.table(ans)) setkey(ans, NULL) # See test 304

where that setkey(ans, NULL) removes the index attribute too as well the key. Test 304 was moved to be the plyr::arrange test in other.Rraw which tests that :

data.table/inst/tests/other.Rraw

Line 59 in 0613157

if (loaded[["plyr"]]) {

But for dplyr::arrange (dplyr not plyr), vctrs::vec_slice is implemented in C :

> vctrs::vec_slice
function (x, i) 
{
    .Call(vctrs_slice, x, i)
}
<bytecode: 0x5617bbed46d8>
<environment: namespace:vctrs>

Since vctrs::vec_slice is doing its own subsetting at C level, it's bypassing [ method dispatch.

There's been some attention here to as.data.table apparently because there's two calls to as.data.table in the original reproducible example at the top :

DT <- distinct(DT) %>% as.data.table()
....
# Index mixed up by arrange
DT <- DT %>% arrange(iso3c) %>% as.data.table()

But neither of those as.data.table() are needed. dplyr::distinct(DT) and dplyr::arrange(DT) already retain the data.table class so those as.data.table() calls just return their input. Hence, even if any potential fix was applied to as.data.table (for example like checking any index was valid and dropping it if not) wouldn't work if as.data.table() isn't used.

Even more minimal example :

require(data.table)
require(dplyr)
DT = data.table(A=c("b","c","a"), B=10:12)
setindex(DT, A)
DT[A=="c"]   # correctly returns row 2
DT2 = dplyr::arrange(DT, A)
class(DT2)   # retains data.tabe class
DT2[A=="c"]  # incorrectly returns no rows

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Pop!_OS 21.04

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.13.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.0.6       data.table_1.14.1

loaded via a namespace (and not attached):
 [1] fansi_0.4.2      utf8_1.2.1       crayon_1.4.1     R6_2.5.0        
 [5] lifecycle_1.0.0  magrittr_2.0.1   pillar_1.6.0     rlang_0.4.11    
 [9] vctrs_0.3.8      generics_0.1.0   ellipsis_0.3.2   glue_1.4.2      
[13] purrr_0.3.4      compiler_4.1.0   pkgconfig_2.0.3  tidyselect_1.1.1
[17] tibble_3.1.1

mattdowle · 2021-07-27T21:49:04Z

The .internal.selfref is detecting the invalid status as intended :

require(data.table)
require(dplyr)
DT = data.table(A=c("b","c","a"), B=10:12)
setindex(DT, A)
DT[A=="c"]   # correctly returns row 2
data.table:::selfrefok(DT)  # 1 i.e. ok
DT2 = dplyr::arrange(DT, A)
data.table:::selfrefok(DT2)  # 0 i.e. not ok
class(DT2)   # retains data.tabe class
DT2[A=="c", verbose=TRUE]  # incorrectly returns no rows
# ...
# ... using index
# ...

We shouldn't be using indexes when the selfref is not ok.

mattdowle · 2021-07-28T19:10:42Z

#5084 fixes this and adds a dplyr::arrange test to test.data.table("other.Rraw")

But now we know what we're looking for, I suspect there are other entry points that need catching too. setkey.R::getindex() could be a central place to drop index if selfref isn't ok. Would need to think of a test to trigger that without going through bmerge() which is now caught by this fix to shallow(). [.data.table could be another central place, at the top.

Current thoughts are that it's probably best to merge #5084 so that folk can test dev, and then follow up in a new issue on a more comprehensive review adding tests for dropping indexes when selfref detects the prior copy by another package.

The advantage of this approach over adding dplyr_row_slice.data.table is that it tackles the root more directly and thus would fix other packages that may now or in future do the same thing as vctrs::vec_slice at C level which bypasses [ dispatch. The selfref mechanism was introduced to catch base R copies after all, so it's probably possible to reproduce an invalid index using base R (like the test in #4893) rather than dplyr::arrange (and if so #5084+#5085 would fix that too). We're just using dplyr::arrange as the first known example for the test really. There may be other methods in dplyr that find their way to vctrs::vec_slice, and maybe there are other functions in vctrs that work similarly to vec_slice, or dplyr's method names might change in future. All that should be covered by the selfref approach, so we insulate ourselves from maintenance, and users from specific version incompatibilities.

That's the way I'm thinking currently, anyway. Happy to reconsider.

jangorecki mentioned this issue Jun 12, 2021

as.df removes index #5043

Merged

Waldi73 mentioned this issue Jun 16, 2021

[dplyr::arrange] interfering with data.table's auto-indexing tidyverse/dtplyr#259

Closed

mattdowle added this to the 1.14.1 milestone Jun 21, 2021

This was referenced Jul 28, 2021

tidy plyr::arrange (not dplyr) comments and test #5082

Merged

don't use index when selfref detects prior copy by another package #5084

Merged

mattdowle mentioned this issue Jul 28, 2021

More tests detecting invalid index using selfrefok #5085

Open

4 tasks

mattdowle closed this as completed in #5084 Jul 28, 2021

mattdowle mentioned this issue Sep 1, 2021

revdep chicane affected #5125

Closed

jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dplyr::arrange interfering with data.table's auto-indexing #5042

dplyr::arrange interfering with data.table's auto-indexing #5042

Waldi73 commented Jun 11, 2021 •

edited

Loading

MichaelChirico commented Jun 12, 2021

avimallu commented Jun 12, 2021 •

edited

Loading

jangorecki commented Jun 12, 2021

tlapak commented Jun 12, 2021

Waldi73 commented Jun 12, 2021 •

edited

Loading

tlapak commented Jun 14, 2021

avimallu commented Jun 14, 2021 •

edited

Loading

hadley commented Jun 16, 2021

mattdowle commented Jul 27, 2021 •

edited

Loading

mattdowle commented Jul 27, 2021 •

edited

Loading

mattdowle commented Jul 28, 2021 •

edited

Loading

dplyr::arrange interfering with data.table's auto-indexing #5042

dplyr::arrange interfering with data.table's auto-indexing #5042

Comments

Waldi73 commented Jun 11, 2021 • edited Loading

Minimal reproducible example

Output of sessionInfo()

MichaelChirico commented Jun 12, 2021

avimallu commented Jun 12, 2021 • edited Loading

jangorecki commented Jun 12, 2021

tlapak commented Jun 12, 2021

Waldi73 commented Jun 12, 2021 • edited Loading

tlapak commented Jun 14, 2021

avimallu commented Jun 14, 2021 • edited Loading

hadley commented Jun 16, 2021

mattdowle commented Jul 27, 2021 • edited Loading

mattdowle commented Jul 27, 2021 • edited Loading

mattdowle commented Jul 28, 2021 • edited Loading

Waldi73 commented Jun 11, 2021 •

edited

Loading

`Minimal reproducible example`

`Output of sessionInfo()`

avimallu commented Jun 12, 2021 •

edited

Loading

Waldi73 commented Jun 12, 2021 •

edited

Loading

avimallu commented Jun 14, 2021 •

edited

Loading

mattdowle commented Jul 27, 2021 •

edited

Loading

mattdowle commented Jul 27, 2021 •

edited

Loading

mattdowle commented Jul 28, 2021 •

edited

Loading