-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dplyr::arrange interfering with data.table's auto-indexing #5042
Comments
could you report this with dplyr as well? at first glance it seems like a bug on that end |
I think it might be related somewhat to #4889, and not fixed by #4893.
The last row returns an output:
which would remain unchanged in an operation such as as.data.frame.data.table
as.data.table.data.frame
|
As far as I can tell, dplyr explicitly preserves all attributes of the input. There needs to be a |
Thanks all for analyzing this! |
@jangorecki I've tested your patch in the meantime and as I expected it does not solve the problem. One possibility (with probably minimal future maintainence, but that tends to be wishful thinking) would be to clear the attributes in The following would fix the problem as well as obviating the need to call dplyr_row_slice.data.table <- function(data, i, ...) {
ans <- NextMethod()
setattr(ans,"sorted",NULL)
setattr(ans,"index",NULL)
setalloccol(ans)
return(ans)
} but that could start to get hairy as to who has to maintain it and beg the question of how much more should be done to make dplyr play nicely with data.table. |
Wouldn't getting A potential problem is if a different package or the user specifically creates an |
As @tlapak points out, dplyr defaults to preserving all attributes (since a reasonable number of people store arbitrary metadata in attributes and expect it to be maintained). As mentioned, @tlapak, you'd be welcome to do a PR if you wanted, and it would be great to get a sense from other data.table developers what attributes you think we should preserve/drop. |
Thanks all. #4893 now merged too and I confirmed it doesn't fix this one either, as @avimallu and @tlapak expected. The root of this issue with > DT
iso3c country income
<char> <char> <char>
1: MOZ Mozambique LIC
2: ZMB Zambia LMIC
3: ALB Albania UMIC
> attr(DT, "index")
integer(0)
attr(,"__iso3c")
[1] 3 1 2
>
> DT2 = vctrs::vec_slice(DT, c(3,1,2))
>
> DT2
iso3c country income
<char> <char> <char>
1: ALB Albania UMIC
2: MOZ Mozambique LIC
3: ZMB Zambia LMIC
> attr(DT2, "index")
integer(0)
attr(,"__iso3c")
[1] 3 1 2 # index left intact and now incorrect
> class(DT2)
[1] "data.table" "data.frame" # still a data.table The intention is that Line 153 in 6fa5cab
where that setkey(ans, NULL) removes the index attribute too as well the key. Test 304 was moved to be the plyr::arrange test in other.Rraw which tests that :data.table/inst/tests/other.Rraw Line 59 in 0613157
But for > vctrs::vec_slice
function (x, i)
{
.Call(vctrs_slice, x, i)
}
<bytecode: 0x5617bbed46d8>
<environment: namespace:vctrs> Since There's been some attention here to DT <- distinct(DT) %>% as.data.table()
....
# Index mixed up by arrange
DT <- DT %>% arrange(iso3c) %>% as.data.table() But neither of those Even more minimal example : require(data.table)
require(dplyr)
DT = data.table(A=c("b","c","a"), B=10:12)
setindex(DT, A)
DT[A=="c"] # correctly returns row 2
DT2 = dplyr::arrange(DT, A)
class(DT2) # retains data.tabe class
DT2[A=="c"] # incorrectly returns no rows
|
The require(data.table)
require(dplyr)
DT = data.table(A=c("b","c","a"), B=10:12)
setindex(DT, A)
DT[A=="c"] # correctly returns row 2
data.table:::selfrefok(DT) # 1 i.e. ok
DT2 = dplyr::arrange(DT, A)
data.table:::selfrefok(DT2) # 0 i.e. not ok
class(DT2) # retains data.tabe class
DT2[A=="c", verbose=TRUE] # incorrectly returns no rows
# ...
# ... using index
# ... We shouldn't be using indexes when the selfref is not ok. |
#5084 fixes this and adds a But now we know what we're looking for, I suspect there are other entry points that need catching too. Current thoughts are that it's probably best to merge #5084 so that folk can test dev, and then follow up in a new issue on a more comprehensive review adding tests for dropping indexes when selfref detects the prior copy by another package. The advantage of this approach over adding That's the way I'm thinking currently, anyway. Happy to reconsider. |
Minimal reproducible example
This is a follow-up on this StackOverflow question/answer.
dplyr::arrange
seems to interfere with auto-indexing, leading to unexpected wrong results.A work-around is to disable auto-indexing with
options(datatable.auto.index = FALSE)
.As this issue is quite tricky, a warning, a mention to it in FAQ or a protection against this happening would be useful!
Output of sessionInfo()
The text was updated successfully, but these errors were encountered: