Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I had expected rbindlist() to preserve attributes, but it doesn't. It'd be a nice feature; or an important limitation to disclose in the documentation. #5569

Open
cthombor opened this issue Dec 20, 2022 · 4 comments · May be fixed by #5570

Comments

@cthombor
Copy link

cthombor commented Dec 20, 2022

# Minimal reproducible example

library(data.table)
x <- data.table(0)
setattr(x,"xa",0)
y <- data.table(1)
setattr(y,"ya",1)
z <- rbindlist(list(x,y))
attributes(z)
#> $names
#> [1] "V1"
#> 
#> $row.names
#> [1] 1 2
#> 
#> $class
#> [1] "data.table" "data.frame"
#> 
#> $.internal.selfref
#> <pointer: 0x00000239858df980>
# I had expected attributes(z) to include at least "xa", or both "xa" and "ya".

Created on 2022-12-20 with reprex v2.0.2

# Output of sessionInfo()

sessionInfo()
#> R version 4.2.2 (2022-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22621)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_New Zealand.utf8  LC_CTYPE=English_New Zealand.utf8   
#> [3] LC_MONETARY=English_New Zealand.utf8 LC_NUMERIC=C                        
#> [5] LC_TIME=English_New Zealand.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] rstudioapi_0.14   knitr_1.41        magrittr_2.0.3    R.cache_0.16.0   
#>  [5] rlang_1.0.6       fastmap_1.1.0     stringr_1.5.0     styler_1.8.1     
#>  [9] highr_0.9         tools_4.2.2       xfun_0.34         R.oo_1.25.0      
#> [13] cli_3.4.1         withr_2.5.0       htmltools_0.5.3   yaml_2.3.6       
#> [17] digest_0.6.29     lifecycle_1.0.3   purrr_0.3.5       vctrs_0.5.0      
#> [21] R.utils_2.12.2    fs_1.5.2          glue_1.6.2        evaluate_0.18    
#> [25] rmarkdown_2.18    reprex_2.0.2      stringi_1.7.8     compiler_4.2.2   
#> [29] R.methodsS3_1.8.2

Created on 2022-12-20 with reprex v2.0.2

@ben-schwen
Copy link
Member

ben-schwen commented Dec 20, 2022

To be fair the base equivalent does also not preserve all attributes.

x <- data.frame(a=0)
attr(x, "xa") <- 1
y <- data.frame(a=1)
attr(y, "ya") <- 1
z <- do.call(rbind, list(x,y))
attributes(z)
#> $names
#> [1] "a"
#> 
#> $row.names
#> [1] 1 2
#> 
#> $xa
#> [1] 1
#> 
#> $class
#> [1] "data.frame"

Neither does dplyr::bind_rows()

x <- data.frame(a=0)
y <- data.frame(a=1)
attr(x, "xa") <- 1
attr(y, "ya") <- 1
attributes(dplyr::bind_rows(x, y))
#> $names
#> [1] "a"
#> 
#> $class
#> [1] "data.frame"
#> 
#> $row.names
#> [1] 1 2
#> 
#> $xa
#> [1] 1

@cthombor
Copy link
Author

Indeed, I'd be happy if rbindlist() was similar to either rbind() or dplyr::bind_rows() with respect to the preservation (or destruction, or copying) of table-level attributes. However I can't find any documentation of their intended behaviour! Furthermore, I wouldn't be surprised if the modify-in-place reference semantics of data.table methods will make it impossible (or undesirable) to mimic the behaviour of either of those data.frame methods.

My use-case for these methods is to add a row of experimental data to an existing table. I now realise that this is much more easily done with data.frame methods than with data.table methods:

xdf <- data.frame(data=c(0))
attr(xdf, "xa") <- 0
xdf <- dplyr::bind_rows(xdf, c(data=1))
# I expect attributes(xdf) to include "xa".  (PASS)
attributes(xdf)
#> $names
#> [1] "data"
#> 
#> $class
#> [1] "data.frame"
#> 
#> $row.names
#> [1] 1 2
#> 
#> $xa
#> [1] 0

xdf <- data.frame(data=c(0))
attr(xdf, "xa") <- 0
xdf <- rbind(xdf, c(data=1))
# I expect attributes(xdf) to include "xa".  (PASS)
attributes(xdf)
#> $names
#> [1] "data"
#> 
#> $row.names
#> [1] 1 2
#> 
#> $xa
#> [1] 0
#> 
#> $class
#> [1] "data.frame"

library(data.table)
xdt <- data.table(data=0)
setattr(xdt, "xa", 0)
xdt <- dplyr::bind_rows(xdt, c(data=1))
# I expect class(xdt) to have "data.table" as its first element (PASS)
class(xdt)
#> [1] "data.table" "data.frame"
# I expect xdt to have an attribute "xa" (PASS)
attributes(xdt)
#> $names
#> [1] "data"
#> 
#> $row.names
#> [1] 1 2
#> 
#> $class
#> [1] "data.table" "data.frame"
#> 
#> $.internal.selfref
#> <pointer: 0x000002c36fbc4bb0>
#> 
#> $xa
#> [1] 0
# I expect class(xdt) to have a valid .internal.selfref structure.  (FAIL)
data.table:::selfrefok(xdt)
#> [1] 0

library(data.table)
xdt <- data.table(data=0)
setattr(xdt, "xa", 0)
xdt <- rbindlist(list(xdt, data.table(data=1)))
# I expect class(xdt) to have a valid .internal.selfref structure.  (PASS)
data.table:::selfrefok(xdt)
#> [1] 1
# I expect attributes(xdt) to include "xa".  (FAIL)
attributes(xdt)
#> $names
#> [1] "data"
#> 
#> $row.names
#> [1] 1 2
#> 
#> $class
#> [1] "data.table" "data.frame"
#> 
#> $.internal.selfref
#> <pointer: 0x000002c36fbc4bb0>

I'll leave it to the DT team to decide whether to continue treating this as a feature-request, or (as I'd now suggest) as a request to amend the documentation of rbindlist(). A sentence would suffice, I should think, to inform users that -- unlike DF <- rbind(DF, ...) and DF <- dplyr::bind_rows(DF, ... ) -- the new data.table DT created by DT <- rbindlist(l, ...) has none of the attributes of the tables in its argument l aside from their column names.

Your prompt response to my feature-request is very impressive, thanks! I'm extremely grateful for the bazillions of hours of volunteer labour, over the decades, which is making my current coding experience in R so so so much more pleasant and productive than my only prior experience with stochastic experimentation -- which was way back in the early 1990s, using S (yeah I'm an old codger! ;-)

@ben-schwen ben-schwen linked a pull request Dec 21, 2022 that will close this issue
4 tasks
@ben-schwen
Copy link
Member

Preserve attributes with rbindlist seems trickier than I initially thought, since we also store key and indices as attributes.

This is something that will definitely break a data.table when using dplyr::bind_rows together with a keyed data.table as first argument (similar to #5361)

@cthombor
Copy link
Author

cthombor commented Dec 21, 2022

Yeah, it'd be a big shift in semantics of rbindlist, if its first arg was redefined to be a DT that'll be modified by reference, followed by a list of DTs to be concatenated! There must be many people (and codes) who are currently using rbindlist as a UNION ALL, i.e. are relying on it not to modify its first argument.

So... maybe... it'd make sense to define a new method for DTs which (essentially) overloads append with the usual DT design pattern of modifying its first arg by reference and returning it invisibly? Maybe call it rappend? As with rbindlist I guess it'd be invalidating the keys, to avoid the performance overhead of rebuilding them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants