Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should support DT[FLAG, ] when FLAG is a logical vector of DT #4255

Closed
shrektan opened this issue Feb 23, 2020 · 8 comments
Closed

should support DT[FLAG, ] when FLAG is a logical vector of DT #4255

shrektan opened this issue Feb 23, 2020 · 8 comments

Comments

@shrektan
Copy link
Member

If DT contains a logical vector (call it the FLAG column), we have to use DT[FLAG==TRUE] while DT[FLAG] will raise errors. It's kind of confusing for the newbies.

Any reason that we can't support this?

library(data.table)
tbl <- data.table(FLAG = c(TRUE, FALSE), VALUE = c(1, 0))
tbl[FLAG == TRUE]
#>    FLAG VALUE
#> 1: TRUE     1
tbl[FLAG == FALSE]
#>     FLAG VALUE
#> 1: FALSE     0
tbl[FLAG]
#> Error in `[.data.table`(tbl, FLAG): FLAG is not found in calling scope but it is a column of type logical. If you wish to select rows where that column contains TRUE, or perhaps that column contains row numbers of itself to select, try DT[(col)], DT[DT$col], or DT[col==TRUE] is particularly clear and is optimized. When the first argument inside DT[...] is a single symbol (e.g. DT[var]), data.table looks for var in calling scope.
tbl[!FLAG]
#> Error in `[.data.table`(tbl, !FLAG): FLAG is not found in calling scope but it is a column of type logical. If you wish to select rows where that column contains TRUE, or perhaps that column contains row numbers of itself to select, try DT[(col)], DT[DT$col], or DT[col==TRUE] is particularly clear and is optimized. When the first argument inside DT[...] is a single symbol (e.g. DT[var]), data.table looks for var in calling scope.
flag <- tbl$FLAG
tbl[flag]
#>    FLAG VALUE
#> 1: TRUE     1
tbl[!flag]
#>     FLAG VALUE
#> 1: FALSE     0

Created on 2020-02-23 by the reprex package (v0.3.0)

Session info
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.6.2 (2019-12-12)
#>  os       macOS Catalina 10.15.3      
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       Asia/Shanghai               
#>  date     2020-02-23                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version    date       lib source                            
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 3.6.0)                    
#>  backports     1.1.5      2019-10-02 [1] CRAN (R 3.6.0)                    
#>  callr         3.4.1      2020-01-24 [1] CRAN (R 3.6.0)                    
#>  cli           2.0.1      2020-01-08 [1] CRAN (R 3.6.0)                    
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 3.6.0)                    
#>  data.table  * 1.12.8     2019-12-09 [1] CRAN (R 3.6.0)                    
#>  desc          1.2.0      2018-05-01 [1] CRAN (R 3.6.0)                    
#>  devtools      2.2.1      2019-09-24 [1] CRAN (R 3.6.0)                    
#>  digest        0.6.23     2019-11-23 [1] CRAN (R 3.6.0)                    
#>  ellipsis      0.3.0      2019-09-20 [1] CRAN (R 3.6.0)                    
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 3.6.0)                    
#>  fansi         0.4.1      2020-01-08 [1] CRAN (R 3.6.0)                    
#>  fs            1.3.1      2019-05-06 [1] CRAN (R 3.6.0)                    
#>  glue          1.3.1      2019-03-12 [1] CRAN (R 3.6.0)                    
#>  highr         0.8        2019-03-20 [1] CRAN (R 3.6.0)                    
#>  htmltools     0.4.0.9002 2020-01-27 [1] Github (rstudio/htmltools@e07546c)
#>  knitr         1.27.2     2020-01-31 [1] local                             
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 3.6.0)                    
#>  memoise       1.1.0      2017-04-21 [1] CRAN (R 3.6.0)                    
#>  pkgbuild      1.0.6      2019-10-09 [1] CRAN (R 3.6.0)                    
#>  pkgload       1.0.2      2018-10-29 [1] CRAN (R 3.6.0)                    
#>  prettyunits   1.1.1      2020-01-24 [1] CRAN (R 3.6.0)                    
#>  processx      3.4.1      2019-07-18 [1] CRAN (R 3.6.0)                    
#>  ps            1.3.0      2018-12-21 [1] CRAN (R 3.6.0)                    
#>  R6            2.4.1      2019-11-12 [1] CRAN (R 3.6.0)                    
#>  Rcpp          1.0.3      2019-11-08 [1] CRAN (R 3.6.0)                    
#>  remotes       2.1.0      2019-06-24 [1] CRAN (R 3.6.0)                    
#>  rlang         0.4.4      2020-01-28 [1] CRAN (R 3.6.2)                    
#>  rmarkdown     2.1        2020-01-20 [1] CRAN (R 3.6.0)                    
#>  rprojroot     1.3-2      2018-01-03 [1] CRAN (R 3.6.0)                    
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 3.6.0)                    
#>  stringi       1.4.5      2020-01-11 [1] CRAN (R 3.6.0)                    
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 3.6.0)                    
#>  testthat      2.3.1      2019-12-01 [1] CRAN (R 3.6.0)                    
#>  usethis       1.5.1      2019-07-04 [1] CRAN (R 3.6.0)                    
#>  withr         2.1.2      2018-03-15 [1] CRAN (R 3.6.0)                    
#>  xfun          0.12       2020-01-13 [1] CRAN (R 3.6.0)                    
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 3.6.2)                    
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
@MichaelChirico
Copy link
Member

DT[(FLAG)] works though; allowing DT[FLAG] opens the Pandora's box for ambiguity -- is FLAG in DT or is it in the parent frame? what if FLAG is in both DT & parent.frame?

@renkun-ken
Copy link
Member

I use DT[(FLAG)] in this case.

I think the data.table behavior/semantics should not be determined dynamically by whether the column exists in the data.table or it would be hard for script reader to guess the intention of the code because the reader needs to know if this column exists. If the semantics is quite static, it would be easier to manage and work with.

@shrektan
Copy link
Member Author

shrektan commented Feb 23, 2020

Sorry, I don't get your guys' point. Even within (), the variable is determined dynamically. Isn't it?
That's said, the variable inside () doesn't guarantee it's a column inside of DT.

library(data.table)
tbl <- data.table(VALUE = c(1, 0))
flag <- c(FALSE, TRUE)
tbl[(flag)]
#>    VALUE
#> 1:     0
tbl[, flag := !flag]
tbl[(flag)]
#>    VALUE flag
#> 1:     1 TRUE

Created on 2020-02-23 by the reprex package (v0.3.0)

@renkun-ken
Copy link
Member

renkun-ken commented Feb 23, 2020

Sorry for not making my point clear.

My point is that we already use a lot of NSE in data table syntax by customizing the interpretation of expressions in [.data.table which is dynamic but the semantics or the meaning of the code we are trying to define is not necessarily (and in my point of view, should not) dynamically determined (not depending on if a column exists or something we only know at runtime).

Suppose we are only looking at the following code:

tbl[flag]

If the meaning or the semantics of the code depends on if flag is a column in tbl, then one has to know if tbl has flag at runtime at this point to know what the author is trying to do.

By contrast, if tbl[flag] is always evaluated in the same way without depending on whether flag is a column of tbl, then the reader of this code knows what the author is trying to do without having to know any runtime information, if the reader knows the evaluation rule. At this level, @shrektan is trying to say that the evaluation rule here might not be natural enough.

Currently, the evaluation rule of tbl[flag] is simple: if flag is a symbol then dt assumes it is an external selector and always tries to find it in the calling scope. Although the evaluation of flag follows dynamic scoping, but the semantics or how it is evaluated is predetermined and is not determined at runtime, not depending on if a column exists.

This makes the code much easier to maintain and read in complex and collaborative data project.

@MichaelChirico
Copy link
Member

Agree with @renkun-ken -- in principle, we could examine whether flag is a logical column of tbl during evaluation of [, and then assume the user is trying to subset with this column;

The current behavior allows (with just two extra characters) the user to signal the intent that flag is a column unambiguously.

The former comes at the cost of (1) increasing the complexity of our codebase (2) introducing some assumptions which are hard to get around [i.e., suppose flag shouldn't be read as being a column of tbl, how can the user assert this so that we don't treat it like that?]

@MichaelChirico
Copy link
Member

About whether ( forces flag to be a column of tbl -- it doesn't , but it clearly establishes the scoping rules --

  • with (, scoping goes: inside tbl, then parent.frame(), then upwards as normal
  • without (, scoping goes: parent.frame(), then upwards as normal

@shrektan
Copy link
Member Author

shrektan commented Feb 23, 2020

Thanks, I see your points. Since I always use UPPERCASE as the column and smallercase as the variable name to avoid the ambiguity, I ignored one possibility that is the i in DT[i] could be more things like a character vector or another data.table.

By always evaluating the i above in the external frame indeed is better to understand in such cases (as well as the maintainability).

So I'm closing this issue.

@renkun-ken @MichaelChirico Thanks for both of your answers.

@ColeMiller1
Copy link
Contributor

ColeMiller1 commented Feb 23, 2020

@MichaelChirico [ already checks to see if flag is a logical. That's why there's a helpful error telling users that they can use (FLAG) | dt$FLAG | FLAG == TRUE. But it is surprising behavior to a user- why does dt[FLAG & FLAG] work?

We could easily accommodate the request with adding an if else statement to the symbol evaluation branch. I compiled the below changes - only test 1773.01 is affected. edit: added else statement at the end to make it work for real on tbl[FLAG].

##########see line 366 in data.table.R
      # isub is a single symbol name such as B in DT[B]  
      i = try(eval(isub, parent.frame(), parent.frame()), silent=TRUE)
      if (inherits(i,"try-error")) {
        # must be "not found" since isub is a mere symbol
        col = try(eval(isub, x), silent=TRUE)  # is it a column name?
  #########wrap with new if statement#################
 if(typeof(col) != "logical") { #####new line 1

        msg = if (inherits(col,"try-error")) " and it is not a column name either."
        else paste0(" but it is a column of type ", typeof(col),". If you wish to select rows where that column contains TRUE",
                    ", or perhaps that column contains row numbers of itself to select, try DT[(col)], DT[DT$col], or DT[col==TRUE] is particularly clear and is optimized.")
        stop(as.character(isub), " is not found in calling scope", msg,
             " When the first argument inside DT[...] is a single symbol (e.g. DT[var]), data.table looks for var in calling scope.")
}  else { #####new
          i = col   ####new
        } ####new

Regarding scoping, the description is inconsistent with j. edit - had these reversed initially.

  • without (, scoping is limited to inside tbl
  • with (, scoping is limited to only the parent.frame().
library(data.table)
tbl <- data.table(FLAG = c(TRUE, FALSE), VALUE = c(1, 0))
col = "surprise"
tbl[, col := 1L][]
#>      FLAG VALUE   col
#>    <lgcl> <num> <int>
#> 1:   TRUE     1     1
#> 2:  FALSE     0     1
tbl[, (col) := 1L][]
#>      FLAG VALUE   col surprise
#>    <lgcl> <num> <int>    <int>
#> 1:   TRUE     1     1        1
#> 2:  FALSE     0     1        1

tbl <- data.table(FLAG = c(TRUE, FALSE), VALUE = c(1, 0))
tbl[, (new_col) := 1L]
#> Error in eval(lhs, parent.frame(), parent.frame()): object 'new_col' not found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants