Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly handle NA column names from pivot_wider() #395

Merged
merged 12 commits into from
Oct 21, 2022

Conversation

markfairbanks
Copy link
Collaborator

@markfairbanks markfairbanks commented Oct 14, 2022

Closes #394

Pretty sure this is failing checks due to things updated in #393. So we can wait to merge until that is ready to go.

@markfairbanks
Copy link
Collaborator Author

markfairbanks commented Oct 17, 2022

FYI checks are passing now that #393 is merged (so we're definitely good for review).

@eutwt
Copy link
Collaborator

eutwt commented Oct 19, 2022

I don't follow the numeric substitution logic. I don't think I'd expect step_setnames(lazdt, c(4, NA), c('s', 'j')) to rename the second column toj. Is it related to the same issue that we'd want that behavior?

Do we even need to change step_setnames? It seems that pivot_wider() already attempts to replicate the data.table varnames created by dcast by creating new_vars, it's just wrong in this edge case. If we add

new_vars <- vctrs::vec_assign(new_vars, is.na(new_vars), "NA") 

right after this:

if (length(names_from) > 1) {
new_vars <- mutate(shallow_dt(data), .names_from = paste(!!!syms(names_from), sep = names_sep))
new_vars <- unique(pull(new_vars, .names_from))
} else {
new_vars <- unique(pull(data, !!sym(names_from)))
new_vars <- as.character(new_vars)
}

Does that solve the issue?

@markfairbanks
Copy link
Collaborator Author

markfairbanks commented Oct 20, 2022

I don't follow the numeric substitution logic. I don't think I'd expect step_setnames(lazdt, c(4, NA), c('s', 'j')) to rename the second column toj. Is it related to the same issue that we'd want that behavior?

Hmm now that you mention it it is probably better to just ignore the numeric case in replace_setnames_na(). Not sure why I thought it was necessary to build in 😅

@markfairbanks
Copy link
Collaborator Author

@eutwt ready for review again.

Do we even need to change step_setnames?

I think it's a good idea to fix this in step_setnames() so that it is fixed in all cases in the future. If you still think the fix should be in pivot_wider() let me know, I'm open to discuss more.

@eutwt
Copy link
Collaborator

eutwt commented Oct 20, 2022

In my mind it would be cleaner if the step_setnames function took valid old/new names and just applied them. It feels a little unintuitive that it would be changing your old/new names for you, possibly increasing confusion if we somehow have unintentionally missing values in input vectors, see examples below.

But, this is an extremely minor point, up to opinion, and the function is internal anyway. I think this is fine to merge as-is

Behavior comparison

Main:

new_name <- names(ToothGrowth)[6] # missing by mistaken index
step_setnames(
  lazy_dt(ToothGrowth),
  "dose", 
  new_name,
  in_place = FALSE
)
#> Error: NA in 'new' at positions [1]
old_name <- names(ToothGrowth)[6] # missing by mistaken index
step_setnames(
  lazy_dt(ToothGrowth %>% mutate("NA" = 3)),
  old_name, 
  "new",
  in_place = FALSE
)
#> Error: NA in 'new' at positions [1]

This PR:

new_name <- names(ToothGrowth)[6] # missing by mistaken index
step_setnames(
  lazy_dt(ToothGrowth),
  "dose", 
  new_name,
  in_place = FALSE
)
#> Source: local data table [60 x 3]
#> Call:   setnames(`_DT1`, "dose", "NA")
#> 
#>     len supp   `NA`
#>   <dbl> <fct> <dbl>
#> 1   4.2 VC      0.5
#> 2  11.5 VC      0.5
#> 3   7.3 VC      0.5
#> 4   5.8 VC      0.5
#> 5   6.4 VC      0.5
#> 6  10   VC      0.5
#> # … with 54 more rows
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
old_name <- names(ToothGrowth)[6] # missing by mistaken index
step_setnames(
  lazy_dt(ToothGrowth %>% mutate("NA" = 3)),
  old_name, 
  "new",
  in_place = FALSE
)
#> Source: local data table [60 x 4]
#> Call:   setnames(`_DT1`, "NA", "new")
#> 
#>     len supp   dose   new
#>   <dbl> <fct> <dbl> <dbl>
#> 1   4.2 VC      0.5     3
#> 2  11.5 VC      0.5     3
#> 3   7.3 VC      0.5     3
#> 4   5.8 VC      0.5     3
#> 5   6.4 VC      0.5     3
#> 6  10   VC      0.5     3
#> # … with 54 more rows
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

Copy link
Collaborator

@eutwt eutwt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See minor comment, but don't necessarily need to address

R/step-setnames.R Outdated Show resolved Hide resolved
@markfairbanks
Copy link
Collaborator Author

It feels a little unintuitive that it would be changing your old/new names for you, possibly increasing confusion if we somehow have unintentionally missing values in input vectors

Wouldn't any workaround we build run into this issue? (like the one mentioned in here)

@eutwt
Copy link
Collaborator

eutwt commented Oct 21, 2022

Basically I see this as being due to the new_vars vector being mis-constructed, not due to some problem in how the new names are applied. Concretely, the example below still errors in this PR, because the NA->"NA" is done in step_setnames() which isn't used.

lazy_dt(tibble(x = c('a', NA), y = 2)) %>% 
  pivot_wider(names_from = 'x', values_from = 'y')

@markfairbanks
Copy link
Collaborator Author

Ah yep, that makes sense.

@markfairbanks markfairbanks requested a review from eutwt October 21, 2022 18:15
Copy link
Collaborator

@eutwt eutwt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@markfairbanks markfairbanks changed the title Correctly handle NA column names in step_setnames() Correctly handle NA column names from pivot_wider() Oct 21, 2022
@markfairbanks markfairbanks removed the request for review from mgirlich October 21, 2022 18:23
@markfairbanks markfairbanks merged commit b42d698 into main Oct 21, 2022
@markfairbanks markfairbanks deleted the step_setnames-NA branch October 21, 2022 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to use names_glue within pivot_wider() when NA is the only unique value
2 participants