Skip to content

[R] stringr binding for str_sub() silently mishandles negative start/stop values #43960

Closed
@coussens

Description

Describe the bug, including details regarding any error messages, version, and platform.

I noticed some unusual behavior behavior when attempting to use negative start/end values (i.e. counting from the end of the string) when using str_sub() in arrow. I've included a few examples below, contrasting how str_sub behaves with tibbles in R and arrow tables:

library(arrow)
library(tidyverse)
library(reprex)

str_tbl <- tibble(my_string = 'abcde')
str_tbl_a <- as_arrow_table(str_tbl)

# example #1: extract first through second-to-last characters from string

# works fine with a tibble
str_tbl %>% mutate(my_substring = str_sub(my_string, start = 1, end = -2))
#> # A tibble: 1 × 2
#>   my_string my_substring
#>   <chr>     <chr>       
#> 1 abcde     abcd

# but with arrow: ruh-roh -- missing all characters
str_tbl_a %>%  mutate(my_substring = str_sub(my_string, start = 1, end = -2)) %>% collect()
#> # A tibble: 1 × 2
#>   my_string my_substring
#>   <chr>     <chr>       
#> 1 abcde     ""


# example #2: extract third-to-last through second-to-last characters from string

# works fine with a tibble
str_tbl %>% mutate(my_substring = str_sub(my_string, start = -3, end = -2))
#> # A tibble: 1 × 2
#>   my_string my_substring
#>   <chr>     <chr>       
#> 1 abcde     cd

# but with arrow: ruh-roh -- missing a character
str_tbl_a %>% mutate(my_substring = str_sub(my_string, -3, -2)) %>% collect()
#> # A tibble: 1 × 2
#>   my_string my_substring
#>   <chr>     <chr>       
#> 1 abcde     c


# example #3: extract third-to-last through last characters from string

# works fine with a tibble
str_tbl %>% mutate(my_substring = str_sub(my_string, start = -3, end = -1))
#> # A tibble: 1 × 2
#>   my_string my_substring
#>   <chr>     <chr>       
#> 1 abcde     cde

# but with arrow: bizarrely, this is also fine
str_tbl_a %>% mutate(my_substring = str_sub(my_string, -3, -1)) %>% collect()
#> # A tibble: 1 × 2
#>   my_string my_substring
#>   <chr>     <chr>       
#> 1 abcde     cde

Created on 2024-09-05 with reprex v2.1.1

Note: the above reprex was created on an Ubuntu 22.04 system running R 4.4.1 and Arrow 16.1.0

Component(s)

R

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions