Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-ascii symbols on systems without UTF-8 #6339

Closed
ben-schwen opened this issue Aug 2, 2024 · 4 comments · Fixed by #6375 or #6567
Closed

Non-ascii symbols on systems without UTF-8 #6339

ben-schwen opened this issue Aug 2, 2024 · 4 comments · Fixed by #6375 or #6567

Comments

@ben-schwen
Copy link
Member

Test of #4711 does not work in systems without UTF-8 encoding as e.g. our test-lin-rel-vanilla container.

Output of spinning up a new container with the image registry.gitlab.com/jangorecki/dockerfiles/r-base-gcc

DT = data.table(a = rep(1:3, 2))
setnames(DT, "a", "a\U00F1o")
DT[ , .N, 'a<U+00F1>o']
#>    a<U+00F1>o     N
#>         <int> <int>
#> 1:          1     2
#> 2:          2     2
#> 3:          3     2
#> Warning message:
#> In eval(bysub, x, parent.frame()) :
#>   unable to translate 'a<U+00F1>o' to native encoding
DT[ , .N, a<U+00F1>o]
#> Error: unexpected symbol in "DT[ , .N, a<U+00F1"
@aitap
Copy link
Contributor

aitap commented Aug 18, 2024

Bare variable names (symbols) are required to be in the native encoding. On systems incapable of representing ñ in the native encoding (LC_ALL=C, or, e.g., KOI8-R), there is no way to preserve an ñ in a variable name.

On non-UTF-8 systems that can represent ñ in the native encoding, the code will work fine:

$ LC_ALL=en_GB.ISO-8859-15 luit R -q -s -e 'as.name("\uf1"); parse(text = "DT[, .N, a\U00F1o]$N[1L]")'
ñ
expression(DT[, .N, año]$N[1L])

If there is no ñ in the current locale, translateChar() internally called by parse() substitutes some text and you get a syntax error, but iconv seems to help:

# this works
LC_ALL=en_GB.ISO-8859-15 luit R -q -s -e 'text <- iconv("DT[, .N, a\U00F1o]$N[1L]", "UTF-8", ""); if (!is.na(text)) parse(text = text)'
# expression(DT[, .N, año]$N[1L])

# this doesn't crash
LC_ALL=C R -q -s -e 'text <- iconv("DT[, .N, a\U00F1o]$N[1L]", "UTF-8", ""); if (!is.na(text)) parse(text = text)'

@MichaelChirico
Copy link
Member

Thanks @aitap. What's luit?

iconv() looks as good a solution as any -- definitely good to still run those tests on non-UTF-8 systems, rather than just skip if parsing fails.

@aitap
Copy link
Contributor

aitap commented Aug 18, 2024

luit converts between the UTF-8 terminal session and the non-UTF-8 encoding used by its child process.

@aitap
Copy link
Contributor

aitap commented Oct 10, 2024

#6559 demonstrates that we cannot rely on iconv() to return NA if conversion fails: on FreeBSD we instead get a?o.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants