-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed char.trunc to better handle combining and full-width multibyte characters #6048
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6048 +/- ##
==========================================
- Coverage 97.53% 97.51% -0.02%
==========================================
Files 80 80
Lines 14916 14918 +2
==========================================
- Hits 14548 14547 -1
- Misses 368 371 +3 ☔ View full report in Codecov by Sentry. |
this looks good to me. |
Done! |
Interesting that it passes on GHA-on-Windows but not Appveyor, hmm. Let's try submitting this to CRAN's winbuilder service to see if it passes there at least. |
OK, it passes on winbuilder: https://win-builder.r-project.org/vpDSszNIo19V/00check.log I am happy to merge and revisit this if someone else encounters the issue. It may be related to GHA and CRAN using UCRT-enabled machines. I see all the Windows checkers on CRAN use Likely there's an issue in the tests for non-UTF8 users, but it will be easier to work with a user encountering this who can do fast iteration on a fix themselves. |
clean_regex = "^\\d+:\\s+" # removes row numbering from beginning of output | ||
# Tests for combining character latin a and acute accent, single row | ||
DT = data.table(strrep(accented_a, 4L)) | ||
test(2253.01, gsub(clean_regex, "", capture.output(print(DT))[-1L]), strrep(accented_a, 4L), options=list(datatable.prettyprint.char = 4L)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stylistic request before we merge -- use style test(n, options=, ...)
to make options=
clear up-front.
Can you also try going back to the simple output=
approach of the tests? It will make it a lot cleaner to read. Hopefully the simpler test also passes on windows GHA, if not we can revert back to this functioning version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do this once I get home later! I used this style as I found it easier to read the expected output of the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the first few tests, the simplified output=
works well:
...
test(2253.04, options=list(datatable.prettyprint.char = 4L), DT, output=strrep(ja_ichi, 4L))
test(2253.05, options=list(datatable.prettyprint.char = 3L), DT, output=paste0(strrep(ja_ichi, 3L), dots))
test(2253.06, options=list(datatable.prettyprint.char = 1L), DT, output=paste0(strrep(ja_ichi, 1L), dots))
...
But for the later tests (Once we have more columns/rows), I find that using this approach requires something like:
test(2253.07, options=list(datatable.prettyprint.char = 4L), DT, output=" V1\n1: á\n2: áá\n3: ááá\n4: áááá")
# vs
test(2253.07, options=list(datatable.prettyprint.char = 4L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), c("á", "áá", "ááá", "áááá"))
I'm not a huge fan of how the expected output looks, but LMK if you think this way is more readable and I can refactor the rest of the tests. For now, I'll commit the options syntax style change and using the simple approach with the first few tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I had something else in mind but it looks like the force push erased the history in the other branch. oh well. I won't belabor it any further :)
great thanks for the contribution. I invited you to https://github.com/orgs/Rdatatable/teams/project-members which gives you permission to create branches in this repo, so next time you make a PR, can you please create the branch in this repo? (instead of your fork) |
Closes #5096
Second PR, see: #5995.
The old version of
trunc.char
indata.table.print
usednchar(x)
internally which works fine in most cases except when dealing with combining characters such as"á"
(The Latin "a" and accent) asnchar(x)
recognizes two distinct readable characters. In this case usingnchar(x, 'width')
works betterThis PR changes
trunc.char
to recognize when the char is combining or full-width and indexes accordingly. Additionally,strtrim()
is used now in place ofsubstr()
because as mentioned in?substr
:Tasks: