Skip to content

Commit

Permalink
Add support for detecting multipart strings
Browse files Browse the repository at this point in the history
  • Loading branch information
Blake-Madden committed Dec 25, 2024
1 parent 6b0f0ca commit dc442ac
Show file tree
Hide file tree
Showing 18 changed files with 308 additions and 11 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,5 @@
/docs/manual/.quarto
/docs/manual/book/site_libs
/docs/manual/*.tex
/docs/manual/book/issues/mulitpart-strings_files/
/docs/manual/issues/mulitpart-strings_files/
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,8 @@ loaded by your application for integration testing.
lengthInconsistency: Check for suspect lengths of translations compared to their source strings.
excessiveNonL10NContent: Check for translatable strings that contain large blocks on non-translatable content.
halfWidth: Check for halfwidth Kanas, Hanguls, and punctuation in source and target strings.
multipartString: Check for strings that appear to contain multiple parts that are
being sliced at runtime.
--disable: Which checks to not perform. (Refer to options available above.)
This will override any options passed to "--enable".
Expand Down
3 changes: 3 additions & 0 deletions docs/manual/_quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,9 @@ book:
- gui/editing.qmd
- gui/tools.qmd
- gui/options.qmd
- part: "i18n & l10n Issues"
chapters:
- issues/mulitpart-strings.qmd
- part: "Additional Features"
chapters:
- suppression.qmd
Expand Down
3 changes: 3 additions & 0 deletions docs/manual/_quarto.yml.in
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,9 @@ book:
- gui/editing.qmd
- gui/tools.qmd
- gui/options.qmd
- part: "i18n & l10n Issues"
chapters:
- issues/mulitpart-strings.qmd
- part: "Additional Features"
chapters:
- suppression.qmd
Expand Down
11 changes: 11 additions & 0 deletions docs/manual/cites.bib
Original file line number Diff line number Diff line change
Expand Up @@ -79,4 +79,15 @@ @article{boyero:2020
volume = {31},
number = {2},
pages = {37}
}

@article{madden:2019,
author = {Blake Madden},
title = {Localization workarounds for non-internationalized software},
shorthand = {Workarounds},
year = {2019},
journal = {MultiLingual},
volume = {30},
number = {6},
pages = {65-66}
}
6 changes: 6 additions & 0 deletions docs/manual/command-line/checks.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,12 @@ Check for halfwidth Kanas, Hanguls, and punctuation in source and target strings

This is performed on *gettext* \*.po files.

- `multipartString`

Check for strings that appear to contain multiple parts that are being sliced at runtime.

This is performed on *gettext* \*.po files.

- `lengthInconsistency`

Check for suspect lengths of translations compared to their source strings.
Expand Down
21 changes: 16 additions & 5 deletions docs/manual/gui/options.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,12 @@ For example, a resource such as "%d of %zu" will emit this warning.
Although not much can be done about the majority of this string being `printf` commands, a translator comment would help with explaining what they represent.
By adding a comment, this warning (along with any ambiguity warnings) will be removed.
:::
::::

:::: {.optionssection data-latex=""}
::: {.optionssectiontitle data-latex=""}
Run the following checks (*continued...*)
:::
**Translatable strings that are surrounded by spaces**: select this to check for translatable strings that start or begin with spaces.
(This is limited to actual space characters, and does not include tabs or newlines.)

Expand All @@ -66,12 +71,13 @@ piecing them together.
An exception is made for strings that end with a colon followed by a space (": ").
It is assumed that this part of tabular data being concatenated and will be ignored.
:::
::::

:::: {.optionssection data-latex=""}
::: {.optionssectiontitle data-latex=""}
Run the following checks (*continued...*)
:::
**Multipart strings ("mega strings")**: select this to check for strings that appear to contain multiple parts that are being sliced into separate strings at runtime.

Refer to @sec-multipart-strings for an overview of this issue.

This check should be ran when modernizing legacy software.

**Suspect i18n function usage**: select this to check for suspect usage of i18n functions.

This will check for issues such as:
Expand Down Expand Up @@ -100,7 +106,12 @@ A string will be considered ambiguous if:
For example, a string such as "Name: %s\\nDate: %s\\Location: %s" may not need a comment.
It is assumed that the label in front of the colon serves as a description for the following "%s"; thus, a warning will not be emitted.
:::
::::

:::: {.optionssection data-latex=""}
::: {.optionssectiontitle data-latex=""}
Run the following checks (*continued...*)
:::
**Deprecated macros and functions**: select this to check for deprecated text macros and functions.
This will detect the usage of functions that are no longer relevant, and provide a suggested replacement.

Expand Down
59 changes: 59 additions & 0 deletions docs/manual/issues/mulitpart-strings.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@

```{r}
#| include: false
#| eval: true
source('../R/appdown.r')
```

# Internationalization Issues

Internationalization (i18n) occurs during the development stage of software, which includes (but not limited to) the following areas:

- Preparing the program to display numbers and dates using the client's regional settings.
- Exposing strings for translation.
- Ensuring that strings not meant for translation are not exposed to translators.

I18N issues include any problems during this stage which will later affect either translation efforts or regional display issues for clients.
This chapter will discuss these various issues and provide recommendations for how to detect and correct them using *Quneiform*.

## Multipart Strings {#sec-multipart-strings}

Multipart strings (or "mega strings") are single string resources that are split at runtime into smaller strings [@madden:2019].

For example, consider a resource string such as:

`"Name Password Domain "`

While this is one string, it contains suspicious looking blocks of spaces in between each word.
As it turns out, an application may be splitting this string into 10-character sections at runtime:

```{r}
#| echo: false
#| out-width: 100%
#| fig-format: png
#| fig-align: center
library(DiagrammeR)
DiagrammeR('
graph TB
A["“Name      Password  Domain    ”"]-->B["“Name      ”"]
A["“Name      Password  Domain    ”"]-->C["“Password  ”"]
A["“Name      Password  Domain    ”"]-->D["“Domain    ”"]
')
```

Then, it will use each of these sections as a separate resource.
This is a bad practice found in legacy software, and can be thought of as the opposite of concatenating strings (another bad i18n practice).

The issue with this practice is that it forces translations to a constrained length (10-characters for each word in this example).
Next, it is difficult to translate, as the translator must count the characters and spaces for each word segment, not just the entire string.
(And this is assuming that the behavior of this string was even communicated to the translators.)
Finally, l10n quality assurance tools are not designed to check unusual strings like this.
Validating translations for these resources requires custom tools, which creates unnecessary technical debt.

If a translator doesn't format each segment of this string perfectly, then at best the translations will overlap or be clipped at runtime.
At worst, the program will crash in situations where the full string is less than 30 characters.

It is preferred to use separate resources for each string.
This ensures that the translations aren't constrained to arbitrary lengths and QA tools will be able to review the resources with ease.
34 changes: 32 additions & 2 deletions src/analyze.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -388,6 +388,7 @@ namespace i18n_check
<< ((m_cpp->get_style() & check_needing_context) ? L"L10NStringNeedsContext\n" :
L"")
<< ((m_cpp->get_style() & check_l10n_contains_url) ? L"urlInL10NString\n" : L"")
<< ((m_cpp->get_style() & check_multipart_strings) ? L"multipartString\n" : L"")
<< ((m_cpp->get_style() & check_l10n_contains_excessive_nonl10n_content) ?
L"excessiveNonL10NContent\n" :
L"")
Expand Down Expand Up @@ -480,6 +481,16 @@ namespace i18n_check
<< L"\"\t[spacesAroundL10NString]\n";
}

for (const auto& val : m_rc->get_multipart_strings())
{
report << val.m_file_name << L"\t" << val.m_line << L"\t\t\""
<< replaceSpecialSpaces(val.m_string) << L"\"\t\""
<< _(L"String available for translation may contain multiple sections "
"being sliced at runtime. Consider splitting each section into "
"a separate resource.")
<< L"\"\t[multipartString]\n";
}

for (const auto& val : m_rc->get_localizable_strings_with_halfwidths())
{
report << val.m_file_name << L"\t" << val.m_line << L"\t" << val.m_column << L"\t"
Expand Down Expand Up @@ -558,6 +569,15 @@ namespace i18n_check
"lacking a translator comment.")
<< "\"\t[L10NStringNeedsContext]\n";
}
else if (issue.first == translation_issue::multipart_string)
{
report << catEntry.first << L"\t" << catEntry.second.m_line << L"\t\t\""
<< issue.second << L"\"\t\""
<< _(L"String available for translation may contain multiple sections "
"being sliced at runtime. Consider splitting each section into "
"a separate resource.")
<< "\"\t[multipartString]\n";
}
else if (issue.first == translation_issue::accelerator_issue)
{
report << catEntry.first << L"\t" << catEntry.second.m_line << L"\t\t\""
Expand Down Expand Up @@ -696,13 +716,23 @@ namespace i18n_check
"using a formatting function.");
report << L"\"\t[spacesAroundL10NString]\n";
}


for (const auto& val : sourceParser->get_multipart_strings())
{
report << val.m_file_name << L"\t" << val.m_line << L"\t" << val.m_column << L"\t"
<< L"\"" << replaceSpecialSpaces(val.m_string) << L"\"\t\"";
report << _(L"String available for translation may contain multiple sections "
"being sliced at runtime. Consider splitting each section into "
"a separate resource.");
report << L"\"\t[multipartString]\n";
}

for (const auto& val : sourceParser->get_localizable_strings_with_halfwidths())
{
report << val.m_file_name << L"\t" << val.m_line << L"\t" << val.m_column << L"\t"
<< L"\"" << replaceSpecialSpaces(val.m_string) << L"\"\t\"";
report << _(L"String available for translation that contains halfwidth characters. "
"Fullwidth characters are recommended.");
"Fullwidth characters are recommended.");
report << L"\"\t[halfWidth]\n";
}

Expand Down
1 change: 1 addition & 0 deletions src/gui/i18nframe.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -780,6 +780,7 @@ void I18NFrame::OnIgnoreSelectedWarning([[maybe_unused]] wxCommandEvent&)
excludeFlag(L"[L10NStringNeedsContext]",
i18n_check::review_style::check_needing_context);
excludeFlag(L"[urlInL10NString]", i18n_check::review_style::check_l10n_contains_url);
excludeFlag(L"[multipartString]", i18n_check::review_style::check_multipart_strings);
excludeFlag(L"[excessiveNonL10NContent]",
i18n_check::review_style::check_l10n_contains_excessive_nonl10n_content);
excludeFlag(L"[spacesAroundL10NString]",
Expand Down
13 changes: 13 additions & 0 deletions src/gui/projectdlg.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,7 @@ void NewProjectDialog::SetOptions(const i18n_check::review_style style)
m_lengthInconsistency = (m_options & i18n_check::review_style::check_length);
m_needsContext = (m_options & i18n_check::review_style::check_needing_context);
m_urlInL10NString = (m_options & i18n_check::review_style::check_l10n_contains_url);
m_multipartString = (m_options & i18n_check::review_style::check_multipart_strings);
m_excessiveNonTranslatableContentInL10NString =
(m_options & i18n_check::review_style::check_l10n_contains_excessive_nonl10n_content);
m_spacesAroundL10NString =
Expand Down Expand Up @@ -243,6 +244,10 @@ void NewProjectDialog::OnOK([[maybe_unused]] wxCommandEvent&)
{
m_options |= i18n_check::review_style::check_l10n_contains_excessive_nonl10n_content;
}
if (m_multipartString)
{
m_options |= i18n_check::review_style::check_multipart_strings;
}
if (m_spacesAroundL10NString)
{
m_options |= i18n_check::review_style::check_l10n_has_surrounding_spaces;
Expand Down Expand Up @@ -582,6 +587,14 @@ void NewProjectDialog::CreateControls()
wxGBPosition(currentRow, 0), wxGBSpan{});
gbSizer->Add(buildCodeLabel(L"spacesAroundL10NString", checkOptionsSizer->GetStaticBox()),
wxGBPosition(currentRow++, 1), wxGBSpan{});

gbSizer->Add(new wxCheckBox(checkOptionsSizer->GetStaticBox(), wxID_ANY,
_(LR"(Multipart strings ("mega strings"))"),
wxDefaultPosition, wxDefaultSize, 0,
wxGenericValidator(&m_multipartString)),
wxGBPosition(currentRow, 0), wxGBSpan{});
gbSizer->Add(buildCodeLabel(L"multipartString", checkOptionsSizer->GetStaticBox()),
wxGBPosition(currentRow++, 1), wxGBSpan{});

gbSizer->Add(new wxCheckBox(checkOptionsSizer->GetStaticBox(), wxID_ANY,
_(L"Suspect i18n function usage"), wxDefaultPosition,
Expand Down
1 change: 1 addition & 0 deletions src/gui/projectdlg.h
Original file line number Diff line number Diff line change
Expand Up @@ -307,6 +307,7 @@ class NewProjectDialog final : public wxDialog
bool m_lengthInconsistency{ false };
bool m_needsContext{ false };
bool m_urlInL10NString{ true };
bool m_multipartString{ false };
bool m_excessiveNonTranslatableContentInL10NString{ false };
bool m_spacesAroundL10NString{ false };
bool m_deprecatedMacro{ true };
Expand Down
12 changes: 12 additions & 0 deletions src/i18n_review.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1126,6 +1126,10 @@ namespace i18n_check
{
m_unsafe_localizable_strings.push_back(str);
}
if ((m_review_styles & check_multipart_strings) && is_string_multipart(str.m_string))
{
m_multipart_strings.push_back(str);
}
if ((m_review_styles & check_l10n_contains_url) &&
(std::regex_search(str.m_string, results, m_url_email_regex) ||
std::regex_search(str.m_string, results, m_us_phone_number_regex) ||
Expand Down Expand Up @@ -1572,6 +1576,13 @@ namespace i18n_check
}
}

//--------------------------------------------------
bool i18n_review::is_string_multipart(std::wstring_view str)
{
static const std::wregex multiConsecSpaces{ LR"(([ ]{2,}|\\t|\b\|\b))" };
return load_matches(str, multiConsecSpaces).size() > 2;
}

//--------------------------------------------------
bool i18n_review::is_string_ambiguous(std::wstring_view str)
{
Expand Down Expand Up @@ -2079,6 +2090,7 @@ namespace i18n_check
m_localizable_strings_in_internal_call.clear();
m_localizable_strings_with_surrounding_spaces.clear();
m_localizable_strings_with_halfwidths.clear();
m_multipart_strings.clear();
m_not_available_for_localization_strings.clear();
m_marked_as_non_localizable_strings.clear();
m_internal_strings.clear();
Expand Down
28 changes: 24 additions & 4 deletions src/i18n_review.h
Original file line number Diff line number Diff line change
Expand Up @@ -117,11 +117,12 @@ namespace i18n_check
check_needing_context = (static_cast<int64_t>(1) << 14),
/// @brief Check for suspect usage of i18n functions.
check_suspect_i18n_usage = (static_cast<int64_t>(1) << 15),
/// @@brief Check for translatable strings that contain
/// @brief Check for translatable strings that contain
/// large blocks on non-translatable content.
check_l10n_contains_excessive_nonl10n_content = (static_cast<int64_t>(1) << 16),
/// @private
i18n_reserved5 = (static_cast<int64_t>(1) << 17),
/// @brief Check for strings that appear to contain multiple parts that are
/// being sliced at runtime.
check_multipart_strings = (static_cast<int64_t>(1) << 17),
/// @private
i18n_reserved6 = (static_cast<int64_t>(1) << 18),
/// @private
Expand Down Expand Up @@ -225,7 +226,10 @@ namespace i18n_check
/// @brief Translation contains malformed syntax (e.g., a bad accelerator key).
malformed_translation,
/// @brief Translation contains halfwidth Kanas, Hanguls, and punctuation.
halfwidth
halfwidth,
/// @brief Source strings that appears to contain multiple parts that are
/// being sliced at runtime.
multipart_string
};

/// @brief File types that can be analyzed.
Expand Down Expand Up @@ -555,6 +559,13 @@ namespace i18n_check
return m_localizable_strings_with_halfwidths;
}

/// @returns The strings that appear to contain multiple sections.
[[nodiscard]]
const std::vector<string_info>& get_multipart_strings() const noexcept
{
return m_multipart_strings;
}

/// @returns The strings that contain extended ASCII characters, but are not encoded.
[[nodiscard]]
const std::vector<string_info>& get_unencoded_ext_ascii_strings() const noexcept
Expand Down Expand Up @@ -938,6 +949,14 @@ namespace i18n_check
[[nodiscard]]
static bool is_string_ambiguous(std::wstring_view str);

/// @returns @c true if a string appears to contain sections that are being sliced at
/// runtime. This is deduced from looking at embedded tabs, pipes (between words),
/// and consecutive spaces. If there are more than two of these in a string, then it
/// is probably being sliced at runtime.
/// @param str The string to review.
[[nodiscard]]
static bool is_string_multipart(std::wstring_view str);

/** @brief Processes a quote after its positions and respective
function/variable assignment has been found.
@param[in,out] currentTextPos The current position into the text buffer.\n
Expand Down Expand Up @@ -1221,6 +1240,7 @@ namespace i18n_check
std::vector<string_info> m_localizable_strings_with_surrounding_spaces;
std::vector<string_info> m_localizable_strings_with_halfwidths;
std::vector<string_info> m_not_available_for_localization_strings;
std::vector<string_info> m_multipart_strings;
std::vector<string_info> m_deprecated_macros;
std::vector<string_info> m_unencoded_strings;
std::vector<string_info> m_printf_single_numbers;
Expand Down
Loading

0 comments on commit dc442ac

Please sign in to comment.