Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transliterator DatagenProvider #3877

Merged
merged 43 commits into from
Aug 23, 2023

Conversation

skius
Copy link
Member

@skius skius commented Aug 16, 2023

Depends on #3872
Part of #3736

This PR

  • adds datagen support for transliterators
  • adds sample transliterator definitions to the checked in cldr-json as a workaround until they land upstream
  • adds compiled data support for transliterators

@skius skius mentioned this pull request Aug 16, 2023
41 tasks
skius added 18 commits August 18, 2023 14:50
commit 7f84837
Author: Shane F. Carr <shane@unicode.org>
Date:   Thu Aug 17 19:31:59 2023 -0500

    Update request.rs

commit 905e697
Author: Shane F. Carr <shane@unicode.org>
Date:   Thu Aug 17 16:50:55 2023 -0700

    gn-gen

commit 1e52541
Author: Shane F. Carr <shane@unicode.org>
Date:   Thu Aug 17 16:50:06 2023 -0700

    Update sizes (probably need to fix nightly sizes too)

commit 45778ce
Author: Shane F. Carr <shane@unicode.org>
Date:   Thu Aug 17 16:47:06 2023 -0700

    Use the TinyStr variant

commit 04ee1df
Author: Shane F. Carr <shane@unicode.org>
Date:   Thu Aug 17 16:42:26 2023 -0700

    Remove Default impl

commit 3d83b95
Author: Shane F. Carr <shane@unicode.org>
Date:   Thu Aug 17 16:42:02 2023 -0700

    Add impls based on Deref

commit 3b14fbd
Author: Shane F. Carr <shane@unicode.org>
Date:   Thu Aug 17 16:10:47 2023 -0700

    Add 3-way enum (not using Stack variant yet)

commit cda1498
Author: Shane F. Carr <shane@unicode.org>
Date:   Thu Aug 17 13:57:26 2023 -0700

    Fix build post merge

commit 71c1087
Author: Shane F. Carr <shane@unicode.org>
Date:   Thu Aug 17 13:56:30 2023 -0700

    Support multipart auxiliary keys

commit f31643a
Author: Shane F. Carr <shane@unicode.org>
Date:   Thu Aug 17 13:17:17 2023 -0700

    Use `+` for aux key

commit dbefbf9
Author: Shane F. Carr <shane@unicode.org>
Date:   Thu Aug 17 13:04:55 2023 -0700

    More more is_und

commit f8e310e
Author: Shane F. Carr <shane@unicode.org>
Date:   Thu Aug 17 13:03:07 2023 -0700

    is_und in driver.rs (post merge cleanup)

commit 33fad72
Merge: 5551d6b ea1ba9f
Author: Shane F. Carr <shane@unicode.org>
Date:   Thu Aug 17 12:58:16 2023 -0700

    Merge branch 'main' into auxkey

    Conflicts:
    	provider/datagen/src/lib.rs

commit 5551d6b
Author: Shane F. Carr <shane@unicode.org>
Date:   Wed Aug 16 17:07:24 2023 -0700

    fmt and refactor

commit cd0d86c
Author: Shane F. Carr <shane@unicode.org>
Date:   Wed Aug 16 17:00:56 2023 -0700

    Fix baked_exporter.rs

commit 241b563
Author: Shane F. Carr <shane@unicode.org>
Date:   Wed Aug 16 14:51:39 2023 -0700

    Pull the separator character into a function where possible.

commit 93f0969
Author: Shane F. Carr <shane@unicode.org>
Date:   Wed Aug 16 14:47:19 2023 -0700

    Handle $ in baked_exporter

commit 3493b75
Author: Shane F. Carr <shane@unicode.org>
Date:   Tue Aug 15 19:24:09 2023 -0700

    Use is_und instead of is_empty more consistently in fallback iterator

commit 52ab723
Author: Shane F. Carr <shane@unicode.org>
Date:   Tue Aug 15 19:19:23 2023 -0700

    Add DataLocale::is_und; gen hello world testdata

commit f36b405
Author: Shane F. Carr <shane@unicode.org>
Date:   Tue Aug 15 19:05:19 2023 -0700

    Use auxiliary keys in HelloWorldProvider

commit 54101ad
Author: Shane F. Carr <shane@unicode.org>
Date:   Tue Aug 15 18:57:06 2023 -0700

    Forbid the empty string in AuxiliaryKey

commit d0e30e2
Author: Shane F. Carr <shane@unicode.org>
Date:   Tue Aug 15 18:54:44 2023 -0700

    Docs, tests, cleanup

commit 06e453c
Author: Shane F. Carr <shane@unicode.org>
Date:   Tue Aug 15 18:29:18 2023 -0700

    Change custom error type to DataError::KeyLocaleSyntax

commit 6cb45f6
Author: Shane F. Carr <shane@unicode.org>
Date:   Tue Aug 15 18:18:00 2023 -0700

    Make comparison operations work

commit 59aa0b4
Author: Shane F. Carr <shane@unicode.org>
Date:   Tue Aug 15 15:55:52 2023 -0700

    Start writing impl FromStr for DataLocale

commit 3d36855
Author: Shane F. Carr <shane@unicode.org>
Date:   Tue Aug 15 13:23:16 2023 -0700

    Initial auxiliary key APIs
@dpulls
Copy link

dpulls bot commented Aug 18, 2023

🎉 All dependencies have been resolved !

@skius skius requested review from sffc, robertbastian and a team as code owners August 21, 2023 10:03
@skius skius removed request for a team and Manishearth August 21, 2023 10:03
@skius
Copy link
Member Author

skius commented Aug 21, 2023

This fails CI because the sample transform rule directory (cldr-transforms-main) does not actually exist when downloading fresh CLDR data, how should I resolve that?

@robertbastian
Copy link
Member

This fails CI because the sample transform rule directory (cldr-transforms-main) does not actually exist when downloading fresh CLDR data, how should I resolve that?

In https://github.com/unicode-org/icu4x/blob/main/tools/testdata-scripts/src/bin/download-repo-sources.rs#L121 you need to move cldr-transforms-main into a temp dir and recover it after.

@skius skius requested a review from echeran as a code owner August 21, 2023 14:31
@skius skius removed the request for review from echeran August 21, 2023 14:32
robertbastian
robertbastian previously approved these changes Aug 22, 2023
"README.md"
]

[dependencies]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove

#[derive(
Debug,
Clone,
zerofrom::ZeroFrom, // QUESTION: does ZeroFrom need to be behind a feature?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's required by DataPayload::with_mut and DataPayload::try_map_project, which is used by .as_deserializing() only, so it could be behind the serde feature. However, it's a dependency of icu_provider, so making it optional doesn't gain much, and we don't have it behind a feature anywhere.

#[doc(hidden)]
pub fn legacy_id_to_internal_id(source: &str, target: &str, variant: Option<&str>) -> String {
// TODO(#3891): Decide representation for unknown BCP47 IDs
let mut id = format!("x-{source}-{target}");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: good solution for now

Comment on lines 240 to 248
let dir_suffix = self.0.dir_suffix()?;
let path = format!("{}-{dir_suffix}/main/{transform}/metadata.json", self.1);
if self.0.serde_cache.file_exists(&path)? {
self.0.serde_cache.read_and_parse_json(&path)
} else {
Err(DataErrorKind::Io(std::io::ErrorKind::NotFound)
.into_error()
.with_display_context(&path))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to construct a CldrDirLang on the fly to call its read_and_parse

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you meant CldrDirNoLang, because there is no language identifier around. Doing this now:

let cldr_dir = CldrDirNoLang(self.0, format!("{}-{dir_suffix}/main/{transform}", self.1));
cldr_dir.read_and_parse("metadata.json")

let dir_suffix = self.0.dir_suffix()?;
let path = format!("{}-{dir_suffix}/main/{transform}/source.txt", self.1);
if self.0.serde_cache.file_exists(&path)? {
let buf = self.0.serde_cache.root.read_to_buf(&path)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change read_to_buf by read_to_string, then the AbstractFs can handle UTF-8 errors. JSON and TOML deserialization can be done on strings as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added read_to_string to AbstractFs

fn lookup_dir_from_internal_id(
&self,
internal_id: &str,
) -> Result<Option<(String, bool)>, DataError> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nicer to return icu_transliterator_parser::Direction


fn find_bcp47_in_list(list: &[String]) -> Option<String> {
for item in list {
if item.contains("-t-") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be better to parse to Locale.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing aliases as an enum over Locale (for -t-) and String (for legacy IDs) now

]
},
{
"cp_inv_list": [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh these are quite big 😬

@skius skius force-pushed the datagen-transform-rules branch from 57e1af3 to fb3a112 Compare August 22, 2023 17:28
robertbastian
robertbastian previously approved these changes Aug 22, 2023
robertbastian
robertbastian previously approved these changes Aug 22, 2023
@robertbastian robertbastian merged commit c5731f7 into unicode-org:main Aug 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants