Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add icu_preferences util #1833

Closed
wants to merge 1 commit into from
Closed

Conversation

zbraniecki
Copy link
Member

Fixes #419.

This is an early WIP that finally captures what I had in mind for #419. It also creates a foundation for #594 (the data will have to come from DataProvider).

}

impl DateTimeFormat {
pub fn new(locale: &Locale, prefs: &DTFPreferences) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion: If we add the preferences bag as a separate argument, we end up with every formatter constructor taking 4 arguments: locale, provider, options, and preferences.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should. I'm ok with preferences being part of options. This is just an example.

}

impl DateTimeFormat {
pub fn new(locale: &Locale, prefs: &DTFPreferences) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: I want to make the Locale argument be a trait (see #1662). Should we consider making that trait also encompass the preferences bag? For example, plain Locale can implement the trait, and so can DTFPreferences, which would wrap the Locale and add extra overrides.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that preferences shouldn't be folded into Locale. Preferences can be resolved without locale, and I am planning to make the resolver/constructor also accept None for Locale and just resolve between Preferences and ResolvedPreferences.

@zbraniecki
Copy link
Member Author

@Manishearth do you have any suggestion on how to approach handling of ca key from Locale for selection of calendar? With the current approach of typed calendars it seems to me that we can only handle ca key in AnyCalendar. Is that correct?

@Manishearth
Copy link
Member

@Manishearth do you have any suggestion on how to approach handling of ca key from Locale for selection of calendar? With the current approach of typed calendars it seems to me that we can only handle ca key in AnyCalendar. Is that correct?

Yes, this is correct, and the plan is that the "regular" DateTimeFormat type will handle AnyCalendars, and what we currently call DateTimeFormat will instead be called TypedDateTimeFormat or something

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • Cargo.lock is different
  • Cargo.toml is different
  • utils/preferences/Cargo.toml is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • Cargo.lock is different
  • Cargo.toml is different
  • utils/preferences/Cargo.toml is different
  • utils/preferences/src/lib.rs is different
  • utils/preferences/src/macros.rs is different
  • utils/preferences/src/preferences.rs is now changed in the branch
  • utils/preferences/tests/dtf.rs is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@zbraniecki zbraniecki requested a review from sffc May 15, 2022 01:23
@zbraniecki
Copy link
Member Author

@sffc, @Manishearth - requesting early feedback on the discussed model.

I'm introducing generic Preferences that can be constructed from Locale, and each component can define its own XPreferences trait that implements additional getters and XPreferencesBag + ResolvedXPreferencesBag which are structs that implement it.

I focused only on architecture and DX so far and I think it starts looking quite nice.
I really like how simple the constructor of the component looks like, I like that a lot of retrievals are lazy, I like the Resolved* for storing because it means that internally DTF will have all the fields accessible.

One hiccup I found so far architecturally is that if I pass Locale as an impl of DTFPreferences then hour_cycle should be fallible because in the value of the keyword may not be a valid enum variant. Not sure how to workaround it. I'd love to avoid making those getters fallible, I'd love to avoid having to eagerly resolve preferences etc. but maybe we'll have to? Not sure.

The other imperfection I see is that ResolvedXPreferencesBag stores lid which is unnecessary for defaults, so I assume that DataProvider will have DefaultXPreferencesBag that will contain just the non-lid fields for lid key. The DTF constructor will then ask DP for the defaults, and build Resolved from Default + Prefs.

I haven't looked at all at performance/memory of this approach. I'd appreciate your early feedback on that as well since I'm trying to avoid switching to that mode until DX is complete.

@zbraniecki zbraniecki requested a review from Manishearth May 15, 2022 01:30
@Manishearth
Copy link
Member

I think this looks pretty promising!

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • Cargo.lock is different
  • Cargo.toml is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • Cargo.lock is different
  • utils/preferences/Cargo.toml is different
  • utils/preferences/src/macros.rs is different
  • utils/preferences/src/preferences.rs is different
  • utils/preferences/tests/dtf.rs is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@zbraniecki zbraniecki removed the request for review from Manishearth June 2, 2022 21:59
@zbraniecki
Copy link
Member Author

@sffc - updated based on our HLD. Unfortunately, we didn't capture notes so I was working from memory. Please let me know how it looks to you now and if there's something I missed.

@sffc
Copy link
Member

sffc commented Jun 6, 2022

I have some scattered thoughts. But before I get into detail on those, I want to step back and look at what the resulting DTF constructor looks like. I'd like to start there and work backwards.

I see several paths, all of which have pros and cons:

  1. Into<Locale> + provider + options bag (what we have now)
  2. Locale + provider + options bag
  3. struct DTFPreferenceBag + provider + options bag (what is being proposed)
  4. trait DTFPreferences + provider + options bag
  5. trait LocaleLike + provider + options and preferences bag (put prefs into the same bag with options)
  6. struct DataLocale + provider + options and preferences bag

Options 5 and 6 are more aligned with my mental model coming in. It is also what ECMA-402 is doing. Inside the constructor, we resolve the locale with the fields of the bag that are preferences using the same logic as ECMA-402 (favor options bag, then locale, then default fallback):

new Intl.DateTimeFormat("en").resolvedOptions().calendar
// 'gregory'
new Intl.DateTimeFormat("en-u-ca-hebrew").resolvedOptions().calendar
// 'hebrew'
new Intl.DateTimeFormat("en-u-ca-hebrew", { calendar: "japanese" }).resolvedOptions().calendar
// 'japanese'

My understanding is that @zbraniecki's main concern with using a locale to carry preferences is that we may want to scale to preferences that don't fit into BCP-47. My response to that is that trait LocaleLike or struct DataLocale can support such preferences; they need not be tied to BCP-47.

Note: struct DataLocale can be what is currently known as ResourceOptions; see #1995.

@zbraniecki
Copy link
Member Author

struct DTFPreferenceBag + provider + options bag (what is being proposed)

This is not what I propose here. I propose just DTFPreferenceBag + provider.

it is intentional, as:

  • Locale is a subset of Preferences
  • Conversion of Locale to Preferences is fallible so I want it external to constructor

trait DTFPreferences + provider + options bag

Why would we do that? I did it before and the problem is that the trait is mainly useful to have an implicit From<Locale> which is fallible, so it's better to have it as explicit TryFrom.

trait LocaleLike + provider + options and preferences bag (put prefs into the same bag with options)

What would it give us? This makes Locale a superset of Preferences, which it is not.

struct DataLocale + provider + options and preferences bag

How is DataLocale different from DTFPreferencesBag? Is that about having a single struct or trait for all components? I have concerns about scaling of that solution.

@sffc
Copy link
Member

sffc commented Jun 7, 2022

struct DTFPreferenceBag + provider + options bag (what is being proposed)

This is not what I propose here. I propose just DTFPreferenceBag + provider.

it is intentional, as:

  • Locale is a subset of Preferences
  • Conversion of Locale to Preferences is fallible so I want it external to constructor

Where do non-preference options fit in, like date style, components bag, etc? (This is what I meant by "options bag")

trait DTFPreferences + provider + options bag

Why would we do that? I did it before and the problem is that the trait is mainly useful to have an implicit From<Locale> which is fallible, so it's better to have it as explicit TryFrom.

We discussed this as an option in our meeting and the idea was so that foreign types could implement the trait and be compatible with ICU4X, but I think it's infeasible if every formatter has a separate trait, so I agree; we can dismiss this option.

trait LocaleLike + provider + options and preferences bag (put prefs into the same bag with options)

What would it give us? This makes Locale a superset of Preferences, which it is not.

There would be a single trait LocaleLike for all formatters that foreign types could implement, and which we could implement for Locale, LanguageIdentifier, LocaleStr, etc.

struct DataLocale + provider + options and preferences bag

How is DataLocale different from DTFPreferencesBag? Is that about having a single struct or trait for all components? I have concerns about scaling of that solution.

Yes, a single struct for all components.

I would like to understand more about your concern of scaling that solution. As far as I know, your only concern was the ability to support non-BCP-47 values. I think is a tractable problem that can be supported in a single struct: in the future, we add more fields (public or private) to hold more structured data.

Please note that the data provider framework wants everything in a single standard format (be it Locale or DataLocale or otherwise). It wants to work the same way across components.

@sffc
Copy link
Member

sffc commented Jun 7, 2022

Note: my envisioned data model for DataLocale is a Language Identifier + map from Unicode Extension Key to Unicode Extension Value. (This is exactly what ResourceOptions is currently.)

@zbraniecki
Copy link
Member Author

Where do non-preference options fit in, like date style, components bag, etc? (This is what I meant by "options bag")

They are part of DTFPreferencesBag. Locale can be either TryInto preferences or merged into it (both operations are fallible).

I would like to understand more about your concern of scaling that solution. As far as I know, your only concern was the ability to support non-BCP-47 values.

It is not my only one.

My concern is that operating on a union of all possible fields that any component may need is going to lead to awkward DX and limit DCE. It will also mean that such type has to keep getting extended for every option that any component needs and that requires revision of icu_preferences.

Note: my envisioned data model for DataLocale is a Language Identifier + map from Unicode Extension Key to Unicode Extension Value. (This is exactly what ResourceOptions is currently.)

We discussed that maybe in the future we'll want to add such "union" type, but I'd prefer not to and I'd prefer not to start with it.

The type needed by DP is suboptimal for internal use by components - in particular, an open ended key-value list can contain invalid values. One of the benefits of the design I'm providing here is that the validation of the values is enforced when working with DTFPreferencesBag and validated explicitly when converted/merged from Locale. It means that the type inside DTF is guaranteed to contain valid values.

I understand that for DP that is not the case, and I recognize that we will want to add such LID + KV map type for DP use.

@sffc
Copy link
Member

sffc commented Jun 7, 2022

Where do non-preference options fit in, like date style, components bag, etc? (This is what I meant by "options bag")

They are part of DTFPreferencesBag. Locale can be either TryInto preferences or merged into it (both operations are fallible).

OK, so this model essentially removes the locale as an explicit argument?

Have we considered the implications of this on people knowing how and where to pass in the locale? I think it's a good design choice on the part of both ICU4C and ECMA-402 to basically always require a standalone locale argument.

My concern is that operating on a union of all possible fields that any component may need is going to lead to awkward DX and limit DCE.

I don't see how this affects either DX or DCE.

  1. For DX, the developer should put programmatic preferences in the options bag (third argument). The locale only carries the user's preferences.
  2. For DCE, irrelevant preferences in the locale are simply dropped/ignored. We only carry code to process the relevant ones, which is code we need to carry anyway.

It will also mean that such type has to keep getting extended for every option that any component needs and that requires revision of icu_preferences.

It's a map. More items can be added to the map without requiring code revision, so long as they conform to BCP-47.

The type needed by DP is suboptimal for internal use by components - in particular, an open ended key-value list can contain invalid values. One of the benefits of the design I'm providing here is that the validation of the values is enforced when working with DTFPreferencesBag and validated explicitly when converted/merged from Locale. It means that the type inside DTF is guaranteed to contain valid values.

That's a nice benefit, I agree. I also think that this case is fine to stick in an error variant if the values are invalid.

I understand that for DP that is not the case, and I recognize that we will want to add such LID + KV map type for DP use.

Understand that whatever type we add here needs to have a conversion into the DP type right off the bat. I see this as a primary motivator of the signature of formatter constructors.

@zbraniecki
Copy link
Member Author

OK, so this model essentially removes the locale as an explicit argument?

Correct.

Have we considered the implications of this on people knowing how and where to pass in the locale? I think it's a good design choice on the part of both ICU4C and ECMA-402 to basically always require a standalone locale argument.

I'm not convinced anymore, hence this design.

My take is that Locale input from the perspective of formatters is fallible. en-us-hc-h98 is a correct Locale but the incorrect hour cycle must be handled in some way. If we accept Locale then the DTF has to decide how to handle that, either implicitly rejecting, implicitly correcting, or providing additional options to decide how to handle invalid values.

In the API I'm proposing this is moved outside of the constructor giving developers ability to decide how to handle failures and allowing us to provide more nuanced methods like .merge_leniently or .merge_strictly etc.

I'm growing belief that Locale is just a formation of preferences such as language, script, region, hour_cycle, calendar, date_style etc. I don't see reason we shouldn't allow, even in ECMA-402 like API for:

new Intl.DateTimeFormat("en-US", {
  region: "CA",
});

For DX, the developer should put programmatic preferences in the options bag (third argument). The locale only carries the user's preferences.

What is a programmatic preference in your mental model? Is hour_cycle a programmatic preference or user preference?

I argue that they are all user preferences, just provided in two ways (preferences bag vs locale) and merged.

It is also lossless one way - you can create preferences bag from locale which is a fallible operation, but the other way may be lossy, but infallible.

I agree that it is nice to allow for both - locale and preferences bag - to be provided, but I don't think they need to be provided to the constructor. It is sufficient that they can be merged prior to construction.

@sffc
Copy link
Member

sffc commented Jun 7, 2022

Thanks for explaining your mental model more.

I think it is a valid model, and a perspective to consider, but I'm not yet convinced that it is superior to the proven traditional model, and I'm hesitant to use ICU4X as a guinea pig to validate it.

My take is that Locale input from the perspective of formatters is fallible. en-us-hc-h98 is a correct Locale but the incorrect hour cycle must be handled in some way. If we accept Locale then the DTF has to decide how to handle that, either implicitly rejecting, implicitly correcting, or providing additional options to decide how to handle invalid values.

To me, this seems like a solution without a real-world problem:

  1. Misspelled subtags are far down the list of problems that I encounter in the i18n space; BCP-47 strings are usually generated by either a machine or a power user.
  2. In cases where we do come across a misspelled subtag, the behavior is well-defined and understood: drop the subtag, just like you do when performing vertical fallback.
  3. The average developer won't know how to resolve this sort of error when it comes up; it seems like the kind of thing that is well suited to a "best effort" approach.
  4. For developers who really want this, it seems perfectly reasonable to have an external function for validating the semantic content of extension keywords; it could go in the locale_canonicalizer component and be driven by the CLDR validity data.

I'm growing belief that Locale is just a formation of preferences such as language, script, region, hour_cycle, calendar, date_style etc.

Yes, but I don't see how this particular belief translates into the structured all-in-one preferences bag proposal.

What is a programmatic preference in your mental model? Is hour_cycle a programmatic preference or user preference?

In my mental model, the bag of preferences comes from the operating system in the form of a BCP-47 string, a well-understood, extremely popular serialization format. It may contain -u-hc, indicating the user's preferred default hour cycle, which may differ from the one used in their region.

However, hour_cycle, as specified in the options bag, is the programmer specifying that they want a specific style of output. This setting overrides the locale preference.

So, -u-hc is the user override, and hour_cycle is the programmer override.

I agree that it is nice to allow for both - locale and preferences bag - to be provided, but I don't think they need to be provided to the constructor. It is sufficient that they can be merged prior to construction.

A major reason why I still think it's wise to keep the locale in the constructor signature is that it reminds the developer where they need to inject the locale. If they just provide a bag of options, they are at high risk of specifying date style and time style, for instance, and unintentionally leaving language, region, etc. at their default values. This problem could be partially mitigated by forcing DTFPreferenceBag to be built with a locale in some way, such as with a builder pattern, constructors, etc., at the expense of some ergonomics.

Making it clear where developers need to put their locale parameter is, in my mind, the most basic tenant of i18n library design.

@zbraniecki
Copy link
Member Author

To me, this seems like a solution without a real-world problem:

Architecture of an API should be designed along the axis of fallible/infallible and lossy/lossless. Locale as is is fallible and lossy compared to Preferences.

I don't understand your argument of "solution without a problem" - we have a "problem" - we need to design API to encode information and provide DX. I provide a solution that I see as superior to what has been done, reflecting the ICU4X API composition and easily usable from ECMA-402 APIs.

In my mental model

I think your mental model is incomplete here.

"preferences bag" does not come from OS. It can come from developer choice, or from user preferences stored in the application.

Locale can come from OS, or from third-party system or can be stored as a user preference in the app as well. Both can have multiple sources and the locale comes from parent, preferences come from developer is but one of multiple models possible.

The API I'm proposing allows for composition of those models including one you evaluate, but also allowing for others including ones where developer has preferences, and user has preferences, and parent provides preferences and allows for chained multi-level combination of those in any order that the developer prefers allowing them to decide how to handle fallibility and lossiness.

The model used by ECMA-402 does not allow for any of that - it forces one architecture, forces particular failure resolution and doesn't allow for composition.

A major reason why I still think it's wise to keep the locale in the constructor signature is that it reminds the developer where they need to inject the locale.

That's a good point. I'll think about how to incorporate that into the DX.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a lot of nitpicks, which I won't post yet since I'm not sold on the architecture, but I thought I'd post one to start.

Comment on lines 13 to 14
fn region(&self) -> Option<&Region> {
None
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: There is more than one concept of a region. There is the region in the sense of a localized version of a language, and there is a region in the sense of setting default preferences for things like currencies and measurement units. This distinction is handled automatically in BCP-47, but if you are forcing a locale into the Preferences mold, you should support both types of regions so that we don't lose information across the conversion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ this question about language region versus preference region remains unresolved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you refer to region subtag vs rg unicode extension.
TR35 calls the former region and the latter region_override with definition:

A Region Override specifies an alternate region to use for obtaining certain region-specific default values (those specified by the element), instead of using the region specified by the unicode_region_subtag in the Unicode Language Identifier (or inferred from the unicode_language_subtag).

I think we should follow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need both types of region for data loading. For example, a British person living in the US who wants to make a MeasureFormat may want British translations ("metre") and American preferences ("miles per gallon") in the same constructor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed that DP may want to have access to more than just LID. In that case we will pass Preferences (or DataLocale::from(Preferences) to DP rather than Preferences::LID. This will allow DP to reason about region and region_override.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the other type of region get stored? Does this basically mean that every preferences object has a region_override field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every Preference that has rg as a relevant extension.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I'm not sure if I like codifying whether or not rg is a relevant extension in each and every Preference struct; I see that more as a data provider decision. CLDR could decide to change a certain piece of data from being language-based to region-based in a future edition; for example, currency symbols are currently being discussed as a possible candidate for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I'm not sure if I like codifying whether or not rg is a relevant extension in each and every Preference struct; I see that more as a data provider decision.

I disagree. It is a potential Preferences field and each struct should decide if it is. I can see it being converted to Option in DataLocale where preferences that have relevant and present will use Some and everything else will result in None.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, the set of what extension keys are relevant is dependent on the data keys reachable from a constructor. That set is changeable, so we should definitely at least make the preferences struct be non_exhaustive

Hmm, I wonder if there's a way to derive Preferences from the ResourceProvider bound?

@zbraniecki zbraniecki force-pushed the preferences branch 4 times, most recently from 814d03b to 847296d Compare May 9, 2024 08:24
@zbraniecki zbraniecki requested a review from sffc May 9, 2024 08:25
@zbraniecki
Copy link
Member Author

@sffc ready for your review. Can you mainly review the tests to verify that the architecture fits your expectations?

sffc
sffc previously approved these changes May 9, 2024
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to land; I have only small issues at this point. The shape looks good.

pub list: &'static [DateTimeFormatResolvedPreferences],
}

const DEFAULT_PREFS: DefaultPrefs = DefaultPrefs {
Copy link
Member

@sffc sffc May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: I think most components won't want to use this exact pattern for DefaultPrefs; they will likely want to load their data first and then populate the ResolvedPreferences based on that. Consider adding a comment saying that this is just a test, not the way we expect all clients to write DefaultPrefs.

$(
_ if *s == unicode::value!($key) => $name::$variant $(($v2::try_from(s).ok()))?,
$(
$(_ if *s == unicode::value!($key, $subk) => $name::$variant(Some($v2::$subv))),*,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This parses the unicode::Value each time over the match statement. It is more efficient to use writeable_cmp_bytes so you don't need to parse anything.

Comment on lines 194 to 253
($value:literal, $value2:literal) => {{
let v: &str = concat!($value, "-", $value2);
let R: $crate::extensions::unicode::Value =
match $crate::extensions::unicode::Value::try_from_bytes(v.as_bytes()) {
Ok(r) => r,
_ => panic!(concat!("Invalid Unicode extension value: ", $value)),
};
R
}};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This branch of the macro does runtime parsing and can't be used in a const. It doesn't do anything you can't do in userland; users should just use concat!($value, "-", $value2).parse::<unicode::Value>().

Custom(unicode::Value)
}

impl TryFrom<&unicode::Value> for $name {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: unknown values go to $name::Custom. What is the error case?

Comment on lines 11 to 13
//XXX: Perf
let s = input.to_string();
let v = TinyAsciiStr::from_str(&s);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: use the doc-hidden fn input.as_single_subtag(); you can consider making it public if you like

"rgsa" => Rgsa
});

enum_keyword!(Calendar {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: this list is missing some aliases. Some of the others might be, too, but I haven't checked. We should do a thorough review of all these enums before 2.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this enum still doesn't include aliases: is that intentional?)

@zbraniecki zbraniecki force-pushed the preferences branch 3 times, most recently from 23696e2 to 79e8ba9 Compare May 17, 2024 22:18
@zbraniecki zbraniecki force-pushed the preferences branch 2 times, most recently from 9063db8 to 573bb04 Compare May 22, 2024 12:47
@zbraniecki zbraniecki requested a review from sffc May 22, 2024 12:47
@zbraniecki
Copy link
Member Author

I think this is the final shape before I start carving out pieces for landing.

Changes:

  • I populated all known keys/values
  • I decided not to add Custom - I kept the macro machinery for it, but I don't think it makes sense now. Unknown values cannot be used so they're ignored. If we want to use a value, we should update the list of values in icu_preferences. I imagine that eventually the list will be generated from CLDR data.
  • I did some arbitrary choices between when to use a struct and when to use enum based on the countable number of values.
  • I moved Subtag to subtags - I think in 2.0 we should remodel the Value code to operate on subtags (from_iter, into_iter etc.) rather than tinystr.
  • I introduced SubdivisionId and SubdivisionSuffix to locid to supply rg and sd.
  • I extended support for try_from_bytes for more extensions/keywords and FromStr

@sffc - can you take a last look at this?

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Architecture still looks good, but there are details. Might be nice to make smaller PRs but I'm also okay landing the whole thing and working on it in tree.

/// ```
/// use icu::locid::extensions::Extensions;
///
/// Extensions::try_from_bytes(b"u-hc-h12").unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we maybe didn't have this in part because it's unclear whether it should include the leading - or not, as in -u-hc-h12. I think the ToString impl adds the - but I'm not 100% sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to become opinionated, but I'm aligned that parse should be more lenient (maybe support both) and serialization should be more strict.

Comment on lines +5 to +11
crate::enum_keyword!(Collation {
"standard" => Standard,
"search" => Search,
"phonetic" => Phonetic,
"pinyin" => Pinyin,
"searchjl" => Searchjl
}, "co");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: This is far from a comprehensive list of collations. Also, I think this should probably be a struct (open enum) instead of a closed enum. @hsivonen might have thoughts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "search" doesn't belong here, since it's for a different mode of usage. It makes sense for the user to have a preference between "stroke" and "zhuyin". It does not make sense for the user to express a preference between "stroke" and "search".

I'm not quite sure what the exact semantics of "search" vs. "searchjl" are, but 1) we don't have a search API at this point anyway and 2) I believe it's a contextual application design decision whether a given search operation with the Korean UI locale should use "search" or "searchjl".

So let's not put either "search" or "searchjl" here.

That isn't enough to decide what to put here.

As I understand it, the main use case for this is deciding whether for Traditional Chinese you want zhuyin instead of the default-for-zh-Hant stroke. stroke makes sense for any of the varieties of Chinese (Mandarin, Cantonese, etc.), but zhuyin is Mandarin-focused. (I don't know if using zhuyin sorting for e.g. Hakka is nonetheless a thing that user do.)

From theorizing in the reverse, one might except it to make sense for a user to override the default Mandarin-focused pinyin collation order of zh-Hans to stroke when not using Mandarin, but I don't know what the actual user practice is.

Other than that choosing between Mandarin-phonetic (pinyin and zhuyin) vs. stroke for zh, it's not really clear how strong the use cases are. compat for Arabic looks a lot like something that's about ICU change management from long ago (as were big5han and gb2312). Is standard vs. dict for Sinhala something that makes sense as a user preference? Is standard vs. phonebk for German something that makes sense as a user preference? Theoretically, standard vs. trad for Spanish looks potentially something that could make sense as user preference, but what's the level of demand, really? standard vs. trad for Finnish and Swedish is being removed. I have no idea of the user demand for standard vs. trad for Bangla.

Copy link
Member

@hsivonen hsivonen May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the warnings I put on MDN for an idea of why this shouldn't just be "everything the LDML spec has" if the purpose is to model user preferences: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Locale/getCollations (When I wrote that, I thought the und locale worked in Intl.Collator and didn't realize if was an illusion of testing with en-US browser UI locale…)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect, I appreciate your feedback and I'm glad to discuss it in context of introducing particular keys. For now I wanted to verify that the architecture works for us. We'll discuss this key in a PR for "co".

"rgsa" => Rgsa
});

enum_keyword!(Calendar {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this enum still doesn't include aliases: is that intentional?)


#[inline]
fn writeable_length_hint(&self) -> writeable::LengthHint {
todo!()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to find and replace any todo!()s before merging


struct_keyword!(
RegionOverride,
"ro",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is ro? Did you mean rg?

use tinystr::TinyAsciiStr;

struct_keyword!(
Timezone,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Timezone,
TimeZoneId,


use crate::enum_keyword;

enum_keyword!(Variant {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably be an open enum (struct), right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we'll discuss it in the PR specific for this key, but my logic here is that we should narrow down the keys to only "known" values - e.g. values we actually may want to act on in the code.

In other words - the fact that "va" may have any subtag is not meaningful for this crate. If we have nowhere in ICU4X where we operate on the particular value of "va" then it's ok to skip "va" - it's meaningless from the perspective of icu_preferences.

If in the next version of ICU4X we will start having logic specific to va=foo then we will in the next version add Foo as a variant of this enum.

The only open question for me is if we'd like to support Custom which might be useful for non-ICU4X users of icu_preferences - for example some crate like Fluent may want to use icu_preferences and have logic that uses a variant that we don't have logic for. But the way I wrote the architecture allows us to add it in a backward compatible way incrementally.

prefs.merge_locale(loc);
let dtf = DateTimeFormat::new(prefs, Default::default());
assert_eq!(dtf.format(0), String::from("Monday, June 23rd 2022, 24:13"));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ I don't think we decided for sure on the borrowing and ownership considerations.

use crate::parser::ParserError;
use crate::subtags::Region;

impl_tinystr_subtag!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: I'm not completely sure about whether this is the right place to put this subdivision logic. It does largely make sense in icu_locid, but as you pointed out, we don't currently return any values from icu_locid that make use of this type. I also don't necessarily think we should continue making the icu_locid APIs more complex. I think I could get behind a function on extensions::Unicode that returned a typesafe subdivision or region override.

@zbraniecki zbraniecki marked this pull request as draft May 23, 2024 12:38
zbraniecki added a commit that referenced this pull request May 29, 2024
First patch from the #1833 patchset.

It moves the `Subtag` to `locid::subtags::Subtag` - this is a generic
subtag (2..=8, separate from private::Subtag which is 1..=8) which will
be used to iterate over multi-part subtags like
`extensions::unicode::Value` to retrieve specific items without relying
on the underlying `TinyStr` storage.

It's a breaking change because we are removing
`extensions::other::Subtag`. I assume this is ok as we're working on 2.0
so I don't need to add alias and deprecate it, but please confirm.
zbraniecki added a commit that referenced this pull request Jun 1, 2024
Second part of #1833.

This one cleans up from/to str for locid extensions. This prepares
ground for reuse of Subtag in Value API which will be introduced in the
next PR.
zbraniecki added a commit that referenced this pull request Jun 3, 2024
This is part of #1833 switching Value API to use Subtag.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resolution between API options and -u- Locale extensions (migrate constructors to ICU4X Preferences)
6 participants