Skip to content

ensure consistent, reproducible choice when multiple parameter aliases are given #5304

Closed
@jameslamb

Description

Summary

Many LightGBM interfaces (in R, Python, C++, and others) accept a key-value map params, which can be used to override LightGBM's default configuration. The valid values are documented at https://lightgbm.readthedocs.io/en/latest/Parameters.html.

For many of those parameters, LightGBM recognizes a "main" parameter name and one or more "aliases" (other names which set the same configuration).

For example, main parameter num_iterations can also be referred to in user code as n_iter, num_round, and more (docs link).

On the C++ side, LightGBM guarantees reproducible behavior whenever multiple of these aliases are provided in the same call, like this:

{
    "num_round": 100,
    "n_iter": 200
}

LightGBM's C++, R, and Python packages should all make identical choices in such situations.

Motivation

Ensuring that LightGBM always chooses the same configuration given a certain content of params eliminates one possible source of the same code producing different results at different times or in different environments. That might save maintainers and users of the project time that would otherwise be lost investigating changes in results.

Description

Currently, the C++ side has some logic to make the choice of alias reproducible.

if (alias_set != tmp_map.end()) { // alias already set
// set priority by length & alphabetically to ensure reproducible behavior
if (alias_set->second.size() < pair.first.size() ||
(alias_set->second.size() == pair.first.size() && alias_set->second < pair.first)) {

Instead of replicating that logic in Python and R code, I believe this feature should be implemented similarly to the approach taken in #4829. The full list of recognized aliases is known at compile time, so it shouldn't necessary to write R and Python code similar to that C++ code which checks name lengths and alphabetic ordering every time params is processed.

I think these could all be kept in sync by:

  • moving that size + alphabetic ordering logic out of ParameterAlias::KeyAliasTransform() and instead having https://github.com/microsoft/LightGBM/blob/master/helpers/parameter_generator.py pre-sort all aliases that way
  • changing ParameterAlias::KeyAliasTransform() in C++ to use the output of Config::DumpAliases() or some other code on Config, iterate over aliases in order, and prefer the first one that it finds (taking advantage of the fact that the aliases have already been sorted)

To avoid the overhead of serializing and deserializing a JSON string, it might also be useful to add an intermediate method for Config::DumpAliases() that returns arrays of names, and re-use that across both Config::DumpAliases() and that alias-resolution code in ParameterAlias::KeyAliasTransform().

References

Created based on #5289 (comment).

The changes in #4829 are highly relevant to this issue, and reading that PR will help those looking to understand this issue more thoroughly.

Initial PR with deterministic aliases resolution method at cpp side: #961.

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions