ensure consistent, reproducible choice when multiple parameter aliases are given #5304
Description
Summary
Many LightGBM interfaces (in R, Python, C++, and others) accept a key-value map params
, which can be used to override LightGBM's default configuration. The valid values are documented at https://lightgbm.readthedocs.io/en/latest/Parameters.html.
For many of those parameters, LightGBM recognizes a "main" parameter name and one or more "aliases" (other names which set the same configuration).
For example, main parameter num_iterations
can also be referred to in user code as n_iter
, num_round
, and more (docs link).
On the C++ side, LightGBM guarantees reproducible behavior whenever multiple of these aliases are provided in the same call, like this:
{
"num_round": 100,
"n_iter": 200
}
LightGBM's C++, R, and Python packages should all make identical choices in such situations.
Motivation
Ensuring that LightGBM always chooses the same configuration given a certain content of params
eliminates one possible source of the same code producing different results at different times or in different environments. That might save maintainers and users of the project time that would otherwise be lost investigating changes in results.
Description
Currently, the C++ side has some logic to make the choice of alias reproducible.
LightGBM/include/LightGBM/config.h
Lines 1141 to 1144 in fc0c8fd
Instead of replicating that logic in Python and R code, I believe this feature should be implemented similarly to the approach taken in #4829. The full list of recognized aliases is known at compile time, so it shouldn't necessary to write R and Python code similar to that C++ code which checks name lengths and alphabetic ordering every time params
is processed.
I think these could all be kept in sync by:
- moving that size + alphabetic ordering logic out of
ParameterAlias::KeyAliasTransform()
and instead having https://github.com/microsoft/LightGBM/blob/master/helpers/parameter_generator.py pre-sort all aliases that way - changing
ParameterAlias::KeyAliasTransform()
in C++ to use the output ofConfig::DumpAliases()
or some other code onConfig
, iterate over aliases in order, and prefer the first one that it finds (taking advantage of the fact that the aliases have already been sorted)
To avoid the overhead of serializing and deserializing a JSON string, it might also be useful to add an intermediate method for Config::DumpAliases()
that returns arrays of names, and re-use that across both Config::DumpAliases()
and that alias-resolution code in ParameterAlias::KeyAliasTransform()
.
References
Created based on #5289 (comment).
The changes in #4829 are highly relevant to this issue, and reading that PR will help those looking to understand this issue more thoroughly.
Initial PR with deterministic aliases resolution method at cpp side: #961.