nmbx, NuMBoX, n_umb_ox,
A box of tools that deal with numbers.
In nmbx.convergence
we have tools for testing convergence in a sequence of
numbers, such as the loss when training machine learning models or any other
iterative process with a converging metric.
SlopeZero
detects a flat plateau ("zero slope"), which is a general-purpose
method. SlopeRise
detects a rise in the history after a flat plateau. This
could be used to monitor a validation loss, for instance.
The idea is to call either in a training loop, passing a history of loss values.
from nmbx.convergence import SlopeZero
# Early stopping with wait=10 iterations of "patience", an absolute tolerance
# of 0.01 and a moving average window of wlen=25 points. Start checking not before
# 100 iterations have been performed.
conv = SlopeZero(wlen=25, atol=0.01, wait=10, delay=100)
history = []
while True:
history.append(compute_loss(model, data))
if conv.check(history):
print("converged")
break
SlopeZero
implements the same logic as found in Keras' or Lightning's
EarlyStopping(mode=...)
with mode="min"
or "max"
. In addition we provide
mode="abs"
(detect convergence not assuming a direction).
Since we only work with a given list of numbers atol
and rtol
parameters are to be understood w.r.t. tol
does in each method, where tol = atol
or tol = rtol * abs(prev)
. In
short
SlopeRise
:last - tol > prev
SlopeZero
:mode="abs": |last - prev| < tol
mode="min": last + tol > prev
mode="max": last - tol < prev
last
and prev
are the mean/median/... (see wlen_avg
) over the last and
previous non-overlapping windows of wlen
points each. This means that the
earliest convergence point can be detected after 2 * wlen
iterations. With
delay
, the first possible convergence point is after 2 * wlen + delay
iterations.
We implement several options that can make convergence checks more robust and versatile than vanilla "early stopping".
-
Noise filtering (smoothing): Histories are often noisy (e.g. when using stochastic optimizers). In vanilla early stopping, the only counter measure is using "patience". We have the option to smooth the history using
- a Gaussian filter (set
smooth_sigma
) and/or - a moving reduction of window size
wlen
(reduction = mean/median/..., seewlen_avg
).wlen=1
means a window of one, so no noise filtering of this kind. You can still use the Gaussian filter by settingsmooth_sigma
.
- a Gaussian filter (set
-
You may use some
delay
to make sure to run at least this many iterations before checking for convergence. This can help to avoid early false positive convergence detection.
Can we get "transferable" tolerances? Well, kind of.
-
Absolute (
atol
) or relative (rtol
) tolerances: If you know the unit of the history and can say something like "we call changes below 0.01 converged", then useatol
. Else, try to use a relative tolerancertol
, in which case we usetol = rtol * abs(prev)
.- Pro: This will be invariant to scaling
$y' = y s$ . - Con: Will not be invariant to a shift
$y' = y + c$ .
- Pro: This will be invariant to scaling
-
Standardization: You can standardize
history
, for example using a z-score (setstd="std"
andstd_avg=np.mean
) to zero mean and unity standard deviation such that, at each iteration$i$ ,atol
will be in units of$\sigma_i$ . Now the convergence criterion is "stop if changes are belowatol
standard deviations".- Pro: This is helpful for histories of very different numerical scale
but similar "shape" and where you don't know or care about the unit of
$y$ . More precisely, you can apply the sameatol
to all histories which differ from$y$ by an affine transform$y' = y s + c$ . - Con: Since
check()
is an online method, the standardization is performed for each iteration, using all history values provided so far. Therefore$\sigma_i$ , and thus the unit ofatol
, will change which makes the effect of standardization more difficult to interpret. There are corner cases where this method doesn't work (for example a noise-free constant history where$\sigma$ is zero). Also some experimentation is needed to find goodatol
value. Checkexamples/convergence/visualize_std.py
and alltest_atol_std*
tests.
- Pro: This is helpful for histories of very different numerical scale
but similar "shape" and where you don't know or care about the unit of
Here are results from a parameter study in
examples/convergence/param_study.py
with noise-free and noisy histories,
where we explore the above parameters. Blue points are the histories. The other
points indicate when check()
is True. The points marked with vertical dashed
lines are the first points where the check is True, i.e. where you would
break out of the training loop. If no colored points show up, then this means
that the corresponding parameter setting leads to no convergence detection.
We observe that SlopeZero
is pretty robust against noise, while SlopeRise
is more tricky, i.e. it is not clear what the right parameters are in this
case.
- Use
wlen > 1
(smoothing by moving reduction) andwait > 1
("patience"). Typical values arewlen=10...20
(but that depends very much on your data), andwait=5..10
. Using the Gaussian filter (smooth_sigma
) is a more effective smoothing option, but still somewlen
can help. - When data is noisy, it can help to raise
tol
to prevent too early / false convergence detection. But having a good setting forwlen
orsmooth_sigma
is preferable. - To find a good
smooth_sigma
value, start withsmooth_sigma = wlen/3
wherewlen
is the value you would use for smoothing instead (e.g.wlen=15 -> smooth_sigma=3
). Seeexamples/convergence/find_sigma_from_wlen.py
. - Use
delay
if you know that you need to run at least this many iterations. This also helps too avoid early false positives. - When using standardization (e.g.
std="std"
), start withatol=0.01
, so "stop when things fluctuate by less than 0.01 standard deviations". - Take the ball park numbers used in
examples/convergence/
as starting point.
One can frame this as an optimization problem. We have an example using Optuna
in examples/convergence/param_opt.py
. You can use this if you have a
representative history recorded for your application and plan to use
convergence detection for many similar runs.