HistGradientBoosting counts and sample weightsΒ #26128
Open
Description
Related issues: #25210
Current State
HistGradientBootingClassifier
and HistGradientBootingRegressor
both:
- Calculate the sample size
count
in histograms - Use
count
for splitting (mostly excluding split candidates) - Save the
count
in the final trees and use it in partial dependence computations.
Proposition
- Evaluate if removing
count
from the histograms (LightGBM only sums gradient and hessian in histograms, no count) gives a good speed-up.
Edit: LightGBM uses an approximate count based on the hessian to check for min sample size. So this might not be what we want. - Add an option to save counts and sample weights to final trees at the very end of
fit
(where the binned trainingX
is still available). - Use partial dependence
method='recursion'
if the above option was set, else usemethod='brute'
.
Why?
#25431 concluded that adding weights to the trees is too expensive. The above proposition gives a user a clear choice: Faster training time or faster pdp afterwards.