-
-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH reuse parent histogram in HGBT #26189
base: main
Are you sure you want to change the base?
ENH reuse parent histogram in HGBT #26189
Conversation
Simple benchmarkOn the higgs dataset, this PR gives ~4% speed-up, which is about the expected amount (1 / 28 features = 3.5%). Running
on main gives
On this PR:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR! I left a first pass review.
9fb9006
to
b1efb34
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I do not think this change is too complex, it still adds more complexity to the codebase. Overall, do you think the complexity is worth the 1-1/n_features
performance bump?
if n_samples_left < n_samples_right: | ||
smallest_child = left_child_node | ||
largest_child = right_child_node | ||
is_left_child = True | ||
else: | ||
smallest_child = right_child_node | ||
largest_child = left_child_node | ||
is_left_child = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
is_left_child = n_samples_left < n_samples_right
if is_left_child:
smallest_child = left_child_node
largest_child = right_child_node
else:
smallest_child = right_child_node
largest_child = left_child_node
unsigned int split_bin_end | ||
unsigned char is_categorical | ||
BITSET_INNER_DTYPE_C [:] left_cat_bitset | ||
bint has_parent_hist = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bint has_parent_hist = False | |
bint has_parent_hist = parent_split_info is not None |
if parent_split_info is not None: | ||
has_parent_hist = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if parent_split_info is not None: | |
has_parent_hist = True | |
if has_parent_hist: |
@@ -141,8 +156,30 @@ cdef class HistogramBuilder: | |||
dtype=HISTOGRAM_DTYPE | |||
) | |||
bint has_interaction_cst = allowed_features is not None | |||
# Feature index of the feature that the parent node was split on. | |||
int split_feature_idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int split_feature_idx | |
int parent_split_feature_idx |
unsigned int split_bin_start | ||
# End (+1) of the bin indices belonging to the feature that was split on. | ||
unsigned int split_bin_end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are also for the parent.
unsigned int split_bin_start | |
# End (+1) of the bin indices belonging to the feature that was split on. | |
unsigned int split_bin_end | |
unsigned int parent_split_bin_start | |
# End (+1) of the bin indices belonging to the feature that was split on. | |
unsigned int parent_split_bin_end |
@@ -422,6 +422,13 @@ Changelog | |||
In effect, less memory has to be allocated and deallocated. | |||
:pr:`27865` by :user:`Christian Lorentzen <lorentzenchr>`. | |||
|
|||
- |Efficiency| :class:`ensemble.HistGradientBoostingClassifier` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be moved to 1.5
@lorentzenchr: should we continue and merge this PR, or not? |
Only if I get the commitment of 2 core devs willing to review and finally approve. |
I would be glad reviewing this PR, but I would rather finish what I started first. |
Reference Issues/PRs
None
What does this implement/fix? Explain your changes.
This PR reuses the parent's histogram for the feature that was split on. This saves a little time.
Any other comments?
The implementation is no beauty. If we want to include this improvement, suggestions or alternative PRs are welcome.