Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical feature support #108

Merged
merged 26 commits into from
Dec 5, 2016
Merged

Categorical feature support #108

merged 26 commits into from
Dec 5, 2016

Conversation

guolinke
Copy link
Collaborator

@guolinke guolinke commented Dec 3, 2016

  1. support input with categorical features (only support int type) directly
  2. much faster than one-hot coding. (about 10x speed up in FlightDelayData)

@wxchan
Copy link
Contributor

wxchan commented Dec 3, 2016

Dump model still causes bug, it's caused by objective function which is not loaded from file. Can you delete L404-L406 in gbdt.cpp for now? You can also add feature names into dump model if you will.

feature_names = std::ref(train_data_->feature_names());
}

ss << "\"feature_names\":" << Common::Join(feature_names.get(), ' ') << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a comma in the end, ss << "\"feature_names\":" << Common::Join(feature_names.get(), ' ') << "," << std::endl;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, still not right here, need to add quotation marks outside string. Should be ss << "\"feature_names\":\"" << Common::Join(feature_names.get(), ' ') << "\"," << std::endl; .

If Common::Join support delimiter as string, ss << "\"feature_names\":[\"" << Common::Join(feature_names.get(), "\",\"") << "\"]," << std::endl; will be better.

Could you help reinstall python-package and run simple_example.py to check it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guolinke btw, will you add get_feature_importance in c_api?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wxchan , i think i will not add this on this PR.

ss << "\"leaf_parent\":" << leaf_parent_.data()[index] << "," << std::endl;
ss << "\"leaf_value\":" << leaf_value_.data()[index] << std::endl;
ss << "\"leaf_parent\":" << leaf_parent_[index] << "," << std::endl;
ss << "\"leaf_value\":" << leaf_value_[index] << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a comma, ss << "\"leaf_value\":" << leaf_value_[index] << "," << std::endl;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see

@guolinke guolinke force-pushed the categorical-feature-support branch from ab3850e to 2e2a4b2 Compare December 3, 2016 06:55
@guolinke guolinke force-pushed the categorical-feature-support branch from b2f6676 to ae672b8 Compare December 3, 2016 11:35

namespace LightGBM {

enum BinType {
NumericalBin,
CategoracilBin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mistype? (CategoricalBin)

@guolinke guolinke force-pushed the categorical-feature-support branch from fe994b9 to 75bd396 Compare December 3, 2016 12:17
int r = num_bin_ - 1;
while (l < r) {
int m = (r + l - 1) / 2;
if (value <= bin_upper_bound_[m]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if value == bin_upper_bound_[m] , why not return m directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is <=

@@ -378,7 +378,18 @@ class Dataset {
inline int label_idx() const { return label_idx_; }

/*! \brief Get names of current data set */
inline std::vector<std::string> feature_names() const { return feature_names_; }
inline const std::vector<std::string>& feature_names() const { return feature_names_; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why call this function like "get_feature_names"....

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

google cpp style.

return;
}
feature_names_ = std::vector<std::string>(num_total_features_);
for (int i = 0; i < num_total_features_; ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or you could use assign instead for easy understanding...

inline static std::string ArrayToString(const T* arr, int n, char delimiter) {
if (n <= 0) {
inline static std::string ArrayToString(const std::vector<T>& arr, char delimiter) {
if (arr.size() <= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or could use empty() for understanding..

// used for leaf node
/*! \brief The parent of leaf */
std::vector<int> leaf_parent_;
/*! \brief Output of leaves */
std::vector<double> leaf_value_;
/*! \brief DataCount of leaves */
std::vector<data_size_t> leaf_count_;
/*! \brief Output of internal nodes(save internal output for per inference feature importance calc) */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mark here

@wxchan
Copy link
Contributor

wxchan commented Dec 4, 2016

2 bugs report:

  1. Tree load and save order not same.
  2. Input_model in config causes segfault, not sure if I do something wrong, you can enable input model in example to take a try

@guolinke
Copy link
Collaborator Author

guolinke commented Dec 4, 2016

@wxchan
fix 2.
for 1, I am not sure what you mean. It use dict to load tree model, the order has no effect.

@wxchan
Copy link
Contributor

wxchan commented Dec 4, 2016

@guolinke Sorry, it might not be the reason, but it still has bug with input_model. The loaded model is not same as the saved one. You can see the logs below.

[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Finished loading data in 0.070819 seconds
[LightGBM] [Info] Number of data: 7000, number of features: 28
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] Iteration:1, training l2 : 0.705793
[LightGBM] [Info] Iteration:1, valid_1 l2 : 0.715089
[LightGBM] [Info] 0.008182 seconds elapsed, finished iteration 1
[LightGBM] [Info] Iteration:2, training l2 : 0.684535
[LightGBM] [Info] Iteration:2, valid_1 l2 : 0.694169
[LightGBM] [Info] 0.015747 seconds elapsed, finished iteration 2
[LightGBM] [Info] Iteration:3, training l2 : 0.664742
[LightGBM] [Info] Iteration:3, valid_1 l2 : 0.674488
[LightGBM] [Info] 0.036652 seconds elapsed, finished iteration 3
[LightGBM] [Info] Iteration:4, training l2 : 0.646386
[LightGBM] [Info] Iteration:4, valid_1 l2 : 0.656487
[LightGBM] [Info] 0.047157 seconds elapsed, finished iteration 4
[LightGBM] [Info] Iteration:5, training l2 : 0.629229
[LightGBM] [Info] Iteration:5, valid_1 l2 : 0.639572
[LightGBM] [Info] 0.062816 seconds elapsed, finished iteration 5
[LightGBM] [Info] Iteration:6, training l2 : 0.613348
[LightGBM] [Info] Iteration:6, valid_1 l2 : 0.623840
[LightGBM] [Info] 0.072787 seconds elapsed, finished iteration 6
[LightGBM] [Info] Iteration:7, training l2 : 0.598412
[LightGBM] [Info] Iteration:7, valid_1 l2 : 0.608859
[LightGBM] [Info] 0.080484 seconds elapsed, finished iteration 7
[LightGBM] [Info] Iteration:8, training l2 : 0.584939
[LightGBM] [Info] Iteration:8, valid_1 l2 : 0.595311
[LightGBM] [Info] 0.088233 seconds elapsed, finished iteration 8
[LightGBM] [Info] Iteration:9, training l2 : 0.572036
[LightGBM] [Info] Iteration:9, valid_1 l2 : 0.582532
[LightGBM] [Info] 0.096979 seconds elapsed, finished iteration 9
[LightGBM] [Info] Iteration:10, training l2 : 0.560268
[LightGBM] [Info] Iteration:10, valid_1 l2 : 0.571050
[LightGBM] [Info] 0.104230 seconds elapsed, finished iteration 10
[LightGBM] [Info] Finished training
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Finished loading 10 models
[LightGBM] [Info] Finished loading data in 0.057781 seconds
[LightGBM] [Info] Number of data: 7000, number of features: 28
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] Iteration:1, training l2 : 0.905469
[LightGBM] [Info] Iteration:1, valid_1 l2 : 0.923409
[LightGBM] [Info] 0.008511 seconds elapsed, finished iteration 1
[LightGBM] [Info] Iteration:2, training l2 : 0.871719
[LightGBM] [Info] Iteration:2, valid_1 l2 : 0.890034
[LightGBM] [Info] 0.016249 seconds elapsed, finished iteration 2
[LightGBM] [Info] Iteration:3, training l2 : 0.839945
[LightGBM] [Info] Iteration:3, valid_1 l2 : 0.858110
[LightGBM] [Info] 0.023572 seconds elapsed, finished iteration 3
[LightGBM] [Info] Iteration:4, training l2 : 0.810232
[LightGBM] [Info] Iteration:4, valid_1 l2 : 0.828193
[LightGBM] [Info] 0.030878 seconds elapsed, finished iteration 4
[LightGBM] [Info] Iteration:5, training l2 : 0.782431
[LightGBM] [Info] Iteration:5, valid_1 l2 : 0.800635
[LightGBM] [Info] 0.038691 seconds elapsed, finished iteration 5
[LightGBM] [Info] Iteration:6, training l2 : 0.756445
[LightGBM] [Info] Iteration:6, valid_1 l2 : 0.774320
[LightGBM] [Info] 0.046078 seconds elapsed, finished iteration 6
[LightGBM] [Info] Iteration:7, training l2 : 0.732074
[LightGBM] [Info] Iteration:7, valid_1 l2 : 0.749998
[LightGBM] [Info] 0.054507 seconds elapsed, finished iteration 7
[LightGBM] [Info] Iteration:8, training l2 : 0.720591
[LightGBM] [Info] Iteration:8, valid_1 l2 : 0.737566
[LightGBM] [Info] 0.061623 seconds elapsed, finished iteration 8
[LightGBM] [Info] Iteration:9, training l2 : 0.698595
[LightGBM] [Info] Iteration:9, valid_1 l2 : 0.715463
[LightGBM] [Info] 0.068917 seconds elapsed, finished iteration 9
[LightGBM] [Info] Iteration:10, training l2 : 0.678236
[LightGBM] [Info] Iteration:10, valid_1 l2 : 0.695107
[LightGBM] [Info] 0.076339 seconds elapsed, finished iteration 10
[LightGBM] [Info] Finished training

@@ -408,6 +408,11 @@ def _label_from_pandas(label):
if not all(dtype.name in PANDAS_DTYPE_MAPPER for dtype in label_dtypes):
raise ValueError('DataFrame.dtypes for label must be int, float or bool')
label = label.values.astype('float')
elif isinstance(label, Series):
label_dtypes = label.dtypes
if not all(dtype.name in PANDAS_DTYPE_MAPPER for dtype in label_dtypes):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pandas.Series.dtypes is not iterable (only one value). Did you test on it? This part is not needed, in simple_example.py I already used Series as label and it works fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I can rollback this version

@guolinke guolinke force-pushed the categorical-feature-support branch from 552cbe9 to 0a7712b Compare December 5, 2016 03:30
@wxchan
Copy link
Contributor

wxchan commented Dec 5, 2016

Is there any way to reset categorical_feature after dataset is constructed?

Actually I suggest not to use (X, y) to represent a dataset in python, because:

  1. most importantly, it's inaccessible after training when you set up some other fields
  2. it's iterable, which is complicated to distinguish with multiple datasets
  3. train_fields, valid_fields make logic very complicated. User may not know what they mean. It'd better set up when user constructs dataset.

I plan to open another PR with some other python-package modifications, we can talk about this in that thread.

@guolinke
Copy link
Collaborator Author

guolinke commented Dec 5, 2016

@wxchan OK.
It is hard to reset categorical_feature after constructed, since it doesn't store raw values of data.
I think let user to interactive with Dataset is also tricky. The main reason is the construction of validation data and the init_predictor. These logic are not easy to follow and may cause mistaken.
Anyway, I think we can figure out a better way

@guolinke guolinke merged commit 1466f90 into master Dec 5, 2016
@lock lock bot locked as resolved and limited conversation to collaborators Mar 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants