-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Categorical feature support #108
Conversation
guolinke
commented
Dec 3, 2016
•
edited
Loading
edited
- support input with categorical features (only support int type) directly
- much faster than one-hot coding. (about 10x speed up in FlightDelayData)
Dump model still causes bug, it's caused by objective function which is not loaded from file. Can you delete L404-L406 in gbdt.cpp for now? You can also add feature names into dump model if you will. |
feature_names = std::ref(train_data_->feature_names()); | ||
} | ||
|
||
ss << "\"feature_names\":" << Common::Join(feature_names.get(), ' ') << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a comma in the end, ss << "\"feature_names\":" << Common::Join(feature_names.get(), ' ') << "," << std::endl;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, still not right here, need to add quotation marks outside string. Should be ss << "\"feature_names\":\"" << Common::Join(feature_names.get(), ' ') << "\"," << std::endl;
.
If Common::Join support delimiter as string, ss << "\"feature_names\":[\"" << Common::Join(feature_names.get(), "\",\"") << "\"]," << std::endl;
will be better.
Could you help reinstall python-package and run simple_example.py to check it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@guolinke btw, will you add get_feature_importance in c_api?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wxchan , i think i will not add this on this PR.
ss << "\"leaf_parent\":" << leaf_parent_.data()[index] << "," << std::endl; | ||
ss << "\"leaf_value\":" << leaf_value_.data()[index] << std::endl; | ||
ss << "\"leaf_parent\":" << leaf_parent_[index] << "," << std::endl; | ||
ss << "\"leaf_value\":" << leaf_value_[index] << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a comma, ss << "\"leaf_value\":" << leaf_value_[index] << "," << std::endl;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see
ab3850e
to
2e2a4b2
Compare
b2f6676
to
ae672b8
Compare
|
||
namespace LightGBM { | ||
|
||
enum BinType { | ||
NumericalBin, | ||
CategoracilBin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mistype? (CategoricalBin)
fe994b9
to
75bd396
Compare
int r = num_bin_ - 1; | ||
while (l < r) { | ||
int m = (r + l - 1) / 2; | ||
if (value <= bin_upper_bound_[m]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if value == bin_upper_bound_[m] , why not return m directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is <=
@@ -378,7 +378,18 @@ class Dataset { | |||
inline int label_idx() const { return label_idx_; } | |||
|
|||
/*! \brief Get names of current data set */ | |||
inline std::vector<std::string> feature_names() const { return feature_names_; } | |||
inline const std::vector<std::string>& feature_names() const { return feature_names_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why call this function like "get_feature_names"....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
google cpp style.
return; | ||
} | ||
feature_names_ = std::vector<std::string>(num_total_features_); | ||
for (int i = 0; i < num_total_features_; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or you could use assign instead for easy understanding...
inline static std::string ArrayToString(const T* arr, int n, char delimiter) { | ||
if (n <= 0) { | ||
inline static std::string ArrayToString(const std::vector<T>& arr, char delimiter) { | ||
if (arr.size() <= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or could use empty() for understanding..
// used for leaf node | ||
/*! \brief The parent of leaf */ | ||
std::vector<int> leaf_parent_; | ||
/*! \brief Output of leaves */ | ||
std::vector<double> leaf_value_; | ||
/*! \brief DataCount of leaves */ | ||
std::vector<data_size_t> leaf_count_; | ||
/*! \brief Output of internal nodes(save internal output for per inference feature importance calc) */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mark here
2 bugs report:
|
@wxchan |
@guolinke Sorry, it might not be the reason, but it still has bug with input_model. The loaded model is not same as the saved one. You can see the logs below.
|
@@ -408,6 +408,11 @@ def _label_from_pandas(label): | |||
if not all(dtype.name in PANDAS_DTYPE_MAPPER for dtype in label_dtypes): | |||
raise ValueError('DataFrame.dtypes for label must be int, float or bool') | |||
label = label.values.astype('float') | |||
elif isinstance(label, Series): | |||
label_dtypes = label.dtypes | |||
if not all(dtype.name in PANDAS_DTYPE_MAPPER for dtype in label_dtypes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pandas.Series.dtypes is not iterable (only one value). Did you test on it? This part is not needed, in simple_example.py I already used Series as label and it works fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I can rollback this version
552cbe9
to
0a7712b
Compare
Is there any way to reset categorical_feature after dataset is constructed? Actually I suggest not to use (X, y) to represent a dataset in python, because:
I plan to open another PR with some other python-package modifications, we can talk about this in that thread. |
@wxchan OK. |