-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add support for more than max(int_32_t) data points #4809
base: master
Are you sure you want to change the base?
Conversation
I think we should benchmark for the training speed regards to this change. |
@thvasilo Thanks for opening this PR. LightGBM/include/LightGBM/meta.h Line 28 in 06e3c4a
There's no need to touch so many files. @guolinke I can help to add templates for the I think it would be better to make |
It seems that it would be difficult to use a hyperparameter to enable |
I think compile option is much easier. |
Hi @shiyu1994 I was trying to follow @guolinke tasklist from #2818 that had listed
I can go back to simply changing the type of Agreed that a compile option or template that allows for runtime selection would be the best choices. I think for a first iteration a compile-time option would be easiest. |
93ff0c7
to
3ccfa71
Compare
Switched to simple substitution of the typedef for |
3ccfa71
to
dbfe69d
Compare
dbfe69d
to
eec5d70
Compare
@shiyu1994 could you provide some guidance with resolving the Python/C API errors? Seems like some symbols are now missing, getting errors like:
I'm assuming the signature of the functions has changed as a result of the type change, but probably some parts of the code are left unchanged causing header/implementation mismatches. It's hard to figure out all parts of the code that need changing as CLion doesn't seem to recognize all targets (e.g. testlighgbm is not recognized). What's the easiest way to run the tests that the CI runs locally? |
@thvasilo Thanks for working on this. I just have a quick test on your branch. The declaration and implementation of |
@@ -1491,7 +1491,7 @@ int LGBM_DatasetSetField(DatasetHandle handle, | |||
|
|||
int LGBM_DatasetGetField(DatasetHandle handle, | |||
const char* field_name, | |||
int* out_len, | |||
data_size_t* out_len, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should change the declaration of LGBM_DatasetGetField
in c_api.h
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, if we change the parameter types of these C APIs, we will break the Python and R-package code where these C APIs are called, and int32_t*
is passed as the argument.
I think in the first step, it would be good enough to enable int64_t
for CLI version, and keep the changes within C++ code. So we can keep these C APIs unchanged. That requires converting these int32_t*
arrays into int64_t*
arrays manually within these C APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I follow: Is is possible to make the switch from int32 to int64 for the C++ package, while isolating them from the Python package? My understanding is that the C_api changes are needed because that's what Python uses for its bindings.
So moving to int64 causes the Python bindings to break, unless we also make the changes in the c_api as well?
Is it possible to isolate changes to the CLI version alone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the python/R code, the C API is called
LightGBM/python-package/lightgbm/basic.py
Lines 1843 to 1848 in 946817a
_safe_call(_LIB.LGBM_DatasetGetSubset( | |
self.reference.construct().handle, | |
used_indices.ctypes.data_as(ctypes.POINTER(ctypes.c_int32)), | |
ctypes.c_int32(used_indices.shape[0]), | |
c_str(params_str), | |
ctypes.byref(self.handle))) |
And if we change the C API, we have to change these calls in python/R, for example the
ctypes.c_int
should be changed to ctypes.c_int64
in line 1798.
So I think we can keep the C APIs unchanged, so that we don't have to make many changes in python/R package now.
But that would require some manual casting of int32_t
into int64_t
within the C APIs, for example, in
Lines 1386 to 1410 in 946817a
int LGBM_DatasetGetSubset( | |
const DatasetHandle handle, | |
const int32_t* used_row_indices, | |
int32_t num_used_row_indices, | |
const char* parameters, | |
DatasetHandle* out) { | |
API_BEGIN(); | |
auto param = Config::Str2Map(parameters); | |
Config config; | |
config.Set(param); | |
OMP_SET_NUM_THREADS(config.num_threads); | |
auto full_dataset = reinterpret_cast<const Dataset*>(handle); | |
CHECK_GT(num_used_row_indices, 0); | |
const int32_t lower = 0; | |
const int32_t upper = full_dataset->num_data() - 1; | |
CheckElementsIntervalClosed(used_row_indices, lower, upper, num_used_row_indices, "Used indices of subset"); | |
if (!std::is_sorted(used_row_indices, used_row_indices + num_used_row_indices)) { | |
Log::Fatal("used_row_indices should be sorted in Subset"); | |
} | |
auto ret = std::unique_ptr<Dataset>(new Dataset(num_used_row_indices)); | |
ret->CopyFeatureMapperFrom(full_dataset); | |
ret->CopySubrow(full_dataset, used_row_indices, num_used_row_indices, true); | |
*out = ret.release(); | |
API_END(); | |
} |
we will do something like this to convert the
int32_t
array into int64_t
,
std::vector<int64_t> used_row_indices_int64(num_used_row_indices);
#pragma omp parallel for schedule(static)
for (data_size_t i = 0; i < num_used_row_indices; ++i) {
used_row_indices_int64[i] = static_cast<int64_t>(used_row_indices[i]);
}
And pass used_row_indices_int64.data()
in place of used_row_indices
in subsequent function calls
@@ -25,7 +25,7 @@ | |||
namespace LightGBM { | |||
|
|||
/*! \brief Type of data size, it is better to use signed type*/ | |||
typedef int32_t data_size_t; | |||
typedef int64_t data_size_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a compile option to switch between int64_t and int32_t.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try this, there's a lot of places where there are mixes of refs to data_size_t
and int32/64
, hopefully I'll be able to isolate them :)
Hi, is this PR still active? Because I meet the same issue, I’m willing to work on this. |
Hi, I'm no longer working on this project so feel free to pick it up.
…On Tue, Jun 14, 2022 at 1:05 AM Sinhrks ***@***.***> wrote:
Hi, is this PR still active? Because I meet the same issue, I’m willing to
work on this.
—
Reply to this email directly, view it on GitHub
<#4809 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACFBHIZL2OEGIKVT4YJVYPDVPA4MXANCNFSM5III5ADA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Aimed to address #4804 and #2818
This is a WIP PR to start work on allowing more than the ~2B data points that
int_32_t
allows for.For now I've simply replaced all references to
data_size_t
withint_64_t
and made some post-hoc fixes in certain places replacingint64_t
withint32_t
where it was needed for the project to compile and tests to run.Currently I see 20/20 tests passing but I also see this output when running them:
Due to the prevalence of
data_size_t
it might be a better strategy to selectively update certain files/classes to move toint64_t
to make reviewing easier and the changes more gradual (splitting up this PR into multiple ones). To do that I would need some guidance from the core devs as to which references todata_size_t
are the best candidates to change to enable the support forint_64_t
data points.Another area where devs could help is suggesting stress tests and how to measure performance/memory cost impact, perhaps running benchmarks or other performance checks.