Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add support for more than max(int_32_t) data points #4809

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

thvasilo
Copy link

Aimed to address #4804 and #2818

This is a WIP PR to start work on allowing more than the ~2B data points that int_32_t allows for.

For now I've simply replaced all references to data_size_t with int_64_t and made some post-hoc fixes in certain places replacing int64_t with int32_t where it was needed for the project to compile and tests to run.

Currently I see 20/20 tests passing but I also see this output when running them:

[LightGBM] [Fatal] ChunkedArray chunk size must be larger than 0!
[LightGBM] [Fatal] no conversion to double for: x1

Due to the prevalence of data_size_t it might be a better strategy to selectively update certain files/classes to move to int64_t to make reviewing easier and the changes more gradual (splitting up this PR into multiple ones). To do that I would need some guidance from the core devs as to which references to data_size_t are the best candidates to change to enable the support for int_64_t data points.

Another area where devs could help is suggesting stress tests and how to measure performance/memory cost impact, perhaps running benchmarks or other performance checks.

@guolinke
Copy link
Collaborator

I think we should benchmark for the training speed regards to this change.

@shiyu1994
Copy link
Collaborator

shiyu1994 commented Nov 18, 2021

@thvasilo Thanks for opening this PR.
To replace data_size_t with int64_t, we only need to change the definition of data_size_t here

typedef int32_t data_size_t;

There's no need to touch so many files.

@guolinke I can help to add templates for the data_partition.cpp, and check that all usages of data_size_t with int64_t is correct.

I think it would be better to make int64_t data size as an option, instead of the default setting. Do you think a compiler option or a hyperparameter is preferred?

@shiyu1994
Copy link
Collaborator

It seems that it would be difficult to use a hyperparameter to enable int64_t without large changes.

@guolinke
Copy link
Collaborator

I think compile option is much easier.
Another solution is using template, and auto choose data_size_t according to #data .

@thvasilo
Copy link
Author

Hi @shiyu1994 I was trying to follow @guolinke tasklist from #2818 that had listed

remove data_size_t completely in code, directly used int64_t

I can go back to simply changing the type of data_size_t if you think that's a better option. Let me know and I'll make the changes accordingly.

Agreed that a compile option or template that allows for runtime selection would be the best choices. I think for a first iteration a compile-time option would be easiest.

@thvasilo
Copy link
Author

Switched to simple substitution of the typedef for data_size_t. I'll try a few learning tasks and report back, if you have particular benchmarks you'd like to see please let me know.

@ghost
Copy link

ghost commented Nov 25, 2021

CLA assistant check
All CLA requirements met.

@thvasilo
Copy link
Author

thvasilo commented Dec 1, 2021

@shiyu1994 could you provide some guidance with resolving the Python/C API errors?

Seems like some symbols are now missing, getting errors like:

E AttributeError: /home/AzDevOps_azpcontainer/.local/lib/python3.9/site-packages/lightgbm/lib_lightgbm.so: undefined symbol: LGBM_DatasetGetField

I'm assuming the signature of the functions has changed as a result of the type change, but probably some parts of the code are left unchanged causing header/implementation mismatches. It's hard to figure out all parts of the code that need changing as CLion doesn't seem to recognize all targets (e.g. testlighgbm is not recognized).

What's the easiest way to run the tests that the CI runs locally?

@shiyu1994
Copy link
Collaborator

@thvasilo Thanks for working on this. I just have a quick test on your branch. The declaration and implementation of LGBM_DatasetGetField is not compatible.

@@ -1491,7 +1491,7 @@ int LGBM_DatasetSetField(DatasetHandle handle,

int LGBM_DatasetGetField(DatasetHandle handle,
const char* field_name,
int* out_len,
data_size_t* out_len,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change the declaration of LGBM_DatasetGetField in c_api.h.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, if we change the parameter types of these C APIs, we will break the Python and R-package code where these C APIs are called, and int32_t* is passed as the argument.

I think in the first step, it would be good enough to enable int64_t for CLI version, and keep the changes within C++ code. So we can keep these C APIs unchanged. That requires converting these int32_t* arrays into int64_t* arrays manually within these C APIs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I follow: Is is possible to make the switch from int32 to int64 for the C++ package, while isolating them from the Python package? My understanding is that the C_api changes are needed because that's what Python uses for its bindings.

So moving to int64 causes the Python bindings to break, unless we also make the changes in the c_api as well?

Is it possible to isolate changes to the CLI version alone?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the python/R code, the C API is called

_safe_call(_LIB.LGBM_DatasetGetSubset(
self.reference.construct().handle,
used_indices.ctypes.data_as(ctypes.POINTER(ctypes.c_int32)),
ctypes.c_int32(used_indices.shape[0]),
c_str(params_str),
ctypes.byref(self.handle)))

And if we change the C API, we have to change these calls in python/R, for example the ctypes.c_int should be changed to ctypes.c_int64 in line 1798.

So I think we can keep the C APIs unchanged, so that we don't have to make many changes in python/R package now.

But that would require some manual casting of int32_t into int64_t within the C APIs, for example, in

LightGBM/src/c_api.cpp

Lines 1386 to 1410 in 946817a

int LGBM_DatasetGetSubset(
const DatasetHandle handle,
const int32_t* used_row_indices,
int32_t num_used_row_indices,
const char* parameters,
DatasetHandle* out) {
API_BEGIN();
auto param = Config::Str2Map(parameters);
Config config;
config.Set(param);
OMP_SET_NUM_THREADS(config.num_threads);
auto full_dataset = reinterpret_cast<const Dataset*>(handle);
CHECK_GT(num_used_row_indices, 0);
const int32_t lower = 0;
const int32_t upper = full_dataset->num_data() - 1;
CheckElementsIntervalClosed(used_row_indices, lower, upper, num_used_row_indices, "Used indices of subset");
if (!std::is_sorted(used_row_indices, used_row_indices + num_used_row_indices)) {
Log::Fatal("used_row_indices should be sorted in Subset");
}
auto ret = std::unique_ptr<Dataset>(new Dataset(num_used_row_indices));
ret->CopyFeatureMapperFrom(full_dataset);
ret->CopySubrow(full_dataset, used_row_indices, num_used_row_indices, true);
*out = ret.release();
API_END();
}

we will do something like this to convert the int32_t array into int64_t,

std::vector<int64_t> used_row_indices_int64(num_used_row_indices);
#pragma omp parallel for schedule(static)
for (data_size_t i = 0; i < num_used_row_indices; ++i) {
    used_row_indices_int64[i] = static_cast<int64_t>(used_row_indices[i]);
}

And pass used_row_indices_int64.data() in place of used_row_indices in subsequent function calls

@@ -25,7 +25,7 @@
namespace LightGBM {

/*! \brief Type of data size, it is better to use signed type*/
typedef int32_t data_size_t;
typedef int64_t data_size_t;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a compile option to switch between int64_t and int32_t.

Copy link
Author

@thvasilo thvasilo Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try this, there's a lot of places where there are mixes of refs to data_size_t and int32/64, hopefully I'll be able to isolate them :)

@sinhrks
Copy link

sinhrks commented Jun 14, 2022

Hi, is this PR still active? Because I meet the same issue, I’m willing to work on this.

@thvasilo
Copy link
Author

thvasilo commented Jun 15, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants