[WIP] Add support for more than max(int_32_t) data points #4809

thvasilo · 2021-11-18T00:02:18Z

Aimed to address #4804 and #2818

This is a WIP PR to start work on allowing more than the ~2B data points that int_32_t allows for.

For now I've simply replaced all references to data_size_t with int_64_t and made some post-hoc fixes in certain places replacing int64_t with int32_t where it was needed for the project to compile and tests to run.

Currently I see 20/20 tests passing but I also see this output when running them:

[LightGBM] [Fatal] ChunkedArray chunk size must be larger than 0!
[LightGBM] [Fatal] no conversion to double for: x1

Due to the prevalence of data_size_t it might be a better strategy to selectively update certain files/classes to move to int64_t to make reviewing easier and the changes more gradual (splitting up this PR into multiple ones). To do that I would need some guidance from the core devs as to which references to data_size_t are the best candidates to change to enable the support for int_64_t data points.

Another area where devs could help is suggesting stress tests and how to measure performance/memory cost impact, perhaps running benchmarks or other performance checks.

guolinke · 2021-11-18T02:35:02Z

I think we should benchmark for the training speed regards to this change.

shiyu1994 · 2021-11-18T03:01:54Z

@thvasilo Thanks for opening this PR.
To replace data_size_t with int64_t, we only need to change the definition of data_size_t here

LightGBM/include/LightGBM/meta.h

Line 28 in 06e3c4a

typedef int32_t data_size_t;

There's no need to touch so many files.

@guolinke I can help to add templates for the data_partition.cpp, and check that all usages of data_size_t with int64_t is correct.

I think it would be better to make int64_t data size as an option, instead of the default setting. Do you think a compiler option or a hyperparameter is preferred?

shiyu1994 · 2021-11-18T03:07:14Z

It seems that it would be difficult to use a hyperparameter to enable int64_t without large changes.

guolinke · 2021-11-18T14:28:46Z

I think compile option is much easier.
Another solution is using template, and auto choose data_size_t according to #data .

thvasilo · 2021-11-18T17:39:31Z

Hi @shiyu1994 I was trying to follow @guolinke tasklist from #2818 that had listed

remove data_size_t completely in code, directly used int64_t

I can go back to simply changing the type of data_size_t if you think that's a better option. Let me know and I'll make the changes accordingly.

Agreed that a compile option or template that allows for runtime selection would be the best choices. I think for a first iteration a compile-time option would be easiest.

thvasilo · 2021-11-19T19:59:58Z

Switched to simple substitution of the typedef for data_size_t. I'll try a few learning tasks and report back, if you have particular benchmarks you'd like to see please let me know.

ghost · 2021-11-25T01:05:21Z

All CLA requirements met.

thvasilo · 2021-12-01T19:02:10Z

@shiyu1994 could you provide some guidance with resolving the Python/C API errors?

Seems like some symbols are now missing, getting errors like:

E AttributeError: /home/AzDevOps_azpcontainer/.local/lib/python3.9/site-packages/lightgbm/lib_lightgbm.so: undefined symbol: LGBM_DatasetGetField

I'm assuming the signature of the functions has changed as a result of the type change, but probably some parts of the code are left unchanged causing header/implementation mismatches. It's hard to figure out all parts of the code that need changing as CLion doesn't seem to recognize all targets (e.g. testlighgbm is not recognized).

What's the easiest way to run the tests that the CI runs locally?

shiyu1994 · 2021-12-02T02:03:16Z

@thvasilo Thanks for working on this. I just have a quick test on your branch. The declaration and implementation of LGBM_DatasetGetField is not compatible.

shiyu1994 · 2021-12-02T02:03:56Z

src/c_api.cpp

@@ -1491,7 +1491,7 @@ int LGBM_DatasetSetField(DatasetHandle handle,

 int LGBM_DatasetGetField(DatasetHandle handle,
                         const char* field_name,
-                         int* out_len,
+                         data_size_t* out_len,


We should change the declaration of LGBM_DatasetGetField in c_api.h.

In addition, if we change the parameter types of these C APIs, we will break the Python and R-package code where these C APIs are called, and int32_t* is passed as the argument.

I think in the first step, it would be good enough to enable int64_t for CLI version, and keep the changes within C++ code. So we can keep these C APIs unchanged. That requires converting these int32_t* arrays into int64_t* arrays manually within these C APIs.

Not sure if I follow: Is is possible to make the switch from int32 to int64 for the C++ package, while isolating them from the Python package? My understanding is that the C_api changes are needed because that's what Python uses for its bindings.

So moving to int64 causes the Python bindings to break, unless we also make the changes in the c_api as well?

Is it possible to isolate changes to the CLI version alone?

In the python/R code, the C API is called

LightGBM/python-package/lightgbm/basic.py

Lines 1843 to 1848 in 946817a

_safe_call(_LIB.LGBM_DatasetGetSubset(

self.reference.construct().handle,

used_indices.ctypes.data_as(ctypes.POINTER(ctypes.c_int32)),

ctypes.c_int32(used_indices.shape[0]),

c_str(params_str),

ctypes.byref(self.handle)))

And if we change the C API, we have to change these calls in python/R, for example the ctypes.c_int should be changed to ctypes.c_int64 in line 1798.

So I think we can keep the C APIs unchanged, so that we don't have to make many changes in python/R package now.

But that would require some manual casting of int32_t into int64_t within the C APIs, for example, in

LightGBM/src/c_api.cpp

Lines 1386 to 1410 in 946817a

int LGBM_DatasetGetSubset(

const DatasetHandle handle,

const int32_t* used_row_indices,

int32_t num_used_row_indices,

const char* parameters,

DatasetHandle* out) {

API_BEGIN();

auto param = Config::Str2Map(parameters);

Config config;

config.Set(param);

OMP_SET_NUM_THREADS(config.num_threads);

auto full_dataset = reinterpret_cast<const Dataset*>(handle);

CHECK_GT(num_used_row_indices, 0);

const int32_t lower = 0;

const int32_t upper = full_dataset->num_data() - 1;

CheckElementsIntervalClosed(used_row_indices, lower, upper, num_used_row_indices, "Used indices of subset");

if (!std::is_sorted(used_row_indices, used_row_indices + num_used_row_indices)) {

Log::Fatal("used_row_indices should be sorted in Subset");

}

auto ret = std::unique_ptr<Dataset>(new Dataset(num_used_row_indices));

ret->CopyFeatureMapperFrom(full_dataset);

ret->CopySubrow(full_dataset, used_row_indices, num_used_row_indices, true);

*out = ret.release();

API_END();

}

we will do something like this to convert the int32_t array into int64_t,

std::vector<int64_t> used_row_indices_int64(num_used_row_indices); #pragma omp parallel for schedule(static) for (data_size_t i = 0; i < num_used_row_indices; ++i) { used_row_indices_int64[i] = static_cast<int64_t>(used_row_indices[i]); }

And pass used_row_indices_int64.data() in place of used_row_indices in subsequent function calls

shiyu1994 · 2021-12-02T02:04:42Z

include/LightGBM/meta.h

@@ -25,7 +25,7 @@
 namespace LightGBM {

 /*! \brief Type of data size, it is better to use signed type*/
-typedef int32_t data_size_t;
+typedef int64_t data_size_t;


Add a compile option to switch between int64_t and int32_t.

I'll try this, there's a lot of places where there are mixes of refs to data_size_t and int32/64, hopefully I'll be able to isolate them :)

sinhrks · 2022-06-14T08:05:20Z

Hi, is this PR still active? Because I meet the same issue, I’m willing to work on this.

thvasilo · 2022-06-15T17:03:35Z

Hi, I'm no longer working on this project so feel free to pick it up.

…

On Tue, Jun 14, 2022 at 1:05 AM Sinhrks ***@***.***> wrote: Hi, is this PR still active? Because I meet the same issue, I’m willing to work on this. — Reply to this email directly, view it on GitHub <#4809 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACFBHIZL2OEGIKVT4YJVYPDVPA4MXANCNFSM5III5ADA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

thvasilo requested review from btrotta, chivee, guolinke, huanzhang12, hzy46, shiyu1994 and tongwu-sh as code owners November 18, 2021 00:02

jameslamb added in progress feature labels Nov 18, 2021

thvasilo force-pushed the int64-data-size branch from 93ff0c7 to 3ccfa71 Compare November 19, 2021 19:28

thvasilo force-pushed the int64-data-size branch from 3ccfa71 to dbfe69d Compare November 25, 2021 01:05

Change type of data_size_t to int64_t

eec5d70

thvasilo force-pushed the int64-data-size branch from dbfe69d to eec5d70 Compare November 29, 2021 23:32

shiyu1994 reviewed Dec 2, 2021

View reviewed changes

StrikerRUS mentioned this pull request Jul 3, 2022

There seems to be a bug when the data exceeds 3 billion #5340

Closed

ninjit mentioned this pull request May 4, 2023

vector::reserve error, but not out of memory #5866

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add support for more than max(int_32_t) data points #4809

[WIP] Add support for more than max(int_32_t) data points #4809

thvasilo commented Nov 18, 2021

guolinke commented Nov 18, 2021

shiyu1994 commented Nov 18, 2021 •

edited

Loading

shiyu1994 commented Nov 18, 2021

guolinke commented Nov 18, 2021

thvasilo commented Nov 18, 2021

thvasilo commented Nov 19, 2021

ghost commented Nov 25, 2021 •

edited by ghost

Loading

thvasilo commented Dec 1, 2021

shiyu1994 commented Dec 2, 2021

shiyu1994 Dec 2, 2021

shiyu1994 Dec 2, 2021

thvasilo Dec 2, 2021

shiyu1994 Dec 3, 2021

shiyu1994 Dec 2, 2021

thvasilo Dec 2, 2021 •

edited

Loading

sinhrks commented Jun 14, 2022

thvasilo commented Jun 15, 2022 via email

	_safe_call(_LIB.LGBM_DatasetGetSubset(
	self.reference.construct().handle,
	used_indices.ctypes.data_as(ctypes.POINTER(ctypes.c_int32)),
	ctypes.c_int32(used_indices.shape[0]),
	c_str(params_str),
	ctypes.byref(self.handle)))

	int LGBM_DatasetGetSubset(
	const DatasetHandle handle,
	const int32_t* used_row_indices,
	int32_t num_used_row_indices,
	const char* parameters,
	DatasetHandle* out) {
	API_BEGIN();
	auto param = Config::Str2Map(parameters);
	Config config;
	config.Set(param);
	OMP_SET_NUM_THREADS(config.num_threads);
	auto full_dataset = reinterpret_cast<const Dataset*>(handle);
	CHECK_GT(num_used_row_indices, 0);
	const int32_t lower = 0;
	const int32_t upper = full_dataset->num_data() - 1;
	CheckElementsIntervalClosed(used_row_indices, lower, upper, num_used_row_indices, "Used indices of subset");
	if (!std::is_sorted(used_row_indices, used_row_indices + num_used_row_indices)) {
	Log::Fatal("used_row_indices should be sorted in Subset");
	}
	auto ret = std::unique_ptr<Dataset>(new Dataset(num_used_row_indices));
	ret->CopyFeatureMapperFrom(full_dataset);
	ret->CopySubrow(full_dataset, used_row_indices, num_used_row_indices, true);
	*out = ret.release();
	API_END();
	}

[WIP] Add support for more than max(int_32_t) data points #4809

Are you sure you want to change the base?

[WIP] Add support for more than max(int_32_t) data points #4809

Conversation

thvasilo commented Nov 18, 2021

guolinke commented Nov 18, 2021

shiyu1994 commented Nov 18, 2021 • edited Loading

shiyu1994 commented Nov 18, 2021

guolinke commented Nov 18, 2021

thvasilo commented Nov 18, 2021

thvasilo commented Nov 19, 2021

ghost commented Nov 25, 2021 • edited by ghost Loading

thvasilo commented Dec 1, 2021

shiyu1994 commented Dec 2, 2021

shiyu1994 Dec 2, 2021

Choose a reason for hiding this comment

shiyu1994 Dec 2, 2021

Choose a reason for hiding this comment

thvasilo Dec 2, 2021

Choose a reason for hiding this comment

shiyu1994 Dec 3, 2021

Choose a reason for hiding this comment

shiyu1994 Dec 2, 2021

Choose a reason for hiding this comment

thvasilo Dec 2, 2021 • edited Loading

Choose a reason for hiding this comment

sinhrks commented Jun 14, 2022

thvasilo commented Jun 15, 2022 via email

shiyu1994 commented Nov 18, 2021 •

edited

Loading

ghost commented Nov 25, 2021 •

edited by ghost

Loading

thvasilo Dec 2, 2021 •

edited

Loading