Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Some More GPU documentation #401

Merged
merged 100 commits into from
Apr 12, 2017
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
4810c79
add dummy gpu solver code
huanzhang12 Feb 10, 2017
e41ba15
initial GPU code
huanzhang12 Feb 12, 2017
6dde565
fix crash bug
huanzhang12 Feb 12, 2017
2dce7d1
first working version
huanzhang12 Feb 12, 2017
146b2dd
use asynchronous copy
huanzhang12 Feb 12, 2017
1f39a03
use a better kernel for root
huanzhang12 Feb 13, 2017
435674d
parallel read histogram
huanzhang12 Feb 13, 2017
22f478a
sparse features now works, but no acceleration, compute on CPU
huanzhang12 Feb 13, 2017
cfd77ae
compute sparse feature on CPU simultaneously
huanzhang12 Feb 13, 2017
40c3212
fix big bug; add gpu selection; add kernel selection
huanzhang12 Feb 14, 2017
c3398c9
better debugging
huanzhang12 Feb 14, 2017
76a13c7
clean up
huanzhang12 Feb 15, 2017
2dc4555
add feature scatter
huanzhang12 Feb 15, 2017
d4c1c01
Add sparse_threshold control
huanzhang12 Feb 15, 2017
97da274
fix a bug in feature scatter
huanzhang12 Feb 15, 2017
a96ca80
clean up debug
huanzhang12 Feb 15, 2017
9be6438
temporarily add OpenCL kernels for k=64,256
huanzhang12 Feb 27, 2017
cbef453
fix up CMakeList and definition USE_GPU
huanzhang12 Feb 27, 2017
4d08152
add OpenCL kernels as string literals
huanzhang12 Feb 28, 2017
624d405
Add boost.compute as a submodule
huanzhang12 Feb 28, 2017
11b241f
add boost dependency into CMakeList
huanzhang12 Feb 28, 2017
5142f19
fix opencl pragma
huanzhang12 Feb 28, 2017
508b48c
use pinned memory for histogram
huanzhang12 Feb 28, 2017
1a63b99
use pinned buffer for gradients and hessians
huanzhang12 Mar 1, 2017
e2166b1
better debugging message
huanzhang12 Mar 1, 2017
3b24e33
add double precision support on GPU
huanzhang12 Mar 9, 2017
e7336ee
fix boost version in CMakeList
huanzhang12 Mar 9, 2017
b29fec7
Add a README
huanzhang12 Mar 9, 2017
97fed3e
reconstruct GPU initialization code for ResetTrainingData
huanzhang12 Mar 12, 2017
164dbd1
move data to GPU in parallel
huanzhang12 Mar 12, 2017
c1c605e
fix a bug during feature copy
huanzhang12 Mar 13, 2017
c5ab1ae
update gpu kernels
huanzhang12 Mar 13, 2017
947629a
update gpu code
huanzhang12 Mar 15, 2017
105b0dd
initial port to LightGBM v2
huanzhang12 Mar 19, 2017
ba2c0a3
speedup GPU data loading process
huanzhang12 Mar 21, 2017
a6cb794
Add 4-bit bin support to GPU
huanzhang12 Mar 22, 2017
ed929cb
re-add sparse_threshold parameter
huanzhang12 Mar 23, 2017
2cd3d85
remove kMaxNumWorkgroups and allows an unlimited number of features
huanzhang12 Mar 23, 2017
4d2758f
add feature mask support for skipping unused features
huanzhang12 Mar 24, 2017
62bc04e
enable kernel cache
huanzhang12 Mar 24, 2017
e4dd344
use GPU kernels withoug feature masks when all features are used
huanzhang12 Mar 24, 2017
61b09a3
REAdme.
Mar 25, 2017
da20fc0
REAdme.
Mar 25, 2017
2d43e36
update README
huanzhang12 Mar 25, 2017
9602cd7
update to v2
huanzhang12 Mar 25, 2017
cd52bb0
fix typos (#349)
wxchan Mar 17, 2017
be91a98
change compile to gcc on Apple as default
chivee Mar 18, 2017
8f1d05e
clean vscode related file
chivee Mar 19, 2017
411383f
refine api of constructing from sampling data.
guolinke Mar 21, 2017
487660e
fix bug in the last commit.
guolinke Mar 21, 2017
882f420
more efficient algorithm to sample k from n.
guolinke Mar 22, 2017
7d0f338
fix bug in filter bin
guolinke Mar 22, 2017
0b44817
change to boost from average output.
guolinke Mar 22, 2017
85a3ba4
fix tests.
guolinke Mar 22, 2017
f615ba0
only stop training when all classes are finshed in multi-class.
guolinke Mar 23, 2017
fbed3ca
limit the max tree output. change hessian in multi-class objective.
guolinke Mar 24, 2017
8eb961b
robust tree model loading.
guolinke Mar 24, 2017
10cd85f
fix test.
guolinke Mar 24, 2017
e57ec49
convert the probabilities to raw score in boost_from_average of class…
guolinke Mar 24, 2017
39965a0
fix the average label for binary classification.
guolinke Mar 24, 2017
8ac77dc
Add boost_from_average to docs (#354)
Laurae2 Mar 24, 2017
25f6268
don't use "ConvertToRawScore" for self-defined objective function.
guolinke Mar 24, 2017
bf3dfb6
boost_from_average seems doesn't work well in binary classification. …
guolinke Mar 24, 2017
22df883
For a better jump link (#355)
JayveeHe Mar 25, 2017
9f4d2f0
add FitByExistingTree.
guolinke Mar 25, 2017
f54ac4d
adapt GPU tree learner for FitByExistingTree
huanzhang12 Mar 26, 2017
59c473b
avoid NaN output.
guolinke Mar 26, 2017
a0549d1
update boost.compute
huanzhang12 Mar 26, 2017
5e945d2
fix typos (#361)
zhangyafeikimi Mar 26, 2017
3891cdb
fix broken links (#359)
wxchan Mar 26, 2017
48b4d9d
update README
huanzhang12 Mar 27, 2017
7248e58
disable GPU acceleration by default
huanzhang12 Mar 27, 2017
56fe2cc
fix image url
huanzhang12 Mar 27, 2017
1c51775
cleanup debug macro
huanzhang12 Mar 27, 2017
78ae386
Initial GPU acceleration
huanzhang12 Mar 27, 2017
2690181
Merge remote-tracking branch 'gpudev/master'
huanzhang12 Mar 27, 2017
f3573d5
remove old README
huanzhang12 Mar 27, 2017
12e5b82
do not save sparse_threshold_ in FeatureGroup
huanzhang12 Mar 27, 2017
1159854
add details for new GPU settings
huanzhang12 Mar 27, 2017
c719ead
ignore submodule when doing pep8 check
huanzhang12 Mar 27, 2017
15c97b4
allocate workspace for at least one thread during builing Feature4
huanzhang12 Mar 27, 2017
cb35a02
move sparse_threshold to class Dataset
huanzhang12 Mar 28, 2017
a039a3a
remove duplicated code in GPUTreeLearner::Split
huanzhang12 Mar 29, 2017
35ab97f
Remove duplicated code in FindBestThresholds and BeforeFindBestSplit
huanzhang12 Mar 29, 2017
28c1715
do not rebuild ordered gradients and hessians for sparse features
huanzhang12 Mar 29, 2017
2af1860
support feature groups in GPUTreeLearner
huanzhang12 Apr 4, 2017
475cf8c
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 5, 2017
4d5d957
Initial parallel learners with GPU support
huanzhang12 Apr 5, 2017
4b44173
add option device, cleanup code
huanzhang12 Apr 5, 2017
b948c1f
clean up FindBestThresholds; add some omp parallel
huanzhang12 Apr 6, 2017
50f7da1
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 7, 2017
3a16753
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 7, 2017
2b0514e
constant hessian optimization for GPU
huanzhang12 Apr 8, 2017
e72d8cd
Fix GPUTreeLearner crash when there is zero feature
huanzhang12 Apr 9, 2017
a68ae52
use np.testing.assert_almost_equal() to compare lists of floats in tests
huanzhang12 Apr 9, 2017
2ac5103
travis for GPU
huanzhang12 Apr 9, 2017
edb30a6
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 9, 2017
0c5eb15
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 9, 2017
b121443
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 11, 2017
74bc952
add tutorial and more GPU docs
huanzhang12 Apr 12, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
use a better kernel for root
  • Loading branch information
huanzhang12 committed Feb 13, 2017
commit 1f39a037b75f07096b53261b77614dc8017cae5a
106 changes: 73 additions & 33 deletions src/treelearner/gpu_tree_learner.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -137,9 +137,7 @@ void CompareHistograms(HistogramBinEntry* h1, HistogramBinEntry* h2, size_t size
Log::Fatal("Mismatched histograms found at location %lu.", i);
}

void GPUTreeLearner::GPUHistogram(data_size_t leaf_num_data, FeatureHistogram* histograms) {
// we have already copied ordered gradients, ordered hessians and indices to GPU
// decide the best number of workgroups working on one feature4 tuple
int GPUTreeLearner::GetNumWorkgroupsPerFeature(data_size_t leaf_num_data) {
// we roughly want 256 workgroups per device, and we have num_feature4_ feature tuples.
// also guarantee that there are at least 2K examples per workgroup
double x = 256.0 / num_feature4_;
Expand All @@ -155,11 +153,20 @@ void GPUTreeLearner::GPUHistogram(data_size_t leaf_num_data, FeatureHistogram* h
exp_workgroups_per_feature = 0;
if (exp_workgroups_per_feature > max_exp_workgroups_per_feature_)
exp_workgroups_per_feature = max_exp_workgroups_per_feature_;
return exp_workgroups_per_feature;
}

void GPUTreeLearner::GPUHistogram(data_size_t leaf_num_data, FeatureHistogram* histograms) {
// we have already copied ordered gradients, ordered hessians and indices to GPU
// decide the best number of workgroups working on one feature4 tuple
// set work group size based on feature size
// each 2^exp_workgroups_per_feature workgroups work on a feature4 tuple
int exp_workgroups_per_feature = GetNumWorkgroupsPerFeature(leaf_num_data);
int num_workgroups = (1 << exp_workgroups_per_feature) * num_feature4_;
if (num_workgroups > max_num_workgroups_)
if (num_workgroups > max_num_workgroups_) {
num_workgroups = max_num_workgroups_;
Log::Warning("BUG detected, num_workgroups too large!");
}
#ifdef DEBUG_GPU
printf("setting exp_workgroups_per_feature to %d, using %u work groups\n", exp_workgroups_per_feature, num_workgroups);
#endif
Expand All @@ -170,14 +177,23 @@ void GPUTreeLearner::GPUHistogram(data_size_t leaf_num_data, FeatureHistogram* h
// process one feature4 tuple

histogram_kernels_[exp_workgroups_per_feature].set_arg(2, leaf_num_data);
indices_future_.wait();
// for the root node, indices are not copied
if (leaf_num_data != num_data_) {
indices_future_.wait();
}
hessians_future_.wait();
gradients_future_.wait();
// printf("launching kernel!\n");
// there will be 2^exp_workgroups_per_feature = num_workgroups / num_feature4 sub-histogram per feature4
// and we will launch num_feature workgroups for this kernel
// will launch threads for all features
queue_.enqueue_1d_range_kernel(histogram_kernels_[exp_workgroups_per_feature], 0, num_workgroups * 256, 256);
if (leaf_num_data == num_data_) {
// printf("using full data kernel with exp_workgroups_per_feature = %d and %d workgroups\n", exp_workgroups_per_feature, num_workgroups);
queue_.enqueue_1d_range_kernel(histogram_fulldata_kernel_, 0, num_workgroups * 256, 256);
}
else {
queue_.enqueue_1d_range_kernel(histogram_kernels_[exp_workgroups_per_feature], 0, num_workgroups * 256, 256);
}
queue_.finish();
// all features finished, copy results to out
// printf("Copying histogram back to host...\n");
Expand Down Expand Up @@ -245,26 +261,48 @@ void GPUTreeLearner::InitGPU(int platform_id, int device_id) {
Log::Info("Using GPU Device: %s, Vendor: %s", dev_.name().c_str(), dev_.vendor().c_str());
Log::Info("Compiling OpenCL Kernels...");
for (int i = 0; i <= max_exp_workgroups_per_feature_; ++i) {
auto program_ = boost::compute::program::create_with_source_file("histogram.cl", ctx_);
std::ostringstream opts;
// FIXME: sparse data
opts << "-D FEATURE_SIZE=" << num_data_ << " -D POWER_FEATURE_WORKGROUPS=" << i
<< " -D USE_CONSTANT_BUF=" << use_constants
<< " -cl-strict-aliasing -cl-mad-enable -cl-no-signed-zeros -cl-fast-relaxed-math -save-temps";
std::cout << "Building options: " << opts.str() << std::endl;
try {
program_.build(opts.str());
}
catch (boost::compute::opencl_error &e) {
Log::Fatal("GPU program built failure:\n %s", program_.build_log().c_str());
}
histogram_kernels_.push_back(program_.create_kernel("histogram256"));
// setup kernel arguments
// The only argument that needs to be changed is num_data_
histogram_kernels_.back().set_args(*device_features_,
*device_data_indices_, num_data_, *device_gradients_, *device_hessians_,
*device_subhistograms_, *sync_counters_, *device_histogram_outputs_);
auto program = boost::compute::program::create_with_source_file("histogram.cl", ctx_);
std::ostringstream opts;
// FIXME: sparse data
opts << "-D FEATURE_SIZE=" << num_data_ << " -D POWER_FEATURE_WORKGROUPS=" << i
<< " -D USE_CONSTANT_BUF=" << use_constants
<< " -cl-strict-aliasing -cl-mad-enable -cl-no-signed-zeros -cl-fast-relaxed-math -save-temps";
std::cout << "Building options: " << opts.str() << std::endl;
try {
program.build(opts.str());
}
catch (boost::compute::opencl_error &e) {
Log::Fatal("GPU program built failure:\n %s", program.build_log().c_str());
}
histogram_kernels_.push_back(program.create_kernel("histogram256"));
// setup kernel arguments
// The only argument that needs to be changed is num_data_
histogram_kernels_.back().set_args(*device_features_,
*device_data_indices_, num_data_, *device_gradients_, *device_hessians_,
*device_subhistograms_, *sync_counters_, *device_histogram_outputs_);
}
// create the OpenCL kernel for the root node (all data)
int full_exp_workgroups_per_feature = GetNumWorkgroupsPerFeature(num_data_);
auto program = boost::compute::program::create_with_source_file("histogram.cl", ctx_);
std::ostringstream opts;
// FIXME: sparse data
opts << "-D FEATURE_SIZE=" << num_data_ << " -D POWER_FEATURE_WORKGROUPS=" << full_exp_workgroups_per_feature
<< " -D IGNORE_INDICES=1"
<< " -D USE_CONSTANT_BUF=" << use_constants
<< " -cl-strict-aliasing -cl-mad-enable -cl-no-signed-zeros -cl-fast-relaxed-math -save-temps";
std::cout << "Building options: " << opts.str() << std::endl;
try {
program.build(opts.str());
}
catch (boost::compute::opencl_error &e) {
Log::Fatal("GPU program built failure:\n %s", program.build_log().c_str());
}
histogram_fulldata_kernel_ = program.create_kernel("histogram256");
// setup kernel arguments
// The only argument that needs to be changed is num_data_
histogram_fulldata_kernel_.set_args(*device_features_,
*device_data_indices_, num_data_, *device_gradients_, *device_hessians_,
*device_subhistograms_, *sync_counters_, *device_histogram_outputs_);

// Now generate new data structure feature4, and copy data to the device
int i;
Expand Down Expand Up @@ -426,6 +464,13 @@ Tree* GPUTreeLearner::Train(const score_t* gradients, const score_t *hessians) {

void GPUTreeLearner::BeforeTrain() {

// copy indices, gradients and hessians to device, start as early as possible
#ifdef GPU_DEBUG
printf("Copying intial full gradients and hessians to device\n");
#endif
hessians_future_ = boost::compute::copy_async(hessians_, hessians_ + num_data_, device_hessians_->begin(), queue_);
gradients_future_ = boost::compute::copy_async(gradients_, gradients_ + num_data_, device_gradients_->begin(), queue_);

// reset histogram pool
histogram_pool_.ResetMap();
// initialize used features
Expand All @@ -449,14 +494,9 @@ void GPUTreeLearner::BeforeTrain() {

// Sumup for root
if (data_partition_->leaf_count(0) == num_data_) {
// copy indices, gradients and hessians to device (TODO: async)
#ifdef GPU_DEBUG
printf("Copying intial indices, gradients and hessians to device\n");
#endif
indices_future_ = boost::compute::copy_async(data_partition_->indices(), data_partition_->indices() + num_data_,
device_data_indices_->begin(), queue_);
hessians_future_ = boost::compute::copy_async(hessians_, hessians_ + num_data_, device_hessians_->begin(), queue_);
gradients_future_ = boost::compute::copy_async(gradients_, gradients_ + num_data_, device_gradients_->begin(), queue_);
// No need to copy the index for the root
// indices_future_ = boost::compute::copy_async(data_partition_->indices(), data_partition_->indices() + num_data_,
// device_data_indices_->begin(), queue_);
// use all data
smaller_leaf_splits_->Init(gradients_, hessians_);
// point to gradients, avoid copy
Expand Down
6 changes: 4 additions & 2 deletions src/treelearner/gpu_tree_learner.h
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,9 @@ class GPUTreeLearner: public TreeLearner {
*/
virtual void Split(Tree* tree, int best_leaf, int* left_leaf, int* right_leaf);

virtual void InitGPU(int platform_id, int device_id);
int GetNumWorkgroupsPerFeature(data_size_t leaf_num_data);

void InitGPU(int platform_id, int device_id);

void GPUHistogram(data_size_t leaf_num_data, FeatureHistogram* histograms);

Expand Down Expand Up @@ -191,10 +193,10 @@ class GPUTreeLearner: public TreeLearner {
boost::compute::device dev_;
boost::compute::context ctx_;
boost::compute::command_queue queue_;
boost::compute::program program_;
/*! \brief a array of histogram kernels with different number
of workgroups per feature */
std::vector<boost::compute::kernel> histogram_kernels_;
boost::compute::kernel histogram_fulldata_kernel_;
boost::compute::kernel reduction_kernel_;
int num_feature4_;
const int max_exp_workgroups_per_feature_ = 10; // 2^10
Expand Down