Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Some More GPU documentation #401

Merged
merged 100 commits into from
Apr 12, 2017
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
4810c79
add dummy gpu solver code
huanzhang12 Feb 10, 2017
e41ba15
initial GPU code
huanzhang12 Feb 12, 2017
6dde565
fix crash bug
huanzhang12 Feb 12, 2017
2dce7d1
first working version
huanzhang12 Feb 12, 2017
146b2dd
use asynchronous copy
huanzhang12 Feb 12, 2017
1f39a03
use a better kernel for root
huanzhang12 Feb 13, 2017
435674d
parallel read histogram
huanzhang12 Feb 13, 2017
22f478a
sparse features now works, but no acceleration, compute on CPU
huanzhang12 Feb 13, 2017
cfd77ae
compute sparse feature on CPU simultaneously
huanzhang12 Feb 13, 2017
40c3212
fix big bug; add gpu selection; add kernel selection
huanzhang12 Feb 14, 2017
c3398c9
better debugging
huanzhang12 Feb 14, 2017
76a13c7
clean up
huanzhang12 Feb 15, 2017
2dc4555
add feature scatter
huanzhang12 Feb 15, 2017
d4c1c01
Add sparse_threshold control
huanzhang12 Feb 15, 2017
97da274
fix a bug in feature scatter
huanzhang12 Feb 15, 2017
a96ca80
clean up debug
huanzhang12 Feb 15, 2017
9be6438
temporarily add OpenCL kernels for k=64,256
huanzhang12 Feb 27, 2017
cbef453
fix up CMakeList and definition USE_GPU
huanzhang12 Feb 27, 2017
4d08152
add OpenCL kernels as string literals
huanzhang12 Feb 28, 2017
624d405
Add boost.compute as a submodule
huanzhang12 Feb 28, 2017
11b241f
add boost dependency into CMakeList
huanzhang12 Feb 28, 2017
5142f19
fix opencl pragma
huanzhang12 Feb 28, 2017
508b48c
use pinned memory for histogram
huanzhang12 Feb 28, 2017
1a63b99
use pinned buffer for gradients and hessians
huanzhang12 Mar 1, 2017
e2166b1
better debugging message
huanzhang12 Mar 1, 2017
3b24e33
add double precision support on GPU
huanzhang12 Mar 9, 2017
e7336ee
fix boost version in CMakeList
huanzhang12 Mar 9, 2017
b29fec7
Add a README
huanzhang12 Mar 9, 2017
97fed3e
reconstruct GPU initialization code for ResetTrainingData
huanzhang12 Mar 12, 2017
164dbd1
move data to GPU in parallel
huanzhang12 Mar 12, 2017
c1c605e
fix a bug during feature copy
huanzhang12 Mar 13, 2017
c5ab1ae
update gpu kernels
huanzhang12 Mar 13, 2017
947629a
update gpu code
huanzhang12 Mar 15, 2017
105b0dd
initial port to LightGBM v2
huanzhang12 Mar 19, 2017
ba2c0a3
speedup GPU data loading process
huanzhang12 Mar 21, 2017
a6cb794
Add 4-bit bin support to GPU
huanzhang12 Mar 22, 2017
ed929cb
re-add sparse_threshold parameter
huanzhang12 Mar 23, 2017
2cd3d85
remove kMaxNumWorkgroups and allows an unlimited number of features
huanzhang12 Mar 23, 2017
4d2758f
add feature mask support for skipping unused features
huanzhang12 Mar 24, 2017
62bc04e
enable kernel cache
huanzhang12 Mar 24, 2017
e4dd344
use GPU kernels withoug feature masks when all features are used
huanzhang12 Mar 24, 2017
61b09a3
REAdme.
Mar 25, 2017
da20fc0
REAdme.
Mar 25, 2017
2d43e36
update README
huanzhang12 Mar 25, 2017
9602cd7
update to v2
huanzhang12 Mar 25, 2017
cd52bb0
fix typos (#349)
wxchan Mar 17, 2017
be91a98
change compile to gcc on Apple as default
chivee Mar 18, 2017
8f1d05e
clean vscode related file
chivee Mar 19, 2017
411383f
refine api of constructing from sampling data.
guolinke Mar 21, 2017
487660e
fix bug in the last commit.
guolinke Mar 21, 2017
882f420
more efficient algorithm to sample k from n.
guolinke Mar 22, 2017
7d0f338
fix bug in filter bin
guolinke Mar 22, 2017
0b44817
change to boost from average output.
guolinke Mar 22, 2017
85a3ba4
fix tests.
guolinke Mar 22, 2017
f615ba0
only stop training when all classes are finshed in multi-class.
guolinke Mar 23, 2017
fbed3ca
limit the max tree output. change hessian in multi-class objective.
guolinke Mar 24, 2017
8eb961b
robust tree model loading.
guolinke Mar 24, 2017
10cd85f
fix test.
guolinke Mar 24, 2017
e57ec49
convert the probabilities to raw score in boost_from_average of class…
guolinke Mar 24, 2017
39965a0
fix the average label for binary classification.
guolinke Mar 24, 2017
8ac77dc
Add boost_from_average to docs (#354)
Laurae2 Mar 24, 2017
25f6268
don't use "ConvertToRawScore" for self-defined objective function.
guolinke Mar 24, 2017
bf3dfb6
boost_from_average seems doesn't work well in binary classification. …
guolinke Mar 24, 2017
22df883
For a better jump link (#355)
JayveeHe Mar 25, 2017
9f4d2f0
add FitByExistingTree.
guolinke Mar 25, 2017
f54ac4d
adapt GPU tree learner for FitByExistingTree
huanzhang12 Mar 26, 2017
59c473b
avoid NaN output.
guolinke Mar 26, 2017
a0549d1
update boost.compute
huanzhang12 Mar 26, 2017
5e945d2
fix typos (#361)
zhangyafeikimi Mar 26, 2017
3891cdb
fix broken links (#359)
wxchan Mar 26, 2017
48b4d9d
update README
huanzhang12 Mar 27, 2017
7248e58
disable GPU acceleration by default
huanzhang12 Mar 27, 2017
56fe2cc
fix image url
huanzhang12 Mar 27, 2017
1c51775
cleanup debug macro
huanzhang12 Mar 27, 2017
78ae386
Initial GPU acceleration
huanzhang12 Mar 27, 2017
2690181
Merge remote-tracking branch 'gpudev/master'
huanzhang12 Mar 27, 2017
f3573d5
remove old README
huanzhang12 Mar 27, 2017
12e5b82
do not save sparse_threshold_ in FeatureGroup
huanzhang12 Mar 27, 2017
1159854
add details for new GPU settings
huanzhang12 Mar 27, 2017
c719ead
ignore submodule when doing pep8 check
huanzhang12 Mar 27, 2017
15c97b4
allocate workspace for at least one thread during builing Feature4
huanzhang12 Mar 27, 2017
cb35a02
move sparse_threshold to class Dataset
huanzhang12 Mar 28, 2017
a039a3a
remove duplicated code in GPUTreeLearner::Split
huanzhang12 Mar 29, 2017
35ab97f
Remove duplicated code in FindBestThresholds and BeforeFindBestSplit
huanzhang12 Mar 29, 2017
28c1715
do not rebuild ordered gradients and hessians for sparse features
huanzhang12 Mar 29, 2017
2af1860
support feature groups in GPUTreeLearner
huanzhang12 Apr 4, 2017
475cf8c
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 5, 2017
4d5d957
Initial parallel learners with GPU support
huanzhang12 Apr 5, 2017
4b44173
add option device, cleanup code
huanzhang12 Apr 5, 2017
b948c1f
clean up FindBestThresholds; add some omp parallel
huanzhang12 Apr 6, 2017
50f7da1
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 7, 2017
3a16753
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 7, 2017
2b0514e
constant hessian optimization for GPU
huanzhang12 Apr 8, 2017
e72d8cd
Fix GPUTreeLearner crash when there is zero feature
huanzhang12 Apr 9, 2017
a68ae52
use np.testing.assert_almost_equal() to compare lists of floats in tests
huanzhang12 Apr 9, 2017
2ac5103
travis for GPU
huanzhang12 Apr 9, 2017
edb30a6
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 9, 2017
0c5eb15
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 9, 2017
b121443
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 11, 2017
74bc952
add tutorial and more GPU docs
huanzhang12 Apr 12, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
use pinned buffer for gradients and hessians
  • Loading branch information
huanzhang12 committed Mar 1, 2017
commit 1a63b995d91e4cb2a53601f96a367b9227750ab1
49 changes: 40 additions & 9 deletions src/treelearner/gpu_tree_learner.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@ GPUTreeLearner::GPUTreeLearner(const TreeConfig* tree_config)
}

GPUTreeLearner::~GPUTreeLearner() {

queue_.enqueue_unmap_buffer(pinned_gradients_, ptr_pinned_gradients_);
queue_.enqueue_unmap_buffer(pinned_hessians_, ptr_pinned_hessians_);
}

void GPUTreeLearner::Init(const Dataset* train_data) {
Expand Down Expand Up @@ -314,8 +315,27 @@ void GPUTreeLearner::InitGPU(int platform_id, int device_id) {
device_features_ = std::unique_ptr<boost::compute::vector<Feature4>>(new boost::compute::vector<Feature4>(num_dense_feature4_ * num_data_, ctx_));
// allocate space for gradients and hessians on device
// we will copy gradients and hessians in after ordered_gradients_ and ordered_hessians_ are constructed
device_gradients_ = std::unique_ptr<boost::compute::vector<score_t>>(new boost::compute::vector<score_t>(allocated_num_data_, ctx_));
device_hessians_ = std::unique_ptr<boost::compute::vector<score_t>>(new boost::compute::vector<score_t>(allocated_num_data_, ctx_));
// device_gradients_ = std::unique_ptr<boost::compute::vector<score_t>>(new boost::compute::vector<score_t>(allocated_num_data_, ctx_));
// device_hessians_ = std::unique_ptr<boost::compute::vector<score_t>>(new boost::compute::vector<score_t>(allocated_num_data_, ctx_));
// Use mapped GPU buffers as ordered gradient and hessian buffers
ordered_gradients_.reserve(allocated_num_data_);
ordered_hessians_.reserve(allocated_num_data_);
pinned_gradients_ = boost::compute::buffer(ctx_, allocated_num_data_ * sizeof(score_t),
boost::compute::memory_object::read_write | boost::compute::memory_object::use_host_ptr,
ordered_gradients_.data());
ptr_pinned_gradients_ = queue_.enqueue_map_buffer(pinned_gradients_, boost::compute::command_queue::map_write_invalidate_region,
0, allocated_num_data_ * sizeof(score_t));
pinned_hessians_ = boost::compute::buffer(ctx_, allocated_num_data_ * sizeof(score_t),
boost::compute::memory_object::read_write | boost::compute::memory_object::use_host_ptr,
ordered_hessians_.data());
ptr_pinned_hessians_ = queue_.enqueue_map_buffer(pinned_hessians_, boost::compute::command_queue::map_write_invalidate_region,
0, allocated_num_data_ * sizeof(score_t));
Log::Info("gradients=%p, pinned_gradients=%p, hessian=%p, pinned_hessian=%p\n",
ordered_gradients_.data(), ptr_pinned_gradients_, ordered_hessians_.data(), ptr_pinned_hessians_);
device_gradients_ = boost::compute::buffer(ctx_, allocated_num_data_ * sizeof(score_t),
boost::compute::memory_object::read_only, nullptr);
device_hessians_ = boost::compute::buffer(ctx_, allocated_num_data_ * sizeof(score_t),
boost::compute::memory_object::read_only, nullptr);
// copy indices to the device
device_data_indices_ = std::unique_ptr<boost::compute::vector<data_size_t>>(new boost::compute::vector<data_size_t>(allocated_num_data_, ctx_));
boost::compute::fill(device_data_indices_->begin(), device_data_indices_->end(), 0, queue_);
Expand All @@ -330,6 +350,7 @@ void GPUTreeLearner::InitGPU(int platform_id, int device_id) {
// device_histogram_outputs_ = std::unique_ptr<boost::compute::vector<char>>(new boost::compute::vector<char>(
// num_dense_feature4_ * 4 * device_bin_size_ * sizeof(GPUHistogramBinEntry), ctx_));
// create OpenCL kernels for different number of workgroups per feature
// The output buffer is allocated to host directly, to overlap compute and data transfer
device_histogram_outputs_ = boost::compute::buffer(ctx_, num_dense_feature4_ * 4 * device_bin_size_ * sizeof(GPUHistogramBinEntry),
boost::compute::memory_object::write_only | boost::compute::memory_object::alloc_host_ptr, nullptr);
Log::Info("Using GPU Device: %s, Vendor: %s", dev_.name().c_str(), dev_.vendor().c_str());
Expand All @@ -355,7 +376,7 @@ void GPUTreeLearner::InitGPU(int platform_id, int device_id) {
// setup kernel arguments
// The only argument that needs to be changed is num_data_
histogram_kernels_.back().set_args(*device_features_,
*device_data_indices_, num_data_, *device_gradients_, *device_hessians_,
*device_data_indices_, num_data_, device_gradients_, device_hessians_,
*device_subhistograms_, *sync_counters_, device_histogram_outputs_);
}
// create the OpenCL kernel for the root node (all data)
Expand All @@ -380,7 +401,7 @@ void GPUTreeLearner::InitGPU(int platform_id, int device_id) {
// setup kernel arguments
// The only argument that needs to be changed is num_data_
histogram_fulldata_kernel_.set_args(*device_features_,
*device_data_indices_, num_data_, *device_gradients_, *device_hessians_,
*device_data_indices_, num_data_, device_gradients_, device_hessians_,
*device_subhistograms_, *sync_counters_, device_histogram_outputs_);

// Now generate new data structure feature4, and copy data to the device
Expand Down Expand Up @@ -554,6 +575,7 @@ void GPUTreeLearner::ResetConfig(const TreeConfig* tree_config) {
Tree* GPUTreeLearner::Train(const score_t* gradients, const score_t *hessians) {
gradients_ = gradients;
hessians_ = hessians;

// some initial works before training
BeforeTrain();
auto tree = std::unique_ptr<Tree>(new Tree(tree_config_->num_leaves));
Expand Down Expand Up @@ -594,8 +616,11 @@ void GPUTreeLearner::BeforeTrain() {
#if GPU_DEBUG >= 2
printf("Copying intial full gradients and hessians to device\n");
#endif
hessians_future_ = boost::compute::copy_async(hessians_, hessians_ + num_data_, device_hessians_->begin(), queue_);
gradients_future_ = boost::compute::copy_async(gradients_, gradients_ + num_data_, device_gradients_->begin(), queue_);
// TODO: use bagging
// initial copy will just use an equeue write buffer
hessians_future_ = queue_.enqueue_write_buffer_async(device_hessians_, 0, num_data_ * sizeof(score_t), hessians_);
gradients_future_ = queue_.enqueue_write_buffer_async(device_gradients_, 0, num_data_ * sizeof(score_t), gradients_);


// reset histogram pool
histogram_pool_.ResetMap();
Expand Down Expand Up @@ -745,19 +770,25 @@ bool GPUTreeLearner::BeforeFindBestSplit(int left_leaf, int right_leaf) {
#endif
indices_future_ = boost::compute::copy_async(indices + begin, indices + end, device_data_indices_->begin(), queue_);


// This is about 7% of time, to re-order gradient and hessians
#pragma omp parallel for schedule(static)
for (data_size_t i = begin; i < end; ++i) {
ordered_hessians_[i - begin] = hessians_[indices[i]];
}
// copy ordered hessians to the GPU:
hessians_future_ = boost::compute::copy_async(ordered_hessians_.begin(), ordered_hessians_.begin() + end - begin, device_hessians_->begin(), queue_);
// hessians_future_ = boost::compute::copy_async(ordered_hessians_.begin(), ordered_hessians_.begin() + end - begin, device_hessians_->begin(), queue_);
hessians_future_ = queue_.enqueue_write_buffer_async(device_hessians_, 0, (end - begin) * sizeof(score_t), ptr_pinned_hessians_);


#pragma omp parallel for schedule(static)
for (data_size_t i = begin; i < end; ++i) {
ordered_gradients_[i - begin] = gradients_[indices[i]];
}
// copy ordered gradients to the GPU:
gradients_future_ = boost::compute::copy_async(ordered_gradients_.begin(), ordered_gradients_.begin() + end - begin, device_gradients_->begin(), queue_);
// gradients_future_ = boost::compute::copy_async(ordered_gradients_.begin(), ordered_gradients_.begin() + end - begin, device_gradients_->begin(), queue_);
gradients_future_ = queue_.enqueue_write_buffer_async(device_gradients_, 0, (end - begin) * sizeof(score_t), ptr_pinned_gradients_);

// assign pointer
ptr_to_ordered_gradients_smaller_leaf_ = ordered_gradients_.data();
ptr_to_ordered_hessians_smaller_leaf_ = ordered_hessians_.data();
Expand Down
19 changes: 13 additions & 6 deletions src/treelearner/gpu_tree_learner.h
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
#include <boost/compute/algorithm/transform.hpp>
#include <boost/compute/container/vector.hpp>
#include <boost/compute/functional/math.hpp>
#include <boost/align/aligned_allocator.hpp>

namespace LightGBM {

Expand Down Expand Up @@ -167,9 +168,9 @@ class GPUTreeLearner: public TreeLearner {
std::unique_ptr<LeafSplits> larger_leaf_splits_;

/*! \brief gradients of current iteration, ordered for cache optimized */
std::vector<score_t> ordered_gradients_;
std::vector<score_t, boost::alignment::aligned_allocator<score_t, 4096>> ordered_gradients_;
/*! \brief hessians of current iteration, ordered for cache optimized */
std::vector<score_t> ordered_hessians_;
std::vector<score_t, boost::alignment::aligned_allocator<score_t, 4096>> ordered_hessians_;

/*! \brief Pointer to ordered_gradients_, use this to avoid copy at BeforeTrain */
const score_t* ptr_to_ordered_gradients_smaller_leaf_;
Expand Down Expand Up @@ -220,8 +221,14 @@ class GPUTreeLearner: public TreeLearner {
std::vector<int> sparse_feature_map_;
std::vector<int> device_bin_mults_;
std::unique_ptr<boost::compute::vector<Feature4>> device_features_;
std::unique_ptr<boost::compute::vector<score_t>> device_gradients_;
std::unique_ptr<boost::compute::vector<score_t>> device_hessians_;
// std::unique_ptr<boost::compute::vector<score_t>> device_gradients_;
boost::compute::buffer device_gradients_;
boost::compute::buffer pinned_gradients_;
void * ptr_pinned_gradients_;
// std::unique_ptr<boost::compute::vector<score_t>> device_hessians_;
boost::compute::buffer device_hessians_;
boost::compute::buffer pinned_hessians_;
void * ptr_pinned_hessians_;
std::unique_ptr<boost::compute::vector<data_size_t>> device_data_indices_;
std::unique_ptr<boost::compute::vector<int>> sync_counters_;
std::unique_ptr<boost::compute::vector<char>> device_subhistograms_;
Expand All @@ -230,8 +237,8 @@ class GPUTreeLearner: public TreeLearner {
boost::compute::wait_list histograms_wait_obj_;
GPUHistogramBinEntry* host_histogram_outputs_;
boost::compute::future<void> indices_future_;
boost::compute::future<void> gradients_future_;
boost::compute::future<void> hessians_future_;
boost::compute::event gradients_future_;
boost::compute::event hessians_future_;
};


Expand Down