Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Some More GPU documentation #401

Merged
merged 100 commits into from
Apr 12, 2017
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
4810c79
add dummy gpu solver code
huanzhang12 Feb 10, 2017
e41ba15
initial GPU code
huanzhang12 Feb 12, 2017
6dde565
fix crash bug
huanzhang12 Feb 12, 2017
2dce7d1
first working version
huanzhang12 Feb 12, 2017
146b2dd
use asynchronous copy
huanzhang12 Feb 12, 2017
1f39a03
use a better kernel for root
huanzhang12 Feb 13, 2017
435674d
parallel read histogram
huanzhang12 Feb 13, 2017
22f478a
sparse features now works, but no acceleration, compute on CPU
huanzhang12 Feb 13, 2017
cfd77ae
compute sparse feature on CPU simultaneously
huanzhang12 Feb 13, 2017
40c3212
fix big bug; add gpu selection; add kernel selection
huanzhang12 Feb 14, 2017
c3398c9
better debugging
huanzhang12 Feb 14, 2017
76a13c7
clean up
huanzhang12 Feb 15, 2017
2dc4555
add feature scatter
huanzhang12 Feb 15, 2017
d4c1c01
Add sparse_threshold control
huanzhang12 Feb 15, 2017
97da274
fix a bug in feature scatter
huanzhang12 Feb 15, 2017
a96ca80
clean up debug
huanzhang12 Feb 15, 2017
9be6438
temporarily add OpenCL kernels for k=64,256
huanzhang12 Feb 27, 2017
cbef453
fix up CMakeList and definition USE_GPU
huanzhang12 Feb 27, 2017
4d08152
add OpenCL kernels as string literals
huanzhang12 Feb 28, 2017
624d405
Add boost.compute as a submodule
huanzhang12 Feb 28, 2017
11b241f
add boost dependency into CMakeList
huanzhang12 Feb 28, 2017
5142f19
fix opencl pragma
huanzhang12 Feb 28, 2017
508b48c
use pinned memory for histogram
huanzhang12 Feb 28, 2017
1a63b99
use pinned buffer for gradients and hessians
huanzhang12 Mar 1, 2017
e2166b1
better debugging message
huanzhang12 Mar 1, 2017
3b24e33
add double precision support on GPU
huanzhang12 Mar 9, 2017
e7336ee
fix boost version in CMakeList
huanzhang12 Mar 9, 2017
b29fec7
Add a README
huanzhang12 Mar 9, 2017
97fed3e
reconstruct GPU initialization code for ResetTrainingData
huanzhang12 Mar 12, 2017
164dbd1
move data to GPU in parallel
huanzhang12 Mar 12, 2017
c1c605e
fix a bug during feature copy
huanzhang12 Mar 13, 2017
c5ab1ae
update gpu kernels
huanzhang12 Mar 13, 2017
947629a
update gpu code
huanzhang12 Mar 15, 2017
105b0dd
initial port to LightGBM v2
huanzhang12 Mar 19, 2017
ba2c0a3
speedup GPU data loading process
huanzhang12 Mar 21, 2017
a6cb794
Add 4-bit bin support to GPU
huanzhang12 Mar 22, 2017
ed929cb
re-add sparse_threshold parameter
huanzhang12 Mar 23, 2017
2cd3d85
remove kMaxNumWorkgroups and allows an unlimited number of features
huanzhang12 Mar 23, 2017
4d2758f
add feature mask support for skipping unused features
huanzhang12 Mar 24, 2017
62bc04e
enable kernel cache
huanzhang12 Mar 24, 2017
e4dd344
use GPU kernels withoug feature masks when all features are used
huanzhang12 Mar 24, 2017
61b09a3
REAdme.
Mar 25, 2017
da20fc0
REAdme.
Mar 25, 2017
2d43e36
update README
huanzhang12 Mar 25, 2017
9602cd7
update to v2
huanzhang12 Mar 25, 2017
cd52bb0
fix typos (#349)
wxchan Mar 17, 2017
be91a98
change compile to gcc on Apple as default
chivee Mar 18, 2017
8f1d05e
clean vscode related file
chivee Mar 19, 2017
411383f
refine api of constructing from sampling data.
guolinke Mar 21, 2017
487660e
fix bug in the last commit.
guolinke Mar 21, 2017
882f420
more efficient algorithm to sample k from n.
guolinke Mar 22, 2017
7d0f338
fix bug in filter bin
guolinke Mar 22, 2017
0b44817
change to boost from average output.
guolinke Mar 22, 2017
85a3ba4
fix tests.
guolinke Mar 22, 2017
f615ba0
only stop training when all classes are finshed in multi-class.
guolinke Mar 23, 2017
fbed3ca
limit the max tree output. change hessian in multi-class objective.
guolinke Mar 24, 2017
8eb961b
robust tree model loading.
guolinke Mar 24, 2017
10cd85f
fix test.
guolinke Mar 24, 2017
e57ec49
convert the probabilities to raw score in boost_from_average of class…
guolinke Mar 24, 2017
39965a0
fix the average label for binary classification.
guolinke Mar 24, 2017
8ac77dc
Add boost_from_average to docs (#354)
Laurae2 Mar 24, 2017
25f6268
don't use "ConvertToRawScore" for self-defined objective function.
guolinke Mar 24, 2017
bf3dfb6
boost_from_average seems doesn't work well in binary classification. …
guolinke Mar 24, 2017
22df883
For a better jump link (#355)
JayveeHe Mar 25, 2017
9f4d2f0
add FitByExistingTree.
guolinke Mar 25, 2017
f54ac4d
adapt GPU tree learner for FitByExistingTree
huanzhang12 Mar 26, 2017
59c473b
avoid NaN output.
guolinke Mar 26, 2017
a0549d1
update boost.compute
huanzhang12 Mar 26, 2017
5e945d2
fix typos (#361)
zhangyafeikimi Mar 26, 2017
3891cdb
fix broken links (#359)
wxchan Mar 26, 2017
48b4d9d
update README
huanzhang12 Mar 27, 2017
7248e58
disable GPU acceleration by default
huanzhang12 Mar 27, 2017
56fe2cc
fix image url
huanzhang12 Mar 27, 2017
1c51775
cleanup debug macro
huanzhang12 Mar 27, 2017
78ae386
Initial GPU acceleration
huanzhang12 Mar 27, 2017
2690181
Merge remote-tracking branch 'gpudev/master'
huanzhang12 Mar 27, 2017
f3573d5
remove old README
huanzhang12 Mar 27, 2017
12e5b82
do not save sparse_threshold_ in FeatureGroup
huanzhang12 Mar 27, 2017
1159854
add details for new GPU settings
huanzhang12 Mar 27, 2017
c719ead
ignore submodule when doing pep8 check
huanzhang12 Mar 27, 2017
15c97b4
allocate workspace for at least one thread during builing Feature4
huanzhang12 Mar 27, 2017
cb35a02
move sparse_threshold to class Dataset
huanzhang12 Mar 28, 2017
a039a3a
remove duplicated code in GPUTreeLearner::Split
huanzhang12 Mar 29, 2017
35ab97f
Remove duplicated code in FindBestThresholds and BeforeFindBestSplit
huanzhang12 Mar 29, 2017
28c1715
do not rebuild ordered gradients and hessians for sparse features
huanzhang12 Mar 29, 2017
2af1860
support feature groups in GPUTreeLearner
huanzhang12 Apr 4, 2017
475cf8c
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 5, 2017
4d5d957
Initial parallel learners with GPU support
huanzhang12 Apr 5, 2017
4b44173
add option device, cleanup code
huanzhang12 Apr 5, 2017
b948c1f
clean up FindBestThresholds; add some omp parallel
huanzhang12 Apr 6, 2017
50f7da1
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 7, 2017
3a16753
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 7, 2017
2b0514e
constant hessian optimization for GPU
huanzhang12 Apr 8, 2017
e72d8cd
Fix GPUTreeLearner crash when there is zero feature
huanzhang12 Apr 9, 2017
a68ae52
use np.testing.assert_almost_equal() to compare lists of floats in tests
huanzhang12 Apr 9, 2017
2ac5103
travis for GPU
huanzhang12 Apr 9, 2017
edb30a6
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 9, 2017
0c5eb15
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 9, 2017
b121443
Merge remote-tracking branch 'upstream/master'
huanzhang12 Apr 11, 2017
74bc952
add tutorial and more GPU docs
huanzhang12 Apr 12, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
constant hessian optimization for GPU
  • Loading branch information
huanzhang12 committed Apr 8, 2017
commit 2b0514e78ce1e443759a29e37bb4246b3d77665e
3 changes: 2 additions & 1 deletion include/LightGBM/tree_learner.h
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,9 @@ class TreeLearner {
/*!
* \brief Initialize tree learner with training dataset
* \param train_data The used training data
* \param is_constant_hessian True if all hessians share the same value
*/
virtual void Init(const Dataset* train_data) = 0;
virtual void Init(const Dataset* train_data, bool is_constant_hessian) = 0;

virtual void ResetTrainingData(const Dataset* train_data) = 0;

Expand Down
2 changes: 1 addition & 1 deletion src/boosting/gbdt.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ void GBDT::ResetTrainingData(const BoostingConfig* config, const Dataset* train_
tree_learner_ = std::unique_ptr<TreeLearner>(TreeLearner::CreateTreeLearner(new_config->tree_learner_type, new_config->device_type, &new_config->tree_config));
}
// init tree learner
tree_learner_->Init(train_data);
tree_learner_->Init(train_data, is_constant_hessian_);

// push training metrics
training_metrics_.clear();
Expand Down
4 changes: 2 additions & 2 deletions src/treelearner/data_parallel_tree_learner.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,9 @@ DataParallelTreeLearner<TREELEARNER_T>::~DataParallelTreeLearner() {
}

template <typename TREELEARNER_T>
void DataParallelTreeLearner<TREELEARNER_T>::Init(const Dataset* train_data) {
void DataParallelTreeLearner<TREELEARNER_T>::Init(const Dataset* train_data, bool is_constant_hessian) {
// initialize SerialTreeLearner
TREELEARNER_T::Init(train_data);
TREELEARNER_T::Init(train_data, is_constant_hessian);
// Get local rank and global machine size
rank_ = Network::rank();
num_machines_ = Network::num_machines();
Expand Down
4 changes: 2 additions & 2 deletions src/treelearner/feature_parallel_tree_learner.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@ FeatureParallelTreeLearner<TREELEARNER_T>::~FeatureParallelTreeLearner() {
}

template <typename TREELEARNER_T>
void FeatureParallelTreeLearner<TREELEARNER_T>::Init(const Dataset* train_data) {
TREELEARNER_T::Init(train_data);
void FeatureParallelTreeLearner<TREELEARNER_T>::Init(const Dataset* train_data, bool is_constant_hessian) {
TREELEARNER_T::Init(train_data, is_constant_hessian);
rank_ = Network::rank();
num_machines_ = Network::num_machines();
input_buffer_.resize(sizeof(SplitInfo) * 2);
Expand Down
263 changes: 165 additions & 98 deletions src/treelearner/gpu_tree_learner.cpp

Large diffs are not rendered by default.

17 changes: 16 additions & 1 deletion src/treelearner/gpu_tree_learner.h
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,9 @@ class GPUTreeLearner: public SerialTreeLearner {
public:
explicit GPUTreeLearner(const TreeConfig* tree_config);
~GPUTreeLearner();
void Init(const Dataset* train_data) override;
void Init(const Dataset* train_data, bool is_constant_hessian) override;
void ResetTrainingData(const Dataset* train_data) override;
Tree* Train(const score_t* gradients, const score_t *hessians, bool is_constant_hessian) override;

void SetBaggingData(const data_size_t* used_indices, data_size_t num_data) override {
SerialTreeLearner::SetBaggingData(used_indices, num_data);
Expand Down Expand Up @@ -101,6 +102,16 @@ class GPUTreeLearner: public SerialTreeLearner {
*/
void AllocateGPUMemory();

/*!
* \brief Compile OpenCL GPU source code to kernel binaries
*/
void BuildGPUKernels();

/*!
* \brief Setup GPU kernel arguments, preparing for launching
*/
void SetupKernelArguments();

/*!
* \brief Compute GPU feature histogram for the current leaf.
* Indices, gradients and hessians have been copied to the device.
Expand Down Expand Up @@ -166,6 +177,10 @@ class GPUTreeLearner: public SerialTreeLearner {
const char *kernel16_src_ =
#include "ocl/histogram16.cl"
;
/*! \brief Currently used kernel source */
std::string kernel_source_;
/*! \brief Currently used kernel name */
std::string kernel_name_;

/*! \brief a array of histogram kernels with different number
of workgroups per feature */
Expand Down
62 changes: 61 additions & 1 deletion src/treelearner/ocl/histogram16.cl
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,10 @@ R""()
#ifndef USE_DP_FLOAT
#define USE_DP_FLOAT 0
#endif
// ignore hessian, and use the local memory for hessian as an additional bank for gradient
#ifndef CONST_HESSIAN
#define CONST_HESSIAN 0
#endif


#define LOCAL_SIZE_0 256
Expand Down Expand Up @@ -208,7 +212,11 @@ __kernel void histogram16(__global const uchar4* restrict feature_data_base,
__constant const data_size_t* restrict data_indices __attribute__((max_constant_size(65536))),
const data_size_t num_data,
__constant const score_t* restrict ordered_gradients __attribute__((max_constant_size(65536))),
#if CONST_HESSIAN == 0
__constant const score_t* restrict ordered_hessians __attribute__((max_constant_size(65536))),
#else
const score_t const_hessian,
#endif
__global char* restrict output_buf,
__global volatile int * sync_counters,
__global acc_type* restrict hist_buf_base) {
Expand All @@ -219,7 +227,11 @@ __kernel void histogram16(__global const uchar4* feature_data_base,
__global const data_size_t* data_indices,
const data_size_t num_data,
__global const score_t* ordered_gradients,
#if CONST_HESSIAN == 0
__global const score_t* ordered_hessians,
#else
const score_t const_hessian,
#endif
__global char* restrict output_buf,
__global volatile int * sync_counters,
__global acc_type* restrict hist_buf_base) {
Expand Down Expand Up @@ -357,7 +369,9 @@ __kernel void histogram16(__global const uchar4* feature_data_base,
data_size_t ind;
data_size_t ind_next;
stat1 = ordered_gradients[subglobal_tid];
#if CONST_HESSIAN == 0
stat2 = ordered_hessians[subglobal_tid];
#endif
#ifdef IGNORE_INDICES
ind = subglobal_tid;
#else
Expand All @@ -370,7 +384,9 @@ __kernel void histogram16(__global const uchar4* feature_data_base,
// prefetch the next iteration variables
// we don't need bondary check because we have made the buffer larger
stat1_next = ordered_gradients[i + subglobal_size];
#if CONST_HESSIAN == 0
stat2_next = ordered_hessians[i + subglobal_size];
#endif
#ifdef IGNORE_INDICES
// we need to check to bounds here
ind_next = i + subglobal_size < num_data ? i + subglobal_size : i;
Expand All @@ -379,12 +395,14 @@ __kernel void histogram16(__global const uchar4* feature_data_base,
#else
ind_next = data_indices[i + subglobal_size];
#endif
#if CONST_HESSIAN == 0
// swap gradient and hessian for threads 8, 9, 10, 11, 12, 13, 14, 15
float tmp = stat1;
stat1 = is_hessian_first ? stat2 : stat1;
stat2 = is_hessian_first ? tmp : stat2;
// stat1 = select(stat1, stat2, is_hessian_first);
// stat2 = select(stat2, tmp, is_hessian_first);
#endif

// STAGE 2: accumulate gradient and hessian
offset = (ltid & DWORD_FEATURES_MASK);
Expand All @@ -399,7 +417,9 @@ __kernel void histogram16(__global const uchar4* feature_data_base,
atomic_local_add_f(gh_hist + addr, stat1);
// thread 0, 1, 2, 3, 4, 5, 6, 7 now process feature 0, 1, 2, 3, 4, 5, 6, 7's hessians for example 0, 1, 2, 3, 4, 5, 6, 7
// thread 8, 9, 10, 11, 12, 13, 14, 15 now process feature 0, 1, 2, 3, 4, 5, 6, 7's gradients for example 8, 9, 10, 11, 12, 13, 14, 15
#if CONST_HESSIAN == 0
atomic_local_add_f(gh_hist + addr2, stat2);
#endif
}
offset = (offset + 1) & DWORD_FEATURES_MASK;
if (feature_mask.s6) {
Expand All @@ -411,7 +431,9 @@ __kernel void histogram16(__global const uchar4* feature_data_base,
atomic_local_add_f(gh_hist + addr, stat1);
// thread 0, 1, 2, 3, 4, 5, 6, 7 now process feature 1, 2, 3, 4, 5, 6, 7, 0's hessians for example 0, 1, 2, 3, 4, 5, 6, 7
// thread 8, 9, 10, 11, 12, 13, 14, 15 now process feature 1, 2, 3, 4, 5, 6, 7, 0's gradients for example 8, 9, 10, 11, 12, 13, 14, 15
#if CONST_HESSIAN == 0
atomic_local_add_f(gh_hist + addr2, stat2);
#endif
}

offset = (offset + 1) & DWORD_FEATURES_MASK;
Expand All @@ -424,7 +446,9 @@ __kernel void histogram16(__global const uchar4* feature_data_base,
atomic_local_add_f(gh_hist + addr, stat1);
// thread 0, 1, 2, 3, 4, 5, 6, 7 now process feature 2, 3, 4, 5, 6, 7, 0, 1's hessians for example 0, 1, 2, 3, 4, 5, 6, 7
// thread 8, 9, 10, 11, 12, 13, 14, 15 now process feature 2, 3, 4, 5, 6, 7, 0, 1's gradients for example 8, 9, 10, 11, 12, 13, 14, 15
#if CONST_HESSIAN == 0
atomic_local_add_f(gh_hist + addr2, stat2);
#endif
}
offset = (offset + 1) & DWORD_FEATURES_MASK;
if (feature_mask.s4) {
Expand All @@ -436,7 +460,9 @@ __kernel void histogram16(__global const uchar4* feature_data_base,
atomic_local_add_f(gh_hist + addr, stat1);
// thread 0, 1, 2, 3, 4, 5, 6, 7 now process feature 3, 4, 5, 6, 7, 0, 1, 2's hessians for example 0, 1, 2, 3, 4, 5, 6, 7
// thread 8, 9, 10, 11, 12, 13, 14, 15 now process feature 3, 4, 5, 6, 7, 0, 1, 2's gradients for example 8, 9, 10, 11, 12, 13, 14, 15
#if CONST_HESSIAN == 0
atomic_local_add_f(gh_hist + addr2, stat2);
#endif
}


Expand All @@ -456,7 +482,9 @@ __kernel void histogram16(__global const uchar4* feature_data_base,
atomic_local_add_f(gh_hist + addr, stat1);
// thread 0, 1, 2, 3, 4, 5, 6, 7 now process feature 4, 5, 6, 7, 0, 1, 2, 3's hessians for example 0, 1, 2, 3, 4, 5, 6, 7
// thread 8, 9, 10, 11, 12, 13, 14, 15 now process feature 4, 5, 6, 7, 0, 1, 2, 3's gradients for example 8, 9, 10, 11, 12, 13, 14, 15
#if CONST_HESSIAN == 0
atomic_local_add_f(gh_hist + addr2, stat2);
#endif
}
offset = (offset + 1) & DWORD_FEATURES_MASK;
if (feature_mask.s2) {
Expand All @@ -468,7 +496,9 @@ __kernel void histogram16(__global const uchar4* feature_data_base,
atomic_local_add_f(gh_hist + addr, stat1);
// thread 0, 1, 2, 3, 4, 5, 6, 7 now process feature 5, 6, 7, 0, 1, 2, 3, 4's hessians for example 0, 1, 2, 3, 4, 5, 6, 7
// thread 8, 9, 10, 11, 12, 13, 14, 15 now process feature 5, 6, 7, 0, 1, 2, 3, 4's gradients for example 8, 9, 10, 11, 12, 13, 14, 15
#if CONST_HESSIAN == 0
atomic_local_add_f(gh_hist + addr2, stat2);
#endif
}

offset = (offset + 1) & DWORD_FEATURES_MASK;
Expand All @@ -481,7 +511,9 @@ __kernel void histogram16(__global const uchar4* feature_data_base,
atomic_local_add_f(gh_hist + addr, stat1);
// thread 0, 1, 2, 3, 4, 5, 6, 7 now process feature 6, 7, 0, 1, 2, 3, 4, 5's hessians for example 0, 1, 2, 3, 4, 5, 6, 7
// thread 8, 9, 10, 11, 12, 13, 14, 15 now process feature 6, 7, 0, 1, 2, 3, 4, 5's gradients for example 8, 9, 10, 11, 12, 13, 14, 15
#if CONST_HESSIAN == 0
atomic_local_add_f(gh_hist + addr2, stat2);
#endif
}
offset = (offset + 1) & DWORD_FEATURES_MASK;
if (feature_mask.s0) {
Expand All @@ -493,7 +525,9 @@ __kernel void histogram16(__global const uchar4* feature_data_base,
atomic_local_add_f(gh_hist + addr, stat1);
// thread 0, 1, 2, 3, 4, 5, 6, 7 now process feature 7, 0, 1, 2, 3, 4, 5, 6's hessians for example 0, 1, 2, 3, 4, 5, 6, 7
// thread 8, 9, 10, 11, 12, 13, 14, 15 now process feature 7, 0, 1, 2, 3, 4, 5, 6's gradients for example 8, 9, 10, 11, 12, 13, 14, 15
#if CONST_HESSIAN == 0
atomic_local_add_f(gh_hist + addr2, stat2);
#endif
}

// STAGE 3: accumulate counter
Expand Down Expand Up @@ -598,9 +632,35 @@ __kernel void histogram16(__global const uchar4* feature_data_base,
// now thread 0 - 7 holds feature 0 - 7's gradient for bin 0 and counter bin 0
// now thread 8 - 15 holds feature 0 - 7's hessian for bin 0 and counter bin 1
// now thread 16- 23 holds feature 0 - 7's gradient for bin 1 and counter bin 2
// now thread 24- 31 holds feature 0 - 7's hessian for bin 2 and counter bin 3
// now thread 24- 31 holds feature 0 - 7's hessian for bin 1 and counter bin 3
// etc,

#if CONST_HESSIAN == 1
// Combine the two banks into one, and fill the hessians with counter value * hessian constant
barrier(CLK_LOCAL_MEM_FENCE);
gh_hist[ltid] = stat_val;
if (ltid < LOCAL_SIZE_0 / 2) {
cnt_hist[ltid] = cnt_val;
}
barrier(CLK_LOCAL_MEM_FENCE);
if (is_hessian_first) {
// this is the hessians
// thread 8 - 15 read counters stored by thread 0 - 7
// thread 24- 31 read counters stored by thread 8 - 15
// thread 40- 47 read counters stored by thread 16- 23, etc
stat_val = const_hessian *
cnt_hist[((ltid - DWORD_FEATURES) >> (LOG2_DWORD_FEATURES + 1)) * DWORD_FEATURES + (ltid & DWORD_FEATURES_MASK)];
}
else {
// this is the gradients
// thread 0 - 7 read gradients stored by thread 8 - 15
// thread 16- 23 read gradients stored by thread 24- 31
// thread 32- 39 read gradients stored by thread 40- 47, etc
stat_val += gh_hist[ltid + DWORD_FEATURES];
}
barrier(CLK_LOCAL_MEM_FENCE);
#endif

// write to output
// write gradients and hessians histogram for all 4 features
// output data in linear order for further reduction
Expand Down
Loading