Skip to content

Commit

Permalink
Update image example with hash encoding
Browse files Browse the repository at this point in the history
- Use a higher-resolution sample image to better showcase the encoding's ability
- Switch from OpenEXR image format to regular jpg
- Improve legibility of compilation instructions
  • Loading branch information
Tom94 committed Feb 12, 2022
1 parent c835bd2 commit b3d5cae
Show file tree
Hide file tree
Showing 113 changed files with 9,889 additions and 41,825 deletions.
52 changes: 29 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ This is a small, self-contained framework for training and querying neural netwo
## Performance

![Image](data/readme/fully-fused-vs-tensorflow.png)
_Fully fused networks vs. TensorFlow v2.5.0 w/ XLA. Measured on 64 (solid line) and 128 (dashed line) neurons wide multi-layer perceptrons on an RTX 3090. Generated by `benchmarks/bench_ours.cu` and `benchmarks/bench_tensorflow.py`._
_Fully fused networks vs. TensorFlow v2.5.0 w/ XLA. Measured on 64 (solid line) and 128 (dashed line) neurons wide multi-layer perceptrons on an RTX 3090. Generated by `benchmarks/bench_ours.cu` and `benchmarks/bench_tensorflow.py` using `data/config_oneblob.json`._


## Usage
Expand Down Expand Up @@ -66,50 +66,56 @@ model.network->inference(inference_inputs, inference_outputs);
We provide a sample application where an image function _(x,y) -> (R,G,B)_ is learned. It can be run via
```sh
tiny-cuda-nn/build> ./mlp_learning_an_image ../data/images/albert.exr ../data/config.json
tiny-cuda-nn/build$ ./mlp_learning_an_image ../data/images/albert.jpg ../data/config_hash.json
```
producing an image every 1000 training steps. Each 1000 steps should take roughly 0.8 seconds with the default configuration on an RTX 3090.
producing an image every 1000 training steps. Each 1000 steps should take roughly 0.42 seconds with the default configuration on an RTX 3090.

| Learned image after 1,000 steps | Learned image after 10,000 steps | Reference image |
|:---:|:---:|:---:|
| ![1,000 steps](data/readme/learned_image_after_1000_steps.jpg) | ![10,000 steps](data/readme/learned_image_after_10000_steps.jpg) | ![reference](data/readme/reference_image.jpg) |
| 10 steps (4.2 ms) | 100 steps (42 ms) | 1000 steps (420 ms) | Reference image |
|:---:|:---:|:---:|:---:|
| ![10steps](data/readme/10.jpg) | ![100steps](data/readme/100.jpg) | ![1000steps](data/readme/1000.jpg) | ![reference](data/images/albert.jpg) |



## Requirements

- CUDA __v10.2 or higher__.
- CMake __v3.18 or higher__.
- A __C++14__ capable compiler.
- A high-end NVIDIA GPU that supports TensorCores and has a large amount of shared memory. The framework was tested primarily with an RTX 3090.
- An __NVIDIA GPU__; tensor cores increase performance when available. All shown results come from an RTX 3090.
- A __C++14__ capable compiler. The following choices are recommended and have been tested:
- __Windows:__ Visual Studio 2019
- __Linux:__ GCC/G++ 7.5 or higher
- __[CUDA](https://developer.nvidia.com/cuda-toolkit) v10.2 or higher__ and __[CMake](https://cmake.org/) v3.21 or higher__.
- The fully fused MLP component of this framework requires a __very large__ amount of shared memory in its default configuration. It will likely only work on an RTX 3090, an RTX 2080 Ti, or high-end enterprise GPUs. Lower end cards must reduce the `n_neurons` parameter or use the `CutlassMLP` (better compatibility but slower) instead.

## Compilation
If you are using Linux, install the following packages
```sh
sudo apt-get install build-essential git
```

We also recommend installing [CUDA](https://developer.nvidia.com/cuda-toolkit) in `/usr/local/` and adding the CUDA installation to your PATH.
For example, if you have CUDA 11.4, add the following to your `~/.bashrc`
```sh
export PATH="/usr/local/cuda-11.4/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-11.4/lib64:$LD_LIBRARY_PATH"
```


## Compilation (Windows & Linux)

Begin by cloning this repository and all its submodules using the following command:
```sh
$ git clone --recursive https://github.com/nvlabs/tiny-cuda-nn
$ cd tiny-cuda-nn
```

Then, use CMake to generate build files:

Then, use CMake to build the project: (on Windows, this must be in a [developer command prompt](https://docs.microsoft.com/en-us/cpp/build/building-on-the-command-line?view=msvc-160#developer_command_prompt))
```sh
tiny-cuda-nn$ mkdir build
tiny-cuda-nn$ cd build
tiny-cuda-nn/build$ cmake ..
tiny-cuda-nn$ cmake . -B build
tiny-cuda-nn$ cmake --build build --config RelWithDebInfo -j 16
```

The last step differs by operating system.
- Windows: open `tiny-cuda-nn/build/tiny-cuda-nn.sln` in Visual Studio and click the "Build" button.
- Linux: run the command
```sh
tiny-cuda-nn/build$ make -j
```

## Components

The following is a summary of all components of this framework that are currently released. Please consult [the JSON documentation](DOCUMENTATION.md) for how to configure them.
Following is a summary of the components of this framework. See [the JSON documentation](DOCUMENTATION.md) for how to configure each.


| Networks |   |  
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/image/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,6 @@
# STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

add_executable(bench_image_ours bench_ours.cu tinyexr.cpp)
add_executable(bench_image_ours bench_ours.cu ../../dependencies/stbi/stbi_wrapper.cpp)
target_link_libraries(bench_image_ours PUBLIC ${CUDA_LIBRARIES} tiny-cuda-nn cublas)
target_compile_options(bench_image_ours PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:${CUDA_NVCC_FLAGS}>)
117 changes: 24 additions & 93 deletions benchmarks/image/bench_ours.cu
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@

#include <tiny-cuda-nn/trainer.h>

#include <tinyexr/tinyexr.h>
#include <stbi/stbi_wrapper.h>

#include <chrono>
#include <cstdlib>
Expand All @@ -56,87 +56,9 @@ using namespace tcnn;
using precision_t = network_precision_t;


bool SaveEXR(const float* data, int width, int height, int nChannels, int channelStride, const char* outfilename) {
EXRHeader header;
InitEXRHeader(&header);

EXRImage image;
InitEXRImage(&image);

image.num_channels = nChannels;

std::vector<std::vector<float>> images(nChannels);
std::vector<float*> image_ptr(nChannels);
for (int i = 0; i < nChannels; ++i) {
images[i].resize(width * height);
}

for (int i = 0; i < nChannels; ++i) {
image_ptr[i] = images[nChannels - i - 1].data();
}

for (size_t i = 0; i < (size_t)width * height; i++) {
for (int c = 0; c < nChannels; ++c) {
images[c][i] = data[channelStride*i+c];
}
}

image.images = (unsigned char**)image_ptr.data();
image.width = width;
image.height = height;

header.num_channels = nChannels;
header.channels = (EXRChannelInfo *)malloc(sizeof(EXRChannelInfo) * header.num_channels);
// Must be (A)BGR order, since most of EXR viewers expect this channel order.
strncpy(header.channels[0].name, "B", 255); header.channels[0].name[strlen("B")] = '\0';
if (nChannels > 1) {
strncpy(header.channels[1].name, "G", 255); header.channels[1].name[strlen("G")] = '\0';
}
if (nChannels > 2) {
strncpy(header.channels[2].name, "R", 255); header.channels[2].name[strlen("R")] = '\0';
}
if (nChannels > 3) {
strncpy(header.channels[3].name, "A", 255); header.channels[3].name[strlen("A")] = '\0';
}

header.pixel_types = (int *)malloc(sizeof(int) * header.num_channels);
header.requested_pixel_types = (int *)malloc(sizeof(int) * header.num_channels);
for (int i = 0; i < header.num_channels; i++) {
header.pixel_types[i] = TINYEXR_PIXELTYPE_FLOAT; // pixel type of input image
header.requested_pixel_types[i] = TINYEXR_PIXELTYPE_HALF; // pixel type of output image to be stored in .EXR
}

const char* err = NULL; // or nullptr in C++11 or later.
int ret = SaveEXRImageToFile(&image, &header, outfilename, &err);
if (ret != TINYEXR_SUCCESS) {
fprintf(stderr, "Save EXR err: %s\n", err);
FreeEXRErrorMessage(err); // free's buffer for an error message
return ret;
}
printf("Saved exr file. [ %s ] \n", outfilename);

free(header.channels);
free(header.pixel_types);
free(header.requested_pixel_types);
return true;
}


GPUMemory<float> load_image(const std::string& filename, int& width, int& height) {
float* out; // width * height * RGBA
const char* err = nullptr;

int ret = LoadEXR(&out, &width, &height, filename.c_str(), &err);

if (ret != TINYEXR_SUCCESS) {
if (err) {
std::string error_message = std::string("Failed to load EXR image: ") + err;
FreeEXRErrorMessage(err);
throw std::runtime_error(error_message);
} else {
throw std::runtime_error("Failed to load EXR image");
}
}
// width * height * RGBA
float* out = load_stbi(&width, &height, filename.c_str());

GPUMemory<float> result(width * height * 4);
result.copy_from_host(out);
Expand All @@ -146,16 +68,25 @@ GPUMemory<float> load_image(const std::string& filename, int& width, int& height
}

template <typename T>
void save_image(const GPUMemory<T>& image, int width, int height, int n_channels, int channel_stride, const std::string& filename) {
std::vector<T> host_data(image.size());
image.copy_to_host(host_data.data());
__global__ void to_ldr(const uint64_t num_elements, const uint32_t n_channels, const uint32_t stride, const T* __restrict__ in, uint8_t* __restrict__ out) {
const uint64_t i = threadIdx.x + blockIdx.x * blockDim.x;
if (i >= num_elements) return;

std::vector<float> float_host_data(host_data.size());
for (size_t i = 0; i < host_data.size(); ++i) {
float_host_data[i] = (float)host_data[i];
}
const uint64_t pixel = i / n_channels;
const uint32_t channel = i - pixel * n_channels;

out[i] = (uint8_t)(powf(fmaxf(fminf(in[pixel * stride + channel], 1.0f), 0.0f), 1.0f/2.2f) * 255.0f + 0.5f);
}

template <typename T>
void save_image(const T* image, int width, int height, int n_channels, int channel_stride, const std::string& filename) {
GPUMemory<uint8_t> image_ldr(width * height * n_channels);
linear_kernel(to_ldr<T>, 0, nullptr, width * height * n_channels, n_channels, channel_stride, image, image_ldr.data());

std::vector<uint8_t> image_ldr_host(width * height * n_channels);
CUDA_CHECK_THROW(cudaMemcpy(image_ldr_host.data(), image_ldr.data(), image_ldr.size(), cudaMemcpyDeviceToHost));

SaveEXR(float_host_data.data(), width, height, n_channels, channel_stride, filename.c_str());
save_stbi(image_ldr_host.data(), width, height, n_channels, filename.c_str());
}

template <uint32_t stride>
Expand Down Expand Up @@ -192,8 +123,8 @@ int main(int argc, char* argv[]) {
}

if (argc < 3) {
std::cout << "USAGE: " << argv[0] << " " << "path-to-image.exr path-to-config.json" << std::endl;
std::cout << "Sample EXR files are provided in 'data/images'." << std::endl;
std::cout << "USAGE: " << argv[0] << " " << "path-to-image.jpg path-to-config.json" << std::endl;
std::cout << "A sample image is provided in 'data/images'." << std::endl;
return 0;
}

Expand Down Expand Up @@ -256,7 +187,7 @@ int main(int argc, char* argv[]) {

eval_image<3><<<n_blocks_linear(n_coords), n_threads_linear>>>(n_coords, texture, filter, width, height, xs_and_ys.data(), sampled_image.data());

save_image(sampled_image, sampling_width, sampling_height, 3, 3, "reference.exr");
save_image(sampled_image.data(), sampling_width, sampling_height, 3, 3, "reference.jpg");

// Fourth step: train the model by sampling the above image and optimizing relative squared error using Adam.
std::vector<uint32_t> batch_sizes = {1 << 14, 1 << 15, 1 << 16, 1 << 17, 1 << 18, 1 << 19, 1 << 20, 1 << 21};
Expand Down Expand Up @@ -362,7 +293,7 @@ int main(int argc, char* argv[]) {
encoding->encode(inference_stream, n_coords, {xs_and_ys.data(), num_dims_encoded}, {eval_obe_out.data(), num_output_dims});
network->inference(inference_stream, eval_obe_out, prediction);

save_image(prediction_data, sampling_width, sampling_height, 3, num_output_dims, std::to_string(batch_size) + "-after-" + std::to_string(n_iterations) + "-iters-" + method + ".exr");
save_image(prediction_data.data(), sampling_width, sampling_height, 3, num_output_dims, std::to_string(batch_size) + "-after-" + std::to_string(n_iterations) + "-iters-" + method + ".jpg");

std::cout << "Finished training benchmark. Mean throughput is " << mean_training_throughput << "/s. Waiting 10 seconds for GPU to cool down." << std::endl;
std::this_thread::sleep_for(std::chrono::seconds{10});
Expand Down
76 changes: 70 additions & 6 deletions benchmarks/image/bench_tensorflow.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,9 @@
import pyexr as exr
import commentjson as json

import PIL.Image
PIL.Image.MAX_IMAGE_PIXELS = 10000000000

import time

import argparse
Expand All @@ -45,6 +48,68 @@
IMAGES_DIR = os.path.join(DATA_DIR, "images")


def write_image_pillow(img_file, img, quality):
img_array = (np.clip(img, 0.0, 1.0) * 255.0 + 0.5).astype(np.uint8)
im = PIL.Image.fromarray(img_array)
if os.path.splitext(img_file)[1] == ".jpg":
im = im.convert("RGB") # Bake the alpha channel
im.save(img_file, quality=quality, subsampling=0)

def read_image_pillow(img_file):
img = PIL.Image.open(img_file, "r")
if os.path.splitext(img_file)[1] == ".jpg":
img = img.convert("RGB")
else:
img = img.convert("RGBA")
img = np.asarray(img).astype(np.float32)
return img / 255.0

def srgb_to_linear(img):
limit = 0.04045
return np.where(img > limit, np.power((img + 0.055) / 1.055, 2.4), img / 12.92)

def linear_to_srgb(img):
limit = 0.0031308
return np.where(img > limit, 1.055 * (img ** (1.0 / 2.4)) - 0.055, 12.92 * img)

def read_image(file):
if os.path.splitext(file)[1] == ".exr":
img = exr.read(file).astype(np.float32)
elif os.path.splitext(file)[1] == ".bin":
with open(file, "rb") as f:
bytes = f.read()
h, w = struct.unpack("ii", bytes[:8])
img = np.frombuffer(bytes, dtype=np.float16, count=h*w*4, offset=8).astype(np.float32).reshape([h, w, 4])
else:
img = read_image_pillow(file)
if img.shape[2] == 4:
img[...,0:3] = srgb_to_linear(img[...,0:3])
# Premultiply alpha
img[...,0:3] *= img[...,3:4]
else:
img = srgb_to_linear(img)
return img

def write_image(file, img, quality=100):
if os.path.splitext(file)[1] == ".exr":
img = exr.write(file, img)
elif os.path.splitext(file)[1] == ".bin":
if img.shape[2] < 4:
img = np.dstack((img, np.ones([img.shape[0], img.shape[1], 4 - img.shape[2]])))
with open(file, "wb") as f:
f.write(struct.pack("ii", img.shape[0], img.shape[1]))
f.write(img.astype(np.float16).tobytes())
else:
if img.shape[2] == 4:
img = np.copy(img)
# Unmultiply alpha
img[...,0:3] = np.divide(img[...,0:3], img[...,3:4], out=np.zeros_like(img[...,0:3]), where=img[...,3:4] != 0)
img[...,0:3] = linear_to_srgb(img[...,0:3])
else:
img = linear_to_srgb(img)
write_image_pillow(file, img, quality)


class Function:
def __init__(self, domain, n_channels, n_dims, wraparound_dims, n_conditionals, n_raw_conditionals):
self.domain = domain
Expand All @@ -65,7 +130,7 @@ def __init__(self, filename):
if not paths:
raise ValueError(f"Invalid image name '{filename}''")
path = paths[0] # Use first path that exists
self.data = exr.read(path)
self.data = read_image(path)
if self.data.shape[-1] > 3:
self.data = self.data[:,:,0:3]
self.data_tf = tf.constant(self.data, dtype=tf.float32)
Expand Down Expand Up @@ -149,7 +214,7 @@ def gaussian_cdf(x, radius):
def get_args():
parser = argparse.ArgumentParser(description="Image benchmark using TensorFlow.")

parser.add_argument("-c", "--config", default="config.json", type=str, help="JSON config filename")
parser.add_argument("-c", "--config", default="config_hash.json", type=str, help="JSON config filename")
parser.add_argument("-i", "--image", default="albert", type=str, help="Image to match")

args = parser.parse_args()
Expand Down Expand Up @@ -248,7 +313,6 @@ def make_graph():
return train_op, loss, input_tensor, output_tensor



if __name__ == "__main__":
tf.disable_eager_execution()
args = get_args()
Expand All @@ -275,7 +339,7 @@ def make_graph():

xy = np.stack((xv.flatten(), yv.flatten())).transpose()
gt = np.reshape(target_fun(xy), img_shape)
exr.write("reference.exr", gt)
write_image("reference.jpg", gt)

# Enable XLA compiler (important for good TensorFlow performance)
session_config = tf.ConfigProto()
Expand Down Expand Up @@ -322,9 +386,9 @@ def body(it, sequencer, _):


img = np.reshape(sess.run(output_tensor, feed_dict={ input_tensor: xy, batch_size_tensor: xy.shape[0] }), img_shape)
filename = f"{batch_size}-after-{N_ITERS}-iters-tensorflow.exr"
filename = f"{batch_size}-after-{N_ITERS}-iters-tensorflow.jpg"
print(f"Saving {filename}")
exr.write(filename, img)
write_image(filename, img)

mean_training_throughput = np.mean(throughputs[1:])

Expand Down
Loading

0 comments on commit b3d5cae

Please sign in to comment.