Update image example with hash encoding

- Use a higher-resolution sample image to better showcase the encoding's ability - Switch from OpenEXR image format to regular jpg - Improve legibility of compilation instructions
YOUSIKI · Feb 12, 2022 · b3d5cae · b3d5cae
1 parent c835bd2
commit b3d5cae
Show file tree

Hide file tree

Showing 113 changed files with 9,889 additions and 41,825 deletions.
diff --git a/README.md b/README.md
@@ -5,7 +5,7 @@ This is a small, self-contained framework for training and querying neural netwo
 ## Performance
 
 ![Image](data/readme/fully-fused-vs-tensorflow.png)
-_Fully fused networks vs. TensorFlow v2.5.0 w/ XLA. Measured on 64 (solid line) and 128 (dashed line) neurons wide multi-layer perceptrons on an RTX 3090. Generated by `benchmarks/bench_ours.cu` and `benchmarks/bench_tensorflow.py`._
+_Fully fused networks vs. TensorFlow v2.5.0 w/ XLA. Measured on 64 (solid line) and 128 (dashed line) neurons wide multi-layer perceptrons on an RTX 3090. Generated by `benchmarks/bench_ours.cu` and `benchmarks/bench_tensorflow.py` using `data/config_oneblob.json`._
 
 
 ## Usage
@@ -66,50 +66,56 @@ model.network->inference(inference_inputs, inference_outputs);
 
 We provide a sample application where an image function _(x,y) -> (R,G,B)_ is learned. It can be run via
 ```sh
-tiny-cuda-nn/build> ./mlp_learning_an_image ../data/images/albert.exr ../data/config.json
+tiny-cuda-nn/build$ ./mlp_learning_an_image ../data/images/albert.jpg ../data/config_hash.json
 ```
-producing an image every 1000 training steps. Each 1000 steps should take roughly 0.8 seconds with the default configuration on an RTX 3090.
+producing an image every 1000 training steps. Each 1000 steps should take roughly 0.42 seconds with the default configuration on an RTX 3090.
 
-| Learned image after 1,000 steps | Learned image after 10,000 steps | Reference image |
-|:---:|:---:|:---:|
-| ![1,000 steps](data/readme/learned_image_after_1000_steps.jpg) | ![10,000 steps](data/readme/learned_image_after_10000_steps.jpg) | ![reference](data/readme/reference_image.jpg) |
+| 10 steps (4.2 ms) | 100 steps (42 ms) | 1000 steps (420 ms) | Reference image |
+|:---:|:---:|:---:|:---:|
+| ![10steps](data/readme/10.jpg) | ![100steps](data/readme/100.jpg) | ![1000steps](data/readme/1000.jpg) | ![reference](data/images/albert.jpg) |
 
 
 
 ## Requirements
 
-- CUDA __v10.2 or higher__.
-- CMake __v3.18 or higher__.
-- A __C++14__ capable compiler.
-- A high-end NVIDIA GPU that supports TensorCores and has a large amount of shared memory. The framework was tested primarily with an RTX 3090.
+- An __NVIDIA GPU__; tensor cores increase performance when available. All shown results come from an RTX 3090.
+- A __C++14__ capable compiler. The following choices are recommended and have been tested:
+  - __Windows:__ Visual Studio 2019
+  - __Linux:__ GCC/G++ 7.5 or higher
+- __[CUDA](https://developer.nvidia.com/cuda-toolkit) v10.2 or higher__ and __[CMake](https://cmake.org/) v3.21 or higher__.
 - The fully fused MLP component of this framework requires a __very large__ amount of shared memory in its default configuration. It will likely only work on an RTX 3090, an RTX 2080 Ti, or high-end enterprise GPUs. Lower end cards must reduce the `n_neurons` parameter or use the `CutlassMLP` (better compatibility but slower) instead.
 
-## Compilation
+If you are using Linux, install the following packages
+```sh
+sudo apt-get install build-essential git
+```
+
+We also recommend installing [CUDA](https://developer.nvidia.com/cuda-toolkit) in `/usr/local/` and adding the CUDA installation to your PATH.
+For example, if you have CUDA 11.4, add the following to your `~/.bashrc`
+```sh
+export PATH="/usr/local/cuda-11.4/bin:$PATH"
+export LD_LIBRARY_PATH="/usr/local/cuda-11.4/lib64:$LD_LIBRARY_PATH"
+```
+
+
+## Compilation (Windows & Linux)
 
 Begin by cloning this repository and all its submodules using the following command:
 ```sh
 $ git clone --recursive https://github.com/nvlabs/tiny-cuda-nn
 $ cd tiny-cuda-nn
 ```
 
-Then, use CMake to generate build files:
-
+Then, use CMake to build the project: (on Windows, this must be in a [developer command prompt](https://docs.microsoft.com/en-us/cpp/build/building-on-the-command-line?view=msvc-160#developer_command_prompt))
 ```sh
-tiny-cuda-nn$ mkdir build
-tiny-cuda-nn$ cd build
-tiny-cuda-nn/build$ cmake ..
+tiny-cuda-nn$ cmake . -B build
+tiny-cuda-nn$ cmake --build build --config RelWithDebInfo -j 16
 ```
 
-The last step differs by operating system.
-- Windows: open `tiny-cuda-nn/build/tiny-cuda-nn.sln` in Visual Studio and click the "Build" button.
-- Linux: run the command
-  ```sh
-  tiny-cuda-nn/build$ make -j
-  ```
 
 ## Components
 
-The following is a summary of all components of this framework that are currently released. Please consult [the JSON documentation](DOCUMENTATION.md) for how to configure them.
+Following is a summary of the components of this framework. See [the JSON documentation](DOCUMENTATION.md) for how to configure each.
 
 
 | Networks | &nbsp; | &nbsp;

diff --git a/benchmarks/image/CMakeLists.txt b/benchmarks/image/CMakeLists.txt
@@ -20,6 +20,6 @@
 # STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-add_executable(bench_image_ours bench_ours.cu tinyexr.cpp)
+add_executable(bench_image_ours bench_ours.cu ../../dependencies/stbi/stbi_wrapper.cpp)
 target_link_libraries(bench_image_ours PUBLIC ${CUDA_LIBRARIES} tiny-cuda-nn cublas)
 target_compile_options(bench_image_ours PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:${CUDA_NVCC_FLAGS}>)
diff --git a/benchmarks/image/bench_ours.cu b/benchmarks/image/bench_ours.cu
@@ -39,7 +39,7 @@
 
 #include <tiny-cuda-nn/trainer.h>
 
-#include <tinyexr/tinyexr.h>
+#include <stbi/stbi_wrapper.h>
 
 #include <chrono>
 #include <cstdlib>
@@ -56,87 +56,9 @@ using namespace tcnn;
 using precision_t = network_precision_t;
 
 
-bool SaveEXR(const float* data, int width, int height, int nChannels, int channelStride, const char* outfilename) {
-	EXRHeader header;
-	InitEXRHeader(&header);
-
-	EXRImage image;
-	InitEXRImage(&image);
-
-	image.num_channels = nChannels;
-
-	std::vector<std::vector<float>> images(nChannels);
-	std::vector<float*> image_ptr(nChannels);
-	for (int i = 0; i < nChannels; ++i) {
-		images[i].resize(width * height);
-	}
-
-	for (int i = 0; i < nChannels; ++i) {
-		image_ptr[i] = images[nChannels - i - 1].data();
-	}
-
-	for (size_t i = 0; i < (size_t)width * height; i++) {
-		for (int c = 0; c < nChannels; ++c) {
-			images[c][i] = data[channelStride*i+c];
-		}
-	}
-
-	image.images = (unsigned char**)image_ptr.data();
-	image.width = width;
-	image.height = height;
-
-	header.num_channels = nChannels;
-	header.channels = (EXRChannelInfo *)malloc(sizeof(EXRChannelInfo) * header.num_channels);
-	// Must be (A)BGR order, since most of EXR viewers expect this channel order.
-	strncpy(header.channels[0].name, "B", 255); header.channels[0].name[strlen("B")] = '\0';
-	if (nChannels > 1) {
-		strncpy(header.channels[1].name, "G", 255); header.channels[1].name[strlen("G")] = '\0';
-	}
-	if (nChannels > 2) {
-		strncpy(header.channels[2].name, "R", 255); header.channels[2].name[strlen("R")] = '\0';
-	}
-	if (nChannels > 3) {
-		strncpy(header.channels[3].name, "A", 255); header.channels[3].name[strlen("A")] = '\0';
-	}
-
-	header.pixel_types = (int *)malloc(sizeof(int) * header.num_channels);
-	header.requested_pixel_types = (int *)malloc(sizeof(int) * header.num_channels);
-	for (int i = 0; i < header.num_channels; i++) {
-		header.pixel_types[i] = TINYEXR_PIXELTYPE_FLOAT; // pixel type of input image
-		header.requested_pixel_types[i] = TINYEXR_PIXELTYPE_HALF; // pixel type of output image to be stored in .EXR
-	}
-
-	const char* err = NULL; // or nullptr in C++11 or later.
-	int ret = SaveEXRImageToFile(&image, &header, outfilename, &err);
-	if (ret != TINYEXR_SUCCESS) {
-		fprintf(stderr, "Save EXR err: %s\n", err);
-		FreeEXRErrorMessage(err); // free's buffer for an error message
-		return ret;
-	}
-	printf("Saved exr file. [ %s ] \n", outfilename);
-
-	free(header.channels);
-	free(header.pixel_types);
-	free(header.requested_pixel_types);
-	return true;
-}
-
-
 GPUMemory<float> load_image(const std::string& filename, int& width, int& height) {
-	float* out; // width * height * RGBA
-	const char* err = nullptr;
-
-	int ret = LoadEXR(&out, &width, &height, filename.c_str(), &err);
-
-	if (ret != TINYEXR_SUCCESS) {
-		if (err) {
-			std::string error_message = std::string("Failed to load EXR image: ") + err;
-			FreeEXRErrorMessage(err);
-			throw std::runtime_error(error_message);
-		} else {
-			throw std::runtime_error("Failed to load EXR image");
-		}
-	}
+	// width * height * RGBA
+	float* out = load_stbi(&width, &height, filename.c_str());
 
 	GPUMemory<float> result(width * height * 4);
 	result.copy_from_host(out);
@@ -146,16 +68,25 @@ GPUMemory<float> load_image(const std::string& filename, int& width, int& height
 }
 
 template <typename T>
-void save_image(const GPUMemory<T>& image, int width, int height, int n_channels, int channel_stride, const std::string& filename) {
-	std::vector<T> host_data(image.size());
-	image.copy_to_host(host_data.data());
+__global__ void to_ldr(const uint64_t num_elements, const uint32_t n_channels, const uint32_t stride, const T* __restrict__ in, uint8_t* __restrict__ out) {
+	const uint64_t i = threadIdx.x + blockIdx.x * blockDim.x;
+	if (i >= num_elements) return;
 
-	std::vector<float> float_host_data(host_data.size());
-	for (size_t i = 0; i < host_data.size(); ++i) {
-		float_host_data[i] = (float)host_data[i];
-	}
+	const uint64_t pixel = i / n_channels;
+	const uint32_t channel = i - pixel * n_channels;
+
+	out[i] = (uint8_t)(powf(fmaxf(fminf(in[pixel * stride + channel], 1.0f), 0.0f), 1.0f/2.2f) * 255.0f + 0.5f);
+}
+
+template <typename T>
+void save_image(const T* image, int width, int height, int n_channels, int channel_stride, const std::string& filename) {
+	GPUMemory<uint8_t> image_ldr(width * height * n_channels);
+	linear_kernel(to_ldr<T>, 0, nullptr, width * height * n_channels, n_channels, channel_stride, image, image_ldr.data());
+
+	std::vector<uint8_t> image_ldr_host(width * height * n_channels);
+	CUDA_CHECK_THROW(cudaMemcpy(image_ldr_host.data(), image_ldr.data(), image_ldr.size(), cudaMemcpyDeviceToHost));
 
-	SaveEXR(float_host_data.data(), width, height, n_channels, channel_stride, filename.c_str());
+	save_stbi(image_ldr_host.data(), width, height, n_channels, filename.c_str());
 }
 
 template <uint32_t stride>
@@ -192,8 +123,8 @@ int main(int argc, char* argv[]) {
 		}
 
 		if (argc < 3) {
-			std::cout << "USAGE: " << argv[0] << " " << "path-to-image.exr path-to-config.json" << std::endl;
-			std::cout << "Sample EXR files are provided in 'data/images'." << std::endl;
+			std::cout << "USAGE: " << argv[0] << " " << "path-to-image.jpg path-to-config.json" << std::endl;
+			std::cout << "A sample image is provided in 'data/images'." << std::endl;
 			return 0;
 		}
 
@@ -256,7 +187,7 @@ int main(int argc, char* argv[]) {
 
 		eval_image<3><<<n_blocks_linear(n_coords), n_threads_linear>>>(n_coords, texture, filter, width, height, xs_and_ys.data(), sampled_image.data());
 
-		save_image(sampled_image, sampling_width, sampling_height, 3, 3, "reference.exr");
+		save_image(sampled_image.data(), sampling_width, sampling_height, 3, 3, "reference.jpg");
 
 		// Fourth step: train the model by sampling the above image and optimizing relative squared error using Adam.
 		std::vector<uint32_t> batch_sizes = {1 << 14, 1 << 15, 1 << 16, 1 << 17, 1 << 18, 1 << 19, 1 << 20, 1 << 21};
@@ -362,7 +293,7 @@ int main(int argc, char* argv[]) {
 				encoding->encode(inference_stream, n_coords, {xs_and_ys.data(), num_dims_encoded}, {eval_obe_out.data(), num_output_dims});
 				network->inference(inference_stream, eval_obe_out, prediction);
 
-				save_image(prediction_data, sampling_width, sampling_height, 3, num_output_dims, std::to_string(batch_size) + "-after-" + std::to_string(n_iterations) + "-iters-" + method + ".exr");
+				save_image(prediction_data.data(), sampling_width, sampling_height, 3, num_output_dims, std::to_string(batch_size) + "-after-" + std::to_string(n_iterations) + "-iters-" + method + ".jpg");
 
 				std::cout << "Finished training benchmark. Mean throughput is " << mean_training_throughput << "/s. Waiting 10 seconds for GPU to cool down." << std::endl;
 				std::this_thread::sleep_for(std::chrono::seconds{10});

diff --git a/benchmarks/image/bench_tensorflow.py b/benchmarks/image/bench_tensorflow.py
@@ -35,6 +35,9 @@
 import pyexr as exr
 import commentjson as json
 
+import PIL.Image
+PIL.Image.MAX_IMAGE_PIXELS = 10000000000
+
 import time
 
 import argparse
@@ -45,6 +48,68 @@
 IMAGES_DIR = os.path.join(DATA_DIR, "images")
 
 
+def write_image_pillow(img_file, img, quality):
+	img_array = (np.clip(img, 0.0, 1.0) * 255.0 + 0.5).astype(np.uint8)
+	im = PIL.Image.fromarray(img_array)
+	if os.path.splitext(img_file)[1] == ".jpg":
+		im = im.convert("RGB") # Bake the alpha channel
+	im.save(img_file, quality=quality, subsampling=0)
+
+def read_image_pillow(img_file):
+	img = PIL.Image.open(img_file, "r")
+	if os.path.splitext(img_file)[1] == ".jpg":
+		img = img.convert("RGB")
+	else:
+		img = img.convert("RGBA")
+	img = np.asarray(img).astype(np.float32)
+	return img / 255.0
+
+def srgb_to_linear(img):
+	limit = 0.04045
+	return np.where(img > limit, np.power((img + 0.055) / 1.055, 2.4), img / 12.92)
+
+def linear_to_srgb(img):
+	limit = 0.0031308
+	return np.where(img > limit, 1.055 * (img ** (1.0 / 2.4)) - 0.055, 12.92 * img)
+
+def read_image(file):
+	if os.path.splitext(file)[1] == ".exr":
+		img = exr.read(file).astype(np.float32)
+	elif os.path.splitext(file)[1] == ".bin":
+		with open(file, "rb") as f:
+			bytes = f.read()
+			h, w = struct.unpack("ii", bytes[:8])
+			img = np.frombuffer(bytes, dtype=np.float16, count=h*w*4, offset=8).astype(np.float32).reshape([h, w, 4])
+	else:
+		img = read_image_pillow(file)
+		if img.shape[2] == 4:
+			img[...,0:3] = srgb_to_linear(img[...,0:3])
+			# Premultiply alpha
+			img[...,0:3] *= img[...,3:4]
+		else:
+			img = srgb_to_linear(img)
+	return img
+
+def write_image(file, img, quality=100):
+	if os.path.splitext(file)[1] == ".exr":
+		img = exr.write(file, img)
+	elif os.path.splitext(file)[1] == ".bin":
+		if img.shape[2] < 4:
+			img = np.dstack((img, np.ones([img.shape[0], img.shape[1], 4 - img.shape[2]])))
+		with open(file, "wb") as f:
+			f.write(struct.pack("ii", img.shape[0], img.shape[1]))
+			f.write(img.astype(np.float16).tobytes())
+	else:
+		if img.shape[2] == 4:
+			img = np.copy(img)
+			# Unmultiply alpha
+			img[...,0:3] = np.divide(img[...,0:3], img[...,3:4], out=np.zeros_like(img[...,0:3]), where=img[...,3:4] != 0)
+			img[...,0:3] = linear_to_srgb(img[...,0:3])
+		else:
+			img = linear_to_srgb(img)
+		write_image_pillow(file, img, quality)
+
+
 class Function:
 	def __init__(self, domain, n_channels, n_dims, wraparound_dims, n_conditionals, n_raw_conditionals):
 		self.domain = domain
@@ -65,7 +130,7 @@ def __init__(self, filename):
 		if not paths:
 			raise ValueError(f"Invalid image name '{filename}''")
 		path = paths[0] # Use first path that exists
-		self.data = exr.read(path)
+		self.data = read_image(path)
 		if self.data.shape[-1] > 3:
 			self.data = self.data[:,:,0:3]
 		self.data_tf = tf.constant(self.data, dtype=tf.float32)
@@ -149,7 +214,7 @@ def gaussian_cdf(x, radius):
 def get_args():
 	parser = argparse.ArgumentParser(description="Image benchmark using TensorFlow.")
 
-	parser.add_argument("-c", "--config", default="config.json", type=str, help="JSON config filename")
+	parser.add_argument("-c", "--config", default="config_hash.json", type=str, help="JSON config filename")
 	parser.add_argument("-i", "--image", default="albert", type=str, help="Image to match")
 
 	args = parser.parse_args()
@@ -248,7 +313,6 @@ def make_graph():
 	return train_op, loss, input_tensor, output_tensor
 
 
-
 if __name__ == "__main__":
 	tf.disable_eager_execution()
 	args = get_args()
@@ -275,7 +339,7 @@ def make_graph():
 
 	xy = np.stack((xv.flatten(), yv.flatten())).transpose()
 	gt = np.reshape(target_fun(xy), img_shape)
-	exr.write("reference.exr", gt)
+	write_image("reference.jpg", gt)
 
 	# Enable XLA compiler (important for good TensorFlow performance)
 	session_config = tf.ConfigProto()
@@ -322,9 +386,9 @@ def body(it, sequencer, _):
 
 
 			img = np.reshape(sess.run(output_tensor, feed_dict={ input_tensor: xy, batch_size_tensor: xy.shape[0] }), img_shape)
-			filename = f"{batch_size}-after-{N_ITERS}-iters-tensorflow.exr"
+			filename = f"{batch_size}-after-{N_ITERS}-iters-tensorflow.jpg"
 			print(f"Saving {filename}")
-			exr.write(filename, img)
+			write_image(filename, img)
 
 			mean_training_throughput = np.mean(throughputs[1:])