Skip to content

Commit

Permalink
Add FAQ page (microsoft#3324)
Browse files Browse the repository at this point in the history
* Create FAQ.md

* Update README.md

* Update README.md

* Update FAQ.md

* Minor update

* Resync readme page from master

* Update structure and wordings

* Minor update

* Updates based on feedback

* Fix links

* Update to include common perf questions

* Update ONNX_Runtime_Perf_Tuning.md

* Update FAQ.md

* Update README.md

* Update FAQ.md

* Update docs/ONNX_Runtime_Perf_Tuning.md

Co-Authored-By: Nat Kershaw (MSFT) <nakersha@microsoft.com>

* Update docs/ONNX_Runtime_Perf_Tuning.md

Co-Authored-By: Nat Kershaw (MSFT) <nakersha@microsoft.com>

* Update docs/ONNX_Runtime_Perf_Tuning.md

Co-Authored-By: Nat Kershaw (MSFT) <nakersha@microsoft.com>

* Update docs/ONNX_Runtime_Perf_Tuning.md

Co-Authored-By: Nat Kershaw (MSFT) <nakersha@microsoft.com>

* Update ONNX_Runtime_Perf_Tuning.md

* Update FAQ.md

* Update README.md

* Update FAQ.md

Co-authored-by: Nat Kershaw (MSFT) <nakersha@microsoft.com>
  • Loading branch information
faxu and natke authored May 6, 2020
1 parent 0e59668 commit 9cca219
Show file tree
Hide file tree
Showing 3 changed files with 154 additions and 34 deletions.
20 changes: 16 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ ONNX Runtime stays up to date with the ONNX standard and supports all operators
* [Builds and Packages](#Builds-and-Packages)
* **[Usage](#usage)**
* [Samples and Tutorials](./samples)
* [Frequently Asked Questions](./docs/FAQ.md)
* [Getting ONNX Models](#getting-onnx-models)
* [Deploying ONNX Runtime](#deploying-onnx-runtime)
* [Data/Telemetry](#Data/Telemetry)
Expand All @@ -50,9 +51,11 @@ Using various graph optimizations and accelerators, ONNX Runtime can provide low
### Supported Accelerators
The list of currently supported accelerators (termed [Execution Providers](./docs/execution_providers)) is below. Please see [BUILD.md](./BUILD.md) for build instructions. If you are interested in contributing a new execution provider, please see [this page](docs/AddingExecutionProvider.md).

Please refer to [Roadmap](./docs/Roadmap.md#accelerators-and-execution-providers) for a list of upcoming accelerators.

#### CPU
* Default CPU - *MLAS (Microsoft Linear Algebra Subprograms) + Eigen*
* [Intel DNNL](./docs/execution_providers/MKL-DNN-ExecutionProvider.md)
* [Intel DNNL](./docs/execution_providers/DNNL-ExecutionProvider.md)
* [Intel nGraph](./docs/execution_providers/nGraph-ExecutionProvider.md)
* Intel MKL-ML

Expand Down Expand Up @@ -94,9 +97,16 @@ The list of currently supported accelerators (termed [Execution Providers](./doc

## Builds and Packages

Official builds are published for the default CPU Provider (Eigen + MLAS), as well as GPU with CUDA. Python packages can be found on PyPi, and C#/C/C++ packages on Nuget. Please view the table on [aka.ms/onnxruntime](https://aka.ms/onnxruntime) for instructions for different build combinations.
Official builds are available for:
* Default CPU Provider (Eigen + MLAS)
* GPU Provider - NVIDIA CUDA
* *note: If your deployment target is Windows, the [DirectML execution provider](./docs/execution_providers/DirectML-ExecutionProvider.md) is recommended for optimal performance and compatibility with a broad set of GPUs. This will be an official package soon. In the meantime, see the build instructions at [BUILD.md](./BUILD.md#directml).*

Python packages can be found on PyPi, and C#/C/C++ packages on Nuget. Please view the table on [aka.ms/onnxruntime](https://aka.ms/onnxruntime) for instructions for different build combinations.

For additional build flavors and/or dockerfiles, please see [BUILD.md](BUILD.md). For production scenarios, it's strongly recommended to build only from an [official release branch](https://github.com/microsoft/onnxruntime/releases).
For additional build flavors and/or dockerfiles, please carefully read through [BUILD.md](./BUILD.md). If you encounter problems, please provide as much information as possible when filing an [issue](https://github.com/Microsoft/onnxruntime/issues).

For production scenarios, it's strongly recommended to build only from an [official release branch](https://github.com/microsoft/onnxruntime/releases).

#### PyPi (Python):
*If using `pip` to download the Python binaries, run `pip install --upgrade pip` prior to downloading.*
Expand Down Expand Up @@ -138,7 +148,9 @@ system.
* For requirements and dependencies of other build options, see detailed build instructions on the [BUILD.md](./BUILD.md#additional-build-instructions) page.
***
# Usage
Please see [Samples and Tutorials](./samples) for examples.
## [Samples and Tutorials](./samples)

## [Frequently Asked Questions](./docs/FAQ.md)

## Getting ONNX Models
To get an ONNX model, please view these [ONNX Tutorials](https://github.com/onnx/tutorials#getting-onnx-models).
Expand Down
68 changes: 68 additions & 0 deletions docs/FAQ.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# FAQ
Here are some commonly raised questions from users of ONNX Runtime and brought up in [Issues](https://github.com/microsoft/onnxruntime/issues).

## Do the GPU builds support quantized models?
The default CUDA build supports 3 standard quantization operators: QuantizeLinear, DequantizeLinear, and MatMulInteger. The TensorRT EP has limited support for INT8 quantized ops. In general, support of quantized models through ORT is continuing to expand on a model-driven basis. For performance improvements, quantization is not always required, and we suggest trying alternative strategies to [performance tune](./ONNX_Runtime_Perf_Tuning.md) before determining that quantization is necessary.

## How do I change the severity level of the default logger to something other than the default (WARNING)?
Setting the severity level to VERBOSE is most useful when debugging errors.

Refer to the API documentation:
* Python - [RunOptions.log_severity_level](https://microsoft.github.io/onnxruntime/python/api_summary.html#onnxruntime.RunOptions.log_severity_level)
```
import onnxruntime as ort
ort.set_default_logger_severity(0)
```
* C - [SetSessionLogSeverityLevel](./../include/onnxruntime/core/session/onnxruntime_c_api.h)

## How do I load and run models that have multiple inputs and outputs using the C/C++ API?
See an example from the 'override initializer' test in [test_inference.cc](./../onnxruntime/test/shared_lib/test_inference.cc) that has 3 inputs and 3 outputs.
```
std::vector<Ort::Value> ort_inputs;
ort_inputs.push_back(std::move(label_input_tensor));
ort_inputs.push_back(std::move(f2_input_tensor));
ort_inputs.push_back(std::move(f11_input_tensor));
std::vector<const char*> input_names = {"Label", "F2", "F1"};
const char* const output_names[] = {"Label0", "F20", "F11"};
std::vector<Ort::Value> ort_outputs = session.Run(Ort::RunOptions{nullptr}, input_names.data(),
ort_inputs.data(), ort_inputs.size(), output_names, countof(output_names));
```

## How do I force single threaded execution mode in ORT? By default, session.run() uses all the computer's cores.

To limit use to a single thread only:
* If built with OpenMP, set the environment variable OMP_NUM_THREADS to 1. The default inter_op_num_threads in session options is already 1.
* If not built with OpenMP, set the session options intra_op_num_threads to 1. Do not change the default inter_op_num_threads (1).

It's recommended to build onnxruntime without openmp if you only need single threaded execution.

This is supported in ONNX Runtime v1.3.0+

**Python example:**
```
#!/usr/bin/python3
os.environ["OMP_NUM_THREADS"] = "1"
import onnxruntime
opts = onnxruntime.SessionOptions()
opts.inter_op_num_threads = 1
opts.execution_mode = onnxruntime.ExecutionMode.ORT_SEQUENTIAL
ort_session = onnxruntime.InferenceSession('/path/to/model.onnx', sess_options=opts)
```

**C++ example:**
```
// initialize enviroment...one enviroment per process
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "test");
// initialize session options if needed
Ort::SessionOptions session_options;
session_options.SetInterOpNumThreads(1);
#ifdef _WIN32
const wchar_t* model_path = L"squeezenet.onnx";
#else
const char* model_path = "squeezenet.onnx";
#endif
Ort::Session session(env, model_path, session_options);
```
100 changes: 70 additions & 30 deletions docs/ONNX_Runtime_Perf_Tuning.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,45 @@
# ONNX Runtime Performance Tuning

## Why do we need to tune performance?
ONNX Runtime is designed to be open and extensible with its concept of "Execution Provider" to represent different execution kernels. See the [design overview](./HighLevelDesign.md).
ONNX Runtime gives high performance across a range of hardware options by providing "Execution Providers" to interface to different execution environments. See: [design overview](./HighLevelDesign.md), [supported execution providers](../README.md#supported-accelerators).

ONNX Runtime supports a variety of execution providers across CPU and GPU: [see the list here](../README.md#high-performance).
For different models and different hardware, there is no silver bullet that can always perform the best. Even for a single execution provider, often there are several knobs that can be tuned (e.g. thread number, wait policy etc.).
Along with this flexibility comes decisions for tuning and usage. For each model running with each execution provider, there are settings that can be tuned (e.g. thread number, wait policy, etc) to improve performance.

This document covers basic tools and knobs that can be leveraged to find the best performance for your model and hardware.

## Is there a tool to help with performance tuning?
Yes, the onnxruntime_perf_test.exe tool (available from the build drop) can be used to test various knobs. Please find the usage instructions using `onnxruntime_perf_test.exe -h`.
**Topics**
* [Performance Tuning Tools](#Performance-Tuning-Tools)
* [Using different Execution Providers](#Using-different-Execution-Providers)
* [Which Execution Provider will provide the best performance?](#Which-Execution-Provider-will-provide-the-best-performance)
* [Tuning performance for specific Execution Providers](#Tuning-performance-for-specific-Execution-Providers)
* [Troubleshooting model performance issues](#Troubleshooting-model-performance-issues)
***

Additionally, the [ONNX Go Live "OLive" tool](https://github.com/microsoft/OLive) provides an easy-to-use pipeline for converting models to ONNX and optimizing performance with ONNX Runtime. The tool can help identify the optimal runtime configuration to get the best performance on the target hardware for the model. For quickstart, check out the notebooks on how to use OLive [here](https://github.com/microsoft/OLive/blob/master/notebook/Convert_Models_and_Tune_Performance_with_OLive_Python_SDK.ipynb) (using Python) and [here](https://github.com/microsoft/OLive/blob/master/notebook/Convert_Models_and_Tune_Performance_with_OLive_Docker_Images.ipynb) (using Docker).
## Performance Tuning Tools
The [ONNX Go Live "OLive" tool](https://github.com/microsoft/OLive) is an easy-to-use pipeline for converting models to ONNX and optimizing performance with ONNX Runtime. The tool can help identify the optimal runtime configuration to get the best performance on the target hardware for the model.
As a quickstart, please see the notebooks: [Python](https://github.com/microsoft/OLive/blob/master/notebook/Convert_Models_and_Tune_Performance_with_OLive_Python_SDK.ipynb), [Docker images](https://github.com/microsoft/OLive/blob/master/notebook/Convert_Models_and_Tune_Performance_with_OLive_Docker_Images.ipynb)

## Using different execution providers

### Profiling and Performance Report

The onnxruntime_perf_test.exe tool (available from the build drop) can be used to test various knobs. Please find the usage instructions using `onnxruntime_perf_test.exe -h`.

You can enable ONNX Runtime latency profiling in code:

```python
import onnxruntime as rt

sess_options = rt.SessionOptions()
sess_options.enable_profiling = True
```
If you are using the onnxruntime_perf_test.exe tool, you can add `-p [profile_file]` to enable performance profiling.

In both cases, you will get a JSON file which contains the detailed performance data (threading, latency of each operator, etc). This file is a standard performance tracing file, and to view it in a user friendly way, you can open it by using chrome://tracing:
* Open chrome browser
* Type chrome://tracing in the address bar
* Load the generated JSON file

## Using different Execution Providers
To learn more about different Execution Providers, see [docs/exeuction_providers](./execution_providers).

### Python API
Official Python packages on Pypi only support the default CPU (MLAS) and default GPU (CUDA) execution providers. For other execution providers, you need to build from source. Please refer to the [build instructions](../BUILD.md). The recommended instructions build the wheel with debug info in parallel.
Expand Down Expand Up @@ -65,8 +91,25 @@ so.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL
session = rt.InferenceSession(model, sess_options=so)
session.set_providers(['CUDAExecutionProvider'])
```
## How to tune performance for a specific execution provider?
* In general if ORT is built with OpenMP, use the OpenMP env variables to control the number of intra op num threads.

## Which Execution Provider will provide the best performance?
Performance is dependent on the specific model you're trying to run, the session and run options you've selected, and of course, your specific hardware target. Below you'll find some more information that may be helpful to select the right Execution Provider.

### CUDA (Default GPU) or CPU?
The CPU version of ONNX Runtime provides a complete implementation of all operators in the ONNX spec. This ensures that your ONNX-compliant model can execute successfully. In order to keep the binary size small, common data types are supported for the ops. If you are using an uncommon data type that is not supported, you can file an issue and/or contribute a PR (see examples - [PR #2112](https://github.com/microsoft/onnxruntime/pull/2112), [PR #2034](https://github.com/microsoft/onnxruntime/pull/2034), [PR #1565](https://github.com/microsoft/onnxruntime/pull/1565)). Please make sure you provide details on usage justification.

Additionally, not all CUDA kernels are implemented, as these have been prioritized on an as-needed basis. This means that if your model contains operators that do not have a CUDA implementation, it will fall back to CPU. Switching between CPU and GPU can cause significant performance impact. If you require a specific operator that is not currently supported, please consider [contributing](./../CONTRIBUTING.md) and/or [file an issue](https://github.com/microsoft/onnxruntime/issues) clearly describing your use case and share your model if possible.

### TensorRT or CUDA?
TensorRT and CUDA are separate execution providers for ONNX Runtime. On the same hardware, TensorRT will generally provide better performance; however, this depends on the specific model and whether the operators in the model can be supported by TensorRT. In cases where TensorRT cannot handle the subgraph(s), it will fall back to CUDA. Note that the TensorRT EP may depend on a different version of CUDA than the CUDA EP.

### TensorRT/CUDA or DirectML?
DirectML is the hardware-accelerated DirectX 12 library for machine learning on Windows and supports all DirectX 12 capable devices (Nvidia, Intel, AMD). This means that if you are targeting Windows GPUs, using the DirectML Execution Provider is likely your best bet. This can be used with both the ONNX Runtime as well as [WinML APIs](./WinRT_API.md).

## Tuning performance for specific Execution Providers

### Thread management
* If ORT is built with OpenMP, use the OpenMP env variable to control the number of intra op num threads.
* If ORT is not built with OpenMP, use the appropriate ORT API to control intra op num threads.
* Inter op num threads (used only when parallel execution is enabled) is not affected by OpenMP settings and should
always be set using the ORT APIs.
Expand Down Expand Up @@ -109,29 +152,26 @@ The most widely used environment variables are:
* Use PASSIVE if your CPU usage already high, and use ACTIVE when you want to trade CPU with latency


## Troubleshooting model performance issues
The answers below are troubleshooting suggestions based on common previous user-filed issues and questions. This list is by no means exhaustive and there is a lot of case-by-case fluctuation depending on the model and specific usage scenario. Please use this information to guide your troubleshooting, search through previously filed issues for related topics, and/or file a new issue if your problem is still not resolved.

## Profiling and Performance Report

You can enable ONNX Runtime latency profiling in code:

```python
import onnxruntime as rt

sess_options = rt.SessionOptions()
sess_options.enable_profiling = True
```
If you are using the onnxruntime_perf_test.exe tool, you can add `-p [profile_file]` to enable performance profiling.

In both cases, you will get a JSON file which contains the detailed performance data (threading, latency of each operator, etc). This file is a standard performance tracing file, and to view it in a user friendly way, you can open it by using chrome://tracing:
* Open chrome browser
* Type chrome://tracing in the address bar
* Load the generated JSON file
### Performance Troubleshooting Checklist
Here is a list of things to check through when assessing performance issues.
* Are you using OpenMP? OpenMP will parallelize some of the code for potential performance improvements. This is not recommended for running on single threads.
* Have you enabled all [graph optimizations](./ONNX_Runtime_Graph_Optimizations.md)? The official published packages do enable all by default, but when building from source, check that these are enabled in your build.
* Have you searched through prior filed [Github issues](https://github.com/microsoft/onnxruntime/issues) to see if your problem has been discussed previously? Please do this before filing new issues.
* If using CUDA or TensorRT, do you have the right versions of the dependent libraries installed?

## Performance Tuning for Bert Models
### I need help performance tuning for BERT models.
For BERT models, sometimes ONNX Runtime cannot apply the best optimization due to reasons such as framework version updates. We recommend trying out the [BERT optimization tool](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/bert), which reflects the latest changes in graph pattern matching and model conversions, and a set of [notebooks](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/bert/notebooks) to help get started.

For Bert models, sometimes ONNX Runtime cannot apply the best optimization due to reasons such as framework version updates. In this case, we recommend trying out the [Bert optimization tool](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/bert), which reflects the latest changes in graph pattern matching and model conversions, and a set of [notebooks](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/bert/notebooks) for quickstart.
### Why is the model graph not optimized even with graph_optimization_level set to ORT_ENABLE_ALL?
The ONNX model from IR_VERSION 4 only treats initializers that appear in graph input as non-constant. This may fail some of the graph optimizations, like const folding, operator fusion and etc. Move initializers out of graph inputs if there is no need to override them, by either re-generating the model with latest exporter/converter or with the tool [remove_initializer_from_input.py](./../tools/python/remove_initializer_from_input.py).

### Why is my model running slower on GPU than CPU?
Depending on which execution provider you're using, it may not have full support for all the operators in your model. Fallback to CPU ops can cause hits in performance speed. Moreover even if an op is implemented by the CUDA execution provider, it may not necessarily assign/place the op to the CUDA EP due to performance reasons. To see the placement decided by ORT, turn on verbose logging and look at the console output.

## Model graph is not optimized even with graph_optimization_level set to ORT_ENABLE_ALL?
### My converted Tensorflow model is slow - why?
NCHW and NHWC are two different memory layout for 4-D tensors.

ONNX model from IR_VERSION 4 only treats initializers that appear in graph input as non-constant. This may fail some of the graph optimizations, like const folding, operator fusion and etc. Move initializers out of graph inputs if there is no need to override them, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
Most TensorFlow operations used by a CNN support both NHWC and NCHW data format. The Tensorflow team suggests that on GPU NCHW is faster but on CPU NHWC is sometimes faster in Tensorflow. However, ONNX only supports NCHW. As a result, if the original model is in NHWC format, when the model is converted extra transposes may be added. The [tensorflow-onnx](https://github.com/onnx/tensorflow-onnx) and [keras-onnx](https://github.com/onnx/keras-onnx) converters do remove many of these transposes, but if this doesn't help sufficiently, consider retraining the model using NCHW.

0 comments on commit 9cca219

Please sign in to comment.