Another case study, based on this YOLO v3 model is available here.
See here for YOLO v4 use.
Use the YOLO v3 algorithms for object detection in C# using ML.Net. We start with a Torch model, then converting it to ONNX format and use it in ML.Net.
This is a case study on a document layout YOLO trained model. The model can be found in the following Medium article: Object Detection — Document Layout Analysis Using Monk AI.
- The ONNX conversion removes 1 feature which is the objectness score, pc. The original model has (5 + classes) features for each bounding box, the ONNX model has (4 + classes) features per bounding box. We will use the class probability as a proxy for the objectness score when performing the Non-maximum Suppression (NMS) step. This is a known issue, more info here.
- Image resizing is not optimised, and will always yield 416x416 size image. This is not the case in the original model (see this issue: RECTANGULAR INFERENCE).
This is based on this article Object Detection — Document Layout Analysis Using Monk AI.
import os
import sys
from IPython.display import Image
sys.path.append("../Monk_Object_Detection/7_yolov3/lib")
from infer_detector import Infer
gtf = Infer()
f = open("dla_yolov3/classes.txt")
class_list = f.readlines()
f.close()
model_name = "yolov3"
weights = "dla_yolov3/dla_yolov3.pt"
gtf.Model(model_name, class_list, weights, use_gpu=False, input_size=(416, 416))
img_path = "test_square.jpg"
gtf.Predict(img_path, conf_thres=0.2, iou_thres=0.5)
Image(filename='output/test_square.jpg')
You need to set ONNX_EXPORT = True
in ...\Monk_Object_Detection\7_yolov3\lib\models.py
before loading the model.
We name the input layer image
and the 2 ouput layers classes
, bboxes
. This is not needed but helps the clarity.
import torch
import torchvision.models as models
dummy_input = torch.randn(1, 3, 416, 416) # Create the right input shape (e.g. for an image)
dummy_input = torch.nn.Sigmoid()(dummy_input) # limit between 0 and 1 (superfluous?)
torch.onnx.export(gtf.system_dict["local"]["model"],
dummy_input,
"dla_yolov3.onnx",
input_names=["image"],
output_names=["classes", "bboxes"],
opset_version=9)
The ONNX model can be viewed in Netron. Our model looks like this:
- The input layer size is [1 x 3 x 416 x 416]. This corresponds to 1 batch size x 3 colors x 416 pixels height x 416 pixel width (more info about fixed batch size here).
As per this article:
For an image of size 416 x 416, YOLO predicts ((52 x 52) + (26 x 26) + 13 x 13)) x 3 = 10,647 bounding boxes.
- The
bboxes
output layer is of size [10,647 x 4]. This corresponds to 10,647 bounding boxes x 4 bounding box coordinates (x, y, h, w). - The
classes
output layer is of size [10,647 x 18]. This corresponds to 10,647 bounding boxes x 18 classes (this model has only 18 classes).
Hence, each bounding box has (4 + classes) = 22 features. The total number of prediction in this model is 22 x 10,647.
NB: The ONNX conversion removes 1 feature which is the objectness score, pc. The original model has (5 + classes) features for each bounding box. We will use the class probability as a proxy for the objectness score.
More information can be found in this article: YOLO v3 theory explained
- https://medium.com/towards-artificial-intelligence/object-detection-document-layout-analysis-using-monk-object-detection-toolkit-6c57200bde5
- https://medium.com/analytics-vidhya/yolo-v3-theory-explained-33100f6d193
- https://towardsdatascience.com/non-maximum-suppression-nms-93ce178e177c
- https://michhar.github.io/convert-pytorch-onnx/