Add quantized version of nms #3601

NicolasHug · 2021-03-24T15:27:22Z

This PR introduces the forward pass of quantized nms on CPU.

NicolasHug · 2021-03-24T15:29:20Z

@fmassa this is still WIP but a first pass would be helpful to make sure this is going in the right direction :)

As far as I can tell from a quick snippet below, the results seem to be consistent with that of the non-quantized version:

quick test

import torch
import numpy as np
from torchvision import ops

def create_tensors_with_iou(N, iou_thresh):
    # taken from test_ops. Makes sure at least one box should be removed
    # Note: this might not actually happen as we convert to ints but should be
    # the case most of the time
    boxes = torch.rand(N, 4) * 100
    boxes[:, 2:] += boxes[:, :2]
    boxes[-1, :] = boxes[0, :]
    x0, y0, x1, y1 = boxes[-1].tolist()
    iou_thresh += 1e-5
    boxes[-1, 2] += (x1 - x0) * (1 - iou_thresh) / iou_thresh
    scores = torch.rand(N) * 100
    return boxes, scores

n_removed = 0
for seed in range(1000):
    torch.manual_seed(seed)
    for iou_threshold in (.5, .8, .9):

        n_boxes = 200
        boxes, scores = create_tensors_with_iou(n_boxes, iou_threshold)

        # use integer values and clamp to the quint8 range for a fair comparison
        boxes = boxes.to(torch.int).to(torch.float).clamp(0, 255)
        scores = scores.to(torch.int).to(torch.float).clamp(0, 255)

        qboxes = torch.quantize_per_tensor(boxes, scale=1, zero_point=0, dtype=torch.quint8)
        qscores = torch.quantize_per_tensor(scores, scale=1, zero_point=0, dtype=torch.quint8)

        keep = ops.nms(boxes, scores, iou_threshold)
        qkeep = ops.nms(qboxes, qscores, iou_threshold)

        n_removed += (boxes.size(0) - keep.size(0))
        assert (keep == qkeep).all()

print(f"looks good. Removed {n_removed} boxes in total")

torchvision/csrc/ops/quantized/cpu/qnms_kernel.cpp

fmassa · 2021-03-24T15:45:19Z

This is a great starting point!

A few things I think would be good to do next:

add tests
try seeing if it's possible to remove the dequantize_val calls, as for the operations we do here the scale and zero_point can be factored out and might get cancelled in most locations

fmassa

Looks great, thanks!

I've made a few comments, let me know what you think

fmassa · 2021-03-25T09:02:56Z

torchvision/csrc/ops/quantized/cpu/qnms_kernel.cpp

+
+  auto areas_a = areas_t.accessor<float, 1>();
+  for (int64_t i = 0; i < ndets; i++) {
+    areas_a[i] = dets_scale * dets_scale * (x2[i].val_ - x1[i].val_) * (y2[i].val_ - y1[i].val_);


given that we only use areas in a relative way, we don't need to multiply them by dets_scale ** 2.
Additionally, this could even be kept as int32_t or some larger than uint8_t type, as we would only be considering the quantized area here if we don't consider the scale

Another thing to keep in mind is that if the boxes are malformed (i.e., x2 < x1) the operation will underflow and the results won't be compatible with the fp32 version anymore

fmassa · 2021-03-25T09:04:43Z

torchvision/csrc/ops/quantized/cpu/qnms_kernel.cpp

+      auto xx2 = std::min(ix2val, x2[j].val_);
+      auto yy2 = std::min(iy2val, y2[j].val_);
+
+      float w = dets_scale * std::max(0, xx2 - xx1);


Same comment here about the dets_scale not being neeed.
Additionally, you need to keep an eye on degenerate boxes otherwise xx2 - xx1 might underflow

test/test_ops.py

NicolasHug · 2021-03-25T10:53:43Z

Thanks for the review @fmassa!

Regarding using an int type higher than int8: there are different quantized dtypes, the biggest one being torch.qint32. Should we use long instead of float then?

Regarding malformed boxes and the risk of overflow, there was a similar discussion in #2582 (see in particular #2582 (comment)). I realize pytorch/pytorch#36853 was just closed so I'll try to submit another PR with the enforcement on the Python side so that the function can't be called from Python on degenerate boxes. Do we still need to take care of degenerate boxes on the C++ side?

fmassa · 2021-03-25T12:33:02Z

I would recommend first opening an issue to discuss adding torch.assert_async to torchvision, as assert_async can make the CUDA context invalidated in case of asserts (so the user would need to restart the runtime).

Do we still need to take care of degenerate boxes on the C++ side?

We don't currently handle degenerate boxes at all in the implementation. My comments were about underflows that can happen due to the unsigned type being used.

fmassa · 2021-03-25T12:36:18Z

torchvision/csrc/ops/quantized/cpu/qnms_kernel.cpp

+      auto w = std::max(0, xx2 - xx1);  // * scale (gets canceled below)
+      auto h = std::max(0, yy2 - yy1);  // * scale (gets canceled below)
+      auto inter = w * h;
+      double ovr = inter / (iarea + areas[j] - inter);


don't you need to cast one of those terms to float before the division? Otherwise there will be truncation in the division before performing the casting.

Otherwise, if the exact value of ovr is not needed and we can get approximate values, we can also shift the numerator by a large integer factor (say 2 ** 8) and perform integer division on int16_t, shifting the iou_threshold as well by the same factor

NicolasHug · 2021-03-25T12:57:07Z

Another thing to keep in mind is that if the boxes are malformed (i.e., x2 < x1) the operation will underflow and the results won't be compatible with the fp32 version anymore

@fmassa what's the rationale behind enforcing the same results between float nms and qnms on degenerate boxes?

The docstring of nms is quite clear regarding the expected format:

boxes (Tensor[N, 4])): boxes to perform NMS on. They are expected to be in (x1, y1, x2, y2) format with 0 <= x1 < x2 and 0 <= y1 < y2.

From my understanding, that means that the results are undefined on degenerate boxes, so for now I don't see a reason to enforce consistency on undefined results. But perhaps I'm missing something?

I think that the correct way to go would be to validate the boxes in Python before they're passed to the C++ functions. Whether we can use assert_async is not clear yet, but maybe we could at least do the validation for CPU tensors? That would prevent any inconsistency between float nms and qnms as qnms only supports CPUs anyway.

fmassa · 2021-03-25T13:22:31Z

From my understanding, that means that the results are undefined on degenerate boxes, so for now I don't see a reason to enforce consistency on undefined results. But perhaps I'm missing something?

It's to avoid surprises to the user. But let's not worry about that now then, and just put a comment in the code saying that there might be underflow happening

NicolasHug · 2021-03-26T11:21:38Z

torchvision/csrc/ops/quantized/cpu/qnms_kernel.cpp

+      auto w = std::max(0, xx2 - xx1);  // * scale (gets canceled below)
+      auto h = std::max(0, yy2 - yy1);  // * scale (gets canceled below)


In fact I think it's possible to get an underflow on unsigned types here, even when boxes aren't degenerate: if box j is strictly left from box i and they don't overlap, then we'll have xx2 < xx1.

yes, good catch, this happens more often than just with degenerate boxes here

Trying to come up with a reproducible example of an underflow was quite informative, and I came to the conclusion that:

they actually don't happen

Even when they do (by manually forcing them to happen), they're effectively harmless, i.e. they don't change the end result. Since they don't change the result, it's actually impossible to test against underflow.

The reason for 1) is that the result of a - b is an int, even if both a and b are uint8_t: TIL about integral promotion (https://stackoverflow.com/questions/21421998/subtraction-operation-on-unsigned-char). Since quint_8 is the only supported quantized unsigned type, such values can always be represented as ints, and integral promotion will always happen.

Regarding 2): It's possible to get an underflow by doing some explicit casts to uint8_t in the qnms code. The following boxes and scores:

boxes = torch.as_tensor([[0, 0, 5, 5], [10, 0, 20, 20]]).to(torch.float) scores = torch.as_tensor([10, 9]).to(torch.float)

should produce 10 5 0 5 for xx1, xx2, yy1, yy2 and thus lead to an underflow when computing w = xx2 - xx1.

However, the underflow leads to w = 251 and inter = some_big_value = 1255 and thus ovr ~= -2 (note the negative sign because of some_big_value that gets subtracted to the denominator).

Hence the if (ovr > iou_threshold) condition is never verified when there's an underflow, and that's actually correct since the correct value for ovr should have been 0 anyway: underflows can only happen when the boxes are disjoint along at least one of the dimensions.

So the results are unchanged, and it's thus impossible to actually test against underflows... I think.

EDIT: we now explicitly cast coordinates to float32 for allowing vectorization, so there's definitely no risk of underflow anymore when computing w or h.

NicolasHug · 2021-03-29T14:40:01Z

I updated the PR in 7e27337 to explicitely use floats for temp variables storing the coordinates. This has a (very) minimal memory overhead over using [u]int[8] (leaving variable declarations with auto), but allows the compiler to produce vectorized instructions which significantly improves computation time. In addition this completely avoids underflows (which IMHO don't even happen anyway #3601 (comment)). On a small benchmark over 500 boxes:

nms on floats:
386 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

qnms on qint32 with implicit auto declaration:
2.06 ms ± 84.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

qnms on qint32 with explicit float declration
399 µs ± 11.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Similar improvements are observed on the other quantized dtypes.

Overall I observed that qnms is slightly slower than nms, but still of the same order of magnitude. Note that it's not unexpected for a quantized version to be slower than its non-quantized counterpart. For example hardsigmoid on a 10 000 long tensor gives:

float 3.75 µs ± 50 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
quint8 5.59 µs ± 314 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
qint8 5.32 µs ± 295 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
qint32 44.9 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

fmassa

Thanks for the thorough investigation!

Summary: * Add quantized version of nms * Added tests * Compute areas only once * remove calls to dequantize_val * fix return type for empty tensor * flake8 * remove use of scale as it gets cancelled out * simpler int convertion in tests * explicitly set ovr to double * add tests for more values of scale and zero_point * comment about underflow * remove unnecessary accessor * properly convert to float for division * Add comments about underflow * explicitely cast coordinates to float to allow vectorization * clang * clang again * hopefully OK now Reviewed By: fmassa Differential Revision: D27433913 fbshipit-source-id: c0b80c3b6817898682a78af705ff383855248a37

Add quantized version of nms

1f2ecd8

NicolasHug added module: ops module: models.quantization Issues related to the quantizable/quantized models improvement labels Mar 24, 2021

facebook-github-bot added the cla signed label Mar 24, 2021

NicolasHug marked this pull request as draft March 24, 2021 15:32

fmassa reviewed Mar 24, 2021

View reviewed changes

torchvision/csrc/ops/quantized/cpu/qnms_kernel.cpp Outdated Show resolved Hide resolved

NicolasHug added 6 commits March 24, 2021 18:07

Added tests

b63bac6

Compute areas only once

41a4927

remove calls to dequantize_val

cdb44fc

fix return type for empty tensor

fb451cc

Merge branch 'master' of github.com:pytorch/vision into qnms

d92171c

flake8

2eef613

fmassa reviewed Mar 25, 2021

View reviewed changes

NicolasHug added 5 commits March 25, 2021 10:15

Merge branch 'master' of github.com:pytorch/vision into qnms

51943d8

remove use of scale as it gets cancelled out

f8a56d7

simpler int convertion in tests

4c75c56

explicitly set ovr to double

2ac5313

add tests for more values of scale and zero_point

f47d238

NicolasHug marked this pull request as ready for review March 25, 2021 10:46

NicolasHug changed the title ~~WIP Add quantized version of nms~~ Add quantized version of nms Mar 25, 2021

fmassa reviewed Mar 25, 2021

View reviewed changes

NicolasHug added 2 commits March 25, 2021 14:39

comment about underflow

4b31259

remove unnecessary accessor

7c18714

properly convert to float for division

31d76ce

NicolasHug commented Mar 26, 2021

View reviewed changes

NicolasHug added 5 commits March 29, 2021 14:32

Add comments about underflow

618bbe1

explicitely cast coordinates to float to allow vectorization

7e27337

Merge branch 'master' of github.com:pytorch/vision into qnms

5450689

clang

4cda6e5

clang again

dd46376

hopefully OK now

aef04fd

fmassa approved these changes Mar 29, 2021

View reviewed changes

fmassa merged commit f74bfab into pytorch:master Mar 30, 2021

datumbox added enhancement and removed improvement labels Jun 1, 2021

NicolasHug mentioned this pull request Jul 16, 2021

TorchVision Roadmap - 2021 H1 #3221

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add quantized version of nms #3601

Add quantized version of nms #3601

NicolasHug commented Mar 24, 2021

NicolasHug commented Mar 24, 2021

fmassa commented Mar 24, 2021

fmassa left a comment

fmassa Mar 25, 2021

fmassa Mar 25, 2021

NicolasHug commented Mar 25, 2021

fmassa commented Mar 25, 2021

fmassa Mar 25, 2021

NicolasHug commented Mar 25, 2021 •

edited

Loading

fmassa commented Mar 25, 2021

NicolasHug Mar 26, 2021

fmassa Mar 26, 2021

NicolasHug Mar 29, 2021 •

edited

Loading

NicolasHug commented Mar 29, 2021 •

edited

Loading

fmassa left a comment

		auto w = std::max(0, xx2 - xx1); // * scale (gets canceled below)
		auto h = std::max(0, yy2 - yy1); // * scale (gets canceled below)

Add quantized version of nms #3601

Add quantized version of nms #3601

Conversation

NicolasHug commented Mar 24, 2021

NicolasHug commented Mar 24, 2021

fmassa commented Mar 24, 2021

fmassa left a comment

Choose a reason for hiding this comment

fmassa Mar 25, 2021

Choose a reason for hiding this comment

fmassa Mar 25, 2021

Choose a reason for hiding this comment

NicolasHug commented Mar 25, 2021

fmassa commented Mar 25, 2021

fmassa Mar 25, 2021

Choose a reason for hiding this comment

NicolasHug commented Mar 25, 2021 • edited Loading

fmassa commented Mar 25, 2021

NicolasHug Mar 26, 2021

Choose a reason for hiding this comment

fmassa Mar 26, 2021

Choose a reason for hiding this comment

NicolasHug Mar 29, 2021 • edited Loading

Choose a reason for hiding this comment

NicolasHug commented Mar 29, 2021 • edited Loading

fmassa left a comment

Choose a reason for hiding this comment

NicolasHug commented Mar 25, 2021 •

edited

Loading

NicolasHug Mar 29, 2021 •

edited

Loading

NicolasHug commented Mar 29, 2021 •

edited

Loading