Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add quantized version of nms #3601

Merged
merged 21 commits into from
Mar 30, 2021
Merged

Add quantized version of nms #3601

merged 21 commits into from
Mar 30, 2021

Conversation

NicolasHug
Copy link
Member

This PR introduces the forward pass of quantized nms on CPU.

@NicolasHug NicolasHug added module: ops module: models.quantization Issues related to the quantizable/quantized models improvement labels Mar 24, 2021
@NicolasHug
Copy link
Member Author

@fmassa this is still WIP but a first pass would be helpful to make sure this is going in the right direction :)

As far as I can tell from a quick snippet below, the results seem to be consistent with that of the non-quantized version:

quick test
import torch
import numpy as np
from torchvision import ops

def create_tensors_with_iou(N, iou_thresh):
    # taken from test_ops. Makes sure at least one box should be removed
    # Note: this might not actually happen as we convert to ints but should be
    # the case most of the time
    boxes = torch.rand(N, 4) * 100
    boxes[:, 2:] += boxes[:, :2]
    boxes[-1, :] = boxes[0, :]
    x0, y0, x1, y1 = boxes[-1].tolist()
    iou_thresh += 1e-5
    boxes[-1, 2] += (x1 - x0) * (1 - iou_thresh) / iou_thresh
    scores = torch.rand(N) * 100
    return boxes, scores

n_removed = 0
for seed in range(1000):
    torch.manual_seed(seed)
    for iou_threshold in (.5, .8, .9):

        n_boxes = 200
        boxes, scores = create_tensors_with_iou(n_boxes, iou_threshold)

        # use integer values and clamp to the quint8 range for a fair comparison
        boxes = boxes.to(torch.int).to(torch.float).clamp(0, 255)
        scores = scores.to(torch.int).to(torch.float).clamp(0, 255)

        qboxes = torch.quantize_per_tensor(boxes, scale=1, zero_point=0, dtype=torch.quint8)
        qscores = torch.quantize_per_tensor(scores, scale=1, zero_point=0, dtype=torch.quint8)

        keep = ops.nms(boxes, scores, iou_threshold)
        qkeep = ops.nms(qboxes, qscores, iou_threshold)

        n_removed += (boxes.size(0) - keep.size(0))
        assert (keep == qkeep).all()

print(f"looks good. Removed {n_removed} boxes in total")

@NicolasHug NicolasHug marked this pull request as draft March 24, 2021 15:32
@fmassa
Copy link
Member

fmassa commented Mar 24, 2021

This is a great starting point!

A few things I think would be good to do next:

  • add tests
  • try seeing if it's possible to remove the dequantize_val calls, as for the operations we do here the scale and zero_point can be factored out and might get cancelled in most locations

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

I've made a few comments, let me know what you think


auto areas_a = areas_t.accessor<float, 1>();
for (int64_t i = 0; i < ndets; i++) {
areas_a[i] = dets_scale * dets_scale * (x2[i].val_ - x1[i].val_) * (y2[i].val_ - y1[i].val_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that we only use areas in a relative way, we don't need to multiply them by dets_scale ** 2.
Additionally, this could even be kept as int32_t or some larger than uint8_t type, as we would only be considering the quantized area here if we don't consider the scale

Another thing to keep in mind is that if the boxes are malformed (i.e., x2 < x1) the operation will underflow and the results won't be compatible with the fp32 version anymore

auto xx2 = std::min(ix2val, x2[j].val_);
auto yy2 = std::min(iy2val, y2[j].val_);

float w = dets_scale * std::max(0, xx2 - xx1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about the dets_scale not being neeed.
Additionally, you need to keep an eye on degenerate boxes otherwise xx2 - xx1 might underflow

test/test_ops.py Outdated Show resolved Hide resolved
test/test_ops.py Outdated Show resolved Hide resolved
@NicolasHug NicolasHug marked this pull request as ready for review March 25, 2021 10:46
@NicolasHug NicolasHug changed the title WIP Add quantized version of nms Add quantized version of nms Mar 25, 2021
@NicolasHug
Copy link
Member Author

Thanks for the review @fmassa!

Regarding using an int type higher than int8: there are different quantized dtypes, the biggest one being torch.qint32. Should we use long instead of float then?

Regarding malformed boxes and the risk of overflow, there was a similar discussion in #2582 (see in particular #2582 (comment)). I realize pytorch/pytorch#36853 was just closed so I'll try to submit another PR with the enforcement on the Python side so that the function can't be called from Python on degenerate boxes. Do we still need to take care of degenerate boxes on the C++ side?

@fmassa
Copy link
Member

fmassa commented Mar 25, 2021

I would recommend first opening an issue to discuss adding torch.assert_async to torchvision, as assert_async can make the CUDA context invalidated in case of asserts (so the user would need to restart the runtime).

Do we still need to take care of degenerate boxes on the C++ side?

We don't currently handle degenerate boxes at all in the implementation. My comments were about underflows that can happen due to the unsigned type being used.

auto w = std::max(0, xx2 - xx1); // * scale (gets canceled below)
auto h = std::max(0, yy2 - yy1); // * scale (gets canceled below)
auto inter = w * h;
double ovr = inter / (iarea + areas[j] - inter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you need to cast one of those terms to float before the division? Otherwise there will be truncation in the division before performing the casting.

Otherwise, if the exact value of ovr is not needed and we can get approximate values, we can also shift the numerator by a large integer factor (say 2 ** 8) and perform integer division on int16_t, shifting the iou_threshold as well by the same factor

@NicolasHug
Copy link
Member Author

NicolasHug commented Mar 25, 2021

Another thing to keep in mind is that if the boxes are malformed (i.e., x2 < x1) the operation will underflow and the results won't be compatible with the fp32 version anymore

@fmassa what's the rationale behind enforcing the same results between float nms and qnms on degenerate boxes?

The docstring of nms is quite clear regarding the expected format:

boxes (Tensor[N, 4])): boxes to perform NMS on. They are expected to be in (x1, y1, x2, y2) format with 0 <= x1 < x2 and 0 <= y1 < y2.

From my understanding, that means that the results are undefined on degenerate boxes, so for now I don't see a reason to enforce consistency on undefined results. But perhaps I'm missing something?

I think that the correct way to go would be to validate the boxes in Python before they're passed to the C++ functions. Whether we can use assert_async is not clear yet, but maybe we could at least do the validation for CPU tensors? That would prevent any inconsistency between float nms and qnms as qnms only supports CPUs anyway.

@fmassa
Copy link
Member

fmassa commented Mar 25, 2021

From my understanding, that means that the results are undefined on degenerate boxes, so for now I don't see a reason to enforce consistency on undefined results. But perhaps I'm missing something?

It's to avoid surprises to the user. But let's not worry about that now then, and just put a comment in the code saying that there might be underflow happening

Comment on lines 77 to 78
auto w = std::max(0, xx2 - xx1); // * scale (gets canceled below)
auto h = std::max(0, yy2 - yy1); // * scale (gets canceled below)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact I think it's possible to get an underflow on unsigned types here, even when boxes aren't degenerate: if box j is strictly left from box i and they don't overlap, then we'll have xx2 < xx1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good catch, this happens more often than just with degenerate boxes here

Copy link
Member Author

@NicolasHug NicolasHug Mar 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to come up with a reproducible example of an underflow was quite informative, and I came to the conclusion that:

  1. they actually don't happen
  2. Even when they do (by manually forcing them to happen), they're effectively harmless, i.e. they don't change the end result. Since they don't change the result, it's actually impossible to test against underflow.

The reason for 1) is that the result of a - b is an int, even if both a and b are uint8_t: TIL about integral promotion (https://stackoverflow.com/questions/21421998/subtraction-operation-on-unsigned-char). Since quint_8 is the only supported quantized unsigned type, such values can always be represented as ints, and integral promotion will always happen.

Regarding 2): It's possible to get an underflow by doing some explicit casts to uint8_t in the qnms code. The following boxes and scores:

    boxes = torch.as_tensor([[0, 0, 5, 5], [10, 0, 20, 20]]).to(torch.float)
    scores = torch.as_tensor([10, 9]).to(torch.float)

should produce 10 5 0 5 for xx1, xx2, yy1, yy2 and thus lead to an underflow when computing w = xx2 - xx1.

However, the underflow leads to w = 251 and inter = some_big_value = 1255 and thus ovr ~= -2 (note the negative sign because of some_big_value that gets subtracted to the denominator).

Hence the if (ovr > iou_threshold) condition is never verified when there's an underflow, and that's actually correct since the correct value for ovr should have been 0 anyway: underflows can only happen when the boxes are disjoint along at least one of the dimensions.

So the results are unchanged, and it's thus impossible to actually test against underflows... I think.

EDIT: we now explicitly cast coordinates to float32 for allowing vectorization, so there's definitely no risk of underflow anymore when computing w or h.

@NicolasHug
Copy link
Member Author

NicolasHug commented Mar 29, 2021

I updated the PR in 7e27337 to explicitely use floats for temp variables storing the coordinates. This has a (very) minimal memory overhead over using [u]int[8] (leaving variable declarations with auto), but allows the compiler to produce vectorized instructions which significantly improves computation time. In addition this completely avoids underflows (which IMHO don't even happen anyway #3601 (comment)). On a small benchmark over 500 boxes:

nms on floats:
386 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

qnms on qint32 with implicit auto declaration:
2.06 ms ± 84.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

qnms on qint32 with explicit float declration
399 µs ± 11.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Similar improvements are observed on the other quantized dtypes.

Overall I observed that qnms is slightly slower than nms, but still of the same order of magnitude. Note that it's not unexpected for a quantized version to be slower than its non-quantized counterpart. For example hardsigmoid on a 10 000 long tensor gives:

float 3.75 µs ± 50 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
quint8 5.59 µs ± 314 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
qint8 5.32 µs ± 295 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
qint32 44.9 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough investigation!

@fmassa fmassa merged commit f74bfab into pytorch:master Mar 30, 2021
facebook-github-bot pushed a commit that referenced this pull request Apr 2, 2021
Summary:
* Add quantized version of nms

* Added tests

* Compute areas only once

* remove calls to dequantize_val

* fix return type for empty tensor

* flake8

* remove use of scale as it gets cancelled out

* simpler int convertion in tests

* explicitly set ovr to double

* add tests for more values of scale and zero_point

* comment about underflow

* remove unnecessary accessor

* properly convert to float for division

* Add comments about underflow

* explicitely cast coordinates to float to allow vectorization

* clang

* clang  again

* hopefully OK now

Reviewed By: fmassa

Differential Revision: D27433913

fbshipit-source-id: c0b80c3b6817898682a78af705ff383855248a37
@NicolasHug NicolasHug mentioned this pull request Jul 16, 2021
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed enhancement module: models.quantization Issues related to the quantizable/quantized models module: ops
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants