-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add quantized version of nms #3601
Conversation
@fmassa this is still WIP but a first pass would be helpful to make sure this is going in the right direction :) As far as I can tell from a quick snippet below, the results seem to be consistent with that of the non-quantized version: quick testimport torch
import numpy as np
from torchvision import ops
def create_tensors_with_iou(N, iou_thresh):
# taken from test_ops. Makes sure at least one box should be removed
# Note: this might not actually happen as we convert to ints but should be
# the case most of the time
boxes = torch.rand(N, 4) * 100
boxes[:, 2:] += boxes[:, :2]
boxes[-1, :] = boxes[0, :]
x0, y0, x1, y1 = boxes[-1].tolist()
iou_thresh += 1e-5
boxes[-1, 2] += (x1 - x0) * (1 - iou_thresh) / iou_thresh
scores = torch.rand(N) * 100
return boxes, scores
n_removed = 0
for seed in range(1000):
torch.manual_seed(seed)
for iou_threshold in (.5, .8, .9):
n_boxes = 200
boxes, scores = create_tensors_with_iou(n_boxes, iou_threshold)
# use integer values and clamp to the quint8 range for a fair comparison
boxes = boxes.to(torch.int).to(torch.float).clamp(0, 255)
scores = scores.to(torch.int).to(torch.float).clamp(0, 255)
qboxes = torch.quantize_per_tensor(boxes, scale=1, zero_point=0, dtype=torch.quint8)
qscores = torch.quantize_per_tensor(scores, scale=1, zero_point=0, dtype=torch.quint8)
keep = ops.nms(boxes, scores, iou_threshold)
qkeep = ops.nms(qboxes, qscores, iou_threshold)
n_removed += (boxes.size(0) - keep.size(0))
assert (keep == qkeep).all()
print(f"looks good. Removed {n_removed} boxes in total") |
This is a great starting point! A few things I think would be good to do next:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks!
I've made a few comments, let me know what you think
|
||
auto areas_a = areas_t.accessor<float, 1>(); | ||
for (int64_t i = 0; i < ndets; i++) { | ||
areas_a[i] = dets_scale * dets_scale * (x2[i].val_ - x1[i].val_) * (y2[i].val_ - y1[i].val_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that we only use areas in a relative way, we don't need to multiply them by dets_scale ** 2
.
Additionally, this could even be kept as int32_t
or some larger than uint8_t
type, as we would only be considering the quantized area here if we don't consider the scale
Another thing to keep in mind is that if the boxes are malformed (i.e., x2 < x1
) the operation will underflow and the results won't be compatible with the fp32 version anymore
auto xx2 = std::min(ix2val, x2[j].val_); | ||
auto yy2 = std::min(iy2val, y2[j].val_); | ||
|
||
float w = dets_scale * std::max(0, xx2 - xx1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here about the dets_scale
not being neeed.
Additionally, you need to keep an eye on degenerate boxes otherwise xx2 - xx1
might underflow
Thanks for the review @fmassa! Regarding using an int type higher than int8: there are different quantized dtypes, the biggest one being Regarding malformed boxes and the risk of overflow, there was a similar discussion in #2582 (see in particular #2582 (comment)). I realize pytorch/pytorch#36853 was just closed so I'll try to submit another PR with the enforcement on the Python side so that the function can't be called from Python on degenerate boxes. Do we still need to take care of degenerate boxes on the C++ side? |
I would recommend first opening an issue to discuss adding
We don't currently handle degenerate boxes at all in the implementation. My comments were about underflows that can happen due to the unsigned type being used. |
auto w = std::max(0, xx2 - xx1); // * scale (gets canceled below) | ||
auto h = std::max(0, yy2 - yy1); // * scale (gets canceled below) | ||
auto inter = w * h; | ||
double ovr = inter / (iarea + areas[j] - inter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't you need to cast one of those terms to float
before the division? Otherwise there will be truncation in the division before performing the casting.
Otherwise, if the exact value of ovr
is not needed and we can get approximate values, we can also shift the numerator by a large integer factor (say 2 ** 8
) and perform integer division on int16_t
, shifting the iou_threshold
as well by the same factor
@fmassa what's the rationale behind enforcing the same results between The docstring of
From my understanding, that means that the results are undefined on degenerate boxes, so for now I don't see a reason to enforce consistency on undefined results. But perhaps I'm missing something? I think that the correct way to go would be to validate the boxes in Python before they're passed to the C++ functions. Whether we can use |
It's to avoid surprises to the user. But let's not worry about that now then, and just put a comment in the code saying that there might be underflow happening |
auto w = std::max(0, xx2 - xx1); // * scale (gets canceled below) | ||
auto h = std::max(0, yy2 - yy1); // * scale (gets canceled below) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact I think it's possible to get an underflow on unsigned types here, even when boxes aren't degenerate: if box j is strictly left from box i and they don't overlap, then we'll have xx2 < xx1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, good catch, this happens more often than just with degenerate boxes here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to come up with a reproducible example of an underflow was quite informative, and I came to the conclusion that:
- they actually don't happen
- Even when they do (by manually forcing them to happen), they're effectively harmless, i.e. they don't change the end result. Since they don't change the result, it's actually impossible to test against underflow.
The reason for 1) is that the result of a - b
is an int
, even if both a
and b
are uint8_t
: TIL about integral promotion (https://stackoverflow.com/questions/21421998/subtraction-operation-on-unsigned-char). Since quint_8
is the only supported quantized unsigned type, such values can always be represented as int
s, and integral promotion will always happen.
Regarding 2): It's possible to get an underflow by doing some explicit casts to uint8_t
in the qnms
code. The following boxes and scores:
boxes = torch.as_tensor([[0, 0, 5, 5], [10, 0, 20, 20]]).to(torch.float)
scores = torch.as_tensor([10, 9]).to(torch.float)
should produce 10 5 0 5
for xx1, xx2, yy1, yy2
and thus lead to an underflow when computing w = xx2 - xx1
.
However, the underflow leads to w = 251
and inter = some_big_value = 1255
and thus ovr ~= -2
(note the negative sign because of some_big_value
that gets subtracted to the denominator).
Hence the if (ovr > iou_threshold)
condition is never verified when there's an underflow, and that's actually correct since the correct value for ovr
should have been 0
anyway: underflows can only happen when the boxes are disjoint along at least one of the dimensions.
So the results are unchanged, and it's thus impossible to actually test against underflows... I think.
EDIT: we now explicitly cast coordinates to float32 for allowing vectorization, so there's definitely no risk of underflow anymore when computing w
or h
.
I updated the PR in 7e27337 to explicitely use
Similar improvements are observed on the other quantized dtypes. Overall I observed that
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thorough investigation!
Summary: * Add quantized version of nms * Added tests * Compute areas only once * remove calls to dequantize_val * fix return type for empty tensor * flake8 * remove use of scale as it gets cancelled out * simpler int convertion in tests * explicitly set ovr to double * add tests for more values of scale and zero_point * comment about underflow * remove unnecessary accessor * properly convert to float for division * Add comments about underflow * explicitely cast coordinates to float to allow vectorization * clang * clang again * hopefully OK now Reviewed By: fmassa Differential Revision: D27433913 fbshipit-source-id: c0b80c3b6817898682a78af705ff383855248a37
This PR introduces the forward pass of quantized nms on CPU.