@Namespace(value="tensorflow::ops") @NoOffset @Properties(inherit=tensorflow.class) public class QuantizeV2 extends Pointer
out[i] = (in[i] - min_range) * range(T) / (max_range - min_range)
if T == qint8: out[i] -= (range(T) + 1) / 2.0
here range(T) = numeric_limits<T>::max() - numeric_limits<T>::min()
*MIN_COMBINED Mode Example*
Assume the input is type float and has a possible range of [0.0, 6.0] and the
output type is quint8 ([0, 255]). The min_range and max_range values should be
specified as 0.0 and 6.0. Quantizing from float to quint8 will multiply each
value of the input by 255/6 and cast to quint8.
If the output type was qint8 ([-128, 127]), the operation will additionally
subtract each value by 128 prior to casting, so that the range of values aligns
with the range of qint8.
If the mode is 'MIN_FIRST', then this approach is used:
num_discrete_values = 1 << (# of bits in T)
range_adjust = num_discrete_values / (num_discrete_values - 1)
range = (range_max - range_min) * range_adjust
range_scale = num_discrete_values / range
quantized = round(input * range_scale) - round(range_min * range_scale) +
numeric_limits<T>::min()
quantized = max(quantized, numeric_limits<T>::min())
quantized = min(quantized, numeric_limits<T>::max())
The biggest difference between this and MIN_COMBINED is that the minimum range
is rounded first, before it's subtracted from the rounded value. With
MIN_COMBINED, a small bias is introduced where repeated iterations of quantizing
and dequantizing will introduce a larger and larger error.
*SCALED mode Example*
SCALED
mode matches the quantization approach used in
QuantizeAndDequantize{V2|V3}
.
If the mode is SCALED
, we do not use the full range of the output type,
choosing to elide the lowest possible value for symmetry (e.g., output range is
-127 to 127, not -128 to 127 for signed 8 bit quantization), so that 0.0 maps to
0.
We first find the range of values in our tensor. The
range we use is always centered on 0, so we find m such that
c++
m = max(abs(input_min), abs(input_max))
Our input tensor range is then [-m, m]
.
Next, we choose our fixed-point quantization buckets, [min_fixed, max_fixed]
.
If T is signed, this is
num_bits = sizeof(T) * 8
[min_fixed, max_fixed] =
[-(1 << (num_bits - 1) - 1), (1 << (num_bits - 1)) - 1]
Otherwise, if T is unsigned, the fixed-point range is
[min_fixed, max_fixed] = [0, (1 << num_bits) - 1]
From this we compute our scaling factor, s:
c++
s = (max_fixed - min_fixed) / (2 * m)
Now we can quantize the elements of our tensor:
c++
result = round(input * s)
One thing to watch out for is that the operator may choose to adjust the
requested minimum and maximum values slightly during the quantization process,
so you should always use the output ports as the range for further calculations.
For example, if the requested minimum and maximum values are close to equal,
they will be separated by a small epsilon value to prevent ill-formed quantized
buffers from being created. Otherwise, you can end up with buffers where all the
quantized values map to the same float value, which causes problems for
operations that have to perform further calculations on them.
Arguments:
* scope: A Scope object
* min_range: The minimum scalar value possibly produced for the input.
* max_range: The maximum scalar value possibly produced for the input.
Returns:
* Output
output: The quantized data produced from the float input.
* Output
output_min: The actual minimum scalar value used for the output.
* Output
output_max: The actual maximum scalar value used for the output.Modifier and Type | Class and Description |
---|---|
static class |
QuantizeV2.Attrs
Optional attribute setters for QuantizeV2
|
Pointer.CustomDeallocator, Pointer.Deallocator, Pointer.NativeDeallocator, Pointer.ReferenceCounter
Constructor and Description |
---|
QuantizeV2(Pointer p)
Pointer cast constructor.
|
QuantizeV2(Scope scope,
Input input,
Input min_range,
Input max_range,
int T) |
QuantizeV2(Scope scope,
Input input,
Input min_range,
Input max_range,
int T,
QuantizeV2.Attrs attrs) |
Modifier and Type | Method and Description |
---|---|
static QuantizeV2.Attrs |
Mode(BytePointer x) |
static QuantizeV2.Attrs |
Mode(String x) |
Operation |
operation() |
QuantizeV2 |
operation(Operation setter) |
Output |
output_max() |
QuantizeV2 |
output_max(Output setter) |
Output |
output_min() |
QuantizeV2 |
output_min(Output setter) |
Output |
output() |
QuantizeV2 |
output(Output setter) |
static QuantizeV2.Attrs |
RoundMode(BytePointer x) |
static QuantizeV2.Attrs |
RoundMode(String x) |
address, asBuffer, asByteBuffer, availablePhysicalBytes, calloc, capacity, capacity, close, deallocate, deallocate, deallocateReferences, deallocator, deallocator, equals, fill, formatBytes, free, getDirectBufferAddress, getPointer, getPointer, getPointer, getPointer, hashCode, interruptDeallocatorThread, isNull, isNull, limit, limit, malloc, maxBytes, maxPhysicalBytes, memchr, memcmp, memcpy, memmove, memset, offsetAddress, offsetof, offsetof, parseBytes, physicalBytes, physicalBytesInaccurate, position, position, put, realloc, referenceCount, releaseReference, retainReference, setNull, sizeof, sizeof, toString, totalBytes, totalCount, totalPhysicalBytes, withDeallocator, zero
public QuantizeV2(Pointer p)
Pointer(Pointer)
.public QuantizeV2(@Const @ByRef Scope scope, @ByVal Input input, @ByVal Input min_range, @ByVal Input max_range, @Cast(value="tensorflow::DataType") int T)
@ByVal public static QuantizeV2.Attrs Mode(@tensorflow.StringPiece BytePointer x)
@ByVal public static QuantizeV2.Attrs Mode(@tensorflow.StringPiece String x)
@ByVal public static QuantizeV2.Attrs RoundMode(@tensorflow.StringPiece BytePointer x)
@ByVal public static QuantizeV2.Attrs RoundMode(@tensorflow.StringPiece String x)
public QuantizeV2 operation(Operation setter)
public QuantizeV2 output(Output setter)
public QuantizeV2 output_min(Output setter)
public QuantizeV2 output_max(Output setter)
Copyright © 2022. All rights reserved.