NumPy Inference¶
This is a software re-implementation of LiteRT Inference in NumPy and fxpmath. This implementation is useful for understanding quantization math before moving on to a hardware implementation in PyRTL Inference.
It produces nearly the same output tensors as LiteRT Inference and exactly the same output tensors as PyRTL Inference.
Implement quantized inference with NumPy and fxpmath.
This does not use the LiteRT Inference reference implementation.
This implements the equations in Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. All Equation references in documentation and code comments refer to equations in this paper.
The first layer is quantized per-axis, which is not described in the paper above. See per-axis quantization for details.
The numpy_inference demo uses NumPyInference to implement quantized
inference with NumPy.
- class pyrtlnet.numpy_inference.NumPyInference(tensor_path)[source]¶
Run quantized inference on an input batch with NumPy and fxpmath.
- __init__(tensor_path)[source]¶
Collect weights, biases, and quantization metadata from a
.npzfile created byquantize_model().- Raises:
FileNotFoundError – If the
.npzfile is not found.- Parameters:
tensor_path (
str) – Path to the.npzfile created byquantize_model().
- run(test_batch)[source]¶
Run quantized inference on a batch.
All calculations are done with NumPy and fxpmath.
- Parameters:
test_batch (
ndarray) – An image batch of shape(batch_size, 12, 12)to run through the NumPy inference implementation.- Return type:
- Returns:
(layer0_outputs, layer1_outputs, actuals), wherelayer0_outputsis the first layer’s raw tensor output, with shape(batch_size, 18).layer1_outputsis the second layer’s raw tensor output, with shape(batch_size, 10).actualsis annumpy.ndarrayof predicted digits with shape(batch_size,).
- pyrtlnet.numpy_inference.normalize(product, m0, n, z3)[source]¶
Convert a 32-bit layer output to a normalized 8-bit output.
This function effectively multiplies the layer’s output by its scale factor
mand adds its zero pointz3.mis a floating-point number, which can also be represented by a 32-bit fixed-point multiplierm0and bitwise right shiftn, seenormalization_constants(). So instead of doing a floating-point multiplication, we do a fixed-point multiplication, followed by a bitwise right shift. This multiplication and shift reduces 32-bitproductvalues into 8-bit outputs, utilizing the 8-bit output range as effectively as possible.Layers can have per-axis scale factors, so
m0andnwill be vectors of scale factors and shift amounts. See per-axis quantization for details.- Parameters:
- Return type:
- Returns:
z3 + (product * m0) >> n, where*is elementwise fixed-point multiplication, and>>is a rounding right shift. The return value has the same shape asproductanddtype int8.
- pyrtlnet.numpy_inference.quantized_matmul(q1, z1, q2, z2)[source]¶
Quantized matrix multiplication of
q1andq2.This function returns the un-normalized matrix multiplication output, which has
dtype int32. See Sections 2.3 and 2.4 in Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. The layer’sint32bias can be added to this function’s output, and therelu()activation function applied, if necessary. The output must then be normalized back toint8withnormalize()before proceeding to the next layer.