NumPy Inference¶

This is a software re-implementation of LiteRT Inference in NumPy and fxpmath. This implementation is useful for understanding quantization math before moving on to a hardware implementation in PyRTL Inference.

It produces nearly the same output tensors as LiteRT Inference and exactly the same output tensors as PyRTL Inference.

Implement quantized inference with NumPy and fxpmath.

This does not use the LiteRT Inference reference implementation.

This implements the equations in Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. All Equation references in documentation and code comments refer to equations in this paper.

The first layer is quantized per-axis, which is not described in the paper above. See per-axis quantization for details.

The numpy_inference demo uses NumPyInference to implement quantized inference with NumPy.

class pyrtlnet.numpy_inference.NumPyInference(tensor_path)[source]¶

Run quantized inference on an input batch with NumPy and fxpmath.

__init__(tensor_path)[source]¶

Collect weights, biases, and quantization metadata from a .npz file created by quantize_model().

Raises:: FileNotFoundError – If the .npz file is not found.
Parameters:: tensor_path (str) – Path to the .npz file created by quantize_model().

run(test_batch)[source]¶

Run quantized inference on a batch.

All calculations are done with NumPy and fxpmath.

Parameters:: test_batch (ndarray) – An image batch of shape (batch_size, 12, 12) to run through the NumPy inference implementation.
Return type:: tuple[ndarray, ndarray, int]
Returns:: (layer0_outputs, layer1_outputs, actuals), where layer0_outputs is the first layer’s raw tensor output, with shape (batch_size, 18). layer1_outputs is the second layer’s raw tensor output, with shape (batch_size, 10). actuals is an numpy.ndarray of predicted digits with shape (batch_size,).

pyrtlnet.numpy_inference.normalize(product, m0, n, z3)[source]¶

Convert a 32-bit layer output to a normalized 8-bit output.

This function effectively multiplies the layer’s output by its scale factor m and adds its zero point z3.

m is a floating-point number, which can also be represented by a 32-bit fixed-point multiplier m0 and bitwise right shift n, see normalization_constants(). So instead of doing a floating-point multiplication, we do a fixed-point multiplication, followed by a bitwise right shift. This multiplication and shift reduces 32-bit product values into 8-bit outputs, utilizing the 8-bit output range as effectively as possible.

Layers can have per-axis scale factors, so m0 and n will be vectors of scale factors and shift amounts. See per-axis quantization for details.

Parameters:

product (ndarray) – Matrix to normalize, with dtype int32.
m0 (Fxp) – Vector of per-row fixed-point multipliers.
n (ndarray) – Vector of per-row shift amounts.
z3 (ndarray) – Vector of per-row zero-point adjustments.

Return type:

ndarray

Returns:

z3 + (product * m0) >> n, where * is elementwise fixed-point multiplication, and >> is a rounding right shift. The return value has the same shape as product and dtype int8.

pyrtlnet.numpy_inference.quantized_matmul(q1, z1, q2, z2)[source]¶

Quantized matrix multiplication of q1 and q2.

This function returns the un-normalized matrix multiplication output, which has dtype int32. See Sections 2.3 and 2.4 in Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. The layer’s int32 bias can be added to this function’s output, and the relu() activation function applied, if necessary. The output must then be normalized back to int8 with normalize() before proceeding to the next layer.

Parameters:

q1 (ndarray) – Left input to the matrix multiplication.
z1 (int) – Zero point for q1.
q2 (ndarray) – Right input to the matrix multiplication.
z2 (int) – Zero point for q2.

Return type:

ndarray

Returns:

Un-normalized matrix multiplication output, with dtype int32.

pyrtlnet.numpy_inference.relu(x)[source]¶

ReLU activation function, which converts negative values to zero.

Parameters:: x (ndarray) – Input to the activation function.
Return type:: ndarray
Returns:: Activation function’s output, where each element will be non-negative.