PyRTL Inference¶

This is a a hardware re-implementation of NumPy Inference in PyRTL. It produces nearly the same output tensors as LiteRT Inference and exactly the same output tensors as NumPy Inference.

Implement quantized inference with PyRTL.

This does not invoke the LiteRT Inference or NumPy Inference implementations.

This effectively reimplements NumPy Inference in hardware, using the PyRTL Matrix Library, running in a PyRTL Simulation.

The pyrtl_inference demo uses PyRTLInference to implement quantized inference with PyRTL.

class pyrtlnet.pyrtl_inference.PyRTLInference(tensor_path, input_bitwidth, accumulator_bitwidth, axi, initial_delay_cycles=0, batch_size=1)[source]¶

Convert a quantized model to hardware, and simulate the hardware with a PyRTL Simulation.

__init__(tensor_path, input_bitwidth, accumulator_bitwidth, axi, initial_delay_cycles=0, batch_size=1)[source]¶

Convert the quantized model to PyRTL inference hardware.

This builds the necessary hardware for each layer’s matrix operations. The diagram below shows layer0’s data flow:

layer0 weight, shape: (18, 144), input_bitwidth
    │
    │   layer0 input image, shape: (144, batch_size), input_bitwidth
    │       │
    ▼       ▼
┌─────────────────────────────────────────────────────────┐
│ layer0 systolic_array (matrix multiplication by weight) │
└─────────────────────────────────────────────────────────┘
    │
    │ output, shape: (18, batch_size), accumulator_bitwidth
    │
    │   layer0 bias, shape: (18, batch_size), accumulator_bitwidth
    │       │
    ▼       ▼
┌───────────────────────────────────┐
│ layer0 elementwise_add (add bias) │
└───────────────────────────────────┘
    │
    │ output, shape: (18, batch_size), accumulator_bitwidth
    ▼
┌─────────────────────────┐
│ layer0 elementwise_relu │
└─────────────────────────┘
    │
    │ output, shape: (18, batch_size), accumulator_bitwidth
    ▼
┌────────────────────────────────────────────────┐
│ layer0 elementwise_normalize (reduce bitwidth) │
└────────────────────────────────────────────────┘
            │
            │
            ▼
        layer0 output, shape: (18, batch_size), input_bitwidth

And this diagram shows the layer1’s data flow, where the layer0 output from layer0 is the second input to layer1’s systolic_array:

layer1 weight, shape: (10, 18), input_bitwidth
    │
    │   layer0 output, shape: (18, batch_size), input_bitwidth
    │       │
    ▼       ▼
┌─────────────────────────────────────────────────────────┐
│ layer1 systolic_array (matrix multiplication by weight) │
└─────────────────────────────────────────────────────────┘
    │
    │ output, shape: (10, batch_size), accumulator_bitwidth
    │
    │   layer1 bias, shape: (10, batch_size), accumulator_bitwidth
    │       │
    ▼       ▼
┌───────────────────────────────────┐
│ layer1 elementwise_add (add bias) │
└───────────────────────────────────┘
    │
    │ output, shape: (10, batch_size), accumulator_bitwidth
    ▼
┌────────────────────────────────────────────────┐
│ layer1 elementwise_normalize (reduce bitwidth) │
└────────────────────────────────────────────────┘
    │
    │
    ▼
layer1 output, shape: (10, batch_size), input_bitwidth

make_systolic_array() performs the matrix multiplication of each layer’s weight and input.
make_elementwise_add() performs the elementwise addition of each layer’s bias.
make_elementwise_relu() performs ReLU (only for layer0).
make_elementwise_normalize() performs normalization to convert from intermediate values with bitwidth accumulator_bitwidth to each layer’s output values with bitwidth input_bitwidth.

Parameters:

tensor_path (str) – Path to the .npz file created by quantize_model().
input_bitwidth (int) – Bitwidth of each element in the input matrix. This should generally be 8.
accumulator_bitwidth (int) – Bitwidth of accumulator registers in the systolic array. This should generally be 32, and larger than input_bitwidth.
axi (bool) – If True, receive input batch data via an AXI-Stream, and return the output’s argmax via AXI-Lite, at addresses 0 - batch_size. If False, the input batch data will be loaded in self.flat_batch_memblock via Simulation’s memory_value_map, and the output’s argmax will be inspected as a wire_matrix.
initial_delay_cycles (int, default: 0) – Number of cycles to wait before starting operation. This is a temporary hack that’s currently required for correct synthesis with Vivado. No delay cycles should be required.

simulate(test_batch, verilog=False)[source]¶

Simulate quantized inference on an image batch.

All calculations are done in PyRTL Simulation, using the hardware generated by __init__().

Parameters:

test_batch (ndarray) – A batch to run through the PyRTL inference implementation.
verilog (bool, default: False) – If True, export the inference implementation to Verilog. Two Verilog files will be generated, one for the pyrtlnet module itself, and another for its testbench. The pyrtlnet module will be named pyrtl_inference.v, or pyrtl_inference_axi.v when constructed with axi=True. The testbench will be named pyrtl_inference_test.v, or pyrtl_inference_axi_test.v when constructed with axi=True.

Return type:

tuple[ndarray, ndarray, int]

Returns:

(layer0_output, layer1_output, argmax), where layer0_output is the first layer’s raw tensor output, with shape (batch_size, 18). layer1_output is the second layer’s raw tensor output, with shape (batch_size, 10). argmax is a list of predicted digits for each image in the batch, with length batch_size.