PyRTL Matrix Library

Hardware implementations of matrix operations, in the PyRTL hardware description language.

PyRTL Matrix Operations

PyRTL implementations of common linear algebra operations.

The operations in this module all use WireMatrix2D as their input and output, so they can be composed to implement arbitrary matrix calculations. See the pyrtl_matrix demo for an example that computes x · (y - y_zero) + a.

Warning

These implementations may not be completely general. They have only been tested in the context of dense neural networks.

pyrtlnet.pyrtl_matrix.make_argmax(a)[source]

Combinationally argmax a matrix a by column, returning the index of the row containing the largest value in each column.

For example, given the matrix:

┌     ┐
│ 1 5 │
│ 3 4 │
└     ┘

make_argmax returns a wire_matrix() containing the values [1, 0], because the 1-st value (3) is the largest in the first column, and the 0-th value (5) is the largest value in the second column.

This implementation is fully combinational (no registers).

Parameters:

a (WireMatrix2D) – Input matrix.

Return type:

wire_matrix

Returns:

A wire_matrix() containing the concatenation of the row indexes of the largest values in each column of a in unsigned binary.

pyrtlnet.pyrtl_matrix.make_elementwise_add(name, a, b, output_bitwidth)[source]

Combinationally add matrices a and b elementwise.

This implementation is fully combinational (no registers). b is allowed to be a column vector of the same amount of rows as a.

Parameters:
Return type:

WireMatrix2D

Returns:

WireMatrix2D containing a + b.

pyrtlnet.pyrtl_matrix.make_elementwise_normalize(name, a, m0, n, z3, output_bitwidth)[source]

Convert an un-normalized layer output to a normalized output.

This function effectively multiplies the layer’s output by its scale factor m and adds its zero point z3.

m is a floating-point number, which is represented by a 32-bit fixed-point multiplier m0 and bitwise rounding right shift n, see normalization_constants(). So instead of doing a floating-point multiplication by m, we do a fixed-point multiplication by m0, followed by a bitwise rounding right shift by n.

See Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference for more details. This implements the part of Equation 7 that’s outside the parentheses (addition of z3 and multiplication by m).

Layers can have per-axis scale factors, so m0 and n will be vectors of per-row scale factors and shift amounts. See per-axis quantization for details.

For example, if accumulator_bitwidth is 32, and output_bitwidth is 8, this function can multiply and shift 32-bit a values into 8-bit output values to most effectively utilize the limited 8-bit output range.

This implementation is fully combinational (no registers).

Parameters:
  • name (str) – The returned WireMatrix2D will be named {name}.output.

  • a (WireMatrix2D) – Matrix to normalize, with bitwidth accumulator_bitwidth.

  • m0 (Fxp) – Vector of per-row fixed-point multipliers.

  • n (ndarray) – Vector of per-row shift amounts.

  • z3 (ndarray) – Vector of per-row zero-point adjustments.

  • output_bitwidth (int) – Number of bits to output for each element. This should generally be 8.

Return type:

WireMatrix2D

Returns:

z3 + (a * m0) >> n, where * is elementwise fixed-point multiplication, and >> is a rounding right shift. The return value has the same shape as a and bitwidth output_bitwidth.

pyrtlnet.pyrtl_matrix.make_elementwise_relu(name, a)[source]

Combinationally ReLU matrix a. This computes max(a, 0) elementwise.

This implementation is fully combinational (no registers).

Parameters:
Return type:

WireMatrix2D

Returns:

max(a, 0) computed elementwise.

pyrtlnet.pyrtl_matrix.make_input_memblock_data(a, input_bitwidth, addrwidth)[source]

Convert a ndarray to MemBlock data for use with the systolic array.

When make_systolic_array() uses a WireMatrix2D with a MemBlock as input, the systolic array will read consecutive addresses of the MemBlock each cycle. The data at each address must contain all the values in a that will be consumed by the systolic array in the next cycle. All the values needed in a cycle are concatenated together, and stored at one address.

The returned memblock_data can be directly used as the romdata for a RomBlock, or enumerated and converted to a dict and used with a MemBlock, via memory_value_map in Simulation:

memblock_data = pyrtl_matrix.make_input_memblock_data(...)
memblock_dict = dict(enumerate(memblock_data))
sim = pyrtl.Simulation(memory_value_map={memblock: memblock_dict})
Parameters:
  • a (ndarray) – Input data to convert to MemBlock data.

  • input_bitwidth (int) – Bitwidth of each element in a.

  • addrwidth (int) – Number of MemBlock address bits. This must be large enough to hold the number of cycles needed to run the systolic array. See num_systolic_array_cycles().

Return type:

list[int]

Returns:

A list of integer values, ready for storage in a MemBlock. Each integer contains all the bits from a that the systolic array needs in one cycle.

pyrtlnet.pyrtl_matrix.make_systolic_array(name, a, b, b_zero, input_bitwidth, accumulator_bitwidth, initial_delay_cycles=0)[source]

Generate an output-stationary systolic array, computing a (b - b_zero).

Parameters:
  • name (str) – The returned WireMatrix2D will be named {name}.output.

  • a (WireMatrix2D | ndarray) – Left input to the systolic array. The types of a and b do not have to match.

  • b (WireMatrix2D | ndarray) – Right input to the systolic array. The types of a and b do not have to match.

  • b_zero (int) – Zero point for b. Useful for quantized neural network computations. Set it to zero for standard matrix multiplication.

  • input_bitwidth (int) – Bitwidth of each input element.

  • accumulator_bitwidth (int) – Bitwidth used when summing dot products. The systolic array multiplies and accumulates many input elements, so accumulator_bitwidth should be larger than input_bitwidth.

  • initial_delay_cycles (int, default: 0) – Number of cycles to wait before starting operation. This is a temporary hack that’s currently required for correct synthesis with Vivado. No delay cycles should be required.

Return type:

WireMatrix2D

Returns:

A WireMatrix2D representing a (b - b_zero).

Systolic Array Architecture

The systolic array’s architecture is shown in the diagram below. l0' is l0, delayed by one cycle, and l0'' is l0, delayed by two cycles:

              t0                         t1
              │                          │
              ▼                          ▼
         ┌─────────┐    l0'         ┌─────────┐    l0''
l0 ─────▶│ reg_0_0 │─────┬─────────▶│ reg_0_1 │─────┬───── ...
         └─────────┘     │          └─────────┘     │
              │          │               │          │
              │          ▼               │          ▼
              │      ┌────────┐          │      ┌────────┐
          t0' ├─────▶│ pe_0_0 │      t1' ├─────▶│ pe_0_1 │
              │      └────────┘          │      └────────┘
              │                          │
              ▼                          ▼
         ┌─────────┐    l1'         ┌─────────┐    l1''
l1 ─────▶│ reg_1_0 │─────┬─────────▶│ reg_1_1 │─────┬───── ...
         └─────────┘     │          └─────────┘     │
              │          │               │          │
              │          ▼               │          ▼
              │      ┌────────┐          │      ┌────────┐
         t0'' ├─────▶│ pe_1_0 │     t1'' ├─────▶│ pe_1_1 │
              │      └────────┘          │      └────────┘
              │                          │
             ...                        ...

The systolic array multiplies matrices a and b, where a has shape (num_rows, num_inner) and b has shape (num_inner, num_columns).

The systolic array is a 2D array of Register (reg) and processing elements (pe), arranged in num_rows rows and num_columns columns. Pairs of Register and processing element are grouped into a tile, for example reg_0_0 and pe_0_0 form the tile at (0, 0). Multiple tiles can be wired together to create the full systolic array.

Systolic Array Operation

Matrix a streams in the left inputs (l0, l1, ... ln), over (num_inner + num_rows - 1) cycles.

Matrix b streams in the top inputs (t0, t1, ... tn), over (num_inner + num_columns - 1) cycles.

Data streams from these left and top inputs, through registers (reg_0_0, reg_0_1, ...), to processing elements (pe_0_0, pe_0_1, ...). The processing elements store the matrix multiplication output in accumulator registers. The output does not move through the array, which makes this array “output-stationary.”

The left and top inputs change over time. If the matrix a is:

    ┌       ┐
a = │ 1 2 3 │
    │ 4 5 6 │
    └       ┘

then num_rows=2 and num_inner=3 because matrix a has shape (2, 3). There are two left inputs because num_rows=2. It will take 4 cycles to stream matrix a, because 3 + 2 - 1 = 4. The left inputs for each cycle are:

   │  cycle
   │ 0 1 2 3
───┼───────
l0 │ 1 2 3 0
l1 │ 0 4 5 6

Note how l1 is shifted forward one cycle, and the holes have been filled with zeroes.

If the matrix b is:

    ┌             ┐
b = │  7  8  9 10 │
    │ 11 12 13 14 │
    │ 15 16 17 18 │
    └             ┘

then num_inner=3 and num_columns=4 because matrix b has shape (3, 4). There are four top inputs because num_columns=4. It will take 6 cycles to stream matrix b, because 3 + 4 - 1 = 6. The top inputs for each cycle are:

   │        cycle
   │  0  1  2  3  4  5
───┼──────────────────
t0 │  7 11 15  0  0  0
t1 │  0  8 12 16  0  0
t2 │  0  0  9 13 17  0
t3 │  0  0  0 10 14 18

Note how matrix b has been transposed. t0 is [7 11 15] over the first three cycles, which corresponds to the leftmost column of matrix b. t1 is shifted forward one cycle, t2 is shifted forward two cycles, t3 is shifted forward three cycles, and the holes have been filled with zeroes.

Compare t0 and l0. l0 corresponds to the topmost row of matrix a, and t0 corresponds to the leftmost column of matrix b. t0 and l0 can be generated by following the same procedure, except matrix b is initially transposed, while matrix a is not.

When there is no more input to stream in to the left or top inputs, the corresponding input should be set to zero. The final result will be ready in (num_rows + num_inner + num_columns) cycles, and the matrix multiplication result can be read from the pe_{row}_{col} registers.

The pyrtl_matrix demo runs this example through the systolic array named mm0, and these parallelogram-shaped inputs can be seen propagating through the array’s mm0.left and mm0.top inputs in the output from render_trace():

                ▕0    ▕1    ▕2    ▕3    ▕4    ▕5    ▕6    ▕7    ▕8    ▕9

     mm0.left[0] ──────▏1   ▕ 2   ▕ 3   ▕────────────────────────────────────

     mm0.left[1] ────────────▏4   ▕ 5   ▕ 6   ▕──────────────────────────────

      mm0.top[0] ──────▏7   ▕ 11  ▕ 15  ▕────────────────────────────────────

      mm0.top[1] ────────────▏8   ▕ 12  ▕ 16  ▕──────────────────────────────

      mm0.top[2] ──────────────────▏9   ▕ 13  ▕ 17  ▕────────────────────────

      mm0.top[3] ────────────────────────▏10  ▕ 14  ▕ 18  ▕──────────────────

mm0.output[0][0] ──────────────────▏7   ▕ 29  ▕ 74

mm0.output[0][1] ────────────────────────▏8   ▕ 32  ▕ 80

mm0.output[0][2] ──────────────────────────────▏9   ▕ 35  ▕ 86

mm0.output[0][3] ────────────────────────────────────▏10  ▕ 38  ▕ 92

mm0.output[1][0] ────────────────────────▏28  ▕ 83  ▕ 173

mm0.output[1][1] ──────────────────────────────▏32  ▕ 92  ▕ 188

mm0.output[1][2] ────────────────────────────────────▏36  ▕ 101 ▕ 203

mm0.output[1][3] ──────────────────────────────────────────▏40  ▕ 110 ▕ 218

       mm0.state INIT ▕ BUSY                                          ▕ DONE
                                                                       ▁▁▁▁▁▁
mm0.output.valid ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▏

The mm0.output signals show the systolic array’s output matrix. For example, mm0.output[0][0] shows the output matrix’s final top left value is 74, which is 1 * 7 + 2 * 11 + 3 * 15.

The trace shows how the systolic array multiplies and accumulates to execute this matrix multiplication over time. For example, the trace for mm0.output[0][0] shows:

 7 in cycle 3, which is 1 * 7.
29 in cycle 4, which is 1 * 7 + 2 * 11.
74 in cycle 5, which is 1 * 7 + 2 * 11 + 3 * 15.

The inputs for computing mm0.output[0][0] can be found in the mm0.left[0] and mm0.top[0] traces.

The expected result of multiplying matrices a and b is:

         ┌                 ┐
output = │  74  80  86  92 │
         │ 173 188 203 218 │
         └                 ┘

And these values can be found on the right side of the mm0.output traces.

pyrtlnet.pyrtl_matrix.minimum_bitwidth(a)[source]

Return the minimum number of bits needed to represent each element in a.

Parameters:

a (ndarray) – Array to process. a may contain negative numbers, so this ensures there are enough bits to represent both the largest and smallest values.

Return type:

int

Returns:

The number of bits needed to represent the largest or smallest element in a.

pyrtlnet.pyrtl_matrix.num_systolic_array_cycles(a_shape, b_shape)[source]

Return the cycles needed to multiply a and b with the systolic array.

When using make_systolic_array() with a MemBlock as input, this function is useful for calculating the MemBlock’s addrwidth.

Parameters:
Return type:

int

Returns:

The number of cycles needed to multiply a and b with the systolic array. See make_systolic_array().

pyrtlnet.pyrtl_matrix.saturating_truncate(value, bitwidth)[source]

Truncate a signed value to bitwidth, saturating at the largest and smallest representable values.

If value is too large to fit in bitwidth (overflow), the output WireVector will have the value 2 ** (bitwidth - 1) - 1.

If value is too small to fit in bitwidth (underflow), the output WireVector will have the value -2 ** (bitwidth - 1).

Otherwise, the output WireVector will have value value.

Parameters:
  • value (WireVector) – Value to truncate.

  • bitwidth (int) – Bitwidth to truncate value to. Must be less than value.bitwidth.

Return type:

WireVector

Returns:

value truncated to bitwidth, saturating at the largest and smallest representable values if overflow or underflow occur.

WireMatrix2D

class pyrtlnet.wire_matrix_2d.WireMatrix2D(values, shape=(), bitwidth=0, name='', ready=None, valid=None)[source]

WireMatrix2D represents a 2D matrix of WireVector.

WireMatrix2D functions like a 2D wire_matrix(), with a NumPy-style shape tuple, and ready/valid signals. It serves as the input and output type for all operations in the PyRTL Matrix Library. These matrix operations can be composed. For example, when computing x y + a, there is an intermediate WireMatrix2D that serves as both the output of the multiplication x y, and the input to the addition _ + a.

WireMatrix2D supports two underlying representations:

  1. self.Matrix, which is a 2D wire_matrix(). wire_matrix supports any PyRTL WireVector type, so you could have a self.Matrix of Register for example. This representation is used when the WireMatrix2D is constructed without a MemBlock.

  2. MemBlock, where the matrix data is stored in a MemBlock or RomBlock. This representation is currently experimental and not completely supported. This representation is used when the WireMatrix2D is constructed with a MemBlock.

ready/valid protocol

WireMatrix2D serves as a shared buffer between an upstream producer that writes data into the WireMatrix2D, and a downstream consumer that reads data from the WireMatrix2D. The producer and consumer must coordinate their usage to avoid corrupting this shared resource. For example:

  1. While the producer is writing data to the WireMatrix2D, it is not safe for the consumer to read data from the WireMatrix2D.

  2. While the consumer is reading data from the WireMatrix2D, it is not safe for the producer to write new data into the WireMatrix2D.

WireMatrix2D provides ready and valid signals to help the producer and consumer coordinate. ready indicates when it is safe for the producer to write new data to the WireMatrix2D, and valid indicates when it is safe for the consumer to read data from the WireMatrix2D.

Note

ready and valid are undriven WireVectors provided as a convenience. WireMatrix2D itself does not assign any values to these wires or inspect the values on these wires. The producer and consumer must set and check these signals appropriately.

__getitem__(row)[source]

Implements WireMatrix2D’s [] operator.

If this WireMatrix2D was not constructed with a MemBlock, its elements can be accessed with self[row][column]. This returns a WireVector with bitwidth self.bitwidth. This method only implements row-level indexing, and returns a wire_matrix(). Column-level indexing is implemented by the returned wire_matrix().

Warning

If this WireMatrix2D was constructed with a MemBlock, this method can currently only retrieve a full row of values as matrix[row]. Per-element access is currently not supported.

Parameters:

row (WireVector) – Row number to retrieve from the matrix.

Return type:

WireVector

Returns:

A WireVector containing all the data in the row concatenated together. If this WireMatrix2D was not constructed with a MemBlock, the returned WireVector is actually a wire_matrix(), which can be further indexed with its __getitem__ operator to retrieve data in a specific column.

__init__(values, shape=(), bitwidth=0, name='', ready=None, valid=None)[source]

Construct a 2D wire_matrix() containing values.

Parameters:
  • values (ndarray | list[list[WireVector]] | WireVector | MemBlock) – Values for the WireMatrix2D. If None, creates a WireMatrix2D of Input. values can also be a ndarray, a list of lists of WireVector, one large concatenated WireVector containing all the values for matrix, or a MemBlock.

  • shape (tuple[int, int], default: ()) – Shape of the WireMatrix2D. Must be two dimensional. If values is a ndarray, the shape will be inferred from the ndarray and this shape argument can be omitted.

  • bitwidth (int, default: 0) – The bitwidth of each element.

  • name (str, default: '') – Names for all elements in the WireMatrix2D will be generated based on this prefix. For example, if name is foo then the top left element will be named foo[0][0].

  • ready (bool | WireVector, default: None) – A 1-bit signal indicating if the WireMatrix2D can be safely written by the upstream producer.

  • valid (bool | WireVector, default: None) – A 1-bit signal indicating if the WireMatrix2D can be safely read by the downstream consumer.

bitwidth: int[source]

Bitwidth of each element in the matrix.

inspect(sim)[source]

Collect and return Output values from a Simulation.

Retrieves Output values for self from a Simulation, and returns the retrieved values in a ndarray.

Use make_outputs() to create the retrieved Output values.

Parameters:

sim (Simulation) – Simulation to retrieve values from.

Return type:

ndarray

Returns:

Retrieved values as a ndarray.

make_outputs(output_name)[source]

Create Output WireVectors for self.

Use inspect() to retrieve these Output values.

Parameters:

output_name (str) – The generated Outputs will have names of the form {output_name}_{row}_{column}.

make_provided_inputs(values)[source]

Create provided_inputs for Simulation.

This should only be used with a WireMatrix2D of Input. This WireMatrix2D should have been constructed with values=None.

Parameters:

values (ndarray) – Values to pack into a provided_inputs dict.

Return type:

dict[str, int]

Returns:

A provided_inputs dict that contains matrix data from values. This dict can be passed to step(), and it will set the Input for the WireMatrix2D to values.

ready: WireVector[source]

A 1-bit signal indicating if the WireMatrix2D can be safely written by the upstream producer.

See ready/valid protocol.

shape: tuple[int, int][source]

Matrix’s shape, as a tuple of integers (num_rows, num_columns).

transpose()[source]

Return a transposed version of self, as another WireMatrix2D.

Warning

If self.memblock is not None, this does not reformat the MemBlock’s data. It only changes the shape. The MemBlock is assumed to already contain transposed data.

Return type:

WireMatrix2D

Returns:

A transposed version of self.

valid: WireVector[source]

A 1-bit signal indicating if the WireMatrix2D can be safely read by the downstream consumer.

See ready/valid protocol.

pyrtlnet.wire_matrix_2d.make_concatenated_value(values, bitwidth)[source]

Pack all elements of values into a large integer, in row-major order.

When using a WireMatrix2D with a MemBlock, this function is useful for setting the initial value of a MemBlock or RomBlock.

Parameters:
  • values (ndarray) – Values to concatenate.

  • bitwidth (int) – Bitwidth of each element.

Return type:

int

Returns:

A large integer containing all bits from values, concatenated together in row-major order. The total number of bits returned will be values.size * bitwidth.