Systolic Matrix MAC — Joshua Taylor

overview

Systolic arrays are the core compute primitive behind modern neural network accelerators — Google's TPU, Apple's Neural Engine, and virtually every edge AI chip use variants of this dataflow architecture. This project implements one from scratch in synthesizable SystemVerilog, designed to tape out on the Tiny Tapeout community multi-project wafer.

The design uses a weight-stationary dataflow with 4-bit inputs and 10-bit accumulators — sized deliberately for area efficiency on SkyWater 130nm. A 4-bit multiplier is ~6× smaller than an 8-bit one, bringing the full 4×4 array within Tiny Tapeout tile budget. The 10-bit accumulator is sufficient since the maximum dot product (15² × 4 = 900) fits cleanly with headroom. A control FSM handles the three-phase protocol: LOAD → COMPUTE → READBACK.

ARRAY SIZE

16

processing elements (4×4)

DATAPATH

4-bit in / 10-bit acc

4×4 multiply fits in 10 bits (15²×4 = 900)

LATENCY

N+3

cycles to first valid output

INTERFACE

SPI mode 0

up to 10 MHz, CPOL=0 CPHA=0

PROCESS NODE

SKY130

SkyWater 130nm open-source PDK

HOST MCU

RP2040

MicroPython firmware over USB

architecture

Each of the 16 PEs performs a MAC operation each clock cycle: acc += a_reg * b_reg — using registered inputs rather than direct inputs so data propagates with the correct one-cycle delay as it travels across the array. In weight-stationary mode, the weight is loaded once and held. The characteristic diagonal "wave" of valid data means output[i][j] is valid at cycle i + j + N, and comp_done asserts at counter cycle 7 (2N−1 = 7 for N=4).

4 × 4 Processing Element Array — animated data wavefront

System Pipeline

💻Python GUI

→

🔌USB / RP2040

→

📡SPI Slave

→

🧠Control FSM

→

⚙4×4 Array

→

📤SPI TX

→

📊Result Display

Control FSM States

The FSM drives the three-phase protocol that coordinates SPI, the array, and the readback engine. Input skewing — inserting cycle-offset delays so that row i of A and column j of B arrive at PE[i][j] simultaneously — is generated entirely in the COMPUTE state, avoiding any external buffering requirement.

STATE	DESCRIPTION
IDLE	Waiting for SPI load to complete (`load_done` assertion)
CLEAR	Synchronous reset of all 16 PE accumulators — one full cycle before any MAC activity
LOAD	Asserts `start` to PEs; boundary logic feeds skew-delayed A rows and B columns
COMPUTE	Counter increments each cycle; array accumulating; `comp_done` asserts at `cnt == 7`
DRAIN	Enables SPI TX to serialize the 16 × 10-bit results back to the RP2040

modules

The design is split into seven source files with a strict compile order: primitives.sv → pe.sv → systolic_array_4x4.sv → spi_slave.sv → spi_tx.sv → control_fsm.sv → tt_um_systolic_top.sv. Each module is independently testable.

⬡

Processing Element

Single MAC unit — acc += a_reg × b_reg per clock. 4-bit inputs, 10-bit accumulator. Registered pass-through on A and B creates the one-cycle propagation delay the systolic dataflow depends on.

pe.sv

⊞

Systolic Array 4×4

16 PE instances in a mesh connected by generate loops. Boundary logic skews A rows and B columns using the cycle counter: row i feeds at cnt >= i, col j feeds at cnt >= j.

systolic_array_4x4.sv

◈

Control FSM

5-state one-hot FSM: IDLE → CLEAR → LOAD → COMPUTE → DRAIN. The explicit CLEAR state before COMPUTE was the fix for the accumulator corruption bug found on Basys3.

control_fsm.sv

⇅

SPI Slave

Deserializes 33 SPI bytes (1 command + 16 A bytes + 16 B bytes) into 4×4 unpacked arrays. Asserts load_done when complete to trigger the FSM.

spi_slave.sv

⇈

SPI TX

Serializes 16 × 10-bit results back to the RP2040 on MISO after DRAIN is entered. Handles MSB-first transmission gated by comp_done.

spi_tx.sv

⊕

Top Wrapper

Tiny Tapeout top-level conforming to yaml_version: 6 port conventions. Maps all modules to the 8-bit ui_in / uo_out / uio_* bus. Includes a primitives.sv layer for DFFs and synchronizers.

tt_um_systolic_top.sv

hardware debugging

Before finalizing RTL for tapeout, the design was synthesized on a Basys3 (Artix-7) FPGA with a UART test harness for hardware-in-the-loop verification. Three bugs were caught that simulation hadn't surfaced:

Clear/start race. On back-to-back runs, PE accumulators occasionally held stale partial results. The FSM was transitioning CLEAR → COMPUTE on the same edge the synchronous clear applied — zero timing margin on hardware. Fix: an explicit CLEAR state in the FSM, registering acc_clear a full cycle before any MAC begins.

comp_done one cycle early. The last column of outputs was always zero because the readback started before the final MAC committed. comp_done was asserted at cnt == 6 but the PE pipeline register meant results weren't architecturally visible until cycle 7. Fix: assert at cnt == 7.

PE[0][0] accumulator corruption. PE[0][0]'s output was consistently wrong by exactly one partial product — every other PE was correct. The start gate on the PE wasn't high in time for the very first valid cycle at [0][0], causing the first multiply to be dropped. Fix: ensured start is asserted one cycle before data enters the array boundary.

FPGA Synthesis on Basys3

Vivado synthesis confirmed timing closure at the target clock frequency with comfortable slack. The 4-bit multipliers synthesize entirely to LUT logic (no DSP48E1 primitives consumed), and the full 4×4 array fits with low utilization — validating there are no synthesis-blocking constructs before committing to the shuttle.

UART Test Infrastructure

Rather than Vivado's ILA (which adds LUT overhead and complicates timing closure), a lightweight UART readback path was wired in for hardware testing: the Basys3 receives test vectors over UART at 115200 baud, runs the full CLEAR → LOAD → COMPUTE → DRAIN sequence through the actual synthesized RTL, and streams the 16 output values to a Python script that compares them against a NumPy golden model. Any mismatch prints PE coordinates, expected value, actual value, and cycle number.

verification

Constrained Random Simulation

The simulation testbench drives the array with constrained random stimulus across 1,000-cycle runs. Randomized 4-bit A and B matrix inputs are generated in SystemVerilog and mirrored to a Python golden model; every output is checked against the software reference and any mismatch asserts a failure flag with full diagnostic output — PE coordinates, expected value, actual value, and cycle number.

Constraint groups target corner cases that exercise accumulator boundaries: all-zeros, all-ones, max value (0xF), and mixed patterns. These ensure the 10-bit accumulator ceiling (15² × 4 = 900 max, well within 10-bit range) and the synchronous clear behavior are hit repeatedly alongside the random baseline. The pe_tb passes 1,000/1,000 directed random trials.

Timeline

PE + Array RTL

Wrote and simulated all 16 PE instances with generate loops. Verified MAC accumulation against Python numpy reference.

SPI + FSM integration

Wired SPI slave/TX to FSM; verified full LOAD → COMPUTE → READBACK sequence in simulation with SPI waveform stimulus.

Basys3 FPGA synthesis & hardware debug

Implemented on Artix-7. Discovered and fixed all three hardware bugs (clear/start race, comp_done timing, PE[0][0] corruption) using UART test harness.

Tiny Tapeout wrapper

Wrote tt_top.sv conforming to yaml_version 6 port conventions. Verified wrapper passes Tiny Tapeout GitHub Actions CI. Cleaned repo to seven-file compile order.

TTSKY26a shuttle submission

Design submitted to Tiny Tapeout TTSKY26a — SkyWater 130nm open-source PDK. Awaiting fab return.

Resource Utilization (Basys3 Artix-7)

RESOURCE	USED	AVAILABLE	NOTES
LUT	~180	20,800	4-bit multipliers synthesize entirely to LUT logic
FF	~160	41,600	PE registers + FSM state + SPI shift regs
DSP48E1	0	90	No DSP inference — 4×4 mults too small
Critical Path	—		timing closure confirmed, comfortable slack

code

The PE is the heart of the design — 20 lines of RTL that encode the entire systolic dataflow. The start gate prevents spurious accumulation before the FSM is ready; the registered a_reg / b_reg (not the raw inputs) are what get multiplied, creating the one-cycle propagation delay that makes the wavefront work.

pe.sv — processing element (4-bit × 4-bit → 10-bit acc)

module pe (
    input  logic        clk,
    input  logic        rst_n, clear, start,
    input  logic [3:0]  a_in, b_in,
    output logic [3:0]  a_out, b_out,
    output logic [9:0]  acc
);
    logic [3:0] a_reg, b_reg;

    always_ff @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            a_reg <= 4'b0;
            b_reg <= 4'b0;
            acc   <= 10'b0;
        end else if (clear) begin
            acc   <= 10'b0;
            a_reg <= 4'b0;
            b_reg <= 4'b0;
        end else if (start) begin
            a_reg <= a_in;
            b_reg <= b_in;
            acc   <= acc + (a_reg * b_reg);  // registered inputs — one-cycle propagation delay
        end
    end

    assign a_out = a_reg;
    assign b_out = b_reg;
endmodule : pe

3d layout viewer