AXI-Stream Arbitration in SystemVerilog

Packet, beat, and weighted round-robin policies in verilaxi

Posted by Nelson Campos on March 22, 2026

AXI-Stream systems often need several producers to share a common downstream path. A video pipeline may merge several processing stages into one output. A packet switch may combine multiple sources before transmission. A verification environment may multiplex several traffic generators into a single sink. In all of these cases an arbiter decides who gets access to the shared stream.

At first glance this looks like a mux plus some priority logic. In practice the policy matters a great deal. Should the arbiter switch source every cycle, or only at packet boundaries? Should all sources receive equal service, or should one stream receive more bandwidth? These choices affect fairness, packet integrity, backpressure behavior, and how reusable the block is across designs.

In verilaxi, the AXI-Stream arbiter explores three policies in a single parameterized module:

  • packet-based round-robin
  • beat-based round-robin
  • weighted packet arbitration

Figure (1) shows the arbiter.

AXI-Stream arbiter block diagram

Figure (1): AXI-Stream arbiter in verilaxi.

Why the policy matters

If AXI-Stream beats are independent, beat-by-beat arbitration is attractive. The arbiter can rotate after every accepted transfer and share bandwidth finely across all sources.

However, AXI-Stream often carries packets, video lines, or frames. In those cases TLAST marks a meaningful boundary. If arbitration changes source mid-packet, the sink observes an interleaved stream where beats from different packets are mixed together. For most packet-oriented sinks that is the wrong behavior, and it is not recoverable without additional framing logic downstream.

This is why packet-based arbitration is a good default. Once a source wins, it keeps ownership until the last beat handshakes:

TVALID && TREADY && TLAST

Only then does the arbiter rotate to the next source.

Parameters and interface

The arbiter is a single module. All N source ports are packed into arrays rather than using SystemVerilog interfaces on the synthesizable boundary, which keeps the RTL tool-agnostic:

module snix_axis_arbiter #(
    parameter int  NUM_SRCS    = 4,
    parameter int  DATA_WIDTH  = 8,
    parameter int  USER_WIDTH  = 1,
    parameter bit  HOLD_PACKET = 1'b1,
    parameter int  WEIGHT_W    = 4,
    parameter logic [NUM_SRCS*WEIGHT_W-1:0] WEIGHTS = '0
) (
    input  logic clk, rst_n,
    input  logic [NUM_SRCS-1:0][DATA_WIDTH-1:0] s_axis_tdata,
    input  logic [NUM_SRCS-1:0]                 s_axis_tvalid,
    input  logic [NUM_SRCS-1:0]                 s_axis_tlast,
    output logic [NUM_SRCS-1:0]                 s_axis_tready,
    output logic [DATA_WIDTH-1:0] m_axis_tdata,
    output logic                  m_axis_tvalid,
    output logic                  m_axis_tlast,
    input  logic                  m_axis_tready);

Two parameters control the arbitration policy:

  • HOLD_PACKET: 1 gives packet mode (hold the grant until TLAST); 0 gives beat mode (release after every accepted beat). All three policies live in the same RTL — switching modes is a single parameter change.
  • WEIGHTS: a packed vector of per-source service weights with source 0 in the least-significant bits. The default '0 gives each source equal weight.

Packet-based round-robin

The packet arbiter keeps three main pieces of registered state:

  • sel: the currently selected source
  • locked: whether a transfer is in progress
  • rr_ptr: the starting point for the next arbitration scan

The combinational block scans the sources starting at rr_ptr, picks the first valid eligible source, and immediately drives it to the output. There is no idle cycle between the moment a source becomes valid and the moment its first beat appears at the output.

The locked signal serves two distinct purposes, and understanding both is important for reading the RTL.

Purpose 1 — stabilize the selection while a beat is stalled. The combinational scan runs every cycle. If the arbiter has chosen a source but TREADY is low, the scan would be free to switch to a different source on the next cycle. To prevent this, the sequential logic locks onto the chosen source as soon as a selection is made, even before any handshake occurs:

// No handshake yet: hold the chosen source stable
if (!handshake) begin
    sel    <= arb_sel;
    locked <= 1'b1;
end

Purpose 2 — hold the grant through a multi-beat packet. Once the first beat of a multi-beat packet is accepted, the arbiter stays locked until TLAST completes:

// First beat accepted, more beats remain: hold until TLAST
else if (HOLD_PACKET && !s_axis_tlast[arb_sel]) begin
    sel    <= arb_sel;
    locked <= 1'b1;
end

If the very first beat of a packet also carries TLAST (a single-beat packet), neither lock path fires. The arbiter completes the unit immediately, advances rr_ptr, and is ready to arbitrate again the next cycle without ever setting locked.

Only the selected source receives TREADY. All other inputs are held off regardless of how long the sink stalls. The TREADY assignment uses a constant-index loop to keep the generated logic straightforward across simulators and synthesizers:

for (int i = 0; i < NUM_SRCS; i++) begin
    s_axis_tready[i] = (eff_valid && (eff_sel == SEL_W'(i)))
                      ? m_axis_tready : 1'b0;
end

How rr_ptr works

The round-robin pointer determines where the next arbitration scan begins. Instead of always starting from source 0, the scan starts at rr_ptr and wraps around. If there are N sources, the candidate at scan position i is:

idx = (rr_ptr + i) mod N,  i = 0 ... N-1

For example, with N = 4 and rr_ptr = 2, the scan order is:

2, 3, 0, 1

In the RTL this is implemented without a modulo operator to keep synthesis clean:

idx = int'(rr_ptr) + i;
if (idx >= NUM_SRCS) idx -= NUM_SRCS;

Once a packet (or beat) completes, the pointer advances to the source immediately after the winner:

rr_ptr <= SEL_W'(int'(sel) == NUM_SRCS-1 ? 0 : int'(sel) + 1);

This ensures the source that just won is not the first candidate in the next scan, which is what gives the arbiter its fairness.

Packet-mode example

Assume:

  • NUM_SRCS = 4, initial rr_ptr = 0, all four sources valid
  • packet lengths: src0 = 3, src1 = 2, src2 = 1, src3 = 4 beats
Roundrr_ptr before scanScan orderWinnerPacket lengthrr_ptr after completion
100,1,2,3031
211,2,3,0122
322,3,0,1213
433,0,1,2340

Table (1): Packet-based round-robin example.

Complete packets are forwarded one at a time. Each source gets its turn, and the next scan always starts after the previous winner.

Beat-based round-robin

Beat mode is enabled by setting HOLD_PACKET = 0. No other change to the RTL is needed. The beat test instantiates the DUT as:

snix_axis_arbiter #(
    .NUM_SRCS   (4),
    .HOLD_PACKET(1'b0)   // beat mode
) u_dut ( ... );

The entire behavioral difference between packet and beat mode reduces to two lines:

assign unit_last = HOLD_PACKET ? s_axis_tlast[eff_sel] : 1'b1;
assign unit_done = handshake && unit_last;

In packet mode, unit_last follows TLAST, so unit_done fires only on the final beat of a packet. In beat mode, unit_last is always 1, so unit_done fires on every accepted beat and rr_ptr advances immediately. The rest of the state machine is identical.

The same pre-handshake lock still applies. Beat mode does not mean "switch every cycle"; it means "switch after each accepted beat". While TREADY is low, the current source remains selected until the handshake completes.

Beat-mode example

Using the same source lengths as before (src0=3, src1=2, src2=1, src3=4 beats), the accepted beats evolve as shown in Table (2). When a source is exhausted it drops TVALID and the scan skips it.

Accepted beatrr_ptr before scanWinnerRemaining beats after transferrr_ptr after beat
000src0=2, src1=2, src2=1, src3=41
111src0=2, src1=1, src2=1, src3=42
222src0=2, src1=1, src2=0, src3=43
333src0=2, src1=1, src2=0, src3=30
400src0=1, src1=1, src2=0, src3=31
511src0=1, src1=0, src2=0, src3=32
623src0=1, src1=0, src2=0, src3=20
700src0=0, src1=0, src2=0, src3=21
813src0=0, src1=0, src2=0, src3=10
903src0=0, src1=0, src2=0, src3=00

Table (2): Beat-based arbitration example.

The difference is clear: beat mode interleaves traffic whenever several sources remain active. At beat 6 the scan starts at rr_ptr=2, skips the exhausted src2, and picks src3. The sink sees beats from multiple packets in alternating order, which is only safe when each beat is independently meaningful.

Weighted packet arbitration

Equal round-robin is fair, but sometimes one source should be serviced more often than others. A weighted arbiter solves this while still keeping packet boundaries intact.

Each source gets a credit counter initialized from its configured weight. A source is eligible only when it is both valid and has nonzero credit. When a packet completes, that source's credit is decremented. Once all active credits reach zero, the arbiter reloads them from WEIGHTS and starts a new weighted round.

Weights are packed into the parameter with source 0 in the least-significant bits:

// {W3, W2, W1, W0} = {1, 1, 2, 4} for a 4:2:1:1 ratio
localparam logic [NUM_SRCS*WEIGHT_W-1:0] WEIGHTS = {4'd1, 4'd1, 4'd2, 4'd4};

Credits are initialized by a helper function that also handles the zero-weight default:

function logic [WEIGHT_W-1:0] cfg_weight(input int src_idx);
    logic [WEIGHT_W-1:0] raw;
    raw = WEIGHTS[src_idx*WEIGHT_W +: WEIGHT_W];
    cfg_weight = (raw == '0) ? WEIGHT_W'(1) : raw;
endfunction

A weight of zero is treated as one, not as "never scheduled". This means the default WEIGHTS = '0 gives every source one credit per round — exactly equal round-robin — so no special case is needed for the unweighted instantiation.

Weighted example

With weights src0:src1:src2:src3 = 4:2:1:1, initial credits (4,2,1,1), all sources valid throughout:

Grantrr_ptrCredits beforeWinnerCredits after
10(4,2,1,1)0(3,2,1,1)
21(3,2,1,1)1(3,1,1,1)
32(3,1,1,1)2(3,1,0,1)
43(3,1,0,1)3(3,1,0,0)
50(3,1,0,0)0(2,1,0,0)
61(2,1,0,0)1(2,0,0,0)
72(2,0,0,0)0(1,0,0,0)
81(1,0,0,0)0(0,0,0,0)
all zero → reload(4,2,1,1)

Table (3): Weighted packet arbitration example.

At grant 7, rr_ptr = 2 and the scan order is 2, 3, 0, 1. Sources 2 and 3 have exhausted their credits, so src0 wins with its remaining credit of 2. At grant 8, only src0 still has credit and wins the final slot. After grant 8 all credits are zero: the arbiter detects that active sources exist but none have credit, reloads, and is immediately ready to start the next weighted round.

Weights control how often a source may win per round; they do not affect packet boundaries. Each grant still forwards a complete packet before the next scan.

How the RTL is organized

The implementation uses one always_comb block and one always_ff block. Keeping them separate makes the design easy to read: anything that affects registers is in the sequential block and nothing else is.

The combinational block scans sources, picks arb_sel, drives the output mux, and computes TREADY. The effective selection is a two-way mux between the locked source and the current scan winner:

eff_sel   = locked ? sel : arb_sel;
eff_valid = locked ? s_axis_tvalid[sel] : arb_valid;

The unit_last and unit_done signals unify packet and beat mode into a single state machine path so that the sequential block does not branch on HOLD_PACKET:

assign unit_last = HOLD_PACKET ? s_axis_tlast[eff_sel] : 1'b1;
assign unit_done = handshake && unit_last;

The sequential block then acts on unit_done to advance rr_ptr and decrement the credit. Changing HOLD_PACKET is the only modification needed to switch between modes.

How the tests are built

Each test instantiates the same topology:

  • four AXI-Stream source BFMs, each driving an input register slice
  • the arbiter DUT receiving packed signal arrays from the register slices
  • an output register slice feeding a sink BFM
  • protocol checkers on all four source interfaces and on the output

The register slices serve a practical purpose. They align BFM-driven signals to clock edges and eliminate direct combinational paths from the BFMs to the DUT, which prevents convergence issues in Verilator under backpressure. The arbitration algorithm does not require them; they are a testbench concern.

Source and sink backpressure are controlled at runtime via plusargs so each test runs in both configurations without recompilation:

void'($value$plusargs("SRC_BP=%d",  src_bp_en));
void'($value$plusargs("SINK_BP=%d", sink_bp_en));

When enabled, each BFM randomly withholds TREADY approximately 20% of the time.

The packet test (axis_arbiter) runs in two phases. Phase 1 is sequential: each source sends one packet alone with no competing traffic, confirming basic forwarding and that the arbiter does not corrupt or drop beats. Phase 2 is concurrent: all four sources send three packets each simultaneously, exercising round-robin rotation and correct behavior under backpressure. The expected totals are hard-coded:

// Phase 1: 4+5+3+6 = 18 beats, 4 packets
// Phase 2: 3*(4+5+3+6) = 54 beats, 12 packets
if (beats_recv !== 72)
    $error("[ARB] FAIL: expected 72 beats, got %0d", beats_recv);
else if (pkts_recv !== 16)
    $error("[ARB] FAIL: expected 16 packets, got %0d", pkts_recv);

The weighted test (axis_arbiter_weighted) sends exactly W0+W1+W2+W3 = 8 packets concurrently and checks that each source's grant count matches its configured weight exactly:

else if (src_grants[0] !== W0)
    $error("[WARB] FAIL: src0 grants expected %0d, got %0d", W0, src_grants[0]);

The beat test (axis_arbiter_beat) sends four equal 3-beat packets concurrently (12 beats total). When source backpressure is disabled, it checks that the accepted beat sequence follows strict round-robin ordering:

for (int i = 0; i < TOTAL_BEATS; i++) begin
    if (sel_hist[i] !== (i % NUM_SRCS))
        $error("[BARB] FAIL: expected beat %0d from src%0d, got src%0d",
               i, i % NUM_SRCS, sel_hist[i]);
end

With equal packet lengths and no backpressure, the expected sequence is exactly 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3. The per-source beat counts are also checked for balance regardless of backpressure mode.

A useful verification lesson

The beat test required a careful choice of observation point for its accepted-beat monitor, and the issue is worth explaining because it recurs in handshake verification.

The arbiter's eff_sel is combinational: it resolves in the same cycle as the handshake. Naively, one might sample eff_sel at the posedge alongside the handshake signals. The problem is that in a Verilator simulation, the posedge triggers multiple update passes: registered state changes, combinational re-evaluation, and BFM ready toggling can all happen in the same delta-cycle group. In edge cases the monitor could read an eff_sel that has already been updated for the next arbitration cycle rather than the source that actually drove the accepted beat.

The fix is to capture eff_sel at the negedge immediately before the posedge. At that point the registered state from the previous cycle is stable, no new handshakes are in flight, and eff_sel correctly reflects the source currently driving the bus:

forever begin
    @(negedge clk);
    beat_sel_pre = int'(u_dut.eff_sel); // stable: posedge not yet
    @(posedge clk);
    beat_sel_mon = beat_sel_pre;        // use the pre-edge value
    if (arb_tvalid_w && arb_tready_w) begin
        sel_hist[beats_recv] = beat_sel_mon;
        beats_recv++;
    end
end

This is a recurring pattern: the interesting signal (which source drove a beat) and the trigger that confirms it (the posedge handshake) are not the same event. Sampling the interesting signal at the negedge cleanly separates the two without needing any delta-cycle reasoning. The same technique applies to any combinational signal that you want to attribute to a specific clock edge.

When to use each mode

Packet round-robin is the safest default when TLAST marks a real packet, frame, or line boundary.

Beat round-robin is useful when beats are independent and finer bandwidth sharing matters more than packet integrity.

Weighted packet arbitration is useful when some streams should receive more service than others without breaking packet boundaries.

Where to go next

This arbiter is already a useful AXI-Stream primitive, but it also opens the door to several natural extensions:

  • weighted beat arbitration
  • TDEST-qualified arbitration
  • destination-based routing
  • full AXI-Stream switches
  • AXI memory-mapped arbiters for shared DMA and CDMA access

Even in its current form it is a good example of how a small RTL block becomes much more useful once the policy is explicit and the tests clearly prove the intended behavior.

Related