AXI-Stream systems often need several producers to share a common downstream path. A video pipeline may merge several processing stages into one output. A packet switch may combine multiple sources before transmission. A verification environment may multiplex several traffic generators into a single sink. In all of these cases an arbiter decides who gets access to the shared stream.
At first glance this looks like a mux plus some priority logic. In practice the policy matters a great deal. Should the arbiter switch source every cycle, or only at packet boundaries? Should all sources receive equal service, or should one stream receive more bandwidth? These choices affect fairness, packet integrity, backpressure behavior, and how reusable the block is across designs.
In verilaxi, the AXI-Stream arbiter explores three policies in a single parameterized module:
- packet-based round-robin
- beat-based round-robin
- weighted packet arbitration
Figure (1) shows the arbiter.
Figure (1): AXI-Stream arbiter in verilaxi.
Why the policy matters
If AXI-Stream beats are independent, beat-by-beat arbitration is attractive. The arbiter can rotate after every accepted transfer and share bandwidth finely across all sources.
However, AXI-Stream often carries packets, video lines, or frames. In those cases TLAST marks a meaningful boundary. If arbitration changes source mid-packet, the sink observes an interleaved stream where beats from different packets are mixed together. For most packet-oriented sinks that is the wrong behavior, and it is not recoverable without additional framing logic downstream.
This is why packet-based arbitration is a good default. Once a source wins, it keeps ownership until the last beat handshakes:
TVALID && TREADY && TLAST
Only then does the arbiter rotate to the next source.
Parameters and interface
The arbiter is a single module. All N source ports are packed into arrays rather than using SystemVerilog interfaces on the synthesizable boundary, which keeps the RTL tool-agnostic:
module snix_axis_arbiter #(
parameter int NUM_SRCS = 4,
parameter int DATA_WIDTH = 8,
parameter int USER_WIDTH = 1,
parameter bit HOLD_PACKET = 1'b1,
parameter int WEIGHT_W = 4,
parameter logic [NUM_SRCS*WEIGHT_W-1:0] WEIGHTS = '0
) (
input logic clk, rst_n,
input logic [NUM_SRCS-1:0][DATA_WIDTH-1:0] s_axis_tdata,
input logic [NUM_SRCS-1:0] s_axis_tvalid,
input logic [NUM_SRCS-1:0] s_axis_tlast,
output logic [NUM_SRCS-1:0] s_axis_tready,
output logic [DATA_WIDTH-1:0] m_axis_tdata,
output logic m_axis_tvalid,
output logic m_axis_tlast,
input logic m_axis_tready);
Two parameters control the arbitration policy:
HOLD_PACKET:1gives packet mode (hold the grant untilTLAST);0gives beat mode (release after every accepted beat). All three policies live in the same RTL — switching modes is a single parameter change.WEIGHTS: a packed vector of per-source service weights with source 0 in the least-significant bits. The default'0gives each source equal weight.
Packet-based round-robin
The packet arbiter keeps three main pieces of registered state:
sel: the currently selected sourcelocked: whether a transfer is in progressrr_ptr: the starting point for the next arbitration scan
The combinational block scans the sources starting at rr_ptr, picks the first valid eligible source, and immediately drives it to the output. There is no idle cycle between the moment a source becomes valid and the moment its first beat appears at the output.
The locked signal serves two distinct purposes, and understanding both is important for reading the RTL.
Purpose 1 — stabilize the selection while a beat is stalled. The combinational scan runs every cycle. If the arbiter has chosen a source but TREADY is low, the scan would be free to switch to a different source on the next cycle. To prevent this, the sequential logic locks onto the chosen source as soon as a selection is made, even before any handshake occurs:
// No handshake yet: hold the chosen source stable
if (!handshake) begin
sel <= arb_sel;
locked <= 1'b1;
end
Purpose 2 — hold the grant through a multi-beat packet. Once the first beat of a multi-beat packet is accepted, the arbiter stays locked until TLAST completes:
// First beat accepted, more beats remain: hold until TLAST
else if (HOLD_PACKET && !s_axis_tlast[arb_sel]) begin
sel <= arb_sel;
locked <= 1'b1;
end
If the very first beat of a packet also carries TLAST (a single-beat packet), neither lock path fires. The arbiter completes the unit immediately, advances rr_ptr, and is ready to arbitrate again the next cycle without ever setting locked.
Only the selected source receives TREADY. All other inputs are held off regardless of how long the sink stalls. The TREADY assignment uses a constant-index loop to keep the generated logic straightforward across simulators and synthesizers:
for (int i = 0; i < NUM_SRCS; i++) begin
s_axis_tready[i] = (eff_valid && (eff_sel == SEL_W'(i)))
? m_axis_tready : 1'b0;
end
How rr_ptr works
The round-robin pointer determines where the next arbitration scan begins. Instead of always starting from source 0, the scan starts at rr_ptr and wraps around. If there are N sources, the candidate at scan position i is:
idx = (rr_ptr + i) mod N, i = 0 ... N-1
For example, with N = 4 and rr_ptr = 2, the scan order is:
2, 3, 0, 1
In the RTL this is implemented without a modulo operator to keep synthesis clean:
idx = int'(rr_ptr) + i;
if (idx >= NUM_SRCS) idx -= NUM_SRCS;
Once a packet (or beat) completes, the pointer advances to the source immediately after the winner:
rr_ptr <= SEL_W'(int'(sel) == NUM_SRCS-1 ? 0 : int'(sel) + 1);
This ensures the source that just won is not the first candidate in the next scan, which is what gives the arbiter its fairness.
Packet-mode example
Assume:
NUM_SRCS = 4, initialrr_ptr = 0, all four sources valid- packet lengths:
src0 = 3,src1 = 2,src2 = 1,src3 = 4beats
| Round | rr_ptr before scan | Scan order | Winner | Packet length | rr_ptr after completion |
|---|---|---|---|---|---|
| 1 | 0 | 0,1,2,3 | 0 | 3 | 1 |
| 2 | 1 | 1,2,3,0 | 1 | 2 | 2 |
| 3 | 2 | 2,3,0,1 | 2 | 1 | 3 |
| 4 | 3 | 3,0,1,2 | 3 | 4 | 0 |
Table (1): Packet-based round-robin example.
Complete packets are forwarded one at a time. Each source gets its turn, and the next scan always starts after the previous winner.
Beat-based round-robin
Beat mode is enabled by setting HOLD_PACKET = 0. No other change to the RTL is needed. The beat test instantiates the DUT as:
snix_axis_arbiter #(
.NUM_SRCS (4),
.HOLD_PACKET(1'b0) // beat mode
) u_dut ( ... );
The entire behavioral difference between packet and beat mode reduces to two lines:
assign unit_last = HOLD_PACKET ? s_axis_tlast[eff_sel] : 1'b1;
assign unit_done = handshake && unit_last;
In packet mode, unit_last follows TLAST, so unit_done fires only on the final beat of a packet. In beat mode, unit_last is always 1, so unit_done fires on every accepted beat and rr_ptr advances immediately. The rest of the state machine is identical.
The same pre-handshake lock still applies. Beat mode does not mean "switch every cycle"; it means "switch after each accepted beat". While TREADY is low, the current source remains selected until the handshake completes.
Beat-mode example
Using the same source lengths as before (src0=3, src1=2, src2=1, src3=4 beats), the accepted beats evolve as shown in Table (2). When a source is exhausted it drops TVALID and the scan skips it.
| Accepted beat | rr_ptr before scan | Winner | Remaining beats after transfer | rr_ptr after beat |
|---|---|---|---|---|
| 0 | 0 | 0 | src0=2, src1=2, src2=1, src3=4 | 1 |
| 1 | 1 | 1 | src0=2, src1=1, src2=1, src3=4 | 2 |
| 2 | 2 | 2 | src0=2, src1=1, src2=0, src3=4 | 3 |
| 3 | 3 | 3 | src0=2, src1=1, src2=0, src3=3 | 0 |
| 4 | 0 | 0 | src0=1, src1=1, src2=0, src3=3 | 1 |
| 5 | 1 | 1 | src0=1, src1=0, src2=0, src3=3 | 2 |
| 6 | 2 | 3 | src0=1, src1=0, src2=0, src3=2 | 0 |
| 7 | 0 | 0 | src0=0, src1=0, src2=0, src3=2 | 1 |
| 8 | 1 | 3 | src0=0, src1=0, src2=0, src3=1 | 0 |
| 9 | 0 | 3 | src0=0, src1=0, src2=0, src3=0 | 0 |
Table (2): Beat-based arbitration example.
The difference is clear: beat mode interleaves traffic whenever several sources remain active. At beat 6 the scan starts at rr_ptr=2, skips the exhausted src2, and picks src3. The sink sees beats from multiple packets in alternating order, which is only safe when each beat is independently meaningful.
Weighted packet arbitration
Equal round-robin is fair, but sometimes one source should be serviced more often than others. A weighted arbiter solves this while still keeping packet boundaries intact.
Each source gets a credit counter initialized from its configured weight. A source is eligible only when it is both valid and has nonzero credit. When a packet completes, that source's credit is decremented. Once all active credits reach zero, the arbiter reloads them from WEIGHTS and starts a new weighted round.
Weights are packed into the parameter with source 0 in the least-significant bits:
// {W3, W2, W1, W0} = {1, 1, 2, 4} for a 4:2:1:1 ratio
localparam logic [NUM_SRCS*WEIGHT_W-1:0] WEIGHTS = {4'd1, 4'd1, 4'd2, 4'd4};
Credits are initialized by a helper function that also handles the zero-weight default:
function logic [WEIGHT_W-1:0] cfg_weight(input int src_idx);
logic [WEIGHT_W-1:0] raw;
raw = WEIGHTS[src_idx*WEIGHT_W +: WEIGHT_W];
cfg_weight = (raw == '0) ? WEIGHT_W'(1) : raw;
endfunction
A weight of zero is treated as one, not as "never scheduled". This means the default WEIGHTS = '0 gives every source one credit per round — exactly equal round-robin — so no special case is needed for the unweighted instantiation.
Weighted example
With weights src0:src1:src2:src3 = 4:2:1:1, initial credits (4,2,1,1), all sources valid throughout:
| Grant | rr_ptr | Credits before | Winner | Credits after |
|---|---|---|---|---|
| 1 | 0 | (4,2,1,1) | 0 | (3,2,1,1) |
| 2 | 1 | (3,2,1,1) | 1 | (3,1,1,1) |
| 3 | 2 | (3,1,1,1) | 2 | (3,1,0,1) |
| 4 | 3 | (3,1,0,1) | 3 | (3,1,0,0) |
| 5 | 0 | (3,1,0,0) | 0 | (2,1,0,0) |
| 6 | 1 | (2,1,0,0) | 1 | (2,0,0,0) |
| 7 | 2 | (2,0,0,0) | 0 | (1,0,0,0) |
| 8 | 1 | (1,0,0,0) | 0 | (0,0,0,0) |
| — | — | all zero → reload | — | (4,2,1,1) |
Table (3): Weighted packet arbitration example.
At grant 7, rr_ptr = 2 and the scan order is 2, 3, 0, 1. Sources 2 and 3 have exhausted their credits, so src0 wins with its remaining credit of 2. At grant 8, only src0 still has credit and wins the final slot. After grant 8 all credits are zero: the arbiter detects that active sources exist but none have credit, reloads, and is immediately ready to start the next weighted round.
Weights control how often a source may win per round; they do not affect packet boundaries. Each grant still forwards a complete packet before the next scan.
How the RTL is organized
The implementation uses one always_comb block and one always_ff block. Keeping them separate makes the design easy to read: anything that affects registers is in the sequential block and nothing else is.
The combinational block scans sources, picks arb_sel, drives the output mux, and computes TREADY. The effective selection is a two-way mux between the locked source and the current scan winner:
eff_sel = locked ? sel : arb_sel;
eff_valid = locked ? s_axis_tvalid[sel] : arb_valid;
The unit_last and unit_done signals unify packet and beat mode into a single state machine path so that the sequential block does not branch on HOLD_PACKET:
assign unit_last = HOLD_PACKET ? s_axis_tlast[eff_sel] : 1'b1;
assign unit_done = handshake && unit_last;
The sequential block then acts on unit_done to advance rr_ptr and decrement the credit. Changing HOLD_PACKET is the only modification needed to switch between modes.
How the tests are built
Each test instantiates the same topology:
- four AXI-Stream source BFMs, each driving an input register slice
- the arbiter DUT receiving packed signal arrays from the register slices
- an output register slice feeding a sink BFM
- protocol checkers on all four source interfaces and on the output
The register slices serve a practical purpose. They align BFM-driven signals to clock edges and eliminate direct combinational paths from the BFMs to the DUT, which prevents convergence issues in Verilator under backpressure. The arbitration algorithm does not require them; they are a testbench concern.
Source and sink backpressure are controlled at runtime via plusargs so each test runs in both configurations without recompilation:
void'($value$plusargs("SRC_BP=%d", src_bp_en));
void'($value$plusargs("SINK_BP=%d", sink_bp_en));
When enabled, each BFM randomly withholds TREADY approximately 20% of the time.
The packet test (axis_arbiter) runs in two phases. Phase 1 is sequential: each source sends one packet alone with no competing traffic, confirming basic forwarding and that the arbiter does not corrupt or drop beats. Phase 2 is concurrent: all four sources send three packets each simultaneously, exercising round-robin rotation and correct behavior under backpressure. The expected totals are hard-coded:
// Phase 1: 4+5+3+6 = 18 beats, 4 packets
// Phase 2: 3*(4+5+3+6) = 54 beats, 12 packets
if (beats_recv !== 72)
$error("[ARB] FAIL: expected 72 beats, got %0d", beats_recv);
else if (pkts_recv !== 16)
$error("[ARB] FAIL: expected 16 packets, got %0d", pkts_recv);
The weighted test (axis_arbiter_weighted) sends exactly W0+W1+W2+W3 = 8 packets concurrently and checks that each source's grant count matches its configured weight exactly:
else if (src_grants[0] !== W0)
$error("[WARB] FAIL: src0 grants expected %0d, got %0d", W0, src_grants[0]);
The beat test (axis_arbiter_beat) sends four equal 3-beat packets concurrently (12 beats total). When source backpressure is disabled, it checks that the accepted beat sequence follows strict round-robin ordering:
for (int i = 0; i < TOTAL_BEATS; i++) begin
if (sel_hist[i] !== (i % NUM_SRCS))
$error("[BARB] FAIL: expected beat %0d from src%0d, got src%0d",
i, i % NUM_SRCS, sel_hist[i]);
end
With equal packet lengths and no backpressure, the expected sequence is exactly 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3. The per-source beat counts are also checked for balance regardless of backpressure mode.
A useful verification lesson
The beat test required a careful choice of observation point for its accepted-beat monitor, and the issue is worth explaining because it recurs in handshake verification.
The arbiter's eff_sel is combinational: it resolves in the same cycle as the handshake. Naively, one might sample eff_sel at the posedge alongside the handshake signals. The problem is that in a Verilator simulation, the posedge triggers multiple update passes: registered state changes, combinational re-evaluation, and BFM ready toggling can all happen in the same delta-cycle group. In edge cases the monitor could read an eff_sel that has already been updated for the next arbitration cycle rather than the source that actually drove the accepted beat.
The fix is to capture eff_sel at the negedge immediately before the posedge. At that point the registered state from the previous cycle is stable, no new handshakes are in flight, and eff_sel correctly reflects the source currently driving the bus:
forever begin
@(negedge clk);
beat_sel_pre = int'(u_dut.eff_sel); // stable: posedge not yet
@(posedge clk);
beat_sel_mon = beat_sel_pre; // use the pre-edge value
if (arb_tvalid_w && arb_tready_w) begin
sel_hist[beats_recv] = beat_sel_mon;
beats_recv++;
end
end
This is a recurring pattern: the interesting signal (which source drove a beat) and the trigger that confirms it (the posedge handshake) are not the same event. Sampling the interesting signal at the negedge cleanly separates the two without needing any delta-cycle reasoning. The same technique applies to any combinational signal that you want to attribute to a specific clock edge.
When to use each mode
Packet round-robin is the safest default when TLAST marks a real packet, frame, or line boundary.
Beat round-robin is useful when beats are independent and finer bandwidth sharing matters more than packet integrity.
Weighted packet arbitration is useful when some streams should receive more service than others without breaking packet boundaries.
Where to go next
This arbiter is already a useful AXI-Stream primitive, but it also opens the door to several natural extensions:
- weighted beat arbitration
TDEST-qualified arbitration- destination-based routing
- full AXI-Stream switches
- AXI memory-mapped arbiters for shared DMA and CDMA access
Even in its current form it is a good example of how a small RTL block becomes much more useful once the policy is explicit and the tests clearly prove the intended behavior.