sme-jit-core Benchmarks

ARM SME JIT vs Apple Accelerate on M4 · Criterion.rs · Source on GitHub

Tiled GEMM — SmeGemm API

SizeJIT (median)Accelerate (median)Speedup
16×16×1658.2 ns218.1 ns3.7×
32×32×3285.9 ns260.1 ns3.0×
48×48×48339.2 ns495.3 ns1.5×
64×64×64867.9 ns712.9 ns0.8×

JIT Detail Accelerate Detail Comparison

Fused Activations — GEMM + ReLU / Bias+ReLU

KernelMedianNote
sgemm_relu_16×1639.9 nsReLU adds ~0 ns over raw GEMM
sgemm_bias_relu_16×1641.5 nsBias+ReLU adds 1.6 ns

ReLU Detail Bias+ReLU Detail Comparison

Core 16×16 — JIT Hot vs Cold vs Accelerate

PathK=1K=4K=16K=64
jit_hot58.1 ns58.3 ns58.5 ns56.7 ns
accelerate50.0 ns117.7 ns162.2 ns232.7 ns
speedup0.9×2.0×2.8×4.1×

jit_cold (fork-isolated) safety mode

Median ~44 ms — 750,000× slower than jit_hot. This is the fork()+mmap+waitpid cost for fault-tolerant opcode probing. Used for exploring unknown instructions, not production workloads.

JIT Hot Detail JIT Cold Detail Accelerate Detail

Methodology

Criterion.rs — Statistical Benchmarking

300–500 samples per benchmark with 3-second warmup. Outlier detection via modified Thompson Tau. Confidence intervals at 95%. All measurements on Apple M4, macOS Sequoia 15.x, Rust nightly, --release profile. Benchmarks exercise the public SmeGemm API (tiled group) and internal kernel builders (other groups).

Full Criterion Report

View All Benchmarks →