ARM SME JIT vs Apple Accelerate on M4 · Criterion.rs · Source on GitHub
| Size | JIT (median) | Accelerate (median) | Speedup |
|---|---|---|---|
| 16×16×16 | 58.2 ns | 218.1 ns | 3.7× |
| 32×32×32 | 85.9 ns | 260.1 ns | 3.0× |
| 48×48×48 | 339.2 ns | 495.3 ns | 1.5× |
| 64×64×64 | 867.9 ns | 712.9 ns | 0.8× |
JIT Detail Accelerate Detail Comparison
| Kernel | Median | Note |
|---|---|---|
| sgemm_relu_16×16 | 39.9 ns | ReLU adds ~0 ns over raw GEMM |
| sgemm_bias_relu_16×16 | 41.5 ns | Bias+ReLU adds 1.6 ns |
ReLU Detail Bias+ReLU Detail Comparison
| Path | K=1 | K=4 | K=16 | K=64 |
|---|---|---|---|---|
| jit_hot | 58.1 ns | 58.3 ns | 58.5 ns | 56.7 ns |
| accelerate | 50.0 ns | 117.7 ns | 162.2 ns | 232.7 ns |
| speedup | 0.9× | 2.0× | 2.8× | 4.1× |
Median ~44 ms — 750,000× slower than jit_hot. This is the fork()+mmap+waitpid cost for fault-tolerant opcode probing. Used for exploring unknown instructions, not production workloads.
JIT Hot Detail JIT Cold Detail Accelerate Detail
300–500 samples per benchmark with 3-second warmup. Outlier detection via modified Thompson Tau.
Confidence intervals at 95%. All measurements on Apple M4, macOS Sequoia 15.x, Rust nightly,
--release profile. Benchmarks exercise the public SmeGemm API (tiled group)
and internal kernel builders (other groups).