The ROCm toolboxes ship with ROCBLAS_USE_HIPBLASLT=1 by default. This forces rocBLAS to prefer
the hipBLASLt kernel library, which historically delivered the best throughput on gfx1151 (Strix Halo).
Rows tagged with __hblt0 were re-run with ROCBLAS_USE_HIPBLASLT=0, letting rocBLAS
auto-select between hipBLASLt, Tensile, or other kernel providers. These runs show how performance shifts when
the tuned hipBLASLt path is disabled.
hipBLASLt is AMD's LT (low-level tuned) matmul backend, optimized for transformer workloads. Disabling it can
expose regressions or improvements depending on driver versions, so both configurations are published for
comparison.
RPC · dual server
These results were produced with two Strix Halo systems (Framework Desktop + HP G1a workstation, each 128 GB)
connected over 5 Gbps Ethernet. One runs rpc-server from llama.cpp; the other runs
llama-bench --rpc.
This setup allows distributed inference, splitting large GGUF models across both machines. The metric shows what
you can expect when latency is limited by the network and the workload is balanced between two RPC participants.
rocWMMA variants
Backends labeled -rocwmma are rebuilt with AMD's rocWMMA library, which unlocks matrix multiply
pipelines accelerated via wave matrix multiply-accumulate (WMMA) instructions.
rocWMMA kernels can significantly accelerate BF16/F16 workloads on RDNA3 but may trade stability or memory
usage; comparing plain toolboxes against -rocwmma ones highlights the benefit or cost.
rocWMMA-improved builds
Toolboxes tagged -rocwmma-improved bake in an experimental llama.cpp patch that retunes rocWMMA
kernels for long-context throughput on Strix Halo.
Patch reference: 12bb5c371bd3. These builds often run faster for 32k+ contexts, but
the changes are not upstream and may be unstable.