The ROCm toolboxes ship with ROCBLAS_USE_HIPBLASLT=1 by default. This forces rocBLAS to
prefer
the hipBLASLt kernel library, which historically delivered the best throughput on gfx1151 (Strix Halo).
Rows tagged with __hblt0 were re-run with ROCBLAS_USE_HIPBLASLT=0, letting
rocBLAS
auto-select between hipBLASLt, Tensile, or other kernel providers. These runs show how performance
shifts when
the tuned hipBLASLt path is disabled.
hipBLASLt is AMD's LT (low-level tuned) matmul backend, optimized for transformer workloads. Disabling it
can
expose regressions or improvements depending on driver versions, so both configurations are published
for
comparison.
RPC · dual server
These results were produced with two Strix Halo systems (Framework Desktops, each
128 GB)
connected over 50 Gbps Ethernet (likely bandwidth is not the limiting factor here, but latency).
One runs rpc-server from llama.cpp; the other runs
llama-bench --rpc.
This setup allows distributed inference, splitting large GGUF models across both machines. The metric
shows what
you can expect when latency is limited by the network and the workload is balanced between two RPC
participants.
rocWMMA variants
Backends labeled -rocwmma are rebuilt with AMD's rocWMMA library, which unlocks matrix
multiply
pipelines accelerated via wave matrix multiply-accumulate (WMMA) instructions.
rocWMMA kernels can significantly accelerate BF16/F16 workloads on RDNA3 but may trade stability or
memory
usage; comparing plain toolboxes against -rocwmma ones highlights the benefit or cost.
rocWMMA-improved builds
Toolboxes tagged -rocwmma-improved bake in an experimental llama.cpp patch that retunes
rocWMMA
kernels for long-context throughput on Strix Halo.
Patch reference: 12bb5c371bd3. These builds often run faster for 32k+ contexts,
but
the changes are not upstream and may be unstable.