The ROCm toolboxes ship with ROCBLAS_USE_HIPBLASLT=1 by default. This forces rocBLAS to
prefer
the hipBLASLt kernel library, which historically delivered the best throughput on gfx1201 (R9700).
Rows tagged with __hblt0 were re-run with ROCBLAS_USE_HIPBLASLT=0, letting
rocBLAS
auto-select between hipBLASLt, Tensile, or other kernel providers. These runs show how performance
shifts when
the tuned hipBLASLt path is disabled.
hipBLASLt is AMD's LT (low-level tuned) matmul backend, optimized for transformer workloads. Disabling it
can
expose regressions or improvements depending on driver versions, so both configurations are published
for
comparison.
Dual GPU (2x R9700)
These results were produced using two AMD Radeon AI PRO R9700 GPUs (32GB each, 64GB total).
Models larger than ~30GB are automatically distributed across both GPUs using
HIP_VISIBLE_DEVICES=0,1. Smaller models run on a single GPU
(HIP_VISIBLE_DEVICES=0).
rocWMMA variants
Backends labeled -rocwmma are rebuilt with AMD's rocWMMA library, which unlocks matrix
multiply
pipelines accelerated via wave matrix multiply-accumulate (WMMA) instructions.
rocWMMA kernels can significantly accelerate BF16/F16 workloads on RDNA3 but may trade stability or
memory
usage; comparing plain toolboxes against -rocwmma ones highlights the benefit or cost.