These results were produced using two AMD Radeon AI PRO R9700 GPUs (32GB each, 64GB total).
Models larger than ~30GB are automatically distributed across both GPUs using
HIP_VISIBLE_DEVICES=0,1. Smaller models run on a single GPU
(HIP_VISIBLE_DEVICES=0).
rocWMMA variants
Backends labeled -rocwmma are rebuilt with AMD's rocWMMA library, which unlocks matrix
multiply
pipelines accelerated via wave matrix multiply-accumulate (WMMA) instructions.
rocWMMA kernels can significantly accelerate BF16/F16 workloads on RDNA3 but may trade stability or
memory
usage; comparing plain toolboxes against -rocwmma ones highlights the benefit or cost.