These results were produced with two Strix Halo systems (Framework Desktops, each
128 GB)
connected over 50 Gbps Ethernet (likely bandwidth is not the limiting factor here, but latency).
One runs rpc-server from llama.cpp; the other runs
llama-bench --rpc.
This setup allows distributed inference, splitting large GGUF models across both machines. The metric
shows what
you can expect when latency is limited by the network and the workload is balanced between two RPC
participants.
rocWMMA variants
Backends labeled -rocwmma are rebuilt with AMD's rocWMMA library, which unlocks matrix
multiply
pipelines accelerated via wave matrix multiply-accumulate (WMMA) instructions.
rocWMMA kernels can significantly accelerate BF16/F16 workloads on RDNA3 but may trade stability or
memory
usage; comparing plain toolboxes against -rocwmma ones highlights the benefit or cost.
rocWMMA-improved builds
Toolboxes tagged -rocwmma-improved bake in an experimental llama.cpp patch that retunes
rocWMMA
kernels for long-context throughput on Strix Halo.
Patch reference: 12bb5c371bd3. These builds often run faster for 32k+ contexts,
but
the changes are not upstream and may be unstable.