The sparse models have a low number of active parameters during each run, while the total number of available parameters can be quite large.
This means that they also run well on systems with limited computation resources (but lots of RAM).
So you can basically run them completely without high-end-GPU on the CPU of of-the-shelf PCs and still easily achieve double-digit token/s values.
I’d add that memory bandwidth is still a relevant factor, so the faster the RAM the faster the inference will be. I think this model would be a perfect fit for the Strix Halo or a >= 64GB Apple Silicon machine, when aiming for CPU-only inference. But mind that llamacpp does not yet support the qwen3-next architecture.
Can confirm that from my setup. Increasing the parallelization beyond 3-4 concurrent threads doesn’t also significantly increase the inference speed any more.
This is a telltale sign that some of the cores are starving because data doesn’t arrive fast enough any more…
The sparse models have a low number of active parameters during each run, while the total number of available parameters can be quite large.
This means that they also run well on systems with limited computation resources (but lots of RAM).
So you can basically run them completely without high-end-GPU on the CPU of of-the-shelf PCs and still easily achieve double-digit token/s values.
I’d add that memory bandwidth is still a relevant factor, so the faster the RAM the faster the inference will be. I think this model would be a perfect fit for the Strix Halo or a >= 64GB Apple Silicon machine, when aiming for CPU-only inference. But mind that llamacpp does not yet support the qwen3-next architecture.
Can confirm that from my setup. Increasing the parallelization beyond 3-4 concurrent threads doesn’t also significantly increase the inference speed any more.
This is a telltale sign that some of the cores are starving because data doesn’t arrive fast enough any more…