About the Lab
Four very different machines, all running the same model— Qwen3-Coder-Next, an 80B mixture-of-experts. Pick any of them in the Chat or Race demos and feel the difference. Same model, same prompt; the hardware is the variable. Here's what's under the hood — hover any number to learn how to read it.
Blackwell Tower
Discrete data-center GPUs
The heavyweight. Dedicated GPUs, the fastest memory, built to serve a crowd.
Chip2× NVIDIA RTX PRO 6000 BlackwellThe processor doing the work. Discrete data-center GPUs.
Memory96 GB GDDR7 (per GPU)How much model + context can fit. Dedicated VRAM, separate from system RAM. More memory = bigger models / longer context.
Mem bandwidth~1.8 TB/sGDDR7 on a discrete GPU — by far the fastest memory here. This is the main reason this box generates tokens several times faster than the others.
Serving stackvLLMA production serving engine with continuous batching — it stays fast even with 64+ people at once, where a laptop server would queue.
M5 MacBook Pro
Laptop SoC · unified memory
A laptop punching way above its weight thanks to fast unified memory.
ChipApple M5 MaxThe processor doing the work. Laptop SoC · unified memory.
Memory128 GB unifiedUnified memory: the CPU and GPU share one big pool. 128 GB is enough to hold an 80B model — on a laptop.
Mem bandwidth~545 GB/sFast unified memory (LPDDR5x). The highest bandwidth of the unified-memory machines — which is why a laptop keeps up surprisingly well.
Serving stackllama.cpp · Metalllama.cpp running on Apple's Metal GPU API. Great for single users; fewer parallel slots than vLLM.
DGX Spark
Grace-Blackwell dev box · unified memory
A tiny Grace-Blackwell box — enormous memory, modest bandwidth.
ChipNVIDIA GB10 Grace BlackwellThe processor doing the work. Grace-Blackwell dev box · unified memory.
Memory128 GB unifiedUnified LPDDR5x shared by the Grace CPU and Blackwell GPU. Huge capacity for the size of the box.
Mem bandwidth~273 GB/sLots of memory, but modest bandwidth — so it fits the model easily yet decodes slower than its bigger sibling. Capacity ≠ speed.
Serving stackllama.cpp · CUDAllama.cpp on CUDA. Recent builds added support for this model's hybrid (Gated-DeltaNet) architecture on the GB10.
Strix Halo
x86 APU · unified memory
An x86 mini-PC APU running the very same 80B model.
ChipAMD Ryzen AI MAX+ 395 (Radeon 8060S)The processor doing the work. x86 APU · unified memory.
Memory128 GB unifiedUnified LPDDR5x shared by the Zen 5 CPU and RDNA 3.5 GPU. Same 80B model fits comfortably.
Mem bandwidth~256 GB/sSimilar bandwidth class to the Spark. Plenty of room for the model; bandwidth is what caps how fast it can talk.
Serving stackllama.cpp · Vulkanllama.cpp on the Vulkan backend, driving the integrated Radeon GPU on an x86 mini-PC.
How to read the numbers
- Time to first token (TTFT)
- How long from hitting send to the first word. It's mostly the queue plus reading your prompt (“prefill”). Low TTFT feels instant.
- Tokens / second
- How fast it writes once it gets going. This is usually capped by memory bandwidth— the chip re-reads the model's weights for every token, so faster memory = more tokens/sec.
- Memory bandwidth
- The single best predictor of single-user speed here. The Blackwell Tower's ~1.8 TB/s vs the unified boxes' ~250–550 GB/s is why it's several times faster — same model, different memory.
- Unified vs discrete memory
- The laptop, Spark and Strix share one memory pool between CPU and GPU (cheap, huge, runs an 80B model anywhere). The Tower has dedicated GPU VRAM — smaller but far faster.