The original post: /r/localllama by /u/Ok-Perception2973 on 2025-01-06 14:41:04.

Hi Everyone,

I’m in the early stages of looking into solutions for an on prem deployment, and am looking into the NVIDIA GH200 and came across some benchmarks for Llama 3.1 70B using vLLM. The results I found were published by Sam Stoelinga on Substratus, and they looked really promising.

Benchmark Results: Default Settings: • Successful Requests: 1000 • Benchmark Duration: 169.46 seconds • Request Throughput: 5.90 req/s • Output Token Throughput: 1022.25 tok/s • Total Token Throughput: 2393.86 tok/s • Mean Time to First Token (TTFT): 34702.73 ms • Median TTFT: 16933.34 ms • Mean Time Per Output Token (TPOT): 164.05 ms

CPU Offload & Increased Context Length (120k tokens): • Successful Requests: 1000 • Benchmark Duration: 439.96 seconds • Request Throughput: 2.27 req/s • Output Token Throughput: 393.61 tok/s • Total Token Throughput: 921.91 tok/s • Mean TTFT: 23549.66 ms • Mean TPOT: 700.44 ms

Full benchmarks are available here: Substratus Blog.

Given the GH200’s specs (624GB total memory, 144GB HBM3e, 480GB LPDDR5X at 512GB/s), I thought it might be a solution that could achieve reasonable generation speeds with Deepseek V3 at Q4 quantization?

Does anyone have experience benchmarking any other models or can confirm the tok/s speeds from this benchmark on the GH200, or know of additional resources?