bOt@zerobytes.monster

bOt@zerobytes.monster

The original post: /r/localllama by /u/EmilPi on 2025-01-06 16:55:49.

My company rig is described in https://www.reddit.com/r/LocalLLaMA/comments/1gjovjm/4x_rtx_3090_threadripper_3970x_256_gb_ram_llm/

0: set up CUDA 12.x

1: set up llama.cpp:

git clone https://github.com/ggerganov/llama.cpp/
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_F16=ON
cmake --build build --config Release --parallel $(nproc)
Your llama.cpp with recently merged DeepSeek V3 support is ready!https://github.com/ggerganov/llama.cpp/

2: Now download the model:

cd ../
mkdir DeepSeek-V3-Q3_K_M
cd DeepSeek-V3-Q3_K_M
for i in {1..8} ; do wget "https://huggingface.co/bullerwins/DeepSeek-V3-GGUF/resolve/main/DeepSeek-V3-Q3_K_M/DeepSeek-V3-Q3_K_M-0000$i-of-00008.gguf?download=true" -o  DeepSeek-V3-Q3_K_M-0000$i-of-00008.gguf ; done

3: Now run it on localhost on port 1234:

cd ../
./llama.cpp/build/bin/llama-server  --host localhost  --port 1234  --model ./DeepSeek-V3-Q3_K_M/DeepSeek-V3-Q3_K_M-00001-of-00008.gguf  --alias DeepSeek-V3-Q3-4k  --temp 0.1  -ngl 15  --split-mode layer -ts 3,4,4,4  -c 4096  --numa distribute

Done!

When you ask it something, e.g. using time curl ...:

time curl 'http://localhost:1234/v1/chat/completions' -X POST -H 'Content-Type: application/json' -d '{"model_name": "DeepSeek-V3-Q3-4k","messages":[{"role":"system","content":"You are an AI coding assistant. You explain as minimum as possible."},{"role":"user","content":"Write prime numbers from 1 to 100, no coding"}], "stream": false}'

you get output like

{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97.","role":"assistant"}}],"created":1736179690,"model":"DeepSeek-V3-Q3-4k","system_fingerprint":"b4418-b56f079e","object":"chat.completion","usage":{"completion_tokens":75,"prompt_tokens":29,"total_tokens":104},"id":"chatcmpl-gYypY7Ysa1ludwppicuojr1anMTUSFV2","timings":{"prompt_n":28,"prompt_ms":2382.742,"prompt_per_token_ms":85.09792857142858,"prompt_per_second":11.751167352571112,"predicted_n":75,"predicted_ms":19975.822,"predicted_per_token_ms":266.3442933333333,"predicted_per_second":3.754538862030308}}
real0m22.387s
user0m0.003s
sys0m0.008s

or in journalctl -f something like

Jan 06 18:01:42 hostname llama-server[1753310]: slot      release: id  0 | task 5720 | stop processing: n_past = 331, truncated = 0
Jan 06 18:01:42 hostname llama-server[1753310]: slot print_timing: id  0 | task 5720 |
Jan 06 18:01:42 hostname llama-server[1753310]: prompt eval time =    1292.85 ms /    12 tokens (  107.74 ms per token,     9.28 tokens per second)
Jan 06 18:01:42 hostname llama-server[1753310]:        eval time =   89758.14 ms /   318 tokens (  282.26 ms per token,     3.54 tokens per second)
Jan 06 18:01:42 hostname llama-server[1753310]:       total time =   91050.99 ms /   330 tokens
Jan 06 18:01:42 hostname llama-server[1753310]: srv  update_slots: all slots are idle
Jan 06 18:01:42 hostname llama-server[1753310]: request: POST /v1/chat/completions  200172.17.0.2

Good luck, fellow rig-builders!

Run DeepSeek-V3 with 96GB VRAM + 256 GB RAM under Linux

Run DeepSeek-V3 with 96GB VRAM + 256 GB RAM under Linux

The original post: /r/localllama by /u/EmilPi on 2025-01-06 16:55:49.