DeepSeek V3 GGUF 2-bit surprisingly works! + BF16, other quants
Hey guys we uploaded GGUF's including 2, 3 ,4, 5, 6 and 8-bit quants for Deepseek V3.
We've also de-quantized Deepseek-V3 to upload the bf16 version so you guys can experiment with it (1.3TB)
Minimum hardware requirements to run Deepseek-V3 in 2-bit: 48GB RAM + 250GB of disk space.
See how to run Deepseek V3 with examples and our full collection here: https://huggingface.co/collections/unsloth/deepseek-v3-all-versions-677cf5cfd7df8b7815fc723c
Deepseek V3 version | Links |
---|---|
GGUF | 2-bit: Q2_K_XS and Q2_K_L |
GGUF | 3, 4, 5, 6 and 8-bit |
bf16 | dequantized 16-bit |
The Unsloth GGUF model details:
Quant Type | Disk Size | Details |
---|---|---|
Q2_K_XS | 207GB | Q2 everything, Q4 embed, Q6 lm_head |
Q2_K_L | 228GB | Q3 down_proj Q2 rest, Q4 embed, Q6 lm_head |
Q3_K_M | 298GB | Standard Q3_K_M |
Q4_K_M | 377GB | Standard Q4_K_M |
Q5_K_M | 443GB | Standard Q5_K_M |
Q6_K | 513GB | Standard Q6_K |
Q8_0 | 712GB | Standard Q8_0 |
- Q2_K_XS should run ok in ~40GB of CPU / GPU VRAM with automatic llama.cpp offloading.
- Use K quantization (not V quantization)
- Do not forget about
<|User|>
and<|Assistant|>
tokens! - Or use a chat template formatter
Example with Q5_0 K quantized cache (V quantized cache doesn't work):
./llama.cpp/llama-cli
--model unsloth/DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_XS/DeepSeek-V3-Q2_K_XS-00001-of-00005.gguf
--cache-type-k q5_0
--prompt '<|User|>What is 1+1?<|Assistant|>'
and running the above generates:
The sum of 1 and 1 is **2**. Here's a simple step-by-step breakdown:
1. **Start with the number 1.**
2. **Add another 1 to it.**
3. **The result is 2.**
So, **1 + 1 = 2**. [end of text]