DeepSeek V3 GGUF 2-bit surprisingly works! + BF16, other quants

Hey guys we uploaded GGUF's including 2, 3 ,4, 5, 6 and 8-bit quants for Deepseek V3.

We've also de-quantized Deepseek-V3 to upload the bf16 version so you guys can experiment with it (1.3TB)

Minimum hardware requirements to run Deepseek-V3 in 2-bit: 48GB RAM + 250GB of disk space.

See how to run Deepseek V3 with examples and our full collection here: https://huggingface.co/collections/unsloth/deepseek-v3-all-versions-677cf5cfd7df8b7815fc723c

Deepseek V3 version Links
GGUF 2-bit: Q2_K_XS and Q2_K_L
GGUF 3456 and 8-bit
bf16 dequantized 16-bit

The Unsloth GGUF model details:

Quant Type Disk Size Details
Q2_K_XS 207GB Q2 everything, Q4 embed, Q6 lm_head
Q2_K_L 228GB Q3 down_proj Q2 rest, Q4 embed, Q6 lm_head
Q3_K_M 298GB Standard Q3_K_M
Q4_K_M 377GB Standard Q4_K_M
Q5_K_M 443GB Standard Q5_K_M
Q6_K 513GB Standard Q6_K
Q8_0 712GB Standard Q8_0
  • Q2_K_XS should run ok in ~40GB of CPU / GPU VRAM with automatic llama.cpp offloading.
  • Use K quantization (not V quantization)
  • Do not forget about <|User|> and <|Assistant|> tokens! - Or use a chat template formatter

Example with Q5_0 K quantized cache (V quantized cache doesn't work):

./llama.cpp/llama-cli
    --model unsloth/DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_XS/DeepSeek-V3-Q2_K_XS-00001-of-00005.gguf
    --cache-type-k q5_0
    --prompt '<|User|>What is 1+1?<|Assistant|>'

and running the above generates:

The sum of 1 and 1 is **2**. Here's a simple step-by-step breakdown:
 1. **Start with the number 1.**
 2. **Add another 1 to it.**
 3. **The result is 2.**
 So, **1 + 1 = 2**. [end of text]