Blogs

Key Highlights From Running DeepSeek R-1 671B on 14x RTX 3090s + Epyc 7713 & 512GB RAM

How KTransformers Dominated llama.cpp in Real-World Inference

Posted on Feb 14, 2025
2 Minutes

#inference #llms #models #llama.cpp #ktransformers #r1 #deepseek #livestream

Cover Image — 14x RTX 3090 Basement AI Server running DeepSeek R-1 671B with KTransformers and vLLM—Full 6-hour livestream, prompt eval benchmarks, and LLM experiments.

Table of Contents

This blogpost was published on my X/Twitter account on February 14th, 2025 .

Hello! If you don’t know me, I’m the guy with the 14x RTX 3090s Basement AI Server . Earlier this week, alongside @OsMo999 as my guest commentator, I livestreamed running DeepSeek R-1 671B-q4 using KTransformers on my AI Server and below are the key takeaways from that 6-hour session.

Why This Experiment?

I was inspired by an announcement from the KTransformers Team showcasing optimizations for running and offloading DeepSeek R-1 671B, 4-bit quantized, to the CPU. Instead of just benchmarking this on my own, I livestreamed everything from A to Z—during the livestream, I went beyond the initial plan and dived into how I use my AI server with vLLM, ExLlamaV2, and llama.cpp, and ran some comparisons.

Key Highlights from the Stream

KTransformers: 13GB GPU offloaded, ~390GB CPU offload → Stream Recording at Timestamp 1:39:59
llama.cpp vs. KTransformers eval comparison → Stream Recording at Timestamp 4:12:29
Cat makes an appearance → Stream Recording at Timestamp 2:49:00

Biggest Takeaway: KTransformers Crushed llama.cpp in Prompt Eval Speeds

KTransformers boosted prompt eval speeds by ~15x over llama.cpp! This matched the benchmarks they released in their release docs. Here’s the eval data for the run at the 1:39:59 mark of the stream:

Prompt Evaluation

14 tokens → 9.18 tokens/sec

Generation Evaluation

805 tokens → 8.24 tokens/sec

Funny enough, last week I wrote a blog post saying not to use llama.cpp for multi-GPU setups… and then I ended up livestreaming it this week!

Watch the Full Stream Recording

YouTube Video

I plan to stream regularly, so let me know what you’d like to see next! Maybe you’d even like to join as a guest? 😎

I have previously written a lot of in-depth blogposts on LLMs and AI, but never livestreamed on my own, so I’d love to hear your feedback! Drop your thoughts, ideas, or even suggestions for future AI server experiments and livestreams.

Find me below: