
Blogs · · ·
-
DeepSeek R-1 671B on 14x RTX 3090s: Real-World Results & Key Takeaways
Posted on
2 MinutesHow KTransformers Dominated llama.cpp in Real-World Inference
Hello! If you don’t know me, I’m the guy with the 14x RTX 3090s Basement AI Server . Earlier this week, alongside @OsMo999 as my guest commentator, I livestreamed running DeepSeek R-1 671B-q4 using KTransformers on my AI Server and below are the key takeaways from that 6-hour session.
I was inspired by an announcement from the KTransformers Team showcasing optimizations for running and offloading DeepSeek R-1 671B, 4-bit quantized, to the CPU. Instead of just benchmarking this on my own, I livestreamed everything from A to Z—during the livestream, I went beyond the initial plan and dived into how I use my AI server with vLLM, ExLlamaV2, and llama.cpp, and ran some comparisons.
KTransformers boosted prompt eval speeds by ~15x over llama.cpp! This matched the benchmarks they released in their release docs. Here’s the eval data for the run at the 1:39:59 mark of the stream:
Prompt Evaluation
Generation Evaluation
Funny enough, last week I wrote a blog post saying not to use llama.cpp for multi-GPU setups… and then I ended up livestreaming it this week!
I plan to stream regularly, so let me know what you’d like to see next! Maybe you’d even like to join as a guest? 😎
I have previously written a lot of in-depth blogposts on LLMs and AI, but never livestreamed on my own, so I’d love to hear your feedback! Drop your thoughts, ideas, or even suggestions for future AI server experiments and livestreams.
Find me below:
-
Stop Wasting Multi-GPUs Setup—Use vLLM or ExLlamaV2 for Tensor Parallelism
Posted on
7 MinutesUse vLLM or ExLlamaV2 for Tensor Parallelism
Context: Yesterday, I watched @ThePrimeagen live stream (love his streams by the way) where he was stress testing his new Green Tinybox—a 6x RTX 4090 build. His plan was to get the LLM to send and receive concurrent messages and respond to each others, increasing the number and frequency of those messages with time, as a way to stress test those GPUs; and he was using llama.cpp for inference. The llama.cpp part got my attention, and with such a powerful setup, llama.cpp is pretty much a system crippler. Around the 26-minute mark of his stream, I commented on that, and after some back-and-forth, I figured it was best not to hijack his stream and just write this blogpost instead.
In one of his responses while live streaming, Michael(@ThePrimeagen) showed a GitHub thread about llama.cpp supporting concurrent requests, but that is more on the software threading side of things, and llama.cpp is not optimized for Tensor Parallelism and Batch Inference. In this blogpost, we dive into the details of various inference engines, explaining when each one makes sense depending on your setup. We’ll cover llama.cpp for CPU offloading when you don’t have enough GPU memory, how vLLM’s Tensor Parallelism gives a massive boost for multi-GPU systems with batch inference, and why ExLlamaV2’s EXL2 quantization is a great choice for Tensor Parallelism and Batch Inference when memory is limited, but not critically so.
In Short: an Inference Engine is a software that understands how to properly send human-input to, and in-turn show human-readable output from, these massive AI Models. In more detail, Large-Language Models (LLMs) are Deep Learning Neural Network Models. The LLMs we use right now come from an Architecture called Transformer which was coined in the infamous paper Attention Is All You Need . Inference Engines usually utilizes the Transformers implemented by the Hugging Face team, which on a lower-level supports the PyTorch, TensorFlow, and Jax libraries allowing for a wide variety of hardware support that those libraries provides tooling for.
The short version above is really all you need to know, so feel free to skip to the next section . But in case you’re curious, Inference Engines also implement a variety of other things that are not necessarily provided out of the box from the Transformers library, such as quantizations, and models’ architectures for those quantizations.
Are you still with me? Good. There are several layers to how an Inference Engine works. It starts with at the bottom level with the hardware you are running (CPU only, CPU and GPU Mixed, GPU only, TPU, NPU, etc), and then it looks into the details of that hardware (Intel, AMD ROCm, Nvidia Cuda, etc), then it goes one level higher and tries to figure out whether you are using a quantization (GGUF, exl2, AWQ, etc), or the original safetensor weights, and then the model architecture itself. The model architecture is the secret sauce—which sometimes is released in a training/white paper—of how the model does it magic to make meaningful connections of your input and then produce a—hopefully— meaningful output.
Imagine Large Language Models as complex Lego sets with Transformers as the basic bricks. However, each set has different instructions—some build from left to right, others can see the whole picture at once, and some might even shuffle the pieces around for a unique approach. Plus, each set might have its own special pieces or ways of snapping the Legos together, making each model unique in how it constructs or understands language.
There are many architectures out there, what we need to keep in mind is that:
This means code implementation for how they’re understood—AKA Inference—is also different.
llama.cpp is an Inference Engine that supports a wide-variety of models architectures and hardware platforms. It however does not support Batch Inference, making it less than ideal for more than one request at a time. It is mainly used with the GGUF quantization format, and the engines runs with an okay performance for single-request runs but not much else. The only time I would actually recommend using llama.cpp is when you do not have enough GPU Memory (VRAM) and need to offload some of the model weights to the CPU Memory (RAM).
-
Resources From X/Twitter Audio Space on LLMs & AI (2025-02-02)
Posted on
3 MinutesCurated Links & Insights from the X/Twitter Space on LLMs, RAG, and AI Tools
Here are the resources that were shared and discussed during the Space on February 2nd, 2025. I’ve also included a few additional resources that I believe will enhance the collection. Space recording (in Arabic) can be found here.
AI, ML, and Neural Networks Visualization
LangChain and Node.js Tutorials
AI adoption in the Middle East
Three Takeaways From DeepSeek’s Big Week
Six Takeaways From a Monumental Week for AI
Shared by @Mishtar
Generative AI in Action
AI Agents in Action
-
Antifragile AI: Harnessing Uncertainty for a Resilient Future
Posted on
5 MinutesThe Evolution from Traditional Software to AI Agentic Systems
I came out of this experience with the following thesis: Given enough data, the aggregate of anyone’s thoughts—or anything’s properties and/or behaviors—could be simulated; a completely new paradigm to engaging with ideas and thoughts. The potential of contextual synthesis at scale is beyond our wildest imaginations, but I will leave exploring that to to future posts.
Building such a system wouldn’t be easy. It would require a significant amount of “bricolage” – piecing together various technologies and approaches in creative ways. But the potential rewards are immense.
Not long ago, I couldn’t grasp the urgency of ideas like Effective Accelerationism. Now, it feels like our only viable path forward. The rapid acceleration of AI capabilities demands we innovate, adapt, and build a future that uplifts everyone—no exceptions.
This isn’t just about keeping up; it’s about thriving in an era of unprecedented change. How do we stay antifragile in this relentless trajectory? By holding onto hope, pushing the boundaries of what’s possible, and shaping the future we want to see.
Now, let’s talk about Software.
Fundamentally, whether it’s a simple calculator or a sophisticated AI system, all computer programs can be thought of as agents that interact with their environment in some way. They perceive their environment through inputs and act upon it through their outputs and actions, even if their autonomy and adaptability vary widely.
The distinction between traditional software and the emerging paradigm of AI Agentic software is profound. This shift represents a fundamental change in how systems are designed, deployed, and interact with their environments, moving from robustness through iteration to antifragility, where systems not only withstand but thrive under uncertainty and stress.
Traditional software is characterized by its deterministic nature. It executes predefined algorithms and operates within a fixed framework, producing consistent outputs for given inputs. This approach is highly effective for tasks that are well-defined and require repeatability. However, this determinism also imposes limitations. Traditional software lacks the ability to adapt to new scenarios or learn from experience.
AI Agentic software, in contrast, embodies principles of adaptability and autonomy. These systems are designed to learn from their environment, make decisions based on available data, and adjust their behavior to achieve specified goals. They leverage machine learning algorithms, natural language processing, and other AI techniques to interpret complex inputs and generate contextually appropriate responses. In theory they can leverage anything and everything available to them, but we are not there yet!
-
All In — Stop Caring & Play The Game
Posted on
2 MinutesPlaying The Game Is The Only Way To Win
The truth is, it does not matter. I came to realize about a year ago that not a single thing I have achieved in life or done to completion was ever for a prize or a reward, and somehow these were the most fulfilling and the highest rewarding things. When I stopped caring and trusted my instincts, I won.
Life is a game. They always say that children’s curiosity is such an amazing thing, and in my opinion, it is more about them totally not caring about a single thing. Of course, that comes from a place of safety, as to having supportive and caring parents, nevertheless, the hypothesis holds.
With high stakes in life, you cannot always be risk-averse; however, when you have done your homework, and you know the underlying logic is valid thus the risk might be highly rewarding, you just have to override your brain’s defaults.
Life is a game. I like playing the game. I have always liked to play the game. I enjoy nothing in life than participating in it like a game. I am privileged to have been able to see it as a game. And the only way to honor that is by not letting my brain scare me away and to keep playing the game and winning.