DeepSeek 4 Flash: Local Inference Engine for Metal and CUDA
If you've been following the AI space, you know DeepSeek's models have been making waves for their performance per dollar. But running them locally has always been a bit of a hassle. That's where ds4 comes in.
It's a lightweight inference engine specifically built for DeepSeek 4 Flash, targeting both Metal (Apple Silicon) and CUDA (NVIDIA GPUs). No cloud dependencies, no bloated frameworks. Just a clean, executable that runs the model on your own hardware.
What It Does
ds4 is a standalone inference engine for DeepSeek 4 Flash, the latest 16B parameter model from DeepSeek. It loads the model weights, runs inference on GPU (Metal or CUDA), and lets you interact with it locally. The repo provides both a C source for building the engine yourself and precompiled binaries for macOS and Linux.
Key features:
- Single-file model loading from Hugging Face (via
huggingface-cli) - Supports 4-bit and 8-bit quantization out of the box
- Uses Metal Performance Shaders (MPS) on Apple Silicon, CUDA on NVIDIA
- Minimal dependencies: just a modern C compiler + system GPU drivers
Why It’s Cool
This isn’t another high-level Python wrapper. It’s written in C, and it shows. The code is lean, focused, and easy to understand if you're comfortable with C. The author (antirez, the creator of Redis) clearly values speed and simplicity over abstraction layers.
A few things stand out:
- Quantization is built in – You don't need separate tools to quantize the model. Just download the weights and run
ds4with--q4or--q8. This saves a ton of GPU memory (4-bit gives you ~2GB instead of 8GB+ for FP16). - Metal support is first-class – Many inference engines treat Metal as an afterthought. ds4 uses MPS kernels directly, so Apple Silicon Macs get true native performance.
- Minimal memory overhead – The engine itself uses <100MB of RAM. The model weights are the only real memory cost.
- No bloat – No Python runtime, no Docker, no pip installs. It's a single binary.
For a practical use case, imagine running a 16B model on a MacBook Air with 8GB RAM. With 4-bit quantization, you could run this locally without swapping to disk. That's wild for a local model.
How to Try It
The repo has clear instructions. Here's the quick start:
- Download the binary – Precompiled versions are in the Releases tab. Grab the one for your OS.
- Get the model weights – Use
huggingface-clito downloaddeepseek-ai/DeepSeek-4-Flash-4bit-gguf(or the 8-bit version). Place them in amodels/folder. - Run it –
./ds4 --model ./models/deepseek-4-flash-4bit.gguf --q4
For CUDA on Linux, you'll need the CUDA toolkit (any recent version works). On macOS, just ensure your system is M1 or newer.
If you want to build from source, make with a C compiler is all you need. The repo includes a Makefile.
Final Thoughts
ds4 feels like the kind of tool a lot of us wanted: a small, fast, no-nonsense way to run large language models locally. The fact that it's written in C by someone who clearly understands performance makes it a breath of fresh air in a world of over-engineered AI stacks.
If you're a developer who wants to experiment with local LLMs without the overhead of Python, or you're on Apple Silicon and tired of slow Metal support, give ds4 a spin. It's still early days, but it already works, and it works well.
Found via @githubprojects
Repository: https://github.com/antirez/ds4