OpenAI-compatible · built on MLX · Apple Silicon

Run LLMs locally
on your Mac. Fast.

A high-performance inference server engineered for speed — continuous batching, prompt caching, speculative decoding. Drop-in for Cursor, Claude Code, Aider, and anything that speaks the OpenAI API.

$ brew install raullenchai/rapid-mlx/rapid-mlx
Apache 2.0·Open source·100% local·3300+ tests
rapid-mlx · local inference
http://localhost:8000/v1
Qwen3.5-4B· Apple Silicon
261tok/s
local only · no cloud
261

tok/s aggregate throughput

0.08s

time to first token, cached

72+

models across 13 families

Performance

HardwareMac Studio M3 Ultra
Concurrency4 concurrent streams
Methodmedian of 3 rounds
Metricaggregate throughput
Qwen3.5-4B261 tok/s
GPT-OSS 20B221 tok/s
Qwen3.5-9B180 tok/s
Qwen3.6-35B-A3B176 tok/s
Qwen3.5-35B-A3B 8bit151 tok/s

Multi-turn stays instant: prompt caching with KV trimming gives sub-100ms time to first token on transformers, and RNN state snapshots bring the same to hybrid architectures — a first on MLX. Full methodology and reproduction scripts →

Quickstart

01
$ brew install raullenchai/rapid-mlx/rapid-mlx
02
zsh
# serve a model — auto-downloads on first run
$ rapid-mlx serve qwen3.5-4b-4bit
⚡ serving on http://localhost:8000/v1

# or just chat
$ rapid-mlx chat
03
example.pyserving · localhost:8000/v1
# point any OpenAI client at it — no key needed
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)
r = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "user", "content": "Say hello"},
    ],
)
print(r.choices[0].message.content)

Pick your model

72+ model aliases across 13 families — Qwen, DeepSeek, Llama, Mistral, Gemma, GLM. Recommended by Mac spec (single-user throughput):

1 · Choose your Mac RAM
2 · Recommended model
16 GBselected

Qwen3.5-4B 4bit

chat, coding, tools
On disk
2.4 GB
Throughput
147 tok/s
3 · Compare nearby tiers

Built for real work

tool_call→ recovered

Tool calling that holds up

17 parsers with automatic recovery when quantized models degrade.

<think>answer

Reasoning models

Chain-of-thought separation for DeepSeek-R1, Qwen3, and friends.

Vision & audio

Multimodal via rapid-mlx[vision], plus STT and TTS.

3300+ passingrapid-mlx doctor ✓

Production-minded

3300+ unit tests and a rapid-mlx doctor self-check.

Works with

Anything that speaks OpenAI. Tested with:

CursorCursor
Claude CodeClaude Code
AiderAider
Continue.devContinue.dev
Open WebUIOpen WebUI
localhost:8000/v1
OpenAI-compatible
LibreChatLibreChat
PydanticAIPydanticAI
LangChainLangChain
smolagentssmolagents
GooseGoose

Frequently asked questions

What is Rapid-MLX?

Rapid-MLX is a high-performance, OpenAI-compatible LLM server for Apple Silicon Macs, built on Apple's MLX framework. It runs models like Qwen, DeepSeek, Llama, and Gemma locally and exposes a drop-in OpenAI API at localhost:8000/v1.

How fast is Rapid-MLX?

Rapid-MLX reaches up to 261 tokens per second of aggregate throughput with a 0.08-second time to first token (cached) on Apple Silicon, using continuous batching, prompt caching, and speculative decoding.

Does Rapid-MLX run on Intel Macs?

No. Rapid-MLX is built on Apple's MLX framework and requires an Apple Silicon Mac (M1 or newer) running macOS 14 or later. Intel Macs are not supported.

Is Rapid-MLX free and open source?

Yes. Rapid-MLX is free and open source under the Apache 2.0 license. The source is on GitHub.

Which models does Rapid-MLX support?

Rapid-MLX supports 72+ models across 13 families, including Qwen, DeepSeek, Llama, Gemma, and GLM — covering text, vision, and audio models.

Is Rapid-MLX OpenAI-compatible?

Yes. Rapid-MLX exposes a drop-in OpenAI-compatible API, so tools like Cursor, Claude Code, Aider, Continue, LangChain, and any OpenAI client work unchanged by pointing them at localhost:8000/v1.

How do I install Rapid-MLX?

Install with Homebrew: brew install raullenchai/rapid-mlx/rapid-mlx, then run rapid-mlx serve <model> to start a local OpenAI-compatible server.