A high-performance inference server engineered for speed — continuous batching, prompt caching, speculative decoding. Drop-in for Cursor, Claude Code, Aider, and anything that speaks the OpenAI API.
$ brew install raullenchai/rapid-mlx/rapid-mlx
tok/s aggregate throughput
time to first token, cached
models across 13 families
Multi-turn stays instant: prompt caching with KV trimming gives sub-100ms time to first token on transformers, and RNN state snapshots bring the same to hybrid architectures — a first on MLX. Full methodology and reproduction scripts →
$ brew install raullenchai/rapid-mlx/rapid-mlx
# serve a model — auto-downloads on first run $ rapid-mlx serve qwen3.5-4b-4bit ⚡ serving on http://localhost:8000/v1 # or just chat $ rapid-mlx chat
# point any OpenAI client at it — no key needed from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="not-needed", ) r = client.chat.completions.create( model="default", messages=[ {"role": "user", "content": "Say hello"}, ], ) print(r.choices[0].message.content)
72+ model aliases across 13 families — Qwen, DeepSeek, Llama, Mistral, Gemma, GLM. Recommended by Mac spec (single-user throughput):
17 parsers with automatic recovery when quantized models degrade.
Chain-of-thought separation for DeepSeek-R1, Qwen3, and friends.
Multimodal via rapid-mlx[vision], plus STT and TTS.
3300+ unit tests and a rapid-mlx doctor self-check.
Anything that speaks OpenAI. Tested with:
Cursor
Claude Code
Aider
Continue.dev
Open WebUI
LibreChat
PydanticAI
LangChain
smolagents
GooseRapid-MLX is a high-performance, OpenAI-compatible LLM server for Apple Silicon Macs, built on Apple's MLX framework. It runs models like Qwen, DeepSeek, Llama, and Gemma locally and exposes a drop-in OpenAI API at localhost:8000/v1.
Rapid-MLX reaches up to 261 tokens per second of aggregate throughput with a 0.08-second time to first token (cached) on Apple Silicon, using continuous batching, prompt caching, and speculative decoding.
No. Rapid-MLX is built on Apple's MLX framework and requires an Apple Silicon Mac (M1 or newer) running macOS 14 or later. Intel Macs are not supported.
Yes. Rapid-MLX is free and open source under the Apache 2.0 license. The source is on GitHub.
Rapid-MLX supports 72+ models across 13 families, including Qwen, DeepSeek, Llama, Gemma, and GLM — covering text, vision, and audio models.
Yes. Rapid-MLX exposes a drop-in OpenAI-compatible API, so tools like Cursor, Claude Code, Aider, Continue, LangChain, and any OpenAI client work unchanged by pointing them at localhost:8000/v1.
Install with Homebrew: brew install raullenchai/rapid-mlx/rapid-mlx, then run rapid-mlx serve <model> to start a local OpenAI-compatible server.