OpenAI-compatible · built on MLX · Apple Silicon

Run LLMs locally
on your Mac. Fast.

Q: How do I install Rapid-MLX?

Install with one curl command: curl -fsSL https://rapidmlx.com/install.sh | bash. The script probes for Python 3.10+ (auto-installs python-build-standalone if missing), creates a venv at ~/.rapid-mlx, and symlinks rapid-mlx into ~/.local/bin. No sudo required. Prefer Homebrew? It's in homebrew/core — just brew install rapid-mlx (no tap, no trust). Then run rapid-mlx serve to start a local OpenAI-compatible server.

A high-performance inference server engineered for speed — continuous batching, prompt caching, and PFlash + DFlash acceleration. Drop-in for Cursor, Claude Code, Aider, and anything that speaks the OpenAI API.

$ curl -fsSL https://rapidmlx.com/install.sh | bash

All install options → Read the docs → View on GitHub →

100% local·Apple Silicon·191 models·3300+ tests

rapid-mlx · local inference

http://localhost:8000/v1

Qwen3.5-4B· Apple Silicon

261tok/s

local only · no cloud

261

tok/s aggregate throughput

0.08s

time to first token, cached

191

models across 15 families

2.9k★

stars on GitHub

Performance

HardwareMac Studio M3 Ultra

Concurrency4 concurrent streams

Methodmedian of 3 rounds

Metricaggregate throughput

Qwen3.5-4B261 tok/s

GPT-OSS 20B221 tok/s

Qwen3.5-9B180 tok/s

Qwen3.6-35B-A3B176 tok/s

Qwen3.5-35B-A3B 8bit151 tok/s

Multi-turn stays instant: prompt caching with KV trimming gives sub-100ms time to first token on transformers, and RNN state snapshots bring the same to hybrid architectures — a first on MLX. Full methodology and reproduction scripts →

Got a different Mac? See — and contribute — community-submitted numbers across Apple Silicon →

Fast is only half of it. We scored 17 MLX models on coding, reasoning, and tool calling on an M3 Ultra — see which ones are actually smart →

Which model fits your Mac?

From a 16 GB MacBook Air to frontier 158B-MoE models on a Mac Studio — there's a model sized for your machine.

Qwen3.5-122B DeepSeek V4 Flash 158B · day-0 GPT-OSS 120B up to 1M context all on one Mac Studio

Your Mac	Best model	Speed · M3 Ultra ref	What you get
16 GB	Qwen3.5-4B	147 tok/s	Chat, coding, tool calling
24 GB	Qwen3.5-9B	101 tok/s	Great all-rounder
32 GB	GPT-OSS 20B	119 tok/s	Harmony-native · 100% tool calling
48 GB	Qwen3.5-35B-A3B 8bit	80 tok/s	Sweet spot — smart + fast
96 GB	Qwen3.5-122B	43 tok/s	Frontier-level intelligence
128 GB	DeepSeek V4 Flash 158B	31–56 tok/s	Day-0 frontier MoE · 1M context

Single-stream throughput on the listed Mac (M3 Ultra reference, rapid-mlx). Browse all 191 models — including day-0 support for our Tier-1 families — Qwen 3.6, Gemma 4, DeepSeek, and gpt-oss 20B/120B — with a live RAM picker at models.rapidmlx.com →

Quickstart

$ curl -fsSL https://rapidmlx.com/install.sh | bash

zsh

# serve a model — auto-downloads on first run
$ rapid-mlx serve qwen3.5-4b-4bit
⚡ serving on http://localhost:8000/v1

# or just chat
$ rapid-mlx chat

example.pyserving · localhost:8000/v1

# point any OpenAI client at it — no key needed
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)
r = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "user", "content": "Say hello"},
    ],
)
print(r.choices[0].message.content)

Built for real work

tool_call→ recovered

Tool calling that holds up

17 parsers with automatic recovery when quantized models degrade.

<think>answer

Reasoning models

Chain-of-thought separation for DeepSeek-R1, Qwen3, and friends.

Vision & audio

Vision via rapid-mlx[vision]. Audio: 26 aliases — Kokoro, Chatterbox, VibeVoice, Whisper, Parakeet — through /v1/audio/speech & /v1/audio/transcriptions.

3300+ passingrapid-mlx doctor ✓

Production-minded

3300+ unit tests and a rapid-mlx doctor self-check.

PFlash prefill · 3.87–8.5× cold TTFT Prompt cache · 2–5× faster TTFT DFlash speculative decoding · 2.6× Audio TTS/STT · 26 aliases Continuous batching Smart cloud routing KV-cache quantization Structured JSON output

Works with

Anything that speaks OpenAI. Tested with:

Cursor

Claude Code

Aider

Continue.dev

Open WebUI

localhost:8000/v1

OpenAI-compatible

LibreChat

PydanticAI

LangChain

smolagents

Goose

Also verified end-to-end: Codex CLI · OpenCode · Hermes Agent · OpenClaude · Claw Code · Goose

Star on GitHub ↗ Read the docs →

Frequently asked questions

What is Rapid-MLX?

Rapid-MLX is a high-performance, OpenAI-compatible LLM server for Apple Silicon Macs, built on Apple's MLX framework. It runs the latest open models — Qwen 3.6, Gemma 4, DeepSeek, GPT-OSS, and more — locally and exposes a drop-in OpenAI API at localhost:8000/v1.

How fast is Rapid-MLX?

Rapid-MLX reaches up to 261 tokens per second of aggregate throughput with a 0.08-second time to first token (cached) on Apple Silicon, using continuous batching, prompt caching, and speculative decoding.

Does Rapid-MLX run on Intel Macs?

No. Rapid-MLX is built on Apple's MLX framework and requires an Apple Silicon Mac (M1 or newer) running macOS 14 or later. Intel Macs are not supported.

Is Rapid-MLX free and open source?

Yes. Rapid-MLX is free and open source under the Apache 2.0 license. The source is on GitHub.

Which models does Rapid-MLX support?

Rapid-MLX supports 191 models across 15 families, including our Tier-1 families Qwen 3.6, Gemma 4, DeepSeek, and GPT-OSS — covering text, vision, and audio. Audio ships 26 aliases — 13 TTS (Kokoro, Chatterbox, VibeVoice, VoxCPM, Dia) and 13 STT (Whisper, Parakeet) — via /v1/audio/speech and /v1/audio/transcriptions. Install with pip install 'rapid-mlx[audio]'.

Is Rapid-MLX OpenAI-compatible?

Yes. Rapid-MLX exposes a drop-in OpenAI-compatible API, so tools like Cursor, Claude Code, Aider, Continue, LangChain, and any OpenAI client work unchanged by pointing them at localhost:8000/v1.

How do I install Rapid-MLX?

One curl command: curl -fsSL https://rapidmlx.com/install.sh | bash. The script probes for Python 3.10+ (auto-installs python-build-standalone if missing), creates a venv at ~/.rapid-mlx, and symlinks rapid-mlx into ~/.local/bin. No sudo required. Prefer Homebrew? It's in homebrew/core — just brew install rapid-mlx (no tap, no trust). Then run rapid-mlx serve <model> to start a local OpenAI-compatible server.

Run LLMs locally
on your Mac. Fast.

Performance

Which model fits your Mac?

Quickstart

Install

Serve a model

Point any OpenAI client at it