MANUAL · 11

Models & tokens.

The AI DJ can run on a small model on your own hardware or a large hosted one. This page tells you which models actually hold up — measured, not guessed — and which settings match the station to whichever you pick.

THE ROOT CHOICE

Which model writes the show.

Every word the DJ speaks and every track it picks comes from one language model, chosen under Admin → LLM. The default is Ollama on your own hardware (no API key, no per-token bill), but you can point the station at a hosted provider (Anthropic, OpenAI, Google, OpenRouter and others) instead. Switching reroutes every call immediately, with no redeploy.

“On your own hardware” isn’t only Ollama — there are three local paths, all keyless and all private to your box:

Ollama — the default. One install, pull a model, done. Ollama’s cloud models (the :cloud tags) also ride this path: same setup, the heavy lifting happens on their hardware.
locca — a first-class, one-command local model server (locca serve <model>) built on llama.cpp. No key, a sensible host default, and the onboarding wizard can detect it for you. locca on GitHub ↗
OpenAI-compatible — any self-hosted server that speaks the OpenAI API (llama.cpp, vLLM, LM Studio); you supply its URL.

One thing to internalise before choosing: the provider is part of the choice. The same model can behave differently through different routes, because each provider translates tools and structured output its own way — a model that fails through one route can be flawless through another. When you evaluate a model, evaluate it through the provider you’ll actually run.

MEASURED, NOT GUESSED

Which models hold up.

SUB/WAVE ships a benchmark that drives every kind of call the DJ makes — track picks, talk segments, listener requests, scripts, banter, programme plans — against any model, in both picker modes, and scores the output against the station’s own rules. The table below is the running record — it grows as more models and providers get benched (station configured lean unless noted):

Model	Verdict	Overall	Picks · pool	Picks · agent	Segments · pool	Segments · agent	Requests	Scripts	Shows	Pick p50	Benched
gemma4:31b-cloudOllama	agent-capable	99%	100%	100%	100%	100%	100%	100%	96%	1.2s	2026-07-10
gemini-3.1-flash-litegoogle	agent-capable	99%	100%	100%	100%	100%	100%	100%	96%	0.7s	2026-07-10
gemini-3.5-flashgoogle	agent-capable	98%	100%	100%	100%	100%	93%	100%	96%	1.2s	2026-07-10
openai/gpt-4o-miniOpenRouter	agent-capable	97%	92%	100%	100%	100%	100%	100%	92%	1.3s	2026-07-10
openai/gpt-5-miniOpenRouter	agent-capable	96%	75%	100%	100%	100%	93%	100%	100%	1.8s	2026-07-10
google/gemma-4-26b-a4b-itopenai-compatible	agent-capable	96%	75%	100%	100%	100%	100%	100%	96%	3.6s	2026-07-10
google/gemma-4-26b-a4b-itOpenRouter	pool-mode	95%	75%	100%	100%	100%	100%	100%	92%	4.6s	2026-07-10
qwen/qwen3.5-9bOpenRouter	pool-mode	90%	75%	50%	92%	100%	93%	100%	92%	3.8s	2026-07-10
deepseek-v4-flashdeepseek	pool-mode	89%	100%	0%	100%	100%	93%	95%	88%	1.5s	2026-07-10
nemotron-3-super:cloudOllama cloud	agent-capable	86%	75%	100%	100%	83%	100%	95%	67%	1.8s	2026-07-10
gemma-4-12b-it-Q4_K_Mlocca (llama.cpp)	pool-mode	86%	75%	—	—	—	100%	100%	75%	16.1s	2026-07-09
qwen/qwen3.5-9bopenai-compatible	prefer native route	84%	75%	50%	100%	100%	87%	90%	79%	1.2s	2026-07-10
anthropic/claude-haiku-4.5OpenRouter	agent-capable	83%	100%	100%	100%	100%	100%	100%	33%	2.3s	2026-07-10
glm-5.2:cloudOllama cloud	pool-mode	83%	92%	83%	100%	100%	80%	100%	54%	5.9s	2026-07-10
kimi-k2.6:cloudOllama cloud	avoid	80%	100%	0%	100%	100%	93%	100%	50%	2.1s	2026-07-10
deepseek-v4-flash:cloudOllama cloud	pool-mode	78%	42%	17%	100%	100%	87%	95%	75%	3.6s	2026-07-10
minimax-m2.7:cloudOllama cloud	avoid (re-test)	63%	8%	17%	100%	83%	80%	86%	46%	18s	2026-07-10

Pass rate per call family (share of bench runs producing valid, rule-clean output; bar length = pass rate). “—” means that family wasn’t benched for the model — the local 12B ran the pool/structured kinds only, on a CPU host, so its latency is excluded. Pick p50 is the median pool-pick round trip as served that day. All rows measured with reasoning off. Last updated 2026-07-10.

gemma4:31b-cloud — Best on record. Only weakness: 3-hour programme plans (picks unoffered feature kinds).
gemini-3.1-flash-lite — The value pick: matches the 31B-class overall, agent 18/18, and the fastest picks benched (0.7 s median).
gemini-3.5-flash — Best score on record; agent cells 18/18 at 1–2 s. Single miss: one "coming up next" slip.
openai/gpt-4o-mini — Zero structural failures; misses are editorial (variety traps, unoffered feature kinds).
openai/gpt-5-mini — Repeats artists under shortlist pressure; sometimes serves the wrong track on exact-title requests.
google/gemma-4-26b-a4b-it — Routing A/B vs native openrouter (91/96 same day): identical score through the openai-compatible transport pointed at openrouter.ai/api/v1 — the body-injection path is transport-clean for cloud backends with non-thinking models.
google/gemma-4-26b-a4b-it — Structurally clean in pool mode but ~38 s median picks as currently served — too slow to prefer over Qwen.
qwen/qwen3.5-9b — The small floor: fast and flawless on pool cells. Wordy request replies; needs thinking disabled (handled automatically).
deepseek-v4-flash — First direct-provider row (early testing was 0/4 direct vs 4/4 via openrouter). Pool cells clean; agent picks 0/6 — emits hallucinated ids instead of exploring. Segments/requests/scripts solid.
nemotron-3-super:cloud — First bench. Agent picks 6/6 and fast (2-4s); score dragged by editorial misses — stage-direction asterisks x3 (TTS-audible), same-artist x3 — and 6 thrown. Fine pick engine, watch the script quality.
gemma-4-12b-it-Q4_K_M — Structured/pool kinds only (×2) on a CPU host — latency excluded. Same family habits as the 31B.
qwen/qwen3.5-9b — Routing A/B, after the aggregator no-think fix: 57 -> 81/96 through the same endpoint (reasoning:{enabled:false} now injected alongside the llama.cpp knobs; leaks 16 -> 5, thrown 22 -> 3). Native openrouter provider still edges it (86/96, zero leaks) via better response-side reasoning handling — use openai-compatible for self-hosted servers, native providers for aggregators.
anthropic/claude-haiku-4.5 — Dependable tool discipline; runs wordy (over-length lines are most of its misses). Early runs measured 15/96 against a suppression bug on our side, not the model.
glm-5.2:cloud — Re-benched on the AI SDK 7 reasoning param (think:false wire-verified; 1 residual server-side think in 96 calls vs 24 pre-fix). Agent cells solid (11/12). Drop vs 07-09 is programme verbosity — plan introNote >240ch (0/6) + over-length exchange lines — model-side drift, not routing.
kimi-k2.6:cloud — Re-benched on AI SDK 7 branch: 77/96 (was 79) — same profile, agent picks 0/6 hallucinated ids, verbose scripts. Verdict unchanged.
deepseek-v4-flash:cloud — Re-benched on AI SDK 7 branch: 75/96 (was 82). think:false honored on a raw probe but 7 cells still leaked thinking on complex prompts — intermittent relay/model-side, same class as the minimax regression, milder. Plus 3 hallucinated ids. Direct deepseek provider (85/96) is the better route for this model now.
minimax-m2.7:cloud — Collapsed 90->86->60: Ollama cloud now IGNORES think:false for this family (raw-API verified: 594ch of thinking on a one-word ask) - thinks on every call, 27 leak cells, plans run 105-178s. Server-side, not routing; re-test after a relay/model update.

Reading it for a recommendation: pick an agent-capable row if you want the full conversational picker — Gemini 3.5 Flash leads the table outright, its Flash-Lite sibling matches the 31B class at the fastest picks benched, and Gemma 4 31B on Ollama cloud remains the best keyless option. Pick any healthy pool-mode row for a lean station (Qwen3.5 9B is the small floor, and a local Gemma 4 12B — locca serve gemma4 — does the same job keylessly on your own box). Remember the route in the second line of each model cell is part of the result — the same model through a different provider can score differently, and two of the scores above changed by 20+ points once bugs in our own thinking-suppression plumbing were found and fixed. The bench checks that too, now.

Two patterns worth knowing whatever you run: the Gemma family at every size shares the same habits (it can repeat an artist when the shortlist pressures it to, and it fumbles feature choices on three-hour programme plans), and the multi-hour programme plan is the hardest single call in the system — the only one that dented every model tested. If a show misbehaves, suspect the plan before the model.

Running from a clone? You can put any candidate model through the same battery before trusting it on air: npm run llm-bench in controller/ benchmarks it across every call kind and prints a comparison table. The DJ Doctor’s LLM checks cover the everyday health of whatever you’ve picked.

RUNNING LEAN

For small models & saving tokens.

If you’re on a modest local model, or paying per token and want the bill low, these are the dials to turn down. None of them take the DJ off the air. They just make it do less work per moment.

Picker agent off (Admin → LLM) — the big one. With it off, the station runs everything in one-call-per-moment style: track picks come from a short pre-built shortlist, and the talk segments (weather, news, curiosities) fetch their data first and make a single call too, instead of running a tool-using agent. Far fewer tokens, and a task shape small models get right — this is the setting that makes the 9B–12B class reliable.
Reasoning off (Admin → LLM) — stops “thinking” models from writing a long internal monologue before they answer. The DJ writes short scripts that don’t need it, and an unbounded thinking step can balloon a call from one second to minutes — or eat the whole reply. Off is the safe default; the station knows how to genuinely switch thinking off per provider, including the model families that ignore the polite version of the request.
Pause when empty on (Admin → LLM) — when nobody is listening, the DJ stops picking, talking and writing IDs entirely; the stream coasts on the fallback playlist and the DJ wakes up the moment someone tunes in. This one is a pure saving: there’s no quality cost, since there’s no one there to hear it.
Concise scripts (Admin → Personas) — each persona’s script length runs from one-liner through concise and extended to storyteller. Concise keeps spoken breaks to a line or two; the longer stops double or triple them.
Quiet frequency (Admin → Personas) — a persona’s frequency sets how often it talks, IDs the station and reads the time and weather. Quiet makes all of that rarer, so there are simply fewer AI calls per hour.
Sound FX off (Admin → Sound FX) — with the effects library disabled, the DJ is no longer shown the catalogue of stingers when it plans a segment, which trims that prompt.

With the lean profile in place, Qwen3.5 9B or a local Gemma 4 12B runs the whole station comfortably — picks, requests, talk breaks, even programme shows — while paying nothing per token.

RUNNING RICH

For large, capable models.

On a capable model the same dials go the other way: spend the capability on a station with more personality and a smarter DJ.

Picker agent on (Admin → LLM) — the full conversational DJ: it remembers the session, reasons about what it has already played, and uses tools to dig through the library. Richer and more coherent — but it’s a genuinely harder job, and the bench is blunt about who can do it: Gemma 4 31B (Ollama cloud), MiniMax M2.7 and hosted models of GPT-5-Mini’s class run it reliably; the 9B–12B locals do not. You don’t need to guess — turn it on and watch the booth log; every agent miss falls back to the pool picker anyway.
Reasoning on (Admin → LLM) — let a thinking model work through its choice before answering. Worth trying only on a model built for it and a generous token budget; the picker and other structured calls suppress thinking regardless, so this mainly buys more considered scripts.
Extended scripts (Admin → Personas) — a storytelling DJ that lingers, with longer links between tracks.
Aggressive frequency (Admin → Personas) — a busy station: frequent IDs, time checks and weather updates.

THE DJ NEVER GOES SILENT

The picker agent has a built-in safety net: if it ever fails or runs too slow, the station quietly falls back to the simple pool picker for that track, the same path you’d get with the agent switched off. Turning it off just makes that lighter path the default rather than the exception.

A SECOND, SMALLER MODEL

How the DJ knows each track’s mood.

The DJ picks partly by mood — mellow mornings, brighter afternoons, a wind-down late at night. To know each track’s mood it leans on the library tagger, which uses a second, much smaller embedding model — not the chat model that writes the show.

Rather than ask the chat model about every track (slow and expensive on a big library), the tagger embeds each track once, has the chat model tag a small, representative seed set, then propagates moods and energy out to everything else by similarity. That’s roughly ten times fewer model calls than tagging track by track.

By default the embedding model follows your LLM provider, so there’s usually nothing extra to set up — an Ollama-local station gets nomic-embed-text for free. Two things are worth knowing if you stray from that:

Anthropic has no embedding model — if your DJ runs on Claude, point embeddings at Ollama or OpenAI instead.
Some providers do chat only — the deepseek and Vercel AI gateway providers have no embeddings endpoint at all. A DJ on one of those works fine, but the tagger can’t follow it, so the console only lists embedding-capable providers in the tagger dropdown (Ollama, OpenAI, Google, OpenRouter, locca, OpenAI-compatible). If you don’t see your chat provider there, that’s why — pick Ollama (local and free) for the embedding step and leave the DJ where it is.
Provider vs. model — mind the difference on a router. “DeepSeek” is a provider (no embeddings), but it’s also a model you can run through OpenRouter. Those aren’t the same: pick the OpenRouter provider with a DeepSeek chat model and your DJ speaks via DeepSeek while embeddings go through OpenRouter’s own embeddings endpoint — by default openai/text-embedding-3-small. OpenRouter, Requesty and the like carry everything (chat and embeddings); the bare provider named after a chat-only company does not.
locca and OpenAI-compatible need a dedicated embedding server — one llama.cpp process can’t serve chat and embeddings at once. With locca that’s a second command, locca embed, on its own port; the console can detect it for you.
Which one should I pick? Any embedding model at 768 dimensions or more is fine for mood similarity — favour a fast, cheap one over a big “best-in-class” model. Good baselines: nomic-embed-text (local, free, 768-d) if you run Ollama, or text-embedding-3-small (cloud, cheap, 1536-d) otherwise. The exact model matters far less than picking one and sticking with it — see the next note.

One catch worth internalising: the vector index is built at your embedding model’s dimension, so changing the embedding model means re-embedding the whole library (Admin → Library tagger → Re-scan → “Re-embed all tracks”). Changing the chat model never needs this — but if embeddings are set to “follow the LLM,” switching your DJ provider quietly changes the embedding model too. The console pins embeddings to your library’s model and warns you before that happens, so the safe move is to pin an embedding provider once and leave it.

It all lives under Admin → Library tagger, and you can see the tagged library laid out in Library Observatory.

WHERE TO SET THEM

All of this lives in the console.

Every setting here is in the admin console and takes effect without a redeploy; most apply to the next thing the DJ does. The full tour of the console is in Admin & Settings; how the DJ actually picks and talks is in How the DJ Works.