We benchmarked Gemma 4, Qwen 3.5 and our 8B default for on-device meeting summaries. The winner was the smaller model.
Four 4-bit LLMs, four real meeting transcripts (including a 15k-token Dutch one), all running locally on an M4 Pro via Apple MLX. We measured speed, format compliance, language-lock and faithfulness — and the result overturned our default.
We benchmarked Gemma 4, Qwen 3.5 and our 8B default for on-device meeting summaries. The winner was the smaller model.
MeetMemo writes your meeting summary on your Mac. No transcript ever leaves the device: Parakeet on the Neural Engine turns audio into text, and a quantised language model running on Apple MLX turns that text into a title, a summary, decisions, action items and follow-ups — in the same language the meeting was held in.
Picking that language model is the single biggest quality lever we have. So when Gemma 4 and Qwen 3.5 landed within weeks of each other, we did the obvious thing: we benchmarked them against what we already ship, on real meetings, and let the numbers decide.
The result was not what we expected. The biggest model we tested — our own 8B default — came last. We've already shipped the change.
The setup
Everything ran locally on an Apple M4 Pro (64 GB) through MLX (mlx-lm 0.31.3), every model at 4-bit. We fed each model the exact system and user prompt the app ships in production — same language-lock rules, same "output exactly this structure" instruction — so this measures the real task, not a toy.
The four contenders:
| Model | Role | Size (4-bit) |
|---|---|---|
| Qwen 3 8B | our current default | ~4.6 GB |
| Qwen 3 4B Instruct (2507) | a lighter option we already shipped | ~2.3 GB |
| Qwen 3.5 4B | brand new | ~2.5 GB |
| Gemma 4 E4B (QAT) | brand new | ~4.2 GB |
The four transcripts: a real ~36-minute Dutch sales/demo call (~15,000 tokens — the hard one, and our canary for European-language quality), plus three English meetings (a sales discovery call, a hiring screen, an engineering stand-up).
What we measured:
- Speed — generation throughput (tokens/second) and wall-clock per summary.
- Format compliance — did it produce the title + four sections, every time?
- Language-lock — and this is the one that mattered most: are the section headers in the language of the meeting? A Dutch meeting must produce
## Samenvatting, not## Summary. - Faithfulness & coverage — did it capture the real decisions, names and numbers without inventing any? Scored by hand against the transcript.
The headline numbers
| Model | Language-lock | Dutch speed | English speed | Notes |
|---|---|---|---|---|
| Qwen 3 4B Instruct (2507) | 4 / 4 ✅ | 15.4 tok/s | 54.6 tok/s | Best coverage + faithfulness |
| Gemma 4 E4B (QAT) | 2 / 4 | 15.9 tok/s | 37.1 tok/s | Cleanest output, but terse |
| Qwen 3 8B (old default) | 2 / 4 | 5.0 tok/s | 21.9 tok/s | Slowest, shallowest |
| Qwen 3.5 4B | 1 / 4 | 22.1 tok/s | 54.0 tok/s | Fast, but repetition loops |
Every model produced valid Markdown structure on every run (4/4 format). And with thinking mode disabled (more on that below), none of them leaked reasoning into the output. After that, they diverged hard.
The 8B default was the worst summary, and by far the slowest
This was the surprise. On the 15k-token Dutch transcript, Qwen 3 8B ran at 5 tokens/second — it took nearly a minute to produce a summary that the 4B models produced in 20 seconds. Worse, the summary it produced was the thinnest: it marked both Key decisions and Action items as "(none)", dumped everything vaguely into Follow-ups, and missed the concrete pricing (€5,000 setup, €3,000/month) and the October go-live date that were stated plainly in the call. It also wrote the Dutch summary under English headers.
This isn't 8B being a bad model — it's 8B being the wrong tool at 4-bit, on long multilingual context. Aggressive quantisation plus a 15k-token non-English window is exactly where a bigger model's advantage evaporates and its slowness doesn't. Our own code comments had hinted at this ("Phi-4 drifted hard at long context"); the benchmark made it undeniable.
The smaller model we already had quietly won
Qwen 3 4B Instruct (2507) got language-lock right on all four meetings — Dutch headers for the Dutch call, English for the English ones — and it produced the richest, most accurate summary: it captured the pricing, the 15-minute ANVA sync interval, the React/PWA build, the October timeline, and attributed action items to the actual people named in the transcript. We spot-checked its most "suspicious" claims (a client name, the prices) against the raw transcript — all genuinely there. It runs ~3× faster than the 8B and downloads half the bytes.
The new models are good — but came with sharp edges
Qwen 3.5 4B was the fastest and its content was on-topic, but it produced repetition loops (the same follow-up bullet three times) and garbled tokens ("Myinmiddag" for "mijnomgeving"). Promising, not shippable as-is.
Gemma 4 E4B produced the cleanest, most disciplined output of any model — no repetition, no invention — but it was terse, under-extracting decisions and follow-ups. And it had the most operational friction (below).
The most useful finding: your prompt can sabotage a good model
Here's the part worth stealing. Our production prompt, to enforce language-lock, gives examples: Dutch → "## Samenvatting", French → "## Résumé". Helpful, right?
For the new models, those examples were poison. Qwen 3.5 4B and Gemma 4 E4B latched onto the first, most salient example (Dutch) and emitted Dutch headers on English meetings — Qwen 3.5 did it on all three English transcripts.
So we re-ran the whole benchmark with a neutral prompt — no language examples, just "detect the transcript's language and mirror it, headers included." The effect:
| Model | Lang-lock with examples | Lang-lock with neutral prompt |
|---|---|---|
| Gemma 4 E4B | 2 / 4 | 4 / 4 ✅ |
| Qwen 3.5 4B | 1 / 4 | 3 / 4 |
| Qwen 3 8B | 2 / 4 | 3 / 4 |
| Qwen 3 4B (2507) | 4 / 4 | 3 / 4 |
The neutral prompt fixed Gemma 4 completely and lifted Qwen 3.5 — but it slightly hurt Qwen 3 4B-2507, which actually leans on the explicit examples. The lesson generalises: few-shot examples in a system prompt are a strong prior, and a strong prior is only good when it points the right way. Stronger instruction-followers want the rule; weaker ones over-fit the example. Match the prompt to the model.
Operating Gemma 4 on Apple Silicon: three gotchas we hit
Gemma 4 is genuinely impressive at this size, but getting it to run well took knowing three things:
- Use a PLE-safe quant. Gemma 4's new Per-Layer-Embedding / ScaledLinear layers blow up under naïve 4-bit — the scalar multiplications amplify quantisation error catastrophically. Use a QAT (quantisation-aware-trained) or Unsloth-Dynamic build, not a plain post-training 4-bit. We tested the QAT build; quality held.
- Disable thinking explicitly. Out of the box, Gemma 4 E4B emitted a full
<|channel>thought…<channel|>reasoning block before the answer — and it isn't<think>, so a stripper built for Qwen would miss it entirely and dump chain-of-thought into your note. Passingenable_thinking=Falseto the chat template killed it cleanly and made generation faster (no wasted reasoning tokens). - The download stalls. The default Hugging Face XET transfer hung for 17 hours on the second shard. Setting
HF_HUB_DISABLE_XET=1fixed it instantly.
What we shipped
We changed MeetMemo's default summarisation model from Qwen 3 8B to Qwen 3 4B Instruct (2507). For our task it is faster, smaller, more faithful, and better at language-lock — and because it was already an integrated option, it ships with zero new model-loading risk. Users who explicitly picked another model keep their choice.
We did not ship Gemma 4 yet, despite it being the cleanest writer and a perfect 4/4 with the right prompt. The blocker is purely plumbing: our Swift MLX path doesn't yet expose the enable_thinking=False chat-template control, and without it Gemma 4 leaks reasoning. That's our next piece of work — and when it lands, Gemma 4 E4B is first in line for an A/B, because a ~4 GB model that summarises a 36-minute meeting in 20 seconds with zero hallucination is exactly the trajectory on-device AI should be on.
Caveats, honestly
This is four models on four transcripts on one machine — directional, not a leaderboard. Faithfulness was scored by hand, not by an automated judge. We benchmarked our production prompt, so results are entangled with our prompt (which is the point — but it means "Gemma 4 is worse at language-lock" really means "Gemma 4 is worse at language-lock with our example-heavy prompt"). The raw outputs and per-run numbers are checked into the repo under docs/benchmarks/ so you can argue with us.
The meta-lesson is the one we keep relearning: bigger isn't better on-device, and newer isn't automatically better for your task. Measure on your own data, with your own prompt. Ours told us to ship the smaller model — and to write a different prompt for every model we'd consider next.
