| Rank | Model | Provider | Generalist | Creative | Intelligence | Total /75 |
|---|---|---|---|---|---|---|
| 1 | DeepSeek-V4-Pro | ollama-cloud | 23 | 24 | 25 | 72 |
| 2 | GLM-5.2 | ollama-cloud | 22 | 23 | 25 | 70 |
| 3 | DeepSeek-V4-Flash | ollama-cloud | 21 | 24 | 24 | 69 |
| 4 | Kimi-K2.7-Code | ollama-cloud | 24 | 21 | 23 | 68 |
| 5 | MiMo-V2.5-Pro | Nous Portal | 23 | 20 | 24 | 67 |
| 6 | MiMo-V2.5 | Nous Portal | 20 | 19 | 24 | 63 |
| 7 | MiniMax-M3 | ollama-cloud | 14 | 18 | 23 | 55 |
| Test | GLM-5.2 | MiniMax-M3 | DS-V4-Pro | DS-V4-Flash | Kimi-K2.7 🆕 | MiMo-V2.5-Pro 🆕 | MiMo-V2.5 🆕 |
|---|---|---|---|---|---|---|---|
| Reasoning Math optimization | 5 | 4 | 4 | 4 | 5 | 5 | 5 |
| Code Debug Bug + 3 sentences | 3 | 2 | 5 | 4 | 4 | 3 | 4 |
| Format Follow Exactly 3 bullets | 5 | 4 | 5 | 4 | 5 | 5 | 5 |
| Knowledge CAP theorem <150w | 5 | 3 | 5 | 4 | 5 | 5 | 4 |
| Code Gen chunk_list, ONLY code | 4 | 1 | 4 | 5 | 5 | 5 | 2 |
| TOTAL | 22 | 14 | 23 | 21 | 24 | 23 | 20 |
| Test | GLM-5.2 | MiniMax-M3 | DS-V4-Pro | DS-V4-Flash | Kimi-K2.7 🆕 | MiMo-V2.5-Pro 🆕 | MiMo-V2.5 🆕 |
|---|---|---|---|---|---|---|---|
| Planning 6-step migration | 5 | 4 | 5 | 5 | 5 | 4 | 4 |
| Summarization 2 sentences | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
| Creative 5-7-5 haiku ONLY | 3 | 2 | 4 | 4 | 1 | 3 | 2 |
| Judgment DB pick <100w | 5 | 4 | 5 | 5 | 5 | 4 | 4 |
| Refactor Code + 2 sentences | 5 | 3 | 5 | 5 | 5 | 4 | 4 |
| TOTAL | 23 | 18 | 24 | 24 | 21 | 20 | 19 |
| Test | GLM-5.2 | MiniMax-M3 | DS-V4-Pro | DS-V4-Flash | Kimi-K2.7 🆕 | MiMo-V2.5-Pro 🆕 | MiMo-V2.5 🆕 |
|---|---|---|---|---|---|---|---|
| Logic Puzzle 5-constraint seating | 5 | 4 | 5 | 4 | 3 | 4 | 4 |
| Lateral Thinking Elevator riddle | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
| Water Jugs 8L/5L/3L → 4L | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
| Pattern Sequence Pronic numbers | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
| Hard Problem 100 prisoners | 5 | 4 | 5 | 5 | 5 | 5 | 5 |
| TOTAL | 25 | 23 | 25 | 24 | 23 | 24 | 24 |
72/75 across all categories. Most consistent constraint-following, cleanest output, no hallucinations. The model you can trust to follow instructions precisely.
Beat DeepSeek-V4-Pro by 1 point. Perfect 5s on reasoning, format following, knowledge, and code generation. Only deduction was code_debug (4) for minor reasoning leakage. However, it dropped to 4th overall due to catastrophic creative test (1/5 — 84 lines of internal monologue, no actual haiku) and broken logic puzzle output (3/5 — correct reasoning but output labeled "[No content, only reasoning]").
Perfect 5s on reasoning, format following, knowledge, code gen, summarization, and all intelligence tests except logic puzzle. Main deductions: code debug had garbled reasoning text, creative had heavy leakage, refactor used an import (violation), and judgment was slightly verbose. Had reliability issues during testing — many tests timed out or returned empty on first run.
Code gen used `from typing import List, TypeVar` which violates the "no imports" constraint (2/5). Knowledge was over 150 words (4/5). Logic puzzle produced 43KB / 923 lines of repetitive reasoning. But intelligence was strong (24/25) — all answers correct.
GLM-5.2 (70/75) has best knowledge density and judgment. Flash (69/75) is best value — nearly matches Pro at lower cost. Both scored 25/25 and 24/25 on intelligence respectively.
55/75. Hallucinated a bug, produced code with syntax errors, leaked 212 lines on a haiku. Real intelligence (23/25) but unreliable for technical work.
Every model solved the water jug puzzle, lateral thinking riddle, pattern sequence, and 100 prisoners problem correctly. The differentiator is output format quality, not reasoning ability.
None of the 7 models fully respected "ONLY the haiku" — all leaked chain-of-thought. Kimi was worst (84 lines, no haiku at all). This is a fundamental property of reasoning-mode models.
Many tests timed out or returned empty/error responses on first run. Required re-runs with longer timeouts (120s) and sequential execution to avoid rate limiting. This is a practical operational concern for production use.
| Role | Model | Rationale |
|---|---|---|
| Default generalist | DeepSeek-V4-Pro or GLM-5.2 | Pro most consistent; GLM best for knowledge-heavy |
| Code-heavy tasks | Kimi-K2.7-Code 🆕 or DeepSeek-V4-Flash | Kimi best Generalist score (24/25); Flash best value |
| Orchestrator | DeepSeek-V4-Flash | Best value, high quality, fast |
| Knowledge / explanation | GLM-5.2 | Best density and precision, verdict-first instinct |
| Researcher (alt) | MiMo-V2.5-Pro 🆕 | Strong knowledge + intelligence (67/75), good citations |
| Writer / creative | MiniMax-M3 | Keep for prose only — reasoning leakage less harmful |