LLM Generalist Benchmark

DeepSeek-V4-Pro · GLM-5.2 · DeepSeek-V4-Flash · Kimi-K2.7-Code · MiMo-V2.5-Pro · MiMo-V2.5 · MiniMax-M3
2026-06-21 · 3 Categories · 15 Tests · 105 One-Shot Queries · 7 Models
👑

DeepSeek-V4-Pro

72 / 75
Most consistently excellent across all three categories
5 — Excellent
4 — Good
3 — Minor issues
2 — Problems
1 — Failed

Grand Total All 3 Categories

RankModelProviderGeneralistCreativeIntelligenceTotal /75
1DeepSeek-V4-Proollama-cloud23242572
2GLM-5.2ollama-cloud22232570
3DeepSeek-V4-Flashollama-cloud21242469
4Kimi-K2.7-Codeollama-cloud24212368
5MiMo-V2.5-ProNous Portal23202467
6MiMo-V2.5Nous Portal20192463
7MiniMax-M3ollama-cloud14182355
DeepSeek-V4-Pro
72 / 75
72
GLM-5.2
70 / 75
70
DeepSeek-V4-Flash
69 / 75
69
Kimi-K2.7-Code
68 / 75
68
MiMo-V2.5-Pro
67 / 75
67
MiMo-V2.5
63 / 75
63
MiniMax-M3
55 / 75
55

Category Overview

Generalist Core

Math reasoning · Code debugging · Format following · Technical knowledge · Code generation
Kimi-K2.7-Code 🆕24
DeepSeek-V4-Pro23
MiMo-V2.5-Pro 🆕23
GLM-5.222
DeepSeek-V4-Flash21
MiMo-V2.5 🆕20
MiniMax-M314

Creative & Strategic

Multi-step planning · Summarization · Creative writing · Decision-making · Code refactoring
DeepSeek-V4-Pro24
DeepSeek-V4-Flash24
GLM-5.223
Kimi-K2.7-Code 🆕21
MiMo-V2.5-Pro 🆕20
MiMo-V2.5 🆕19
MiniMax-M318

Intelligence Test

Logic puzzle · Lateral thinking · Water jugs · Pattern recognition · 100 prisoners problem
DeepSeek-V4-Pro25
GLM-5.225
DeepSeek-V4-Flash24
MiMo-V2.5-Pro 🆕24
MiMo-V2.5 🆕24
Kimi-K2.7-Code 🆕23
MiniMax-M323

Detailed Scores Per Test

Generalist Core

TestGLM-5.2MiniMax-M3DS-V4-ProDS-V4-FlashKimi-K2.7 🆕MiMo-V2.5-Pro 🆕MiMo-V2.5 🆕
Reasoning
Math optimization
5444555
Code Debug
Bug + 3 sentences
3254434
Format Follow
Exactly 3 bullets
5454555
Knowledge
CAP theorem <150w
5354554
Code Gen
chunk_list, ONLY code
4145552
TOTAL22142321242320

Creative & Strategic

TestGLM-5.2MiniMax-M3DS-V4-ProDS-V4-FlashKimi-K2.7 🆕MiMo-V2.5-Pro 🆕MiMo-V2.5 🆕
Planning
6-step migration
5455544
Summarization
2 sentences
5555555
Creative
5-7-5 haiku ONLY
3244132
Judgment
DB pick <100w
5455544
Refactor
Code + 2 sentences
5355544
TOTAL23182424212019

Intelligence Test

TestGLM-5.2MiniMax-M3DS-V4-ProDS-V4-FlashKimi-K2.7 🆕MiMo-V2.5-Pro 🆕MiMo-V2.5 🆕
Logic Puzzle
5-constraint seating
5454344
Lateral Thinking
Elevator riddle
5555555
Water Jugs
8L/5L/3L → 4L
5555555
Pattern Sequence
Pronic numbers
5555555
Hard Problem
100 prisoners
5455555
TOTAL25232524232424

Model Profiles

DeepSeek-V4-Pro

🏆 Best Overall — 72/75
72
  • Best constraint following across all categories
  • Perfect 25/25 on intelligence test
  • No hallucinations or fabricated content
  • Best "Heisenbug" haiku (creative + technical)
  • Slightly verbose on math reasoning

GLM-5.2

🧠 Best Knowledge & Judgment — 70/75
70
  • Perfect 25/25 on intelligence test
  • Best knowledge density and precision
  • Best judgment answer (zero leakage)
  • Self-corrected arithmetic mid-reasoning
  • Failed 3-sentence constraint in Generalist
  • Heavy reasoning leakage on creative task

DeepSeek-V4-Flash

⚡ Best Value — 69/75
69
  • Best code generation in Generalist
  • Tied for best in Creative (24/25)
  • Most defensive code (isinstance checks)
  • Nearly matches Pro at lower cost
  • 503 lines on logic puzzle (most verbose)

Kimi-K2.7-Code 🆕

💻 Best Generalist Score — 68/75
68
  • HIGHEST Generalist score: 24/25 (beat Pro!)
  • Perfect 5s on reasoning, format, knowledge, code gen
  • Excellent hard problem explanation (31.2%)
  • Best code_debug answer (3 exact sentences)
  • Creative: 84 lines of monologue, NO haiku output (1/5)
  • Logic puzzle: correct but "[No content]" label (3/5)
  • Reasoning tokens consume entire response on some tests

MiMo-V2.5-Pro 🆕

🔬 Strong Newcomer — 67/75
67
  • Perfect 5s on reasoning, format, knowledge, code gen
  • Strong intelligence (24/25) — all correct
  • Excellent hard problem with concrete example
  • Cites Brewer/Gilbert-Lynch + PACELC
  • Code debug had garbled reasoning text
  • Creative: heavy leakage, 3 lines not 4
  • Refactor used `from typing import Hashable` (import violation)
  • Reliability issues during testing (timeouts, empty responses)

MiMo-V2.5 🆕

📊 Solid Mid-Tier — 63/75
63
  • Strong intelligence (24/25) — matches Pro
  • Perfect format following (5/5)
  • Correct on all reasoning tests
  • Good judgment (mentions TimescaleDB)
  • Code gen used `from typing import` (import violation → 2/5)
  • Knowledge over 150 words (4/5)
  • Logic puzzle: 43KB / 923 lines of repetitive reasoning
  • Reliability issues — many tests needed re-runs

MiniMax-M3

✍️ Writer Only — 55/75
55
  • Good summarization (5/5)
  • Real intelligence (23/25 on intelligence)
  • Hallucinated a non-existent bug (RuntimeError)
  • Syntax error in code gen (missing bracket)
  • 212 lines of reasoning leakage on haiku
  • Failed sentence/word constraints repeatedly

Key Findings

1. DeepSeek-V4-Pro remains the clear generalist winner

72/75 across all categories. Most consistent constraint-following, cleanest output, no hallucinations. The model you can trust to follow instructions precisely.

2. 🆕 Kimi-K2.7-Code scored HIGHEST on Generalist Core (24/25)

Beat DeepSeek-V4-Pro by 1 point. Perfect 5s on reasoning, format following, knowledge, and code generation. Only deduction was code_debug (4) for minor reasoning leakage. However, it dropped to 4th overall due to catastrophic creative test (1/5 — 84 lines of internal monologue, no actual haiku) and broken logic puzzle output (3/5 — correct reasoning but output labeled "[No content, only reasoning]").

3. 🆕 MiMo-V2.5-Pro is a strong newcomer (67/75)

Perfect 5s on reasoning, format following, knowledge, code gen, summarization, and all intelligence tests except logic puzzle. Main deductions: code debug had garbled reasoning text, creative had heavy leakage, refactor used an import (violation), and judgment was slightly verbose. Had reliability issues during testing — many tests timed out or returned empty on first run.

4. 🆕 MiMo-V2.5 is solid but weaker than Pro (63/75)

Code gen used `from typing import List, TypeVar` which violates the "no imports" constraint (2/5). Knowledge was over 150 words (4/5). Logic puzzle produced 43KB / 923 lines of repetitive reasoning. But intelligence was strong (24/25) — all answers correct.

5. GLM-5.2 and DeepSeek-V4-Flash remain strong picks

GLM-5.2 (70/75) has best knowledge density and judgment. Flash (69/75) is best value — nearly matches Pro at lower cost. Both scored 25/25 and 24/25 on intelligence respectively.

6. MiniMax-M3 should stay on writer/creative only

55/75. Hallucinated a bug, produced code with syntax errors, leaked 212 lines on a haiku. Real intelligence (23/25) but unreliable for technical work.

7. All seven models are genuinely intelligent

Every model solved the water jug puzzle, lateral thinking riddle, pattern sequence, and 100 prisoners problem correctly. The differentiator is output format quality, not reasoning ability.

8. Reasoning leakage is systemic across all models

None of the 7 models fully respected "ONLY the haiku" — all leaked chain-of-thought. Kimi was worst (84 lines, no haiku at all). This is a fundamental property of reasoning-mode models.

9. 🆕 MiMo models had significant reliability issues via Nous Portal

Many tests timed out or returned empty/error responses on first run. Required re-runs with longer timeouts (120s) and sequential execution to avoid rate limiting. This is a practical operational concern for production use.

Recommended Routing

RoleModelRationale
Default generalistDeepSeek-V4-Pro or GLM-5.2Pro most consistent; GLM best for knowledge-heavy
Code-heavy tasksKimi-K2.7-Code 🆕 or DeepSeek-V4-FlashKimi best Generalist score (24/25); Flash best value
OrchestratorDeepSeek-V4-FlashBest value, high quality, fast
Knowledge / explanationGLM-5.2Best density and precision, verdict-first instinct
Researcher (alt)MiMo-V2.5-Pro 🆕Strong knowledge + intelligence (67/75), good citations
Writer / creativeMiniMax-M3Keep for prose only — reasoning leakage less harmful