LLM Generalist Benchmark

DeepSeek-V4-Pro · GLM-5.2 · DeepSeek-V4-Flash · Kimi-K2.7-Code · MiMo-V2.5-Pro · MiMo-V2.5 · MiniMax-M3

2026-06-21 · 3 Categories · 15 Tests · 105 One-Shot Queries · 7 Models

👑

DeepSeek-V4-Pro

72 / 75

Most consistently excellent across all three categories

5 — Excellent

4 — Good

3 — Minor issues

2 — Problems

1 — Failed

Grand Total All 3 Categories

Rank	Model	Provider	Generalist	Creative	Intelligence	Total /75
1	DeepSeek-V4-Pro	ollama-cloud	23	24	25	72
2	GLM-5.2	ollama-cloud	22	23	25	70
3	DeepSeek-V4-Flash	ollama-cloud	21	24	24	69
4	Kimi-K2.7-Code	ollama-cloud	24	21	23	68
5	MiMo-V2.5-Pro	Nous Portal	23	20	24	67
6	MiMo-V2.5	Nous Portal	20	19	24	63
7	MiniMax-M3	ollama-cloud	14	18	23	55

DeepSeek-V4-Pro

72 / 75

GLM-5.2

70 / 75

DeepSeek-V4-Flash

69 / 75

Kimi-K2.7-Code

68 / 75

MiMo-V2.5-Pro

67 / 75

MiMo-V2.5

63 / 75

MiniMax-M3

55 / 75

Category Overview

Generalist Core

Math reasoning · Code debugging · Format following · Technical knowledge · Code generation

Kimi-K2.7-Code 🆕24

DeepSeek-V4-Pro23

MiMo-V2.5-Pro 🆕23

GLM-5.222

DeepSeek-V4-Flash21

MiMo-V2.5 🆕20

MiniMax-M314

Creative & Strategic

Multi-step planning · Summarization · Creative writing · Decision-making · Code refactoring

DeepSeek-V4-Pro24

DeepSeek-V4-Flash24

GLM-5.223

Kimi-K2.7-Code 🆕21

MiMo-V2.5-Pro 🆕20

MiMo-V2.5 🆕19

MiniMax-M318

Intelligence Test

Logic puzzle · Lateral thinking · Water jugs · Pattern recognition · 100 prisoners problem

DeepSeek-V4-Pro25

GLM-5.225

DeepSeek-V4-Flash24

MiMo-V2.5-Pro 🆕24

MiMo-V2.5 🆕24

Kimi-K2.7-Code 🆕23

MiniMax-M323

Detailed Scores Per Test

Generalist Core

Test	GLM-5.2	MiniMax-M3	DS-V4-Pro	DS-V4-Flash	Kimi-K2.7 🆕	MiMo-V2.5-Pro 🆕	MiMo-V2.5 🆕
Reasoning Math optimization	5	4	4	4	5	5	5
Code Debug Bug + 3 sentences	3	2	5	4	4	3	4
Format Follow Exactly 3 bullets	5	4	5	4	5	5	5
Knowledge CAP theorem <150w	5	3	5	4	5	5	4
Code Gen chunk_list, ONLY code	4	1	4	5	5	5	2
TOTAL	22	14	23	21	24	23	20

Creative & Strategic

Test	GLM-5.2	MiniMax-M3	DS-V4-Pro	DS-V4-Flash	Kimi-K2.7 🆕	MiMo-V2.5-Pro 🆕	MiMo-V2.5 🆕
Planning 6-step migration	5	4	5	5	5	4	4
Summarization 2 sentences	5	5	5	5	5	5	5
Creative 5-7-5 haiku ONLY	3	2	4	4	1	3	2
Judgment DB pick <100w	5	4	5	5	5	4	4
Refactor Code + 2 sentences	5	3	5	5	5	4	4
TOTAL	23	18	24	24	21	20	19

Intelligence Test

Test	GLM-5.2	MiniMax-M3	DS-V4-Pro	DS-V4-Flash	Kimi-K2.7 🆕	MiMo-V2.5-Pro 🆕	MiMo-V2.5 🆕
Logic Puzzle 5-constraint seating	5	4	5	4	3	4	4
Lateral Thinking Elevator riddle	5	5	5	5	5	5	5
Water Jugs 8L/5L/3L → 4L	5	5	5	5	5	5	5
Pattern Sequence Pronic numbers	5	5	5	5	5	5	5
Hard Problem 100 prisoners	5	4	5	5	5	5	5
TOTAL	25	23	25	24	23	24	24

Model Profiles

DeepSeek-V4-Pro

🏆 Best Overall — 72/75

Best constraint following across all categories
Perfect 25/25 on intelligence test
No hallucinations or fabricated content
Best "Heisenbug" haiku (creative + technical)

Slightly verbose on math reasoning

GLM-5.2

🧠 Best Knowledge & Judgment — 70/75

Perfect 25/25 on intelligence test
Best knowledge density and precision
Best judgment answer (zero leakage)
Self-corrected arithmetic mid-reasoning

Failed 3-sentence constraint in Generalist
Heavy reasoning leakage on creative task

DeepSeek-V4-Flash

⚡ Best Value — 69/75

Best code generation in Generalist
Tied for best in Creative (24/25)
Most defensive code (isinstance checks)
Nearly matches Pro at lower cost

503 lines on logic puzzle (most verbose)

Kimi-K2.7-Code 🆕

💻 Best Generalist Score — 68/75

HIGHEST Generalist score: 24/25 (beat Pro!)
Perfect 5s on reasoning, format, knowledge, code gen
Excellent hard problem explanation (31.2%)
Best code_debug answer (3 exact sentences)

Creative: 84 lines of monologue, NO haiku output (1/5)
Logic puzzle: correct but "[No content]" label (3/5)
Reasoning tokens consume entire response on some tests

MiMo-V2.5-Pro 🆕

🔬 Strong Newcomer — 67/75

Perfect 5s on reasoning, format, knowledge, code gen
Strong intelligence (24/25) — all correct
Excellent hard problem with concrete example
Cites Brewer/Gilbert-Lynch + PACELC

Code debug had garbled reasoning text
Creative: heavy leakage, 3 lines not 4
Refactor used `from typing import Hashable` (import violation)
Reliability issues during testing (timeouts, empty responses)

MiMo-V2.5 🆕

📊 Solid Mid-Tier — 63/75

Strong intelligence (24/25) — matches Pro
Perfect format following (5/5)
Correct on all reasoning tests
Good judgment (mentions TimescaleDB)

Code gen used `from typing import` (import violation → 2/5)
Knowledge over 150 words (4/5)
Logic puzzle: 43KB / 923 lines of repetitive reasoning
Reliability issues — many tests needed re-runs

MiniMax-M3

✍️ Writer Only — 55/75

Good summarization (5/5)
Real intelligence (23/25 on intelligence)

Hallucinated a non-existent bug (RuntimeError)
Syntax error in code gen (missing bracket)
212 lines of reasoning leakage on haiku
Failed sentence/word constraints repeatedly

Key Findings

1. DeepSeek-V4-Pro remains the clear generalist winner

72/75 across all categories. Most consistent constraint-following, cleanest output, no hallucinations. The model you can trust to follow instructions precisely.

2. 🆕 Kimi-K2.7-Code scored HIGHEST on Generalist Core (24/25)

Beat DeepSeek-V4-Pro by 1 point. Perfect 5s on reasoning, format following, knowledge, and code generation. Only deduction was code_debug (4) for minor reasoning leakage. However, it dropped to 4th overall due to catastrophic creative test (1/5 — 84 lines of internal monologue, no actual haiku) and broken logic puzzle output (3/5 — correct reasoning but output labeled "[No content, only reasoning]").

3. 🆕 MiMo-V2.5-Pro is a strong newcomer (67/75)

Perfect 5s on reasoning, format following, knowledge, code gen, summarization, and all intelligence tests except logic puzzle. Main deductions: code debug had garbled reasoning text, creative had heavy leakage, refactor used an import (violation), and judgment was slightly verbose. Had reliability issues during testing — many tests timed out or returned empty on first run.

4. 🆕 MiMo-V2.5 is solid but weaker than Pro (63/75)

Code gen used `from typing import List, TypeVar` which violates the "no imports" constraint (2/5). Knowledge was over 150 words (4/5). Logic puzzle produced 43KB / 923 lines of repetitive reasoning. But intelligence was strong (24/25) — all answers correct.

5. GLM-5.2 and DeepSeek-V4-Flash remain strong picks

GLM-5.2 (70/75) has best knowledge density and judgment. Flash (69/75) is best value — nearly matches Pro at lower cost. Both scored 25/25 and 24/25 on intelligence respectively.

6. MiniMax-M3 should stay on writer/creative only

55/75. Hallucinated a bug, produced code with syntax errors, leaked 212 lines on a haiku. Real intelligence (23/25) but unreliable for technical work.

7. All seven models are genuinely intelligent

Every model solved the water jug puzzle, lateral thinking riddle, pattern sequence, and 100 prisoners problem correctly. The differentiator is output format quality, not reasoning ability.

8. Reasoning leakage is systemic across all models

None of the 7 models fully respected "ONLY the haiku" — all leaked chain-of-thought. Kimi was worst (84 lines, no haiku at all). This is a fundamental property of reasoning-mode models.

9. 🆕 MiMo models had significant reliability issues via Nous Portal

Many tests timed out or returned empty/error responses on first run. Required re-runs with longer timeouts (120s) and sequential execution to avoid rate limiting. This is a practical operational concern for production use.

Recommended Routing

Role	Model	Rationale
Default generalist	DeepSeek-V4-Pro or GLM-5.2	Pro most consistent; GLM best for knowledge-heavy
Code-heavy tasks	Kimi-K2.7-Code 🆕 or DeepSeek-V4-Flash	Kimi best Generalist score (24/25); Flash best value
Orchestrator	DeepSeek-V4-Flash	Best value, high quality, fast
Knowledge / explanation	GLM-5.2	Best density and precision, verdict-first instinct
Researcher (alt)	MiMo-V2.5-Pro 🆕	Strong knowledge + intelligence (67/75), good citations
Writer / creative	MiniMax-M3	Keep for prose only — reasoning leakage less harmful