The Optimal Architecture for Small Language Models

Publish Date: Jan 2, 2026

Publish Date: Jan 2, 2026

Summary: The Optimal Architecture for Small Language Models

Summary: The Optimal Architecture for Small Language Models

Sign indicates geology talks and rim walks meet here.
Sign indicates geology talks and rim walks meet here.

Introduction

Introduction

In our previous article, we found that 50% FinePDFs + 30% DCLM + 20% FineWeb-Edu is the optimal dataset mix for training GPT-2, achieving 38.50% average accuracy. But that used the standard 12-layer architecture.

What if we could do even better by changing the model itself?

We ran 19 experiments to find out:

7 GPT-2 variants with radically different depth-width ratios (4→64 layers)
12 architecture families including LLaMA3, Gemma3, Qwen3, MoE, diffusion models, and novel hybrids
Here's what surprised us:

Finding Why It Matters
Models cluster into exactly two performance tiers ~38% vs ~32%—with almost nothing in between
Hidden dimension ≥512 is a hard threshold Below it, even 64 layers can't compensate
32 layers beats 12 layers 38.50% vs 38.15% with comparable parameter budgets
All 12 architectures perform within ~2% LLaMA3, Qwen3, GPT-2—they're all nearly identical at 70M
Diffusion models are 3.8x faster 183 tok/s vs 48 tok/s with parallel token generation
Diffusion models have the best factuality 49.27% TruthfulQA—highest of any architecture
AR→Diffusion conversion needs only 100M tokens 10x more efficient than training from scratch
The result: Dhara-70M, a diffusion model that sacrifices 1.33% accuracy for 3.8x throughput and superior factuality.

The Problem: What's the Optimal Architecture for Small Models?
Our previous work established that 50% FinePDFs + 30% DCLM + 20% FineWeb-Edu is optimal for training small models. With that dataset recipe fixed, we asked: Does model architecture matter as much as data composition?

The standard GPT-2 uses 12 layers with 768 hidden dimensions. But this was designed in 2019 for ~124M parameters. For a 70M model trained on 1B tokens, is this still optimal? And what about newer architectures like LLaMA, Gemma, MoE, or even diffusion language models?

Discovery #1: The Two-Tier Performance Pattern
Our first finding was completely unexpected. We expected a smooth trade-off curve—more layers for less width, or vice versa. Instead, we found a hard binary split:

two_tier_discovery

Configuration Average Score Tier Gap from High
4L Ultra-Wide 31.98% Low -6.52%
12L Wide 38.15% High —
16L Intermediate 32.61% Low -5.89%
24L Medium 31.79% Low -6.71%
32L Goldilocks 38.50% High —
48L Deep 32.45% Low -6.05%
64L Deep-Narrow 38.21% High —
The gap between tiers is substantial: 6+ percentage points separating them, while variance within each tier is only ~0.5%.

This bimodal distribution is notable: configurations either achieve the high tier (38%) or fall to the low tier (32%), with no intermediate performance levels observed.

Discovery #2: The Hidden Dimension Threshold
Why do some configurations succeed while others fail? We identified the critical factor: hidden_size >= 512.

hidden_threshold

Config Hidden Score Explanation
12L 512 38.15% Meets threshold
16L 448 32.61% Below threshold, depth doesn't compensate
24L 384 31.79% Below threshold, depth doesn't compensate
32L 384 38.50% Below threshold, but OPTIMAL depth compensates
48L 320 32.45% Below threshold, suboptimal depth
64L 256 38.21% Below threshold, but EXTREME depth compensates
The rule emerges: Models need either:

hidden_size >= 512, OR
Exactly 32 layers (the "Goldilocks" depth), OR
Extremely deep (64+ layers) to compensate
The 16L, 24L, and 48L configurations fall into a "dead zone" - their hidden dimensions are too narrow, and their depths aren't at the sweet spots that can compensate.

Discovery #3: 32 Layers is the Goldilocks Depth
With hidden=384, the 32-layer configuration achieves the best overall score (38.50%), slightly beating even the standard 12-layer design.

depth_vs_performance

Benchmark 12L Wide 32L Goldilocks Difference
MMLU 24.11% 25.77% +1.66%
HellaSwag 27.03% 26.46% -0.57%
ARC-Challenge 21.67% 22.27% +0.60%
PIQA 57.29% 58.05% +0.76%
WinoGrande 51.46% 52.64% +1.18%
TruthfulQA 47.31% 45.83% -1.48%
GSM8K 0.99% 1.21% +0.22%
Average 38.15% 38.50% +0.35%
The 32-layer model wins on 5 out of 7 benchmarks, with particular strengths in:

WinoGrande (+1.18%): Better pronoun resolution suggests deeper compositional reasoning
MMLU (+1.66%): More layers help with academic knowledge retention
Part 2: Architecture Family Comparison
Armed with the optimal 32-layer depth, we compared 12 different architecture families:

Architectures Tested
Architecture Type Parameters Special Features
GPT-2 Classic Transformer 76.48M Learned positional embeddings, LayerNorm
LLaMA3 Modern Transformer 71.25M RoPE, RMSNorm, GQA, SiLU
Qwen3 Modern Transformer 71.25M RoPE, RMSNorm, GQA, SiLU
Gemma3 Modern Transformer 71.27M Sliding window attention (1024), logit capping
LFM2 Hybrid Conv+Attn ~80M Conv-Conv-Attn pattern
dLLM Diffusion LM 71.25M Bidirectional, masked diffusion (MDLM)
MoE Mixture of Experts 327M (67M active) 16 experts, 2 active per token
Titans-MAC Memory-Augmented 67.76M Neural memory modules at layers [0,7,14,21]
dLLM-Recursive Diffusion LM 76.11M Recursive refinement module
LLaMA3-Canon LLaMA3 + Canon 71.34M Depthwise causal convolutions
dLLM-Canon Diffusion + Canon 76.05M Canon layers + bidirectional diffusion
Dhara AR→Diffusion (WSD) 71.34M WSD-converted from LLaMA3-Canon
Complete Benchmark Results
Model HellaSwag PIQA WinoGrande ARC-C MMLU TruthfulQA GSM8K Avg
GPT-2 (32L) 26.46 58.05 52.64 22.27 25.77 45.83 1.21 33.18
LLaMA3 27.17 59.47 50.99 23.21 26.16 43.82 0.00 32.97
Qwen3 26.85 59.41 50.91 18.26 26.62 44.35 0.15 32.36
Gemma3 26.90 59.74 51.54 17.15 26.19 44.20 1.59 32.47
LFM2 26.27 56.96 50.12 17.83 25.95 47.40 0.61 32.16
LLaMA3-Canon 26.72 58.81 51.46 22.27 26.79 44.82 1.67 33.22
MoE 27.30 59.74 50.20 19.62 25.69 47.51 1.06 33.02
Titans-MAC 26.18 57.02 48.78 17.24 25.67 46.26 1.36 31.79
dLLM 25.55 49.67 51.07 21.16 23.96 47.08 0.00 31.21
dLLM-Recursive 24.74 50.44 51.46 22.27 24.04 47.68 0.23 31.55
dLLM-Canon 24.67 50.16 51.46 22.70 24.02 49.27 0.38 31.81
Dhara 25.58 51.58 49.64 24.83 23.85 47.50 0.00 31.85
Discovery #4: Architecture Choice Has Minimal Impact at 70M Scale
Surprisingly, all 12 architecture families achieve similar benchmark accuracy:

High tier (AR models): 32-33% average
Low tier (Diffusion models): 31-32% average
The differences are within noise at this scale. Modern architectural improvements (RMSNorm, RoPE, GQA) are designed for 7B+ models and don't provide measurable benefits at 70M parameters.

Winner: LLaMA3-Canon (33.22%) slightly edges out GPT-2 (33.18%), but the difference is not statistically significant.

Discovery #5: dLLMs Trade Accuracy for 3.8x Throughput
The real differentiation comes from inference characteristics, not accuracy:

throughput_vs_accuracy

Model Throughput Accuracy Memory TTFT
LLaMA3 50 tok/s 32.97% 0.15 GB 24 ms
GPT-2 (32L) 48 tok/s 33.18% 0.15 GB ~25 ms
MoE 49 tok/s 33.02% 0.62 GB 51 ms
dLLM 289 tok/s 31.21% 0.31 GB 34 ms
Dhara 183 tok/s 31.85% 0.24 GB 35 ms
The trade-off is clear:

-1.33% accuracy (31.85% vs 33.18% average)
+3.8x throughput (183 vs 48 tok/s)
+1.6x memory (bidirectional attention overhead)
Discovery #6: dLLMs Excel at Factuality
One of our most surprising findings: diffusion models achieve the highest TruthfulQA scores among all architectures tested.

Model TruthfulQA Rank
dLLM-Canon 49.27% #1
dLLM-Recursive 47.68% #2
MoE 47.51% #3
Dhara 47.50% #4
LFM2 47.40% #5
dLLM 47.08% #6
GPT-2 (32L) 45.83% #7
task_breakdown

Why might dLLMs excel at factuality? We hypothesize three contributing factors:


Reference

Reference

Name

Description

Links

Acharya, D.B., Kuppan

2025. Agentic AI: Autonomous intelligence for complex goals—A comprehensive survey

Read More

Adnan, M., Kuhn, C.C.N.

2025. The Debugging Decay Index: Rethinking Debugging Strategies for Code LLMs.

Read More