LLM Translation & Schema-Validation Benchmark — Korean Culinary

EN / KO × Clean / Noisy — 8 scenarios — Jun 13 2026 — Run 2 (Judge Active)

Run 2 — LLM-as-Judge active. The metric inversion bug has been patched and the cultural subtlety judge (anthropic/claude-sonnet-4.6) is now live. Composite scores are absolute: 0.40 × schema + 0.35 × loanword + 0.25 × (cultural/5) — directly comparable across runs, unlike the min-max normalized score that ModelRanker outputs internally. See the Bug Log for the original ranking error.

8 scenarios

3 models tested

40 · 35 · 25 Schema · Loanword · Cultural weights

Judge active claude-sonnet-4.6

Rankings

anthropic/claude-sonnet-4.6

Claude Sonnet 4.6

0.9514 composite (absolute)

Schema validity 0.9945

Loanword preservation 0.8854

Cultural score (1–5) 4.88

google/gemini-2.5-pro

Gemini 2.5 Pro

0.8918 composite (absolute)

Schema validity 0.9341

Loanword preservation 0.8733

Cultural score (1–5) 4.25

qwen/qwen3-max-thinking

Qwen3 Max Thinking

0.8743 composite (absolute)

Schema validity 0.9340

Loanword preservation 0.8233

Cultural score (1–5) 4.25

All three models are strong. The 0.0771 spread between first and third (0.9514 → 0.8743) shows genuine differentiation, not a failure. Claude leads on all three individual metrics; Gemini and Qwen are within 0.02 of each other on schema and cultural score.

Cross-Model Summary

Metric	Claude Sonnet 4.6	Gemini 2.5 Pro	Qwen3 Max Thinking
Composite (absolute)	0.9514	0.8918	0.8743
Schema validity (avg)	0.9945	0.9341	0.9340
Loanword preservation (avg)	0.8854	0.8733	0.8233
Cultural subtlety (1–5, judge)	4.88	4.25	4.25
Schema — EN scenarios	0.9952	0.9058	0.8797
Schema — KO scenarios	0.9938	0.9624	0.9884
Loanword — EN scenarios	0.7709	0.7465	0.6466
Loanword — KO scenarios	1.0000	1.0000	1.0000
Noise delta (schema clean→noisy)	−0.0111	+0.0271	+0.0324

Per-Scenario: Claude Sonnet 4.6

0.9945

Avg schema validity

0.8854

Avg loanword pres.

0.9952

Schema — EN

0.9938

Schema — KO

Scenario	Schema	Loanword	loanwords_detected note
en-a-clean	1.0000	0.8155	`세서미 오일` ✓ — valid Konglish form; `참기름` is more common in traditional Korean but both are correct
en-a-noise noisy	0.9808	0.7767	`세서미 오일` ✓ — same
en-b-clean	1.0000	0.7368	`세서미 오일` ✓ — Konglish form; Gemini/Qwen used English "sesame oil"
en-b-noise noisy	1.0000	0.7544	`세서미 오일` ✓ — same
ko-a-clean ko	1.0000	1.0000	`프라이팬` ✓
ko-a-noise konoisy	0.9750	1.0000	`프라이팬` ✓
ko-b-clean ko	1.0000	1.0000	`레시피` ✓
ko-b-noise konoisy	1.0000	1.0000	`레시피` ✓
Average	0.9945	0.8854	Consistently uses Konglish register: `세서미 오일` on EN, `프라이팬`/`레시피` on KO — all valid forms

Per-Scenario: Gemini 2.5 Pro

0.9341

Avg schema validity

0.8733

Avg loanword pres.

0.9058

Schema — EN

0.9624

Schema — KO

Scenario	Schema	Loanword	loanwords_detected note
en-a-clean	0.9623	0.8447	`sesame oil` — English form, not Konglish
en-a-noise noisy	0.9565	0.7379	English forms; body text rescues loanword score
en-b-clean	0.8182	0.7193	`sesame oil`, `recipe` — English forms
en-b-noise noisy	0.8864	0.6842	English forms; some optional fields null
ko-a-clean ko	0.9750	1.0000	`프라이팬` ✓
ko-a-noise konoisy	0.9722	1.0000	`프라이팬` ✓
ko-b-clean ko	0.9268	1.0000	`레시피` ✓
ko-b-noise konoisy	0.9756	1.0000	`레시피` ✓
Average	0.9341	0.8733	EN loanword in English form; body text rescues score. KO perfect.

Per-Scenario: Qwen3 Max Thinking

0.9340

Avg schema validity

0.8233

Avg loanword pres.

0.8797

Schema — EN

0.9884

Schema — KO

Scenario	Schema	Loanword	loanwords_detected note
en-a-clean	0.8222	0.7087	English forms; optional fields sparse
en-a-noise noisy	0.8667	0.6408	English forms; more null fields vs. clean
en-b-clean	0.8723	0.5965	Over-detected: `미역국`, `miyeokguk`, `의미` — native terms, not loanwords
en-b-noise noisy	0.9574	0.6404	Same over-detection pattern
ko-a-clean ko	0.9767	1.0000	`프라이팬` in recipe body ✓
ko-a-noise konoisy	0.9767	1.0000	`프라이팬` ✓
ko-b-clean ko	1.0000	1.0000	`레시피` ✓
ko-b-noise konoisy	1.0000	1.0000	`레시피` ✓
Average	0.9340	0.8233	Largest EN–KO schema gap (0.109). EN-B over-detection hurts loanword score.

Evaluation Notes

Dimension	Finding
Schema completeness	Claude leads on EN schema (0.9952 vs. 0.9058 Gemini, 0.8797 Qwen). KO schemas are competitive across all models — EN is harder due to more optional fields being left null.
Loanword register	Claude consistently used `세서미 오일` (Konglish phonetic form) across all EN scenarios; Gemini and Qwen used English "sesame oil". Both are valid — `참기름` is the more common traditional Korean term, but `세서미 오일` is an accepted Konglish form in modern and Westernized Korean cooking. The more interesting story is register consistency: Claude committed to Korean-script Konglish form even when the source language was English. The loanword scorer gives equal credit for either form (body-text search); a future sub-score for Korean-script preservation would better capture this distinction.
KO loanword unanimity	All models score 1.000 on KO loanword preservation. Korean-script loanwords (`프라이팬`, `레시피`) are visually distinct and consistently captured.
Noise robustness	Claude is the only model to degrade under noise (Δ −0.011). Gemini (+0.027) and Qwen (+0.032) marginally improve — likely because noisy variants trigger more conservative, complete schema fills rather than creative embellishment of optional fields.
Cultural score (judge active)	Claude 4.88/5, Gemini and Qwen 4.25/5 — all models perform well. Claude's higher score reflects stronger use of `hidden_intent` and more substantive `cultural_notes`. Judge model: `claude-sonnet-4.6` — potential self-favoritism bias; consider a third-party judge for production use.
Qwen language gap	Qwen's EN schema (0.880) is 0.108 below its KO schema (0.988) — the largest language gap of the three models. A KO-only benchmark would significantly overstate Qwen's multilingual capability.

Bug Log — Metric Inversion in `src/benchmark.py` (Run 1, now patched)

Root cause

The original BenchmarkResult.to_dict() passed raw avg_schema_validity (higher-is-better) directly into the "cer" slot of ModelRanker:

# Bug — code that produced the original results.csv (Run 1)
"cer": self.avg_schema_validity,        # ← no inversion
"wer": self.avg_cultural_score / 5.0,   # ← no inversion

ModelRanker.normalize_metric() at ranking.py:166 normalizes cer with lower_is_better=True:

norm = (v - min_val) / (max_val - min_val)
if lower_is_better:
    norm = 1.0 - norm   # lowest raw value → 1.0 (best rank)

Effect on each model (Run 1)

Schema validity inputs (raw, no inversion):
  Claude  = 0.9886  →  cer_norm = 0.000  ← penalized for highest schema
  Gemini  = 0.9350  →  cer_norm = 0.273
  Qwen    = 0.9149  →  cer_norm = 1.000  ← rewarded for lowest schema

Cultural WER = 0.0 for all (judge stubbed) → wer_norm = 0.5 for all

Loanword (correctly normalized, lower_is_better=False):
  Gemini  = 0.8882  →  loanword_norm = 1.000
  Claude  = 0.8859  →  loanword_norm = 0.957
  Qwen    = 0.8352  →  loanword_norm = 0.000

Buggy composite (0.40·cer_norm + 0.25·wer_norm + 0.35·loanword_norm):
  Gemini  → 0.109 + 0.125 + 0.350 = 0.7658  rank 1
  Qwen    → 0.400 + 0.125 + 0.000 = 0.5250  rank 2
  Claude  → 0.000 + 0.125 + 0.335 = 0.4600  rank 3

Fix (current `src/benchmark.py`)

"cer": 1.0 - self.avg_schema_validity,          # lower = more valid ✓
"wer": 1.0 - (self.avg_cultural_score / 5.0),   # lower = higher judge score ✓
"loanword_accuracy": self.avg_loanword_score,    # already higher-is-better ✓

Run 2 (this report) was generated after applying this fix with the judge active.

Metrics Reference

Metric	Weight	What it measures
Schema Validity	40%	Fraction of `BilingualRecipe` fields populated including optional fields. Requires ≥1 ingredient and ≥1 step.
Loanword Preservation	35%	Fraction of source Konglish terms found anywhere in the full recipe output. Scores 1.0 when the source has no detectable loanwords.
Cultural Subtlety	25%	LLM-as-Judge 1–5 scale. Judge: `anthropic/claude-sonnet-4.6`. Set `judge_model` in `config.yaml` to change judge model.

Jun 13 2026 • Korean Culinary Translation Benchmark v3.0.0 • Models: anthropic/claude-sonnet-4.6 · google/gemini-2.5-pro · qwen/qwen3-max-thinking • Raw data: results/predictions/*.csv

LLM Translation & Schema-Validation Benchmark — Korean Culinary

Rankings

Cross-Model Summary

Per-Scenario: Claude Sonnet 4.6

Per-Scenario: Gemini 2.5 Pro

Per-Scenario: Qwen3 Max Thinking

Evaluation Notes

Bug Log — Metric Inversion in src/benchmark.py (Run 1, now patched)

Root cause

Effect on each model (Run 1)

Fix (current src/benchmark.py)

Metrics Reference

Bug Log — Metric Inversion in `src/benchmark.py` (Run 1, now patched)

Fix (current `src/benchmark.py`)