LLM Translation & Schema-Validation Benchmark — Korean Culinary

EN / KO × Clean / Noisy — 8 scenarios — Jun 13 2026 — Run 2 (Judge Active)

Run 2 — LLM-as-Judge active. The metric inversion bug has been patched and the cultural subtlety judge (anthropic/claude-sonnet-4.6) is now live. Composite scores are absolute: 0.40 × schema + 0.35 × loanword + 0.25 × (cultural/5) — directly comparable across runs, unlike the min-max normalized score that ModelRanker outputs internally. See the Bug Log for the original ranking error.
8 scenarios
3 models tested
40 · 35 · 25 Schema · Loanword · Cultural weights
Judge active claude-sonnet-4.6

Rankings

1
anthropic/claude-sonnet-4.6
Claude Sonnet 4.6
0.9514 composite (absolute)
Schema validity 0.9945
Loanword preservation 0.8854
Cultural score (1–5) 4.88
2
google/gemini-2.5-pro
Gemini 2.5 Pro
0.8918 composite (absolute)
Schema validity 0.9341
Loanword preservation 0.8733
Cultural score (1–5) 4.25
3
qwen/qwen3-max-thinking
Qwen3 Max Thinking
0.8743 composite (absolute)
Schema validity 0.9340
Loanword preservation 0.8233
Cultural score (1–5) 4.25
All three models are strong. The 0.0771 spread between first and third (0.9514 → 0.8743) shows genuine differentiation, not a failure. Claude leads on all three individual metrics; Gemini and Qwen are within 0.02 of each other on schema and cultural score.

Cross-Model Summary

Metric Claude Sonnet 4.6 Gemini 2.5 Pro Qwen3 Max Thinking
Composite (absolute) 0.9514 0.8918 0.8743
Schema validity (avg) 0.9945 0.9341 0.9340
Loanword preservation (avg) 0.8854 0.8733 0.8233
Cultural subtlety (1–5, judge) 4.88 4.25 4.25
Schema — EN scenarios 0.9952 0.9058 0.8797
Schema — KO scenarios 0.9938 0.9624 0.9884
Loanword — EN scenarios 0.7709 0.7465 0.6466
Loanword — KO scenarios 1.0000 1.0000 1.0000
Noise delta (schema clean→noisy) −0.0111 +0.0271 +0.0324

Per-Scenario: Claude Sonnet 4.6

0.9945
Avg schema validity
0.8854
Avg loanword pres.
0.9952
Schema — EN
0.9938
Schema — KO
Scenario Schema Loanword loanwords_detected note
en-a-clean 1.0000 0.8155 세서미 오일 ✓ — valid Konglish form; 참기름 is more common in traditional Korean but both are correct
en-a-noise noisy 0.9808 0.7767 세서미 오일 ✓ — same
en-b-clean 1.0000 0.7368 세서미 오일 ✓ — Konglish form; Gemini/Qwen used English "sesame oil"
en-b-noise noisy 1.0000 0.7544 세서미 오일 ✓ — same
ko-a-clean ko 1.0000 1.0000 프라이팬
ko-a-noise konoisy 0.9750 1.0000 프라이팬
ko-b-clean ko 1.0000 1.0000 레시피
ko-b-noise konoisy 1.0000 1.0000 레시피
Average 0.9945 0.8854 Consistently uses Konglish register: 세서미 오일 on EN, 프라이팬/레시피 on KO — all valid forms

Per-Scenario: Gemini 2.5 Pro

0.9341
Avg schema validity
0.8733
Avg loanword pres.
0.9058
Schema — EN
0.9624
Schema — KO
Scenario Schema Loanword loanwords_detected note
en-a-clean 0.9623 0.8447 sesame oil — English form, not Konglish
en-a-noise noisy 0.9565 0.7379 English forms; body text rescues loanword score
en-b-clean 0.8182 0.7193 sesame oil, recipe — English forms
en-b-noise noisy 0.8864 0.6842 English forms; some optional fields null
ko-a-clean ko 0.9750 1.0000 프라이팬
ko-a-noise konoisy 0.9722 1.0000 프라이팬
ko-b-clean ko 0.9268 1.0000 레시피
ko-b-noise konoisy 0.9756 1.0000 레시피
Average 0.9341 0.8733 EN loanword in English form; body text rescues score. KO perfect.

Per-Scenario: Qwen3 Max Thinking

0.9340
Avg schema validity
0.8233
Avg loanword pres.
0.8797
Schema — EN
0.9884
Schema — KO
Scenario Schema Loanword loanwords_detected note
en-a-clean 0.8222 0.7087 English forms; optional fields sparse
en-a-noise noisy 0.8667 0.6408 English forms; more null fields vs. clean
en-b-clean 0.8723 0.5965 Over-detected: 미역국, miyeokguk, 의미 — native terms, not loanwords
en-b-noise noisy 0.9574 0.6404 Same over-detection pattern
ko-a-clean ko 0.9767 1.0000 프라이팬 in recipe body ✓
ko-a-noise konoisy 0.9767 1.0000 프라이팬
ko-b-clean ko 1.0000 1.0000 레시피
ko-b-noise konoisy 1.0000 1.0000 레시피
Average 0.9340 0.8233 Largest EN–KO schema gap (0.109). EN-B over-detection hurts loanword score.

Evaluation Notes

Dimension Finding
Schema completeness Claude leads on EN schema (0.9952 vs. 0.9058 Gemini, 0.8797 Qwen). KO schemas are competitive across all models — EN is harder due to more optional fields being left null.
Loanword register Claude consistently used 세서미 오일 (Konglish phonetic form) across all EN scenarios; Gemini and Qwen used English "sesame oil". Both are valid — 참기름 is the more common traditional Korean term, but 세서미 오일 is an accepted Konglish form in modern and Westernized Korean cooking. The more interesting story is register consistency: Claude committed to Korean-script Konglish form even when the source language was English. The loanword scorer gives equal credit for either form (body-text search); a future sub-score for Korean-script preservation would better capture this distinction.
KO loanword unanimity All models score 1.000 on KO loanword preservation. Korean-script loanwords (프라이팬, 레시피) are visually distinct and consistently captured.
Noise robustness Claude is the only model to degrade under noise (Δ −0.011). Gemini (+0.027) and Qwen (+0.032) marginally improve — likely because noisy variants trigger more conservative, complete schema fills rather than creative embellishment of optional fields.
Cultural score (judge active) Claude 4.88/5, Gemini and Qwen 4.25/5 — all models perform well. Claude's higher score reflects stronger use of hidden_intent and more substantive cultural_notes. Judge model: claude-sonnet-4.6 — potential self-favoritism bias; consider a third-party judge for production use.
Qwen language gap Qwen's EN schema (0.880) is 0.108 below its KO schema (0.988) — the largest language gap of the three models. A KO-only benchmark would significantly overstate Qwen's multilingual capability.

Bug Log — Metric Inversion in src/benchmark.py (Run 1, now patched)

Root cause

The original BenchmarkResult.to_dict() passed raw avg_schema_validity (higher-is-better) directly into the "cer" slot of ModelRanker:

# Bug — code that produced the original results.csv (Run 1)
"cer": self.avg_schema_validity,        # ← no inversion
"wer": self.avg_cultural_score / 5.0,   # ← no inversion

ModelRanker.normalize_metric() at ranking.py:166 normalizes cer with lower_is_better=True:

norm = (v - min_val) / (max_val - min_val)
if lower_is_better:
    norm = 1.0 - norm   # lowest raw value → 1.0 (best rank)

Effect on each model (Run 1)

Schema validity inputs (raw, no inversion):
  Claude  = 0.9886  →  cer_norm = 0.000  ← penalized for highest schema
  Gemini  = 0.9350  →  cer_norm = 0.273
  Qwen    = 0.9149  →  cer_norm = 1.000  ← rewarded for lowest schema

Cultural WER = 0.0 for all (judge stubbed) → wer_norm = 0.5 for all

Loanword (correctly normalized, lower_is_better=False):
  Gemini  = 0.8882  →  loanword_norm = 1.000
  Claude  = 0.8859  →  loanword_norm = 0.957
  Qwen    = 0.8352  →  loanword_norm = 0.000

Buggy composite (0.40·cer_norm + 0.25·wer_norm + 0.35·loanword_norm):
  Gemini  → 0.109 + 0.125 + 0.350 = 0.7658  rank 1
  Qwen    → 0.400 + 0.125 + 0.000 = 0.5250  rank 2
  Claude  → 0.000 + 0.125 + 0.335 = 0.4600  rank 3

Fix (current src/benchmark.py)

"cer": 1.0 - self.avg_schema_validity,          # lower = more valid ✓
"wer": 1.0 - (self.avg_cultural_score / 5.0),   # lower = higher judge score ✓
"loanword_accuracy": self.avg_loanword_score,    # already higher-is-better ✓

Run 2 (this report) was generated after applying this fix with the judge active.

Metrics Reference

Metric Weight What it measures
Schema Validity 40% Fraction of BilingualRecipe fields populated including optional fields. Requires ≥1 ingredient and ≥1 step.
Loanword Preservation 35% Fraction of source Konglish terms found anywhere in the full recipe output. Scores 1.0 when the source has no detectable loanwords.
Cultural Subtlety 25% LLM-as-Judge 1–5 scale. Judge: anthropic/claude-sonnet-4.6. Set judge_model in config.yaml to change judge model.