LLM Translation & Schema-Validation Benchmark — Korean Culinary
EN / KO × Clean / Noisy — 8 scenarios — Jun 13 2026 — Run 2 (Judge
Active)
Run 2 — LLM-as-Judge active. The metric inversion bug has been patched and the
cultural subtlety judge (anthropic/claude-sonnet-4.6) is now live.
Composite scores are absolute: 0.40 × schema + 0.35 × loanword + 0.25 × (cultural/5)
— directly comparable across runs, unlike the min-max normalized score that
ModelRanker outputs internally. See the Bug Log for the original ranking error.
Rankings
1
anthropic/claude-sonnet-4.6
Claude Sonnet 4.6
0.9514 composite (absolute)
Schema validity
0.9945
Loanword preservation
0.8854
Cultural score (1–5)
4.88
2
google/gemini-2.5-pro
Gemini 2.5 Pro
0.8918 composite (absolute)
Schema validity
0.9341
Loanword preservation
0.8733
Cultural score (1–5)
4.25
3
qwen/qwen3-max-thinking
Qwen3 Max Thinking
0.8743 composite (absolute)
Schema validity
0.9340
Loanword preservation
0.8233
Cultural score (1–5)
4.25
All three models are strong. The 0.0771 spread between first and third
(0.9514 → 0.8743) shows genuine differentiation, not a failure. Claude leads on all three
individual metrics; Gemini and Qwen are within 0.02 of each other on schema and cultural score.
Cross-Model Summary
| Metric |
Claude Sonnet 4.6 |
Gemini 2.5 Pro |
Qwen3 Max Thinking |
| Composite (absolute) |
0.9514 |
0.8918 |
0.8743 |
| Schema validity (avg) |
0.9945 |
0.9341 |
0.9340 |
| Loanword preservation (avg) |
0.8854 |
0.8733 |
0.8233 |
| Cultural subtlety (1–5, judge) |
4.88 |
4.25 |
4.25 |
| Schema — EN scenarios |
0.9952 |
0.9058 |
0.8797 |
| Schema — KO scenarios |
0.9938 |
0.9624 |
0.9884 |
| Loanword — EN scenarios |
0.7709 |
0.7465 |
0.6466 |
| Loanword — KO scenarios |
1.0000 |
1.0000 |
1.0000 |
| Noise delta (schema clean→noisy) |
−0.0111 |
+0.0271 |
+0.0324 |
Per-Scenario: Claude Sonnet 4.6
0.9945
Avg schema validity
0.8854
Avg loanword pres.
| Scenario |
Schema |
Loanword |
loanwords_detected note |
| en-a-clean |
1.0000 |
0.8155 |
세서미 오일 ✓ — valid Konglish form; 참기름 is more common in traditional Korean but
both are correct |
| en-a-noise noisy |
0.9808 |
0.7767 |
세서미 오일 ✓ — same |
| en-b-clean |
1.0000 |
0.7368 |
세서미 오일 ✓ — Konglish form; Gemini/Qwen used English "sesame oil" |
| en-b-noise noisy |
1.0000 |
0.7544 |
세서미 오일 ✓ — same |
| ko-a-clean ko |
1.0000 |
1.0000 |
프라이팬 ✓ |
| ko-a-noise konoisy |
0.9750 |
1.0000 |
프라이팬 ✓ |
| ko-b-clean ko |
1.0000 |
1.0000 |
레시피 ✓ |
| ko-b-noise konoisy |
1.0000 |
1.0000 |
레시피 ✓ |
| Average |
0.9945 |
0.8854 |
Consistently uses Konglish register: 세서미 오일 on EN, 프라이팬/레시피 on KO —
all valid forms |
Per-Scenario: Gemini 2.5 Pro
0.9341
Avg schema validity
0.8733
Avg loanword pres.
| Scenario |
Schema |
Loanword |
loanwords_detected note |
| en-a-clean |
0.9623 |
0.8447 |
sesame oil — English form, not Konglish |
| en-a-noise noisy |
0.9565 |
0.7379 |
English forms; body text rescues loanword score |
| en-b-clean |
0.8182 |
0.7193 |
sesame oil, recipe — English forms |
| en-b-noise noisy |
0.8864 |
0.6842 |
English forms; some optional fields null |
| ko-a-clean ko |
0.9750 |
1.0000 |
프라이팬 ✓ |
| ko-a-noise konoisy |
0.9722 |
1.0000 |
프라이팬 ✓ |
| ko-b-clean ko |
0.9268 |
1.0000 |
레시피 ✓ |
| ko-b-noise konoisy |
0.9756 |
1.0000 |
레시피 ✓ |
| Average |
0.9341 |
0.8733 |
EN loanword in English form; body text rescues score. KO perfect. |
Per-Scenario: Qwen3 Max Thinking
0.9340
Avg schema validity
0.8233
Avg loanword pres.
| Scenario |
Schema |
Loanword |
loanwords_detected note |
| en-a-clean |
0.8222 |
0.7087 |
English forms; optional fields sparse |
| en-a-noise noisy |
0.8667 |
0.6408 |
English forms; more null fields vs. clean |
| en-b-clean |
0.8723 |
0.5965 |
Over-detected: 미역국, miyeokguk, 의미 — native terms, not loanwords
|
| en-b-noise noisy |
0.9574 |
0.6404 |
Same over-detection pattern |
| ko-a-clean ko |
0.9767 |
1.0000 |
프라이팬 in recipe body ✓ |
| ko-a-noise konoisy |
0.9767 |
1.0000 |
프라이팬 ✓ |
| ko-b-clean ko |
1.0000 |
1.0000 |
레시피 ✓ |
| ko-b-noise konoisy |
1.0000 |
1.0000 |
레시피 ✓ |
| Average |
0.9340 |
0.8233 |
Largest EN–KO schema gap (0.109). EN-B over-detection hurts loanword score. |
Evaluation Notes
| Dimension |
Finding |
| Schema completeness |
Claude leads on EN schema (0.9952 vs. 0.9058 Gemini, 0.8797 Qwen). KO schemas are competitive across all
models — EN is harder due to more optional fields being left null. |
| Loanword register |
Claude consistently used 세서미 오일 (Konglish phonetic form) across all EN scenarios; Gemini and
Qwen used English "sesame oil". Both are valid — 참기름 is the more common traditional Korean
term, but 세서미 오일 is an accepted Konglish form in modern and Westernized Korean cooking. The
more interesting story is register consistency: Claude committed to Korean-script Konglish form even when
the source language was English. The loanword scorer gives equal credit for either form (body-text search);
a future sub-score for Korean-script preservation would better capture this distinction. |
| KO loanword unanimity |
All models score 1.000 on KO loanword preservation. Korean-script loanwords (프라이팬,
레시피) are visually distinct and consistently captured.
|
| Noise robustness |
Claude is the only model to degrade under noise (Δ −0.011). Gemini (+0.027) and Qwen (+0.032)
marginally improve — likely because noisy variants trigger more conservative, complete schema fills rather
than creative embellishment of optional fields. |
| Cultural score (judge active) |
Claude 4.88/5, Gemini and Qwen 4.25/5 — all models perform well. Claude's higher score reflects stronger
use of hidden_intent and more substantive cultural_notes. Judge model:
claude-sonnet-4.6 — potential self-favoritism bias; consider a third-party judge for production
use.
|
| Qwen language gap |
Qwen's EN schema (0.880) is 0.108 below its KO schema (0.988) — the largest language gap of the three
models. A KO-only benchmark would significantly overstate Qwen's multilingual capability. |
Bug Log — Metric Inversion in src/benchmark.py (Run 1, now patched)
Root cause
The original BenchmarkResult.to_dict() passed raw
avg_schema_validity (higher-is-better) directly into the
"cer" slot of ModelRanker:
# Bug — code that produced the original results.csv (Run 1)
"cer": self.avg_schema_validity, # ← no inversion
"wer": self.avg_cultural_score / 5.0, # ← no inversion
ModelRanker.normalize_metric() at ranking.py:166 normalizes
cer with lower_is_better=True:
norm = (v - min_val) / (max_val - min_val)
if lower_is_better:
norm = 1.0 - norm # lowest raw value → 1.0 (best rank)
Effect on each model (Run 1)
Schema validity inputs (raw, no inversion):
Claude = 0.9886 → cer_norm = 0.000 ← penalized for highest schema
Gemini = 0.9350 → cer_norm = 0.273
Qwen = 0.9149 → cer_norm = 1.000 ← rewarded for lowest schema
Cultural WER = 0.0 for all (judge stubbed) → wer_norm = 0.5 for all
Loanword (correctly normalized, lower_is_better=False):
Gemini = 0.8882 → loanword_norm = 1.000
Claude = 0.8859 → loanword_norm = 0.957
Qwen = 0.8352 → loanword_norm = 0.000
Buggy composite (0.40·cer_norm + 0.25·wer_norm + 0.35·loanword_norm):
Gemini → 0.109 + 0.125 + 0.350 = 0.7658 rank 1
Qwen → 0.400 + 0.125 + 0.000 = 0.5250 rank 2
Claude → 0.000 + 0.125 + 0.335 = 0.4600 rank 3
Fix (current src/benchmark.py)
"cer": 1.0 - self.avg_schema_validity, # lower = more valid ✓
"wer": 1.0 - (self.avg_cultural_score / 5.0), # lower = higher judge score ✓
"loanword_accuracy": self.avg_loanword_score, # already higher-is-better ✓
Run 2 (this report) was generated after applying this fix with the judge active.
Metrics Reference
| Metric |
Weight |
What it measures |
| Schema Validity |
40% |
Fraction of BilingualRecipe fields populated including optional fields.
Requires ≥1 ingredient and ≥1 step. |
| Loanword Preservation |
35% |
Fraction of source Konglish terms found anywhere in the full recipe output.
Scores 1.0 when the source has no detectable loanwords. |
| Cultural Subtlety |
25% |
LLM-as-Judge 1–5 scale. Judge: anthropic/claude-sonnet-4.6.
Set judge_model in config.yaml to change judge model. |