ASR Benchmark Report

OpenAI vs Deepgram
Multilingual Kitchen Audio

dateMay 31, 2026 clips8 languagesEN · KO models2
Summary

openai-gpt4o-transcribe wins on accuracy (lowest CER). deepgram-nova-3 is cheapest per minute. deepgram-nova-3 has the lowest latency.

Model Results


#1
openai-gpt4o-transcribe
0.0528
CER
0.1135
WER
0.9742
Loanword Acc
1.0000
Composite
latency4.01s avg cost$0.04728 $/min$0.0060
#2
deepgram-nova-3
0.0773
CER
0.1784
WER
0.9710
Loanword Acc
0.0000
Composite
latency1.97s avg cost$0.04098 $/min$0.0052
Excellent (CER ≤ 0.05) Good (CER ≤ 0.10) Needs improvement (> 0.10)

Per-Sample Breakdown


Highlighted rows are noisy clips. Lat = API latency in seconds.

Sample openai-gpt4o-transcribedeepgram-nova-3
CERWERLatCERWERLat
en-a-clean 0.0401 0.0473 4.12s 0.0518 0.0676 1.5s
en-a-noise noisy 0.0518 0.0878 3.33s 0.0968 0.1622 4.45s
en-b-clean 0.0472 0.0818 4.04s 0.0730 0.1321 2.91s
en-b-noise noisy 0.0572 0.1006 4.24s 0.0730 0.1195 0.99s
ko-a-clean 0.0548 0.2079 4.04s 0.0685 0.2376 1.99s
ko-a-noise noisy 0.0651 0.1980 4.03s 0.0753 0.3168 0.92s
ko-b-clean 0.0337 0.0690 4.1s 0.0599 0.1638 2.23s
ko-b-noise noisy 0.0899 0.1810 4.2s 0.1423 0.3276 0.76s

Noise Impact


Average CER on clean vs noisy clips. Lower Δ = more noise-robust.

Model Clean avg CER Noisy avg CER Degradation Δ
openai-gpt4o-transcribe 0.0440 0.0660 +0.0221
deepgram-nova-3 0.0633 0.0969 +0.0336

Cost & Latency


Cost = audio duration × price/min. Latency = API response time only — rate-limit pauses excluded.

Model $/min Audio Est. cost Avg latency Total latency
openai-gpt4o-transcribe $0.0060 7.88 min $0.047279 4.01s 32.10s
deepgram-nova-3 $0.0052 7.88 min $0.040975 1.97s 15.75s

Methodology


CER — Character Error Rate

Primary metric for Korean. Spaces stripped before comparison — Korean spacing is inconsistent across models. Follows KsponSpeech evaluation standard.

WER — Word Error Rate

Secondary metric. Less reliable for Korean due to ambiguous word boundaries. Use CER as primary for Korean content.

Loanword Accuracy

Accuracy on English loanwords and code-switched terms (오븐, 레시피, 간 맞추기). Critical for kitchen use case.

Composite Score

Weighted: CER 55% + WER 30% + Loanword 15%. Relative between models. Speed excluded — measures API latency, not model quality.