When working with Speech-to-Text (STT) systems, the most common way to quantify transcription quality is through CER (Character Error Rate) and WER (Word Error Rate).
This post explains what CER/WER mean, when each metric is useful.
✅ Goal
By the end of this post, you’ll be able to:
- Understand what CER and WER measure
- Know how they are computed (intuition + formula)
- Decide when to use CER vs WER
🧠 What Are CER and WER?
Both CER and WER are based on edit distance (Levenshtein distance).
They measure how many edits are required to transform the STT output into the ground-truth transcript.
Edits include:
- Substitution (S): wrong character/word
- Deletion (D): missing character/word
- Insertion (I): extra character/word
🔤 CER (Character Error Rate)
CER evaluates errors at the character level.
✅ Formula
[ CER = \frac{S + D + I}{N} ]
- S = substitutions (characters)
- D = deletions (characters)
- I = insertions (characters)
- N = number of characters in the reference (ground truth)
🟢 When CER is especially useful
- Languages where spacing/word segmentation is tricky (e.g., Korean)
- You want sensitivity to small spelling/particle changes
- You care about fine-grained transcription quality
How to interpret them
- 0–2%: Excellent on clean data.
- 2–10%: Good; errors may be noticeable.
- 10–20%: Usable with post-editing.
- 20%+: Significant quality issues. These ranges are context-dependent (domain, language, noise, fonts). Always compare against a baseline on the same test set. (Guidance based on common practice and tool docs.)
🧾 WER (Word Error Rate)
WER evaluates errors at the word level.
✅ Formula
[ WER = \frac{S + D + I}{N} ]
Same structure as CER, but counts are based on words, and N is the number of words in the reference.
🟢 When WER is especially useful
- English-like languages where words are clearly separated by spaces
- You want a metric aligned with “how readable the sentence is”
- Your downstream system treats STT output as word tokens
🆚 CER vs WER: Which One Should I Use?
Here’s a practical rule of thumb:
- If your language/tokenization is ambiguous → use CER
- If your language is space-delimited and word-based → use WER
- For Korean STT evaluation, many teams report both, but often rely more on CER.
📌 Notes That Make Your Metrics More Trustworthy
A few practical tips that improve interpretability:
- Always define normalization rules (numbers, punctuation, spacing, special tokens)
- Ensure the same tokenization policy is used consistently across time
- Report:
- Average CER/WER
- And if possible, breakdowns by scenario (e.g., consent vs general 상담)
📝 Conclusion
CER and WER are simple but powerful metrics to track STT quality.