Greek final sigma (ς) maps to <unk> and causes systematic transcription errors!

#26
by GeorgeArGR - opened

In Greek Language, when the letter 'sigma' ('σ') is final, it is replaced by the letter 'final sigma' ('ς').
Canary-1b-v2 recognises the existence of a 'final sigma' but returns token 0 () in place of every 'final sigma'!
What can I do?

We have identified a reproducible issue affecting Greek transcription in Canary-1B-v2 (and Nemotron 3.5 Streaming ASR).

The problem is related to the Greek final sigma character "ς" (U+03C2).

Tokenizer inspection shows:

  • "σ" is present in the vocabulary.
  • "ς" is not present and maps to .
  • Common Greek endings such as "ος", "ης", "ας", "ός" also fail because they contain the final sigma.

Diagnostic example:

piece/token for "σ" -> valid token
piece/token for "ς" ->

As a result, during ASR decoding the acoustic evidence for the final /s/ sound is often lost or attached to the following word. This produces errors such as:

"ο κύριος Αντώνης" -> "ο κύριο Σαντώνη"

"ο κύριος Αντώνης" -> "ο κύριο ?? αντώνη ??"

or similar word-boundary shifts.

Interestingly, Greek recognition quality is otherwise quite good, suggesting that the issue is not primarily acoustic but tokenizer/vocabulary related.

Could NVIDIA comment on whether:

  1. The absence of the Greek final sigma (ς) from the tokenizer vocabulary is intentional?
  2. A future tokenizer update is planned?
  3. Fine-tuning on adaptation-ready Greek data is expected to address this issue, or whether tokenizer modifications are required?

We would be happy to provide additional examples and diagnostics.

(Example:)

TEXT: σ
IDS : [1357]
TOK : ['▁σ']
DEC : σ

TEXT: ς
IDS : [16053, 0]
TOK : ['▁', '']
DEC : ⁇

TEXT: ος
IDS : [1747, 0]
TOK : ['▁ο', '']
DEC : ο ⁇

TEXT: ός
IDS : [1632, 0]
TOK : ['▁ό', '']
DEC : ό ⁇

TEXT: ης
IDS : [2231, 0]
TOK : ['▁η', '']
DEC : η ⁇

TEXT: ας
IDS : [1345, 0]
TOK : ['▁α', '']
DEC : α ⁇

TEXT: κύριος
IDS : [11411, 7089, 0]
TOK : ['▁κύ', 'ριο', '']
DEC : κύριο ⁇

TEXT: αντώνης
IDS : [2355, 8661, 16135, 0]
TOK : ['▁αν', 'τών', 'η', '']
DEC : αντώνη ⁇

TEXT: ο κύριος αντώνης
IDS : [1747, 11411, 7089, 0, 2355, 8661, 16135, 0]
TOK : ['▁ο', '▁κύ', 'ριο', '', '▁αν', 'τών', 'η', '']
DEC : ο κύριο ⁇ αντώνη ⁇

GeorgeArGR changed discussion title from No 'final sigma' ('ς') in Greek Language: I get <UNK0> (token 0?) in place of every 'final sigma' ('-ς') to Greek final sigma (ς) maps to <unk> and causes systematic transcription errors!

Sign up or log in to comment