Adding '\n' to this model (using CTranslate2)
#7
by
Geremia23
- opened
How do I add special tokens (like \n
) to this model?
tokenizer.add_tokens('\n')
seems to work, but CTranslate2 drops the \n
when translating:
import ctranslate2
import transformers
translator = ctranslate2.Translator("opus-mt-de-en", device='cuda')
tokenizer = transformers.AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-de-en")
# add special token
tokenizer.add_tokens('\n') # output: 1
tokenizer.added_tokens_decoder # output: {58101: '\n'}
tokenizer.added_tokens_encoder # output: {'\n': 58101}
source = tokenizer.convert_ids_to_tokens(tokenizer.encode("Guten\ntag!")) # == ['▁Guten', '\n', '▁', 'tag', '!', '</s>']
results = translator.translate_batch([source], beam_size=5) # == [TranslationResult(hypotheses=[['▁Good', '▁day', '!']], scores=[], attention=[])] ← NOTICE THE `\n` IS DROPPED!
How do I get CTranslate2 to map token ID #58101 to \n
?
Geremia23
changed discussion title from
Add '\n' to this model?
to Adding '\n' to this model (using CTranslate2)