Adding '\n' to this model (using CTranslate2)

#7
by Geremia23 - opened

How do I add special tokens (like \n) to this model?

tokenizer.add_tokens('\n') seems to work, but CTranslate2 drops the \n when translating:

import ctranslate2
import transformers

translator = ctranslate2.Translator("opus-mt-de-en", device='cuda')
tokenizer = transformers.AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-de-en")

# add special token
tokenizer.add_tokens('\n')     # output:  1

tokenizer.added_tokens_decoder     # output: {58101: '\n'}
tokenizer.added_tokens_encoder     # output: {'\n': 58101}

source = tokenizer.convert_ids_to_tokens(tokenizer.encode("Guten\ntag!"))    # ==  ['▁Guten', '\n', '▁', 'tag', '!', '</s>']
results = translator.translate_batch([source], beam_size=5)     # == [TranslationResult(hypotheses=[['▁Good', '▁day', '!']], scores=[], attention=[])]      ← NOTICE THE `\n` IS DROPPED!

How do I get CTranslate2 to map token ID #58101 to \n?

Geremia23 changed discussion title from Add '\n' to this model? to Adding '\n' to this model (using CTranslate2)

Sign up or log in to comment