vinai/bertweet-base · Specializing BERTweet for different languages.

Sep 10, 2022

Hi, I have trained a LM for tweets in Portuguese following most of your work here. I am stuck now in a issue that I think you guys authors of BERTweet might help.

Well, to adapt the BERTweetTokenizer for Portuguese we cloned the entire BERTweetTokenizer code and changed only one simple line of it:

From:

return self.demojizer(token)

To:

return self.demojizer(token, language='pt')

We were able to train the Tokenizer and the LM from scratch with no problem. The only inconvenient is that we could not instantiate it from AutoTokenizer but it worked from BERTweetBRTokenizer.from_pretrained method while we conducted experiments locally. However, now we want to finally share the model and also tokenizer to the world through Hugging Face and publish the work. Then, allowing users to instatiate from AutoModel and AutoTokenizer classes.

We created the the repo and procedures to upload the model and tokenizer to hugging face following documentation. We were able to instantiate the model using AutoModel but did not manage to do the same for tokenizer.

It is clear that BERTweetBRTokenizer was not found and I was able to work around this locally but in order to make it truly available to the comunity I wanted to work exactly the same every other models like BERTweet using Auto* classes. As you use very similar approach and are the base of my work for Portuguese I was wondering what you guys have done to make that works from AutoTokenizer. Can you give me a tip on that?

Additionally what you would do if you needed to extend BERTweet to a different language? Would you do the same we did here creating a whole new .py file for a specific tokenizer language but actually changing on single line? (in fact, only passing a different parameter value)?

I appreciate in advance your help.
Fernando.

dqnguyen

VinAI Research org Sep 12, 2022

Did you include BertweetBRTokenizer into "tokenization_auto.py"?
For a different language, e.g. Portuguese, I would create a new tokenizer for that language, i.e. including a new vocabulary of BPE subword tokens. At the moment, as far as I understand, you're reusing the BertweetTokenizer's vocabulary that is specified for English.
OR, you can simply reuse the Tokenizer from "XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond" without changing anything.

dqnguyen changed discussion status to closed Sep 12, 2022