Data Preparation Method

#4
by amanpatkar - opened

Hi there, I'm curious about the tokenizer you used for tokenizing text data. I've seen examples where text is formatted as ["a p p l e", "d o g"], with characters separated by spaces. Did you employ a similar format and utilize the standard BERT tokenizer from the BERT model family? I'm trying to understand the approach taken for tokenization in your context. Thank you!

Hello.
Yes, I used a standard BertTokenizer. Specifically, I defined a character-level vocabulary in a vocab.txt file, then implemented it by passing the path of this file into the vocab_file parameter of BertTokenizer.
Please refer to vocab.txt uploaded in the root directory.
Thank you.

Thanks.
Actually, I am trying to predict multiple [Mask] letter of a word. I do have list of words.
One will pick, any combination of these words and i need to predict the [MASK] words.

Can you share your approach, How will you pick this.

First, I have the following questions:

  1. When predicting the [MASK] letters within each word, is only one selected word input into the language model?
  2. What is the ratio of [MASK] tokens within the words?

Sign up or log in to comment