Data Preparation Method

by amanpatkar - opened Jun 13, 2024

Jun 13, 2024

Hi there, I'm curious about the tokenizer you used for tokenizing text data. I've seen examples where text is formatted as ["a p p l e", "d o g"], with characters separated by spaces. Did you employ a similar format and utilize the standard BERT tokenizer from the BERT model family? I'm trying to understand the approach taken for tokenization in your context. Thank you!

lhy

Owner Jun 14, 2024

Hello.
Yes, I used a standard BertTokenizer. Specifically, I defined a character-level vocabulary in a vocab.txt file, then implemented it by passing the path of this file into the vocab_file parameter of BertTokenizer.
Please refer to vocab.txt uploaded in the root directory.
Thank you.

amanpatkar

Jun 14, 2024

Thanks.
Actually, I am trying to predict multiple [Mask] letter of a word. I do have list of words.
One will pick, any combination of these words and i need to predict the [MASK] words.

Can you share your approach, How will you pick this.

lhy

Owner Jun 14, 2024

First, I have the following questions:

When predicting the [MASK] letters within each word, is only one selected word input into the language model?
What is the ratio of [MASK] tokens within the words?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment