Bert base model for Korean

  • 70GB Korean text dataset and 42000 lower-cased subwords are used
  • Check the model performance and other language models for Korean in github
from transformers import BertTokenizerFast, GPT2LMHeadModel
tokenizer_gpt3 = BertTokenizerFast.from_pretrained("kykim/gpt3-kor-small_based_on_gpt2")
input_ids = tokenizer_gpt3.encode("text to tokenize")[1:]  # remove cls token
        
model_gpt3 = GPT2LMHeadModel.from_pretrained("kykim/gpt3-kor-small_based_on_gpt2")
Downloads last month
425
Hosted inference API

Unable to determine this model’s pipeline type. Check the docs .