ku-accms/roberta-base-japanese-ssuw

Model description

This is a pre-trained Japanese RoBERTa base model for super short unit words (SSUW).

Pre-processing

The input text should be converted to full-width (zenkaku) characters and segmented into super short unit words in advance (e.g., by KyTea).

How to use

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='ku-accms/roberta-base-japanese-ssuw')
>>> unmasker("京都 大学 で [MASK] を 専攻 する 。")
[{'sequence': '京都 大学 で 文学 を 専攻 する 。',
  'score': '0.1479644924402237',
  'token': '17907',
  'token_str': '文学'}
 {'sequence': '京都 大学 で 哲学 を 専攻 する 。',
  'score': '0.07658644765615463',
  'token': '19302',
  'token_str': '哲学'}
 {'sequence': '京都 大学 で デザイン を 専攻 する 。',
  'score': '0.06302948296070099',
  'token': '14411',
  'token_str': 'デザイン'}
 {'sequence': '京都 大学 で 建築 を 専攻 する 。',
  'score': '0.060596249997615814',
  'token': '15478',
  'token_str': '建築'}
 {'sequence': '京都 大学 で 工学 を 専攻 する 。',
  'score': '0.0574776753783226',
  'token': '18632',
  'token_str': '工学'}

Here is how to use this model to get the features of a given text in PyTorch:

import zenhan
import Mykytea
kytea_model_path = "somewhere"
kytea = Mykytea.Mykytea("-model {} -notags".format(kytea_model_path))
def preprocess(text):
    return " ".join(kytea.getWS(zenhan.h2z(text)))

from transformers import BertTokenizer, RobertaModel
tokenizer = BertTokenizer.from_pretrained('ku-accms/roberta-base-japanese-ssuw')
model = RobertaModel.from_pretrained("ku-accms/roberta-base-japanese-ssuw")
text = "京都大学で自然言語処理を専攻する。"
encoded_input = tokenizer(preprocess(text), return_tensors='pt')
output = model(**encoded_input)

Training data

We used a Japanese Wikipedia dump (as of 20230101, 3.3GB) and a Japanese portion of CC100 (70GB).

Training procedure

We first segmented the texts into words by KyTea and then tokenized the words into subwords using WordPiece with a vocabulary size of 32,000. We pre-trained the RoBERTa model using transformers library. The training took about 7 days using 4 NVIDIA A100-SXM4-80GB GPUs.

The following hyperparameters were used for the pre-training.

  • learning_rate: 1e-4
  • weight decay: 1e-2
  • per_device_train_batch_size: 80
  • num_devices: 4
  • gradient_accumulation_steps: 3
  • total_train_batch_size: 960
  • max_seq_length: 512
  • optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-06
  • lr_scheduler_type: linear schedule with warmup
  • training_steps: 500,000
  • warmup_steps: 10,000
Downloads last month
17
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train ku-accms/roberta-base-japanese-ssuw