ku-nlp
/

bart-base-japanese

Text2Text Generation

Inference Endpoints

Model card Files Files and versions Community

bart-base-japanese / README.md

Matttttttt's picture

fixed a description error in README

902b2a4 over 1 year ago

|

history blame contribute delete

2 kB

	---
	license: cc-by-sa-4.0
	language:
	- ja
	library_name: transformers
	datasets:
	- wikipedia
	---

	# Model Card for Japanese BART base

	## Model description

	This is a Japanese BART base model pre-trained on Japanese Wikipedia.

	## How to use

	You can use this model as follows:

	```python
	from transformers import AutoTokenizer, MBartForConditionalGeneration
	tokenizer = AutoTokenizer.from_pretrained('ku-nlp/bart-base-japanese')
	model = MBartForConditionalGeneration.from_pretrained('ku-nlp/bart-base-japanese')
	sentence = '京都大学で自然言語処理を専攻する。' # input should be segmented into words by Juman++ in advance
	encoding = tokenizer(sentence, return_tensors='pt')
	...
	```

	You can fine-tune this model on downstream tasks.

	## Tokenization

	The input text should be segmented into words by [Juman++](https://github.com/ku-nlp/jumanpp) in advance. [Juman++ 2.0.0-rc3](https://github.com/ku-nlp/jumanpp/releases/tag/v2.0.0-rc3) was used for pre-training. Each word is tokenized into subwords by [sentencepiece](https://github.com/google/sentencepiece).

	## Training data

	We used the following corpora for pre-training:

	- Japanese Wikipedia (18M sentences)

	## Training procedure

	We first segmented texts in the corpora into words using [Juman++](https://github.com/ku-nlp/jumanpp).
	Then, we built a sentencepiece model with 32000 tokens including words ([JumanDIC](https://github.com/ku-nlp/JumanDIC)) and subwords induced by the unigram language model of [sentencepiece](https://github.com/google/sentencepiece).

	We tokenized the segmented corpora into subwords using the sentencepiece model and trained the Japanese BART model using [fairseq](https://github.com/facebookresearch/fairseq) library.
	The training took 2 weeks using 4 Tesla V100 GPUs.

	The following hyperparameters were used during pre-training:

	- distributed_type: multi-GPU
	- num_devices: 4
	- batch_size: 512
	- training_steps: 500,000
	- encoder layers: 6
	- decoder layers: 6
	- hidden size: 768