---
library_name: transformers
license: apache-2.0
language:
- ja
- en
---

# RetrievaBERT Model
The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.
It is designed for use in Japanese.

## Model Details

### Model Description

The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.

It is designed for use in Japanese.

This model offers several advanced features compared to traditional BERT models:
- **PreNorm**: Improved stability during training.  
- **SwiGLU**: Enhanced activation function for better performance.  
- **Grouped-Query Attention (Multi-Query Attention)**: Efficient attention mechanism.  
- **Max Sequence Length**: 2048 tokens, allowing for longer context.  
- **Parameters**: 1.3 billion parameters.  
- **Pre-training Objective**: Only Masked Language Modeling (MLM), not Next Sentence Prediction (NSP).  
- **Token Type IDs**: Not used in this model.

### Model Sources
- **Developed by:** Retrieva, Inc.
- **Model type:** Based on MegatronBERT Architecture.
- **Language(s) (NLP):** Primarily Japanese (optional support for English).
- **License:** Apache 2.0


## Uses

This model can be used as a Masked Language Model (MLM).
However, it is primarily intended to be fine-tuned on downstream tasks.
Depending on your use case, follow the appropriate section below.

### Direct Use

This model is pre-trained using Masked Language Modeling.
The mask token used is `<MASK|LLM-jp>`.
Note that you need to set `trust_remote_code` to `True` because RetrievaBERT uses a custom model implementation.  
  
Example code for direct use: 

```python
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

model_id = "retrieva-jp/bert-1.3b"
model = AutoModelForMaskedLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("fill-mask", model=model, tokenizer=tokenizer)

text = "こんにちは！私の名前は<MASK|LLM-jp>です！"
print(pipe(text))
```

### Downstream Use

RetrievaBERT is compatible with Hugging Face's AutoModels.
To fine-tune RetrievaBERT for your specific task, use the corresponding AutoModel class.
For detailed configuration, refer to the config.json file.


## Training Details

### Training Data
The RetrievaBERT model was pre-trained on the reunion of five datasets:
- [Japanese CommonCrawl Dataset by LLM-jp](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v2).
- [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb).
- Chinese Wikipedia dumped on 20240120.
- Korean Wikipedia dumped on 20240120.
- [The Stack](https://huggingface.co/datasets/bigcode/the-stack)

The model was trained on 180 billion tokens using the above dataset.

### Training Procedure
The model was trained on 4 to 32 H100 GPUs with a batch size of 1,024.
We adopted the curriculum learning which is similar to the Sequence Length Warmup and training with the following sequence lengths and number of steps.

- The sequence length of 128: 31,000 steps.
- The sequence length of 256: 219,000 steps.
- The sequence length of 512: 192,000 steps.
- The sequence length of 2048: 12,000 steps.

#### Training Hyperparameters
The model was trained on the following hyperparameters.

- Learning rate: 1.5e-4.
- Learning rate decay style: Linear.
- Learning rate warmup fraction: 0.01
- Minimum learning rate: 1e-6
- Floating point expression: BF16

## Evaluation
We fine-tuned the following models and evaluated them on the [JGLUE](https://github.com/yahoojapan/JGLUE) development set. 
We adjusted the learning rate and training epochs for each model and task in accordance with [the JGLUE paper](https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_pdf/-char/ja).

| Model                            | MARC-ja/acc | JSTS/pearson | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc |
| :--- |---:|---:|---:|---:|---:|---:|---:|
| tohoku-nlp/bert-base-japanese-v3 | 0.957       | 0.914        | 0.876         | 0.906    | 0.878     | 0.946     | 0.849      |
| tohoku-nlp/bert-large-japanese-v2| 0.959       | 0.916        | 0.877         | 0.901    | 0.884     | 0.951     | 0.867      |
| ku-nlp/deberta-v3-base-japanese　　　　| 0.958       | 0.925        | 0.890         | 0.902    | 0.925     | 0.910     | 0.882      |
| retrieva-jp/bert-1.3b　　　　　　　　　　　　　　　　　　　　　　　　| 0.952       | 0.916        | 0.877         | 0.896    | 0.916     | 0.879     | 0.815      |


## Technical Specifications

### Model Architectures
The RetrievaBERT model is based on BERT with the following hyperparameters:

- Number of layers: 48
- Hidden layer size: 1536
- FFN hidden layer size: 4096
- Number of attention heads: 24
- Maximum length of position embeddings: 2048

As mentioned earlier, the main differences from the original BERT are:
- PreNorm: Improved stability during training.  
- SwiGLU: Enhanced activation function for better performance.  
- Grouped-Query Attention (Multi-Query Attention): Efficient attention mechanism.  


### Compute Infrastructure

[TSUBAME 4](https://www.t4.gsic.titech.ac.jp/en/hardware)

This model is based on results obtained from the [TSUBAME deep-learning mini-camp](https://www.t4.gsic.titech.ac.jp/en/minicamp-dl-202406).

#### Software

The model was trained using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM).

## More Information

https://note.com/retrieva/n/n715bea2c2cd1 (in Japanese)

## Model Card Authors

Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba

## Model Card Contact
pr@retrieva.jp