File size: 7,830 Bytes

025d0db

---
base_model: ce-lery/japanese-mistral-300m-base
tags:
- generated_from_trainer
model-index:
- name: checkpoints-finetuning
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# japanese-mistral-300m-instruction

## Overview

Welcome to my model card!   

This Model feature is ...

- Suppression of unknown word generation by using byte fallback in SentencePiece tokenizer and conversion to huggingface Tokenizers format
- Pretrained by wikipedia dataset and cc100 dataset
- Use of [Mistral 300M](https://huggingface.co/ce-lery/japanese-mistral-300m-base/blob/main/config.json)
- Fine-tuning [ce-lery/japanese-mistral-300m-base](https://huggingface.co/ce-lery/japanese-mistral-300m-base) with [kunishou/databricks-dolly-15k-ja](https://huggingface.co/datasets/kunishou/databricks-dolly-15k-ja)

Yukkuri shite ittene!

## How to use the model

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import os

MODEL_NAME = "ce-lery/japanese-mistral-300m-instruction"
torch.set_float32_matmul_precision('high')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=False,trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME,trust_remote_code=True).to(device)

MAX_ASSISTANT_LENGTH = 100
MAX_INPUT_LENGTH = 128
INPUT_PROMPT = r'<s>\n以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。\n[SEP]\n指示:\n{instruction}\n[SEP]\n入力:\n{input}\n[SEP]\n応答:\n'
NO_INPUT_PROMPT = r'<s>\n以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。\n[SEP]\n指示:\n{instruction}\n[SEP]\n応答:\n'

def prepare_input(instruction, input_text):
    if input_text != "":
        prompt = INPUT_PROMPT.format(instruction=instruction, input=input_text)
    else:
        prompt = NO_INPUT_PROMPT.format(instruction=instruction)
    return prompt

def format_output(output):
    output = output.lstrip("<s>").rstrip("</s>").replace("[SEP]", "").replace("\\n", "\n")
    return output

def generate_response(instruction, input_text):
    prompt = prepare_input(instruction, input_text)
    token_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
    n = len(token_ids[0])
    # print(n)

    with torch.no_grad():
        output_ids = model.generate(
            token_ids.to(model.device),
            min_length=n,
            max_length=min(MAX_INPUT_LENGTH, n + MAX_ASSISTANT_LENGTH),
            top_p=0.95,
            top_k=50,
            temperature=0.4,
            do_sample=True,
            no_repeat_ngram_size=2,
            num_beams=3,
            pad_token_id=tokenizer.pad_token_id,
            bos_token_id=tokenizer.bos_token_id,
            eos_token_id=tokenizer.eos_token_id,
            bad_words_ids=[[tokenizer.unk_token_id]]
        )

    output = tokenizer.decode(output_ids.tolist()[0])
    formatted_output_all = format_output(output)
    response = f"Assistant:{formatted_output_all.split('応答:')[-1].strip()}"

    return formatted_output_all, response 

instruction = "あなたは何でも正確に答えられるAIです。"
questions = [
    "日本で一番高い山は？",
    "日本で一番広い湖は？",
    "世界で一番高い山は？",
    "世界で一番広い湖は？",
    "冗談を言ってください。",
]

# 各質問に対して応答を生成して表示
for question in questions:
    formatted_output_all, response = generate_response(instruction, question)
    print(response)

```

## Receipe

If you want to restruct this model, you can refer [this Github repository](https://github.com/ce-lery/japanese-mistral-300m-recipe).

I wrote the receipe for struction this model. For example,

- Preprocess with sentencepiece
- Pretraining with flash attention2 and torch.compile and DeepSpeed
- Fine-tuning with databricks-dolly-15k-ja

If you find my mistake,error,...etc, please create issue.
If you create pulreqest, I'm very happy! 
## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-06
- train_batch_size: 4
- eval_batch_size: 4
- seed: 42
- distributed_type: multi-GPU
- gradient_accumulation_steps: 64
- total_train_batch_size: 256
- optimizer: Adam with betas=(0.9,0.95) and epsilon=0.0001
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 1000
- num_epochs: 200
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch  | Step | Validation Loss |
|:-------------:|:------:|:----:|:---------------:|
| 3.595         | 3.51   | 40   | 3.5299          |
| 3.4769        | 7.02   | 80   | 3.3722          |
| 3.3037        | 10.53  | 120  | 3.1871          |
| 3.1255        | 14.05  | 160  | 3.0088          |
| 2.9615        | 17.56  | 200  | 2.8684          |
| 2.8468        | 21.07  | 240  | 2.7808          |
| 2.7699        | 24.58  | 280  | 2.7205          |
| 2.7139        | 28.09  | 320  | 2.6793          |
| 2.6712        | 31.6   | 360  | 2.6509          |
| 2.6356        | 35.12  | 400  | 2.6294          |
| 2.6048        | 38.63  | 440  | 2.6120          |
| 2.5823        | 42.14  | 480  | 2.5974          |
| 2.5536        | 45.65  | 520  | 2.5849          |
| 2.5293        | 49.16  | 560  | 2.5740          |
| 2.5058        | 52.67  | 600  | 2.5644          |
| 2.482         | 56.19  | 640  | 2.5556          |
| 2.4575        | 59.7   | 680  | 2.5477          |
| 2.4339        | 63.21  | 720  | 2.5405          |
| 2.4073        | 66.72  | 760  | 2.5350          |
| 2.3845        | 70.23  | 800  | 2.5303          |
| 2.3606        | 73.74  | 840  | 2.5253          |
| 2.329         | 77.26  | 880  | 2.5215          |
| 2.3071        | 80.77  | 920  | 2.5185          |
| 2.2768        | 84.28  | 960  | 2.5155          |
| 2.2479        | 87.79  | 1000 | 2.5144          |
| 2.2181        | 91.3   | 1040 | 2.5151          |
| 2.1901        | 94.81  | 1080 | 2.5139          |
| 2.1571        | 98.33  | 1120 | 2.5148          |
| 2.1308        | 101.84 | 1160 | 2.5166          |
| 2.1032        | 105.35 | 1200 | 2.5193          |
| 2.0761        | 108.86 | 1240 | 2.5204          |
| 2.0495        | 112.37 | 1280 | 2.5269          |
| 2.0231        | 115.88 | 1320 | 2.5285          |
| 2.0021        | 119.4  | 1360 | 2.5328          |
| 1.9793        | 122.91 | 1400 | 2.5383          |
| 1.9575        | 126.42 | 1440 | 2.5442          |
| 1.9368        | 129.93 | 1480 | 2.5488          |
| 1.9216        | 133.44 | 1520 | 2.5534          |
| 1.902         | 136.95 | 1560 | 2.5584          |
| 1.8885        | 140.47 | 1600 | 2.5609          |
| 1.8728        | 143.98 | 1640 | 2.5657          |
| 1.8605        | 147.49 | 1680 | 2.5697          |
| 1.8476        | 151.0  | 1720 | 2.5741          |
| 1.8402        | 154.51 | 1760 | 2.5770          |
| 1.8274        | 158.02 | 1800 | 2.5803          |
| 1.8218        | 161.54 | 1840 | 2.5829          |
| 1.8144        | 165.05 | 1880 | 2.5847          |
| 1.8097        | 168.56 | 1920 | 2.5867          |
| 1.8076        | 172.07 | 1960 | 2.5883          |
| 1.8014        | 175.58 | 2000 | 2.5892          |
| 1.8001        | 179.09 | 2040 | 2.5899          |
| 1.7987        | 182.61 | 2080 | 2.5903          |
| 1.7971        | 186.12 | 2120 | 2.5906          |
| 1.7979        | 189.63 | 2160 | 2.5907          |
| 1.7975        | 193.14 | 2200 | 2.5907          |


### Framework versions

- Transformers 4.35.2
- Pytorch 2.1.1+cu121
- Datasets 2.14.5
- Tokenizers 0.14.1