base_model:
- werty1248/Mistral-Nemo-NT-Ko-12B-sft
datasets:
- zake7749/kyara-chinese-preference-rl-dpo-s0-30K
- sionic/ko-dpo-mix-7k-trl-style
- kuotient/orca-math-korean-dpo-pairs
- HuggingFaceH4/ultrafeedback_binarized
language:
- en
- ko
- ja
- zh
license: apache-2.0
Mistral-Nemo-NT-Ko-12B-dpo-test
Description
Mistral-Nemo-NT-Ko-12B-dpo-test is a shallowly DPO-trained version of werty1248/Mistral-Nemo-NT-Ko-12B-sft.
According to the Hermes 3 Tech Report, DPO made negligible performance improvements in their model. Therefore, I followed the same approach described in the report and applied DPO using LoRA.
- LoRA r = 32
- Lora alpha = 16
- lr = 3e-6
- neftune alpha = 5
The datasets used are as follows:
- (En) HuggingFaceH4/ultrafeedback_binarized
- (Ko, translated from En) sionic/ko-dpo-mix-7k-translation-exclude
- (Ko, translated from En) kuotient/orca-math-korean-dpo-pairs
- (Zh) zake7749/kyara-chinese-preference-rl-dpo-s0-30K
I've been looking for native Korean/Japanese DPO datasets, but haven't found anything that I'm personally satisfied with(Quantity/Quality).
From each dataset, I sampled a subset based on the score given by the reward model. In the end, I used about 13K samples for training for each language.
Features
The base model supports a context length of 128K, while I fine-tuned this model with an 8K context size.
This model works well for multi-turn conversations, and tends to strongly reflect the previous conversation.
Evaluation
LogicKor
Cot-1-shot
모델 | 방법 | 추론 | 수학 | 글쓰기 | 코딩 | 이해 | 문법 | 싱글턴 | 멀티턴 | 총점 |
---|---|---|---|---|---|---|---|---|---|---|
Mistral-Nemo-NT-Ko-12B-sft | cot-1-shot | 7.36 | 6.57 | 8.71 | 8.57 | 9.57 | 6.43 | 7.81 | 7.93 | 7.87 |
Mistral-Nemo-NT-Ko-12B-dpo-test | cot-1-shot | 6.79 | 6.43 | 9.43 | 9.79 | 9.43 | 5.29 | 7.71 | 8.00 | 7.86 |
Mistral Nemo | cot-1-shot | 5.43 | 6.86 | 6.07 | 7.57 | 5.86 | 7.57 | 7.50 | 5.62 | 6.56 |
1-shot
모델 | 방법 | 추론 | 수학 | 글쓰기 | 코딩 | 이해 | 문법 | 싱글턴 | 멀티턴 | 총점 |
---|---|---|---|---|---|---|---|---|---|---|
Mistral-Nemo-NT-Ko-12B-dpo-test | 1-shot | 8.14 | 5.50 | 9.36 | 8.57 | 9.50 | 4.71 | 7.38 | 7.88 | 7.63 |
Mistral-Nemo-NT-Ko-12B-sft | 1-shot | 9.00 | 5.71 | 7.93 | 8.29 | 7.93 | 5.21 | 7.29 | 7.40 | 7.35 |
Mistral Nemo | 1-shot | 5.00 | 6.50 | 6.86 | 8.07 | 7.64 | 8.43 | 7.60 | 6.57 | 7.08 |
Default
모델 | 방법 | 추론 | 수학 | 글쓰기 | 코딩 | 이해 | 문법 | 싱글턴 | 멀티턴 | 총점 |
---|---|---|---|---|---|---|---|---|---|---|
Mistral-Nemo-NT-Ko-12B-dpo-test | default | 6.21 | 5.79 | 8.00 | 8.36 | 9.43 | 5.43 | 7.17 | 7.24 | 7.20 |
Mistral-Nemo-NT-Ko-12B-sft | default | 6.00 | 4.93 | 5.43 | 7.14 | 9.71 | 4.00 | 6.45 | 5.95 | 6.20 |
Mistral Nemo | default | 0.43 | 7.64 | 6.21 | 7.14 | 6.79 | 7.21 | 6.26 | 5.55 | 5.90 |
Language-Confusion
Model | Language | Monolingual-LPR | Monolingual-WPR | Crosslingual-LPR | Crosslingual-WPR |
---|---|---|---|---|---|
Mistral-Nemo-NT-Ko-12B-dpo-test | ko | 100.00% | 97.96% | 85.63% | 96.93% |
Mistral-Nemo-NT-Ko-12B-sft | ko | 100.00% | 99.00% | 87.51% | 96.96% |
Mistral-Nemo-Instruct-2407 | ko | 90.72% | 93.18% | 46.75% | 92.84% |
Meta-Llama-3.1-8B-Instruct | ko | 99.00% | 96.97% | 91.45% | 93.01% |
gemma-2-9b-it | ko | 100.00% | 98.00% | 87.93% | 95.58% |
--- | --- | --- | --- | --- | --- |
Mistral-Nemo-NT-Ko-12B-dpo-test | zh | 99.00% | 99.50% | 80.52% | 97.51% |
Mistral-Nemo-Instruct-2407 | zh | 97.50% | 98.98% | 53.43% | 93.58% |
--- | --- | --- | --- | --- | --- |
Mistral-Nemo-NT-Ko-12B-dpo-test | ja | 100.00% | 100.00% | 86.89% | 95.41% |
Mistral-Nemo-Instruct-2407 | ja | 94.00% | 98.94% | 50.27% | 96.05% |
Template
<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
I trained Mistral-Nemo-NT-Ko-12B with various system prompt from dozens of dataset. You can chat with/without your system prompt.
Dataset
- zake7749/kyara-chinese-preference-rl-dpo-s0-30K
- sionic/ko-dpo-mix-7k-trl-style
- kuotient/orca-math-korean-dpo-pairs
- HuggingFaceH4/ultrafeedback_binarized
Training Details
- GPU: 2xA100
- epoch: 1
- total batch size: 32
- learning rate: 3e-6
- neftune_noise_alpha: 5
See axolotl config
axolotl version: 0.4.1
base_model: werty1248/Mistral-Nemo-NT-Ko-12B-sft
model_type: MistralForCausalLM
tokenizer_type: AutoTokenizer
load_in_8bit: false
load_in_4bit: false
strict: false
adapter: lora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
dpo_beta: 0.1
rl: dpo
datasets:
- path: werty1248/NT-dpo
split: train
type: chatml.prompt_pairs
dataset_prepared_path: /workspace/data/prepared_datasets
output_dir: /workspace/data
save_steps: 500
sequence_len: 8192
sample_packing: false
pad_to_sequence_len: true
gradient_accumulation_steps: 16
micro_batch_size: 1
num_epochs: 1
optimizer: rmsprop
weight_decay: 0.0
learning_rate: 0.000003
lr_scheduler: linear
neftune_noise_alpha: 5
train_on_inputs: false
group_by_length: false
#wandb_project:
#wandb_entity:
#wandb_watch:
#wandb_name:
#wandb_log_model:
bf16: true
fp16: false
tf32: false
gradient_checkpointing: true
flash_attention: true
warmup_steps: 9
eval_steps:
val_set_size: 0
early_stopping_patience:
logging_steps: 1
special_tokens:
pad_token: <pad>
- Training loss