This is incredible
It's truly unbelievable how training on a such a small and diverse dataset could give such a good model. I think this deserves a deeper look on why this dataset mixture surpassed 100s of other finetunes.
I am working on reproducing this model and then do some ablative experiment. Is it possible for you to share the axolotl config or more details about the training? Also did you start from base mistral or some other finetune?
Thanks for your interest! I'm happy to share how I produced it - I'd love to get to the bottom of what made it work so well.
Here's the axolotl config I used:
base_model: mistralai/Mistral-7B-v0.1
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true
load_in_8bit: false
load_in_4bit: false
strict: false
datasets:
- path: chargoddard/PIPPA-Judged
name: adequately_rated
type: pippa
- path: chargoddard/rpguild
name: pruned
type: rp_forum
shards: 20
- path: pankajmathur/orca_mini_v1_dataset
type: orca_mini
shards: 10
- path: chargoddard/summarize_from_feedback_alpaca
type: alpaca
shards: 20
- path: json
data_files: /workspace/limaerp-8192.jsonl
type: rp_forum
prompt_format: rpinstruct
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./mistral-rp-out
save_safetensors: true
adapter: lora
lora_model_dir:
sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true
total_num_tokens: 30637024
sample_packing_eff_est: 0.98
lora_r: 64
lora_alpha: 128
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
- gate_proj
- down_proj
- up_proj
- q_proj
- v_proj
- k_proj
- o_proj
wandb_project: mistral-rp
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:
gradient_accumulation_steps: 4
micro_batch_size: 4
eval_batch_size: 4
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false
gradient_checkpointing: true
logging_steps: 1
flash_attention: true
warmup_steps: 10
eval_steps: 0.05
save_steps: 0.05
weight_decay: 0.0
special_tokens:
bos_token: "<s>"
eos_token: "</s>"
unk_token: "<unk>"
It does look like I goofed a bit on the dataset split table - the proportion of summarize_from_feedback used is even lower than listed.
The limaerp-8192.jsonl
mentioned is just a lightly preprocessed version of lemonilia's LimaRP dataset. I would upload it to huggingface but it's, uh, way too spicy for my tastes. You can download it here: https://files.catbox.moe/jj9srp.jsonl
I used my fork of axolotl with custom prompt handling. Specifically this commit was used to train the model. The way train_on_inputs
, labels, and EOS tokkens are handled is different so it won't reproduce exactly on mainline axolotl. I could probably throw a pre-tokenized version of the dataset up if that's useful though.
Thanks. Will post the results of my ablative experiments here once I run them. Also I am using feature/rp
branch of https://github.com/cg123/rathe/
. Is that correct?
I am not able to reproduce the results. I checked out the commit ID you mentioned and installed rathe
from the given branch and used the same axolotl config. Evaluation code is same for both your and reproduced model.
Your model benchmarks:
ARC: 66.72
Truthful: 59.86
winogrande: 79.16
Reproduced model benchmarks:
ARC: 61.00
truthful: 43.00
winogrande: 78.53