See axolotl config

axolotl version: 0.3.0

base_model: ./yi-34b-200k-llamafied
base_model_config: ./yi-34b-200k-llamafied
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: false
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: true

bnb_config_kwargs:
    llm_int8_has_fp16_weight: false
    bnb_4bit_quant_type: nf4
    bnb_4bit_use_double_quant: true


bnb_4bit_compute_dtype: torch.bfloat16
torch_dtype: bf16
strict: false

rl: true
datasets:
  - path: /run/media/.../axolotl/datasets/rawrr_v1/
    split: train
    type: apply_chatml
    
dataset_prepared_path: last_run_prepared
val_set_size: 0.01
adapter: qlora
lora_model_dir:
sequence_len: 200
sample_packing: false
lora_r: 4
lora_alpha: 8
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj
  - gate_proj
  - down_proj
  - up_proj
lora_fan_in_fan_out:
wandb_project:
wandb_watch:
wandb_run_id:
wandb_log_model:
output_dir: ./qlora-yi-34b-200k-rawrr-2
pad_to_sequence_len: true
micro_batch_size: 1
gradient_accumulation_steps: 16
num_epochs: 1
optimizer: adamw_bnb_8bit
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.00003
cosine_min_lr_ratio: 0.2
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false
bfloat16: true
flash_optimum: false
gradient_checkpointing: true
early_stopping_patience:
save_safetensors:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
deepspeed: 
seed: 42
warmup_steps: 100
eval_steps: 5000000
save_steps: 80
save_total_limit: 10
eval_table_size: 
eval_table_max_new_tokens:
debug:
weight_decay:
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<|startoftext|>"
  eos_token: "<|endoftext|>"
  unk_token: "<unk>"

qlora-yi-34b-200k-rawrr-2

Yi-34B-200K trained via DPO on rawrr_v1 dataset. Sequence length of just 200, that's the max I could fit in with axolotl on RTX 3090 Ti.

Model description

Looks like DPO worked, even despite being a tiny rank, sequence length and learning rate.
That's great, I was preparing to start fine-tuning in the cloud but I saw this pull request and I was hopeful it would allow me to squeeze in 34B qlora DPO on 24GB of VRAM. And it did!

Intended uses & limitations

merge it with yi-34b-200k llama-fied to get a more raw-feeling Yi 34B. In my initial testing on comparison between exl2 4.65bpw base and my DPO fine-tune, my model has no AALM refusals and feels moreso like a true base model.
Some instruct training still shines through though when you ask it to make an itinerary etc for you.

Training procedure

The following bitsandbytes quantization config was used during training:

quant_method: bitsandbytes
load_in_8bit: False
load_in_4bit: True
llm_int8_threshold: 6.0
llm_int8_skip_modules: None
llm_int8_enable_fp32_cpu_offload: False
llm_int8_has_fp16_weight: False
bnb_4bit_quant_type: nf4
bnb_4bit_use_double_quant: True
bnb_4bit_compute_dtype: bfloat16

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3e-05
train_batch_size: 1
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 16
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 100
training_steps: 517

Training results

Framework versions

PEFT 0.7.0
Transformers 4.37.0.dev0
Pytorch 2.0.1+cu118
Datasets 2.15.0
Tokenizers 0.15.0

adamo1139
/

Yi-34B-200K-rawrr1-LORA-DPO-experimental-r2