See axolotl config
axolotl version: 0.3.0
base_model: ./yi-34b-200k-llamafied
base_model_config: ./yi-34b-200k-llamafied
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: false
is_llama_derived_model: true
load_in_8bit: false
load_in_4bit: true
bnb_config_kwargs:
llm_int8_has_fp16_weight: false
bnb_4bit_quant_type: nf4
bnb_4bit_use_double_quant: true
bnb_4bit_compute_dtype: torch.bfloat16
torch_dtype: bf16
strict: false
rl: true
datasets:
- path: /run/media/.../axolotl/datasets/rawrr_v1/
split: train
type: apply_chatml
dataset_prepared_path: last_run_prepared
val_set_size: 0.01
adapter: qlora
lora_model_dir:
sequence_len: 200
sample_packing: false
lora_r: 4
lora_alpha: 8
lora_dropout: 0.05
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
- gate_proj
- down_proj
- up_proj
lora_fan_in_fan_out:
wandb_project:
wandb_watch:
wandb_run_id:
wandb_log_model:
output_dir: ./qlora-yi-34b-200k-rawrr-2
pad_to_sequence_len: true
micro_batch_size: 1
gradient_accumulation_steps: 16
num_epochs: 1
optimizer: adamw_bnb_8bit
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.00003
cosine_min_lr_ratio: 0.2
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false
bfloat16: true
flash_optimum: false
gradient_checkpointing: true
early_stopping_patience:
save_safetensors:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
deepspeed:
seed: 42
warmup_steps: 100
eval_steps: 5000000
save_steps: 80
save_total_limit: 10
eval_table_size:
eval_table_max_new_tokens:
debug:
weight_decay:
fsdp:
fsdp_config:
special_tokens:
bos_token: "<|startoftext|>"
eos_token: "<|endoftext|>"
unk_token: "<unk>"
qlora-yi-34b-200k-rawrr-2
Yi-34B-200K trained via DPO on rawrr_v1 dataset. Sequence length of just 200, that's the max I could fit in with axolotl on RTX 3090 Ti.
Model description
Looks like DPO worked, even despite being a tiny rank, sequence length and learning rate.
That's great, I was preparing to start fine-tuning in the cloud but I saw this pull request and I was hopeful it would allow me to squeeze in 34B qlora DPO on 24GB of VRAM. And it did!
Intended uses & limitations
merge it with yi-34b-200k llama-fied to get a more raw-feeling Yi 34B. In my initial testing on comparison between exl2 4.65bpw base and my DPO fine-tune, my model has no AALM refusals and feels moreso like a true base model.
Some instruct training still shines through though when you ask it to make an itinerary etc for you.
Training procedure
The following bitsandbytes
quantization config was used during training:
- quant_method: bitsandbytes
- load_in_8bit: False
- load_in_4bit: True
- llm_int8_threshold: 6.0
- llm_int8_skip_modules: None
- llm_int8_enable_fp32_cpu_offload: False
- llm_int8_has_fp16_weight: False
- bnb_4bit_quant_type: nf4
- bnb_4bit_use_double_quant: True
- bnb_4bit_compute_dtype: bfloat16
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 3e-05
- train_batch_size: 1
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 16
- total_train_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 100
- training_steps: 517
Training results
Framework versions
- PEFT 0.7.0
- Transformers 4.37.0.dev0
- Pytorch 2.0.1+cu118
- Datasets 2.15.0
- Tokenizers 0.15.0
- Downloads last month
- 4