Training procedure
trained 1 epoch on 1024 rows of turkish Q&A data there is 150 sentetic medical related Q&A in our data rest of data is mundane Q&A
LoRA attention dimension
lora_r = 16
Alpha parameter for LoRA scaling
lora_alpha = 16
Dropout probability for LoRA layers
lora_dropout = 0.1
Number of training epochs
num_train_epochs = 1
Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False bf16 = False
Batch size per GPU for training
per_device_train_batch_size = 2
Batch size per GPU for evaluation
per_device_eval_batch_size = 2
Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 4
Enable gradient checkpointing
gradient_checkpointing = True
Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3
Initial learning rate (AdamW optimizer)
learning_rate = 2e-4
Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001
Optimizer to use
optim = "paged_adamw_32bit"
Learning rate schedule
lr_scheduler_type = "cosine"
Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03
Group sequences into batches with same length
Saves memory and speeds up training considerably
group_by_length = True
################################################################################
SFT parameters
################################################################################
Maximum sequence length to use
max_seq_length = None
Pack multiple short examples in the same input sequence to increase efficiency
packing = False
Load the entire model on the GPU 0
device_map = {"": 0}
Log every X updates steps
logging_steps = 2
The following bitsandbytes
quantization config was used during training:
- load_in_8bit: False
- load_in_4bit: True
- llm_int8_threshold: 6.0
- llm_int8_skip_modules: None
- llm_int8_enable_fp32_cpu_offload: False
- llm_int8_has_fp16_weight: False
- bnb_4bit_quant_type: nf4
- bnb_4bit_use_double_quant: False
- bnb_4bit_compute_dtype: float16
Framework versions
- PEFT 0.4.0
- Downloads last month
- 0