You are viewing v0.7.106 version.
A newer version
v0.8.24 is available.
Seq2Seq Parameters
--batch-size BATCH_SIZE
Training batch size to use
--seed SEED Random seed for reproducibility
--epochs EPOCHS Number of training epochs
--gradient_accumulation GRADIENT_ACCUMULATION
Gradient accumulation steps
--disable_gradient_checkpointing
Disable gradient checkpointing
--lr LR Learning rate
--log {none,wandb,tensorboard}
Use experiment tracking
--text-column TEXT_COLUMN
Specify the column name in the dataset that contains the text data. Useful for distinguishing between multiple text fields.
Default is 'text'.
--target-column TARGET_COLUMN
Specify the column name that holds the target data for training. Helps in distinguishing different potential outputs.
Default is 'target'.
--max-seq-length MAX_SEQ_LENGTH
Set the maximum sequence length (number of tokens) that the model should handle in a single input. Longer sequences are
truncated. Affects both memory usage and computational requirements. Default is 128 tokens.
--max-target-length MAX_TARGET_LENGTH
Define the maximum number of tokens for the target sequence in each input. Useful for models that generate outputs, ensuring
uniformity in sequence length. Default is set to 128 tokens.
--warmup-ratio WARMUP_RATIO
Define the proportion of training to be dedicated to a linear warmup where learning rate gradually increases. This can help
in stabilizing the training process early on. Default ratio is 0.1.
--optimizer OPTIMIZER
Choose the optimizer algorithm for training the model. Different optimizers can affect the training speed and model
performance. 'adamw_torch' is used by default.
--scheduler SCHEDULER
Select the learning rate scheduler to adjust the learning rate based on the number of epochs. 'linear' decreases the
learning rate linearly from the initial lr set. Default is 'linear'. Try 'cosine' for a cosine annealing schedule.
--weight-decay WEIGHT_DECAY
Set the weight decay rate to apply for regularization. Helps in preventing the model from overfitting by penalizing large
weights. Default is 0.0, meaning no weight decay is applied.
--max-grad-norm MAX_GRAD_NORM
Specify the maximum norm of the gradients for gradient clipping. Gradient clipping is used to prevent the exploding gradient
problem in deep neural networks. Default is 1.0.
--logging-steps LOGGING_STEPS
Determine how often to log training progress. Set this to the number of steps between each log output. -1 determines logging
steps automatically. Default is -1.
--evaluation-strategy EVALUATION_STRATEGY
Specify how often to evaluate the model performance. Options include 'no', 'steps', 'epoch'. 'epoch' evaluates at the end of
each training epoch by default.
--save-total-limit SAVE_TOTAL_LIMIT
Limit the total number of model checkpoints to save. Helps manage disk space by retaining only the most recent checkpoints.
Default is to save only the latest one.
--auto-find-batch-size
Enable automatic batch size determination based on your hardware capabilities. When set, it tries to find the largest batch
size that fits in memory.
--mixed-precision {fp16,bf16,None}
Choose the precision mode for training to optimize performance and memory usage. Options are 'fp16', 'bf16', or None for
default precision. Default is None.
--peft Enable LoRA-PEFT
--quantization {int8,None}
Select the quantization mode to reduce model size and potentially increase inference speed. Options include 'int8' for 8-bit
integer quantization or None for no quantization. Default is None
--lora-r LORA_R Set the rank 'R' for the LoRA (Low-Rank Adaptation) technique. Default is 16.
--lora-alpha LORA_ALPHA
Specify the 'Alpha' parameter for LoRA. Default is 32.
--lora-dropout LORA_DROPOUT
Determine the dropout rate to apply in the LoRA layers, which can help in preventing overfitting by randomly disabling a
fraction of neurons during training. Default rate is 0.05.
--target-modules TARGET_MODULES
List the modules within the model architecture that should be targeted for specific techniques such as LoRA adaptations.
Useful for fine-tuning particular components of large models. By default all linear layers are targeted.