YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Memory Optimization Strategies

1. 4-Bit Quantization (QLoRA)

Used BitsAndBytesConfig from the bitsandbytes library to load the model in 4-bit precision:

quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)

This reduces the GPU memory footprint of the full model weights. Storing weights in 4-bit rather than 16-bit cuts memory usage by roughly 4x for those parameters. This approach allows us to hold large language models in limited GPU memory while still keeping them mostly “frozen” for efficient adaptation.

2. LoRA

Used LoRA from the peft library for parameter-efficient finetuning:

lora_config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1,
bias="none",
task_type=TaskType.CAUSAL_LM
)
lora_model = get_peft_model(base_model, lora_config)

Instead of updating all of the model’s parameters, we use LoRA to inject small low-rank matrices (r=8 in this script) into the attention layers’ query and value projections. Only these adapter weights are trainable, which massively reduces the number of trainable parameters (and thus memory usage for gradients).

3. Gradient Accumulation

Used gradient accumulation to simulate a very large effective batch size without requiring a correspondingly large per-device batch in memory. By setting:

gradient_accumulation_steps = 2048
per_device_train_batch_size=7

we can process multiple micro-batches sequentially, accumulating their gradients and only after 2,048 micro-batches do we perform an optimizer step. This approach allows us to achieve high effective batch sizes without running out of memory on a single GPU.

4. Gradient Checkpointing

Used gradient checkpointing on the model:

base_model.gradient_checkpointing_enable()

With gradient checkpointing, the intermediate activations are not stored in memory during the forward pass; they are recomputed on-the-fly during backpropagation which significantly reduces GPU memory usage in exchange for an increase in compute time.

5. Mixed Precision (FP16)

In the TrainingArguments, we set:

fp16=True

This instructs PyTorch to perform many operations in half precision. While the weights themselves are already in 4-bit, computations such as activations and gradients, can use FP16, further reducing memory usage and improving throughput.

6. Dynamic Padding to Multiple of 16

used:

data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False,
pad_to_multiple_of=16
)

Padding sequences to a multiple of 16 can yield better memory alignment on many GPUs, improving efficiency (and sometimes speed). This ensures that any leftover partial chunks are still batched optimally without forcing all examples to be max-length.

7. Automatic Device Placement

By specifying:

device_map="auto"

in AutoModelForCausalLM.from_pretrained(), the model layers are automatically allocated across available GPU devices in the system.

8. Monitoring Memory Usage

We implemented a custom MemoryUsageCallback:

class MemoryUsageCallback(TrainerCallback)

This callback periodically logs CPU and GPU memory usage every log_frequency steps. By monitoring memory usage in real-time, we can gauge whether we still have memory to increase the micro-batch size or whether we risk running out of memory.

Training Performance and Evaluation Results

Training Performance

Batch Size Configuration

•	**per\_device\_train\_batch\_size \= 7**

•	**gradient\_accumulation\_steps \= 2048**

Since we trained on a single GPU, the effective training batch size is computed as:

7 * 2048 = 14,336

This large effective batch size was made possible by:

  • Gradient Accumulation (simulating large batches via multiple micro-steps).
  • 4-bit Quantization + LoRA (reducing memory usage).
  • Mixed Precision + Gradient Checkpointing (further memory optimization).

Evaluation Configuration

•	**per\_device\_eval\_batch\_size \= 10**

•	**eval\_accumulation\_steps \= 2048**

Similarly, the effective evaluation batch size is:

10 * 2048 = 20,480

(for evaluation, memory usage is much lower than training because no gradients or backward pass are needed.)


Evaluation Results

  1. Initial (Baseline) Evaluation
    1. eval_loss: 2.1278
    2. Perplexity: ~8.3963

This represents the perplexity on the 10% evaluation set before fine-tuning, confirming that even the base model has low perplexity on this dataset.

  1. Final Evaluation (After Fine-Tuning)
    1. eval_loss: 2.1261
    2. Perplexity: ~8.0820

After training for 3 epochs (effectively it completed in about 1 epoch if early stopping wasn’t fully removed), the perplexity dropped slightly from 8.40 to 8.08, indicating a modest improvement on the evaluation data.

The memory optimization strategies allowed us to handle an effective batch size of 14,336 tokens per update step on a single GPU, which shows the success of 4-bit quantization, LoRA, gradient checkpointing, and accumulation setup.

Guide to Train and Evaluate

Both training and evaluation are included in the train.py script in my /prog-assignment-1/climate-optimized-Llama-3B/train.py which evaluates baseline perplexity then trains and evaluates perplexity of fine tuned model, outputting to the llama-finetune-model directory

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support