Model Card for TinyLlama-DPO-Orca

This model is the result of a two-stage alignment pipeline applied to the TinyLlama-1.1B base model, utilizing Supervised Fine-Tuning (SFT) followed by Direct Preference Optimization (DPO) to align outputs with human preference data.

Model Details

Model Description

  • Developed by: Hadeeqa Al Islam
  • Model type: Causal Language Model
  • Language(s) (NLP): English
  • Finetuned from model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
  • Training data: argilla/distilabel-intel-orca-dpo-pairs (5,922 filtered samples)
  • LoRA config: r=16, lora_alpha=16, target_modules=[q_proj, k_proj, v_proj, o_proj]
  • Training: lr=1e-07, batch=4 (effective batch size 16), epochs=2, bf16=True, beta=0.1

Uses

Direct Use

This model is intended to be used for text generation and conversational question-answering based on the Orca preference dataset style.

Out-of-Scope Use

Due to known issues with token collapse, this model is not suitable for production deployment or long-form reliable generation without further prompt engineering or tokenizer alignment.

Bias, Risks, and Limitations

Known Issue: Token Collapse During the SFT phase, sequence packing was implemented using add_special_tokens=False to strictly prevent cross-contamination across VRAM blocks. While this optimized memory and isolated sequences, it caused a severe distribution shift away from TinyLlama's pre-trained chat template.

Consequently, during standard inference with the tokenizer's chat template applied, the DPO model experiences catastrophic forgetting and token collapse (frequently outputting loops of |> or < < <). The extremely low BLEU score reflects this formatting mismatch rather than an inability to learn the underlying linguistic representations.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Future iterations should harmonize the sequence packing tokenization with the base model's inherent chat structure.

How to Get Started with the Model

model_id = "Haldi247/TinyLlama-DPO-Orca" tokenizer = AutoTokenizer.from_pretrained(model_id) messages = [{"role": "user", "content": prompt}]

Training Details

Training Data

The DPO phase utilized the argilla/distilabel-intel-orca-dpo-pairs dataset. To ensure high-quality preference alignment, the data was rigorously filtered:

  • Removed tied responses (status != tie).
  • Required a high chosen score (chosen_score >= 8).
  • Excluded GSM8K training data to prevent contamination (not in_gsm8k_train).

Final Dataset Size: 5,922 preference pairs.

Training Procedure

The model was trained using the trl and peft libraries for Parameter-Efficient Fine-Tuning (PEFT) via LoRA.

Training Hyperparameters

  • Training regime: bf16 mixed precision (bf16=True to prevent gradient underflow)
  • LoRA Rank (r): 16
  • LoRA Alpha: 16
  • Target Modules: q_proj, k_proj, v_proj, o_proj
  • Beta (DPO Temperature): 0.1
  • Learning Rate: 1e-07
  • Batch Size: 4 (with gradient_accumulation_steps=4, resulting in an effective batch size of 16)
  • Epochs: 2
  • Max Length: 512
  • Attention Implementation: Scaled Dot-Product Attention (sdpa)

Speeds, Sizes, Times

  • Training Time: 30.1 minutes

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated against a custom set of 10 QA prompts using reference answers.

Metrics

Performance was measured using linguistic overlap and semantic similarity metrics:

  • BLEU
  • BERTScore
  • Training Loss

Results

  • Average BLEU: 0.0282
  • Average BERTScore: 0.7870
  • Final Train Loss: 0.6928

Technical Specifications

Compute Infrastructure

Hardware

  • Hardware Type: NVIDIA GeForce RTX 5070 Ti (16GB VRAM)
  • Compute Region: Local deployment via WSL2

Summary

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: [More Information Needed]
  • Hours used: [More Information Needed]
  • Cloud Provider: [More Information Needed]
  • Compute Region: [More Information Needed]
  • Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed]

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]

Downloads last month
46
Safetensors
Model size
1B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Haldi247/TinyLlama-DPO-Orca

Adapter
(1536)
this model

Dataset used to train Haldi247/TinyLlama-DPO-Orca

Paper for Haldi247/TinyLlama-DPO-Orca