Pharma TinyLlama DPO LoRA Adapter

This repository contains a LoRA adapter trained using Direct Preference Optimization (DPO) on pharma-domain preference data.

This adapter was trained on top of the Stage 2 merged instruction-tuned model:

Base model for this stage: ssuvetha/pharma-tinyllama-instruction-merged

This stage follows:

Stage 1: non-instruction domain adaptation
Stage 2: instruction fine-tuning
Stage 3: preference tuning with DPO

Model Type

Stage: 3
Training type: Preference tuning
Method: DPO (Direct Preference Optimization)
Adapter type: LoRA
Task: Preference-aligned pharma instruction response generation

What this stage adds

Stage 1 taught domain language.
Stage 2 taught instruction following.
Stage 3 teaches the model to prefer better responses over weaker ones.

The DPO dataset uses:

prompt
chosen
rejected

This encourages the model to generate answers closer to the preferred response style.

Preference data format

Examples follow this structure:

{
  "prompt": "Explain the mechanism of action of metformin.",
  "chosen": "Metformin primarily activates AMPK and reduces hepatic gluconeogenesis...",
  "rejected": "Metformin is a strong antibiotic used to lower infection..."
}

Intended use

This adapter is intended for:

preference tuning research
alignment demonstrations
educational LLM fine-tuning projects
Stage 3 of a multi-stage pharma fine-tuning workflow

Not intended use

This model is not intended for:

clinical decision making
diagnosis or prescription use
safety-critical medical deployment
regulatory or production healthcare systems

Training pipeline summary

The high-level Stage 3 pipeline was:

Load Stage 2 merged instruction-tuned model
Load preference dataset with prompt/chosen/rejected
Add a fresh LoRA adapter
Train with DPO using TRL
Save and upload DPO adapter
Optionally merge into a final standalone model

Training configuration summary

Base model: ssuvetha/pharma-tinyllama-instruction-merged
Adapter type: LoRA
DPO beta: 0.1
Learning rate: 5e-5
Batch size per device: 1
Gradient accumulation steps: 8
Max steps: 5
Quantization: 4-bit NF4
Hardware: Google Colab T4 GPU
Libraries: transformers, peft, trl, bitsandbytes

How to use

Load this adapter on top of the merged Stage 2 instruction model.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

base_model_name = "ssuvetha/pharma-tinyllama-instruction-merged"
adapter_name = "ssuvetha/pharma-tinyllama-dpo-lora-adapter"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(base_model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

model = PeftModel.from_pretrained(base_model, adapter_name)
model.eval()

Example inference

prompt = """### Instruction:
Explain the primary mechanism of action of metformin.

### Response:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.1,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Prompt format

Use instruction-style prompting:

### Instruction:
<question>

### Response:

Limitations

preference tuning does not guarantee factual correctness
may still hallucinate medical claims
quality depends heavily on chosen/rejected data quality
not validated for healthcare use
not a medical safety model

Project pipeline context

This repository contains the Stage 3 DPO adapter only from a three-stage pharma fine-tuning project:

ssuvetha
/

pharma-tinyllama-dpo-lora-adapter