Pharma TinyLlama DPO LoRA Adapter

This repository contains a LoRA adapter trained using Direct Preference Optimization (DPO) on pharma-domain preference data.

This adapter was trained on top of the Stage 2 merged instruction-tuned model:

  • Base model for this stage: ssuvetha/pharma-tinyllama-instruction-merged

This stage follows:

  • Stage 1: non-instruction domain adaptation
  • Stage 2: instruction fine-tuning
  • Stage 3: preference tuning with DPO

Model Type

  • Stage: 3
  • Training type: Preference tuning
  • Method: DPO (Direct Preference Optimization)
  • Adapter type: LoRA
  • Task: Preference-aligned pharma instruction response generation

What this stage adds

Stage 1 taught domain language.
Stage 2 taught instruction following.
Stage 3 teaches the model to prefer better responses over weaker ones.

The DPO dataset uses:

  • prompt
  • chosen
  • rejected

This encourages the model to generate answers closer to the preferred response style.


Preference data format

Examples follow this structure:

{
  "prompt": "Explain the mechanism of action of metformin.",
  "chosen": "Metformin primarily activates AMPK and reduces hepatic gluconeogenesis...",
  "rejected": "Metformin is a strong antibiotic used to lower infection..."
}

Intended use

This adapter is intended for:

  • preference tuning research
  • alignment demonstrations
  • educational LLM fine-tuning projects
  • Stage 3 of a multi-stage pharma fine-tuning workflow

Not intended use

This model is not intended for:

  • clinical decision making
  • diagnosis or prescription use
  • safety-critical medical deployment
  • regulatory or production healthcare systems

Training pipeline summary

The high-level Stage 3 pipeline was:

  1. Load Stage 2 merged instruction-tuned model
  2. Load preference dataset with prompt/chosen/rejected
  3. Add a fresh LoRA adapter
  4. Train with DPO using TRL
  5. Save and upload DPO adapter
  6. Optionally merge into a final standalone model

Training configuration summary

  • Base model: ssuvetha/pharma-tinyllama-instruction-merged
  • Adapter type: LoRA
  • DPO beta: 0.1
  • Learning rate: 5e-5
  • Batch size per device: 1
  • Gradient accumulation steps: 8
  • Max steps: 5
  • Quantization: 4-bit NF4
  • Hardware: Google Colab T4 GPU
  • Libraries: transformers, peft, trl, bitsandbytes

How to use

Load this adapter on top of the merged Stage 2 instruction model.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

base_model_name = "ssuvetha/pharma-tinyllama-instruction-merged"
adapter_name = "ssuvetha/pharma-tinyllama-dpo-lora-adapter"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(base_model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

model = PeftModel.from_pretrained(base_model, adapter_name)
model.eval()

Example inference

prompt = """### Instruction:
Explain the primary mechanism of action of metformin.

### Response:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.1,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Prompt format

Use instruction-style prompting:

### Instruction:
<question>

### Response:

Limitations

  • preference tuning does not guarantee factual correctness
  • may still hallucinate medical claims
  • quality depends heavily on chosen/rejected data quality
  • not validated for healthcare use
  • not a medical safety model

Project pipeline context

This repository contains the Stage 3 DPO adapter only from a three-stage pharma fine-tuning project:

  1. Stage 1: non-instruction LoRA adapter
  2. Stage 2: instruction LoRA adapter
  3. Stage 3: DPO LoRA adapter

Citation

If you use this model, please cite:

  • TinyLlama
  • PEFT / LoRA / QLoRA
  • TRL / DPO
  • your project repository or notebook
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support