Instructions to use ssuvetha/pharma-tinyllama-dpo-lora-adapter with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use ssuvetha/pharma-tinyllama-dpo-lora-adapter with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("/content/pharma_tinyllama_instruction_merged_model") model = PeftModel.from_pretrained(base_model, "ssuvetha/pharma-tinyllama-dpo-lora-adapter") - Notebooks
- Google Colab
- Kaggle
Pharma TinyLlama DPO LoRA Adapter
This repository contains a LoRA adapter trained using Direct Preference Optimization (DPO) on pharma-domain preference data.
This adapter was trained on top of the Stage 2 merged instruction-tuned model:
- Base model for this stage:
ssuvetha/pharma-tinyllama-instruction-merged
This stage follows:
- Stage 1: non-instruction domain adaptation
- Stage 2: instruction fine-tuning
- Stage 3: preference tuning with DPO
Model Type
- Stage: 3
- Training type: Preference tuning
- Method: DPO (Direct Preference Optimization)
- Adapter type: LoRA
- Task: Preference-aligned pharma instruction response generation
What this stage adds
Stage 1 taught domain language.
Stage 2 taught instruction following.
Stage 3 teaches the model to prefer better responses over weaker ones.
The DPO dataset uses:
promptchosenrejected
This encourages the model to generate answers closer to the preferred response style.
Preference data format
Examples follow this structure:
{
"prompt": "Explain the mechanism of action of metformin.",
"chosen": "Metformin primarily activates AMPK and reduces hepatic gluconeogenesis...",
"rejected": "Metformin is a strong antibiotic used to lower infection..."
}
Intended use
This adapter is intended for:
- preference tuning research
- alignment demonstrations
- educational LLM fine-tuning projects
- Stage 3 of a multi-stage pharma fine-tuning workflow
Not intended use
This model is not intended for:
- clinical decision making
- diagnosis or prescription use
- safety-critical medical deployment
- regulatory or production healthcare systems
Training pipeline summary
The high-level Stage 3 pipeline was:
- Load Stage 2 merged instruction-tuned model
- Load preference dataset with prompt/chosen/rejected
- Add a fresh LoRA adapter
- Train with DPO using TRL
- Save and upload DPO adapter
- Optionally merge into a final standalone model
Training configuration summary
- Base model:
ssuvetha/pharma-tinyllama-instruction-merged - Adapter type: LoRA
- DPO beta: 0.1
- Learning rate: 5e-5
- Batch size per device: 1
- Gradient accumulation steps: 8
- Max steps: 5
- Quantization: 4-bit NF4
- Hardware: Google Colab T4 GPU
- Libraries:
transformers,peft,trl,bitsandbytes
How to use
Load this adapter on top of the merged Stage 2 instruction model.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
base_model_name = "ssuvetha/pharma-tinyllama-instruction-merged"
adapter_name = "ssuvetha/pharma-tinyllama-dpo-lora-adapter"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
quantization_config=bnb_config,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, adapter_name)
model.eval()
Example inference
prompt = """### Instruction:
Explain the primary mechanism of action of metformin.
### Response:
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=150,
do_sample=True,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Prompt format
Use instruction-style prompting:
### Instruction:
<question>
### Response:
Limitations
- preference tuning does not guarantee factual correctness
- may still hallucinate medical claims
- quality depends heavily on chosen/rejected data quality
- not validated for healthcare use
- not a medical safety model
Project pipeline context
This repository contains the Stage 3 DPO adapter only from a three-stage pharma fine-tuning project:
- Stage 1: non-instruction LoRA adapter
- Stage 2: instruction LoRA adapter
- Stage 3: DPO LoRA adapter
Citation
If you use this model, please cite:
- TinyLlama
- PEFT / LoRA / QLoRA
- TRL / DPO
- your project repository or notebook
- Downloads last month
- 14