Instructions to use adithash/gemma2b-dolly-qlora-merged with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use adithash/gemma2b-dolly-qlora-merged with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="adithash/gemma2b-dolly-qlora-merged") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("adithash/gemma2b-dolly-qlora-merged") model = AutoModelForCausalLM.from_pretrained("adithash/gemma2b-dolly-qlora-merged") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - PEFT
How to use adithash/gemma2b-dolly-qlora-merged with PEFT:
Task type is invalid.
- Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use adithash/gemma2b-dolly-qlora-merged with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "adithash/gemma2b-dolly-qlora-merged" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adithash/gemma2b-dolly-qlora-merged", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/adithash/gemma2b-dolly-qlora-merged
- SGLang
How to use adithash/gemma2b-dolly-qlora-merged with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "adithash/gemma2b-dolly-qlora-merged" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adithash/gemma2b-dolly-qlora-merged", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "adithash/gemma2b-dolly-qlora-merged" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adithash/gemma2b-dolly-qlora-merged", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use adithash/gemma2b-dolly-qlora-merged with Docker Model Runner:
docker model run hf.co/adithash/gemma2b-dolly-qlora-merged
gemma2b-dolly-qlora-merged
A fully merged, standalone fine-tuned version of google/gemma-2b-it on the databricks/databricks-dolly-15k instruction-following dataset.
This is the merged model — the QLoRA adapter weights have been fused directly into the base model using merge_and_unload(). No PEFT or adapter loading required. Just load and run.
LoRA adapter-only version also available (lighter, ~50 MB):
adithash/gemma2b-dolly-qlora
What is a Merged Model?
During QLoRA fine-tuning, only a small set of adapter weights (~13M params) are trained on top of the frozen base model (2B params). After training there are two ways to distribute the result:
| Adapter repo | This repo (merged) | |
|---|---|---|
| Size | ~50 MB | ~3 GB |
| Requires base model | Yes | No |
| Requires PEFT | Yes | No |
| Load complexity | 2 steps | 1 step |
| Use case | Efficient sharing | Simple deployment |
Model Details
| Property | Value |
|---|---|
| Base model | google/gemma-2b-it |
| Fine-tuning method | QLoRA (4-bit NF4 quantized base + LoRA adapters) |
| Merge method | PeftModel.merge_and_unload() |
| Dataset | databricks/databricks-dolly-15k |
| Training samples | 14,911 |
| Training steps | 500 (capped for free Colab T4) |
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| Learning rate | 2e-4 |
| Batch size | 2 (effective: 8 with grad accum ×4) |
| Sequence length | 256 tokens |
| Hardware | Google Colab T4 (16 GB VRAM) |
| Training time | ~2.5 hours |
| Model size | ~3 GB (fp16) |
| Framework | transformers + peft + trl (SFTTrainer) |
Training Loss
| Step | Training Loss |
|---|---|
| 25 | 3.60 |
| 50 | 2.65 |
| 75 | 2.32 |
| 100 | 2.22 |
Prompt Format
This model uses the Gemma chat template format. Always wrap your inputs correctly:
<start_of_turn>user
Your instruction here<end_of_turn>
<start_of_turn>model
If your prompt includes context (e.g. a passage to summarise), append it to the instruction:
<start_of_turn>user
Summarise the following text.
Context: <your context here><end_of_turn>
<start_of_turn>model
How to Use
Basic usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"adithash/gemma2b-dolly-qlora-merged",
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("adithash/gemma2b-dolly-qlora-merged")
def chat(instruction, context="", max_new_tokens=200, temperature=0.7):
user_msg = f"{instruction}\n\nContext: {context}" if context.strip() else instruction
prompt = (
f"<start_of_turn>user\n{user_msg}<end_of_turn>\n"
f"<start_of_turn>model\n"
)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=temperature,
top_p=0.9,
)
return tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()
# Examples
print(chat("Explain what overfitting is in machine learning and how to prevent it."))
print(chat("What is the difference between SQL and NoSQL databases?"))
print(chat("Write a Python function to reverse a string."))
Memory-efficient usage (4-bit)
If you're on a GPU with limited VRAM, load in 4-bit:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"adithash/gemma2b-dolly-qlora-merged",
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("adithash/gemma2b-dolly-qlora-merged")
CPU-only usage (slow but works)
model = AutoModelForCausalLM.from_pretrained(
"adithash/gemma2b-dolly-qlora-merged",
torch_dtype=torch.float32,
)
tokenizer = AutoTokenizer.from_pretrained("adithash/gemma2b-dolly-qlora-merged")
# Remove .to("cuda") calls — use .to("cpu") or omit device mapping
Intended Use
- ✅ Learning and experimentation with fine-tuned LLMs
- ✅ Portfolio demonstration of end-to-end QLoRA fine-tuning
- ✅ Simple single-step model loading without PEFT dependency
- ✅ Starting point for domain-specific instruction tuning
- ❌ Not intended for production or commercial use
- ❌ Not suitable for safety-critical applications
Limitations
- Fine-tuned for only 500 steps as a proof-of-concept on free Colab T4 — a full epoch would be ~7,400 steps
- Gemma 2B is a small model — complex multi-step reasoning will be limited
- Training sequence length capped at 256 tokens — very long prompts will be truncated
- A newer base model exists:
google/gemma-2-2b-it - Subject to Google's Gemma Terms of Use
Training Infrastructure
| Component | Detail |
|---|---|
| Notebook | Google Colab (free tier) |
| GPU | NVIDIA T4 — 16 GB VRAM |
| Libraries | transformers 4.x, peft, trl 1.3+, bitsandbytes, accelerate |
| Trainer | SFTTrainer with SFTConfig |
| Gradient checkpointing | Enabled |
| Mixed precision | bfloat16 |
| Merge | PeftModel.merge_and_unload() in fp16 |
Related Repositories
| Repo | Description |
|---|---|
adithash/gemma2b-dolly-qlora |
LoRA adapter only (~50 MB) — requires base model + PEFT |
adithash/gemma2b-dolly-qlora-merged |
This repo — fully merged standalone model (~3 GB) |
Citation
If you use this model in your work, please credit the base model:
@article{gemma_2024,
title = {Gemma: Open Models Based on Gemini Research and Technology},
author = {Gemma Team, Google DeepMind},
year = {2024},
url = {https://arxiv.org/abs/2403.08295}
}
Author
Aditya Dey — ML Engineer
🤗 HuggingFace · GitHub
- Downloads last month
- 220
Model tree for adithash/gemma2b-dolly-qlora-merged
Base model
google/gemma-2b-it