gemma2b-dolly-qlora-merged

A fully merged, standalone fine-tuned version of google/gemma-2b-it on the databricks/databricks-dolly-15k instruction-following dataset.

This is the merged model — the QLoRA adapter weights have been fused directly into the base model using merge_and_unload(). No PEFT or adapter loading required. Just load and run.

LoRA adapter-only version also available (lighter, ~50 MB): adithash/gemma2b-dolly-qlora


What is a Merged Model?

During QLoRA fine-tuning, only a small set of adapter weights (~13M params) are trained on top of the frozen base model (2B params). After training there are two ways to distribute the result:

Adapter repo This repo (merged)
Size ~50 MB ~3 GB
Requires base model Yes No
Requires PEFT Yes No
Load complexity 2 steps 1 step
Use case Efficient sharing Simple deployment

Model Details

Property Value
Base model google/gemma-2b-it
Fine-tuning method QLoRA (4-bit NF4 quantized base + LoRA adapters)
Merge method PeftModel.merge_and_unload()
Dataset databricks/databricks-dolly-15k
Training samples 14,911
Training steps 500 (capped for free Colab T4)
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0.05
Learning rate 2e-4
Batch size 2 (effective: 8 with grad accum ×4)
Sequence length 256 tokens
Hardware Google Colab T4 (16 GB VRAM)
Training time ~2.5 hours
Model size ~3 GB (fp16)
Framework transformers + peft + trl (SFTTrainer)

Training Loss

Step Training Loss
25 3.60
50 2.65
75 2.32
100 2.22

Prompt Format

This model uses the Gemma chat template format. Always wrap your inputs correctly:

<start_of_turn>user
Your instruction here<end_of_turn>
<start_of_turn>model

If your prompt includes context (e.g. a passage to summarise), append it to the instruction:

<start_of_turn>user
Summarise the following text.

Context: <your context here><end_of_turn>
<start_of_turn>model

How to Use

Basic usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "adithash/gemma2b-dolly-qlora-merged",
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("adithash/gemma2b-dolly-qlora-merged")

def chat(instruction, context="", max_new_tokens=200, temperature=0.7):
    user_msg = f"{instruction}\n\nContext: {context}" if context.strip() else instruction
    prompt = (
        f"<start_of_turn>user\n{user_msg}<end_of_turn>\n"
        f"<start_of_turn>model\n"
    )
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
        )
    return tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()

# Examples
print(chat("Explain what overfitting is in machine learning and how to prevent it."))
print(chat("What is the difference between SQL and NoSQL databases?"))
print(chat("Write a Python function to reverse a string."))

Memory-efficient usage (4-bit)

If you're on a GPU with limited VRAM, load in 4-bit:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "adithash/gemma2b-dolly-qlora-merged",
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("adithash/gemma2b-dolly-qlora-merged")

CPU-only usage (slow but works)

model = AutoModelForCausalLM.from_pretrained(
    "adithash/gemma2b-dolly-qlora-merged",
    torch_dtype=torch.float32,
)
tokenizer = AutoTokenizer.from_pretrained("adithash/gemma2b-dolly-qlora-merged")
# Remove .to("cuda") calls — use .to("cpu") or omit device mapping

Intended Use

  • ✅ Learning and experimentation with fine-tuned LLMs
  • ✅ Portfolio demonstration of end-to-end QLoRA fine-tuning
  • ✅ Simple single-step model loading without PEFT dependency
  • ✅ Starting point for domain-specific instruction tuning
  • ❌ Not intended for production or commercial use
  • ❌ Not suitable for safety-critical applications

Limitations

  • Fine-tuned for only 500 steps as a proof-of-concept on free Colab T4 — a full epoch would be ~7,400 steps
  • Gemma 2B is a small model — complex multi-step reasoning will be limited
  • Training sequence length capped at 256 tokens — very long prompts will be truncated
  • A newer base model exists: google/gemma-2-2b-it
  • Subject to Google's Gemma Terms of Use

Training Infrastructure

Component Detail
Notebook Google Colab (free tier)
GPU NVIDIA T4 — 16 GB VRAM
Libraries transformers 4.x, peft, trl 1.3+, bitsandbytes, accelerate
Trainer SFTTrainer with SFTConfig
Gradient checkpointing Enabled
Mixed precision bfloat16
Merge PeftModel.merge_and_unload() in fp16

Related Repositories

Repo Description
adithash/gemma2b-dolly-qlora LoRA adapter only (~50 MB) — requires base model + PEFT
adithash/gemma2b-dolly-qlora-merged This repo — fully merged standalone model (~3 GB)

Citation

If you use this model in your work, please credit the base model:

@article{gemma_2024,
  title  = {Gemma: Open Models Based on Gemini Research and Technology},
  author = {Gemma Team, Google DeepMind},
  year   = {2024},
  url    = {https://arxiv.org/abs/2403.08295}
}

Author

Aditya Dey — ML Engineer
🤗 HuggingFace · GitHub

Downloads last month
220
Safetensors
Model size
3B params
Tensor type
F16
·
Inference Providers NEW
Input a message to start chatting with adithash/gemma2b-dolly-qlora-merged.

Model tree for adithash/gemma2b-dolly-qlora-merged

Adapter
(674)
this model

Paper for adithash/gemma2b-dolly-qlora-merged