Lily-1.5b-v0.3

Lily-1.5b-v0.3 is a distilled instruction-tuned language model built by continuing training from abhinav0231/Lily-1.5b-v0.1 on the abhinav0231/Sarvam-105b-Distill-100k dataset using the chatml split/configuration.

This version was trained as an offline supervised fine-tuning run focused on high-quality long-form assistant responses in ChatML format, with many examples following an explicit <think> and <answer> structure.

The model was trained and merged in a single-GPU Modal workflow on an NVIDIA A100-SXM4-40GB system using BF16, QLoRA, and Unsloth.

Model summary

This checkpoint starts from abhinav0231/Lily-1.5b-v0.1 and applies a distillation-style supervised fine-tuning stage rather than training from scratch.

The base architecture loaded during training is a Qwen2-style causal language model with:

28 layers
hidden size 1536
12 attention heads
2 key-value heads
vocabulary size 151,936

The training setup targets:

instruction following
structured response generation
distilled reasoning-flavored outputs

rather than pure base-model continuation pretraining.

Training objective

The goal of v0.3 was to improve the model through offline SFT distillation from a synthetic/teacher-style dataset while preserving the usability and compact size of the 1.5B-class base model.

The dataset examples are preformatted as ChatML conversations and frequently instruct the assistant to reason in a <think> block before producing a final <answer> block.

Because of that training distribution, the model may naturally produce more structured, tutor-like, stepwise outputs than the earlier checkpoint depending on the prompt style.

Base model

Base model: abhinav0231/Lily-1.5b-v0.1
Final merged model repo: abhinav0231/Lily-1.5b-v0.3
GGUF Repo abhinav0231/Lily-1.5b-v0.3-GGUF

Benchmarks

Evaluation setup using lm-evaluation-harness, v0.3 achieved:

Dataset

The main training dataset is:

abhinav0231/Sarvam-105b-Distill-100k

using the chatml configuration, stored as a single text column of preformatted conversations.

The final training notebook loaded:

91,457 training examples
1,908 validation examples

A separate sanity-check pass over the dataset family showed a very similar distribution, including:

92,040 training examples
1,917 validation examples
1,918 test examples

confirming the same overall ChatML reasoning-style format.

Dataset style

The dataset uses ChatML with:

<|im_start|>
<|im_end|>

delimiters and includes a chat template in the tokenizer setup.

Many examples use a system prompt that explicitly asks the assistant to think through the problem in a <think> block and then give the final response in an <answer> block.

This means the model was not trained on plain raw instruction-response text alone; it was trained on a formatted conversational distribution with strong structural priors.

Length characteristics

A 5,000-sample sanity slice of the training set had:

mean length = 1640.72 tokens
p50 = 1219
p90 = 3221
p95 = 4096.15
p99 = 6883.35

About:

5.00% of sampled training examples
4.33% of sampled validation examples

exceeded 4096 tokens.

These numbers matter because the training run used a 4096 token max sequence length, so the longest examples are subject to truncation or packing effects depending on preprocessing behavior.

Training setup

Training was run on a single NVIDIA A100-SXM4-40GB GPU in Modal, without:

DDP
accelerate launch
multi-process orchestration

The environment used:

Unsloth 2026.5.2
TRL 0.22.2
PyTorch 2.8.0+cu129
CUDA 12.9
Triton 3.4.0
BF16 mixed precision

Flash Attention 2 was auto-enabled by Unsloth because the A100 supports it.

Core hyperparameters

Parameter	Value
Max sequence length	4096
Num epochs	2
Learning rate	2e-5
Warmup steps	100
Warmup ratio	0.03
Batch size	24
Gradient accumulation	1
Effective batch size	24
Seed	42

Optimization stack

The model was loaded with QLoRA 4-bit weights during training, while the final merged checkpoint was saved in 16-bit merged form for deployment and inference use.

The W&B config logged the optimizer as adamw_8bit, while the trainer config used fused AdamW (adamw_torch_fused) in the notebook training arguments.

Sequence packing was enabled, dataset preprocessing used multiprocessing, and periodic evaluation/checkpoint saving was configured during the run.

LoRA / PEFT details

The fine-tuning used:

LoRA rank = 32
LoRA alpha = 64

Target modules:

q_proj
k_proj
v_proj
o_proj
gate_proj
up_proj
down_proj

The run reported approximately:

36.9M trainable parameters

which corresponded to around 2.34%–4.0% of total parameters depending on counting conventions.

Hardware and runtime

Training hardware:

NVIDIA A100-SXM4-40GB
~42.4 GB VRAM exposed
Compute capability 8.0
BF16 support
Flash Attention 2 support

The run specifically targeted A100-native BF16 and Flash Attention 2 optimizations.

Total training runtime was approximately:

5 hours 14 minutes

Checkpointing and merge

Intermediate checkpoints were pushed to:

abhinav0231/Lily-1.5b-distill-v3-checkpoints

during training.

The workflow included auto-resume logic from the latest Hugging Face checkpoint.

After training, the LoRA adapter was merged back into the base model in BF16/16-bit form and pushed as:

abhinav0231/Lily-1.5b-v0.3

The notebook also included GGUF export paths for quantized deployment variants.

Training logs

The trainer log reported:

33,297 packed training examples
2 epochs
2,776 optimization steps

Validation loss decreased from:

9.100862 at step 500 to
8.973075 at step 2500

These values should be interpreted as internal training diagnostics rather than direct end-user quality metrics.

Intended use

This model is intended for:

instruction-following chat experiments
structured answer generation
research on distilled reasoning-style outputs
lightweight local or hosted inference in the 1.5B parameter class

It is especially suited to prompts where:

a user asks for explanations or breakdowns
the desired answer format is structured
the prompt resembles the ChatML style used during training

Prompting notes

Because the training data is ChatML-formatted, best results usually come from chat-style prompting rather than plain raw completion prompting.

The model may respond in a more verbose tutor-like style because many training prompts encouraged detailed reasoning followed by a final answer.

If a cleaner direct-answer style is preferred, using a concise system prompt and explicitly requesting short outputs can help steer generation.

Limitations

This model was trained on synthetic/distilled instruction data rather than broad raw web-scale pretraining data.

As a result:

outputs may reflect teacher-style formatting biases
responses may become over-structured
reasoning markup may occasionally appear in generations

The dataset sanity checks also flagged formatting irregularities in sampled rows, including repeated markers and malformed counts, so downstream behavior may inherit some formatting artifacts from the source corpus.

Safety

This model is not designed for fully autonomous use in high-stakes domains such as:

legal
medical
financial
safety-critical systems

Outputs can still be:

incorrect
incomplete
overconfident

Human review is recommended for consequential use cases.

Usage

Transformers

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "abhinav0231/Lily-1.5b-v0.3"

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain overfitting in simple terms."},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
    )

print(tokenizer.decode(outputs, skip_special_tokens=True))

Suggested prompting

For best results:

use chat-style prompts,
keep instructions explicit,
specify desired format,
request concise output if you do not want long reasoning-style responses.

Provenance

Base model: abhinav0231/Lily-1.5b-v0.1
Training dataset: abhinav0231/Sarvam-105b-Distill-100k (chatml)
Training framework: Unsloth + TRL
Hardware: 1x NVIDIA A100-SXM4-40GB
Final merged repo: abhinav0231/Lily-1.5b-v0.3

Acknowledgements

This model was trained with Unsloth, Hugging Face Transformers, TRL, PEFT/LoRA-style fine-tuning, and W&B logging in a Modal-hosted workflow.

This qwen2 model was trained 2x faster with Unsloth and Huggingface's TRL library.

Downloads last month: 225

Safetensors

Model size

2B params

Tensor type

F16

Model tree for abhinav0231/Lily-1.5b-v0.3

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Finetuned

abhinav0231/Lily-1.5b-v0.1

Adapter

(1)

this model

Adapters

1 model

Quantizations

1 model

abhinav0231
/

Lily-1.5b-v0.3

Lily-1.5b-v0.3

Model summary

Training objective

Base model

Benchmarks

Dataset

Dataset style

Length characteristics

Training setup

Core hyperparameters

Optimization stack

LoRA / PEFT details

Hardware and runtime

Checkpointing and merge

Training logs

Intended use

Prompting notes

Limitations

Safety

Usage

Transformers

Suggested prompting

Provenance

Acknowledgements

Model tree for abhinav0231/Lily-1.5b-v0.3

Dataset used to train abhinav0231/Lily-1.5b-v0.3