Lily-1.5b-v0.3

Lily-1.5b-v0.3 is a distilled instruction-tuned language model built by continuing training from abhinav0231/Lily-1.5b-v0.1 on the abhinav0231/Sarvam-105b-Distill-100k dataset using the chatml split/configuration.

This version was trained as an offline supervised fine-tuning run focused on high-quality long-form assistant responses in ChatML format, with many examples following an explicit <think> and <answer> structure.

The model was trained and merged in a single-GPU Modal workflow on an NVIDIA A100-SXM4-40GB system using BF16, QLoRA, and Unsloth.


Model summary

This checkpoint starts from abhinav0231/Lily-1.5b-v0.1 and applies a distillation-style supervised fine-tuning stage rather than training from scratch.

The base architecture loaded during training is a Qwen2-style causal language model with:

  • 28 layers
  • hidden size 1536
  • 12 attention heads
  • 2 key-value heads
  • vocabulary size 151,936

The training setup targets:

  • instruction following
  • structured response generation
  • distilled reasoning-flavored outputs

rather than pure base-model continuation pretraining.


Training objective

The goal of v0.3 was to improve the model through offline SFT distillation from a synthetic/teacher-style dataset while preserving the usability and compact size of the 1.5B-class base model.

The dataset examples are preformatted as ChatML conversations and frequently instruct the assistant to reason in a <think> block before producing a final <answer> block.

Because of that training distribution, the model may naturally produce more structured, tutor-like, stepwise outputs than the earlier checkpoint depending on the prompt style.


Base model

  • Base model: abhinav0231/Lily-1.5b-v0.1
  • Final merged model repo: abhinav0231/Lily-1.5b-v0.3
  • GGUF Repo abhinav0231/Lily-1.5b-v0.3-GGUF

Benchmarks

Evaluation setup using lm-evaluation-harness, v0.3 achieved:

image


Dataset

The main training dataset is:

abhinav0231/Sarvam-105b-Distill-100k

using the chatml configuration, stored as a single text column of preformatted conversations.

The final training notebook loaded:

  • 91,457 training examples
  • 1,908 validation examples

A separate sanity-check pass over the dataset family showed a very similar distribution, including:

  • 92,040 training examples
  • 1,917 validation examples
  • 1,918 test examples

confirming the same overall ChatML reasoning-style format.


Dataset style

The dataset uses ChatML with:

  • <|im_start|>
  • <|im_end|>

delimiters and includes a chat template in the tokenizer setup.

Many examples use a system prompt that explicitly asks the assistant to think through the problem in a <think> block and then give the final response in an <answer> block.

This means the model was not trained on plain raw instruction-response text alone; it was trained on a formatted conversational distribution with strong structural priors.


Length characteristics

A 5,000-sample sanity slice of the training set had:

  • mean length = 1640.72 tokens
  • p50 = 1219
  • p90 = 3221
  • p95 = 4096.15
  • p99 = 6883.35

About:

  • 5.00% of sampled training examples
  • 4.33% of sampled validation examples

exceeded 4096 tokens.

These numbers matter because the training run used a 4096 token max sequence length, so the longest examples are subject to truncation or packing effects depending on preprocessing behavior.


Training setup

Training was run on a single NVIDIA A100-SXM4-40GB GPU in Modal, without:

  • DDP
  • accelerate launch
  • multi-process orchestration

The environment used:

  • Unsloth 2026.5.2
  • TRL 0.22.2
  • PyTorch 2.8.0+cu129
  • CUDA 12.9
  • Triton 3.4.0
  • BF16 mixed precision

Flash Attention 2 was auto-enabled by Unsloth because the A100 supports it.


Core hyperparameters

Parameter Value
Max sequence length 4096
Num epochs 2
Learning rate 2e-5
Warmup steps 100
Warmup ratio 0.03
Batch size 24
Gradient accumulation 1
Effective batch size 24
Seed 42

Optimization stack

The model was loaded with QLoRA 4-bit weights during training, while the final merged checkpoint was saved in 16-bit merged form for deployment and inference use.

The W&B config logged the optimizer as adamw_8bit, while the trainer config used fused AdamW (adamw_torch_fused) in the notebook training arguments.

Sequence packing was enabled, dataset preprocessing used multiprocessing, and periodic evaluation/checkpoint saving was configured during the run.


LoRA / PEFT details

The fine-tuning used:

  • LoRA rank = 32
  • LoRA alpha = 64

Target modules:

  • q_proj
  • k_proj
  • v_proj
  • o_proj
  • gate_proj
  • up_proj
  • down_proj

The run reported approximately:

  • 36.9M trainable parameters

which corresponded to around 2.34%–4.0% of total parameters depending on counting conventions.


Hardware and runtime

Training hardware:

  • NVIDIA A100-SXM4-40GB
  • ~42.4 GB VRAM exposed
  • Compute capability 8.0
  • BF16 support
  • Flash Attention 2 support

The run specifically targeted A100-native BF16 and Flash Attention 2 optimizations.

Total training runtime was approximately:

  • 5 hours 14 minutes

Checkpointing and merge

Intermediate checkpoints were pushed to:

abhinav0231/Lily-1.5b-distill-v3-checkpoints

during training.

The workflow included auto-resume logic from the latest Hugging Face checkpoint.

After training, the LoRA adapter was merged back into the base model in BF16/16-bit form and pushed as:

abhinav0231/Lily-1.5b-v0.3

The notebook also included GGUF export paths for quantized deployment variants.


Training logs

The trainer log reported:

  • 33,297 packed training examples
  • 2 epochs
  • 2,776 optimization steps

Validation loss decreased from:

  • 9.100862 at step 500 to
  • 8.973075 at step 2500

These values should be interpreted as internal training diagnostics rather than direct end-user quality metrics.


Intended use

This model is intended for:

  • instruction-following chat experiments
  • structured answer generation
  • research on distilled reasoning-style outputs
  • lightweight local or hosted inference in the 1.5B parameter class

It is especially suited to prompts where:

  • a user asks for explanations or breakdowns
  • the desired answer format is structured
  • the prompt resembles the ChatML style used during training

Prompting notes

Because the training data is ChatML-formatted, best results usually come from chat-style prompting rather than plain raw completion prompting.

The model may respond in a more verbose tutor-like style because many training prompts encouraged detailed reasoning followed by a final answer.

If a cleaner direct-answer style is preferred, using a concise system prompt and explicitly requesting short outputs can help steer generation.


Limitations

This model was trained on synthetic/distilled instruction data rather than broad raw web-scale pretraining data.

As a result:

  • outputs may reflect teacher-style formatting biases
  • responses may become over-structured
  • reasoning markup may occasionally appear in generations

The dataset sanity checks also flagged formatting irregularities in sampled rows, including repeated markers and malformed counts, so downstream behavior may inherit some formatting artifacts from the source corpus.


Safety

This model is not designed for fully autonomous use in high-stakes domains such as:

  • legal
  • medical
  • financial
  • safety-critical systems

Outputs can still be:

  • incorrect
  • incomplete
  • overconfident

Human review is recommended for consequential use cases.


Usage

Transformers

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "abhinav0231/Lily-1.5b-v0.3"

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain overfitting in simple terms."},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
    )

print(tokenizer.decode(outputs, skip_special_tokens=True))

Suggested prompting

For best results:

  • use chat-style prompts,
  • keep instructions explicit,
  • specify desired format,
  • request concise output if you do not want long reasoning-style responses.

Provenance

  • Base model: abhinav0231/Lily-1.5b-v0.1
  • Training dataset: abhinav0231/Sarvam-105b-Distill-100k (chatml)
  • Training framework: Unsloth + TRL
  • Hardware: 1x NVIDIA A100-SXM4-40GB
  • Final merged repo: abhinav0231/Lily-1.5b-v0.3

Acknowledgements

This model was trained with Unsloth, Hugging Face Transformers, TRL, PEFT/LoRA-style fine-tuning, and W&B logging in a Modal-hosted workflow.

This qwen2 model was trained 2x faster with Unsloth and Huggingface's TRL library.

Downloads last month
225
Safetensors
Model size
2B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for abhinav0231/Lily-1.5b-v0.3

Adapter
(1)
this model
Adapters
1 model
Quantizations
1 model

Dataset used to train abhinav0231/Lily-1.5b-v0.3