Instructions to use abhinav0231/Lily-1.5b-v0.3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Local Apps
- Unsloth Studio new
How to use abhinav0231/Lily-1.5b-v0.3 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for abhinav0231/Lily-1.5b-v0.3 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for abhinav0231/Lily-1.5b-v0.3 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for abhinav0231/Lily-1.5b-v0.3 to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="abhinav0231/Lily-1.5b-v0.3", max_seq_length=2048, )
Lily-1.5b-v0.3
Lily-1.5b-v0.3 is a distilled instruction-tuned language model built by continuing training from abhinav0231/Lily-1.5b-v0.1 on the abhinav0231/Sarvam-105b-Distill-100k dataset using the chatml split/configuration.
This version was trained as an offline supervised fine-tuning run focused on high-quality long-form assistant responses in ChatML format, with many examples following an explicit <think> and <answer> structure.
The model was trained and merged in a single-GPU Modal workflow on an NVIDIA A100-SXM4-40GB system using BF16, QLoRA, and Unsloth.
Model summary
This checkpoint starts from abhinav0231/Lily-1.5b-v0.1 and applies a distillation-style supervised fine-tuning stage rather than training from scratch.
The base architecture loaded during training is a Qwen2-style causal language model with:
- 28 layers
- hidden size 1536
- 12 attention heads
- 2 key-value heads
- vocabulary size 151,936
The training setup targets:
- instruction following
- structured response generation
- distilled reasoning-flavored outputs
rather than pure base-model continuation pretraining.
Training objective
The goal of v0.3 was to improve the model through offline SFT distillation from a synthetic/teacher-style dataset while preserving the usability and compact size of the 1.5B-class base model.
The dataset examples are preformatted as ChatML conversations and frequently instruct the assistant to reason in a <think> block before producing a final <answer> block.
Because of that training distribution, the model may naturally produce more structured, tutor-like, stepwise outputs than the earlier checkpoint depending on the prompt style.
Base model
- Base model:
abhinav0231/Lily-1.5b-v0.1 - Final merged model repo:
abhinav0231/Lily-1.5b-v0.3 - GGUF Repo
abhinav0231/Lily-1.5b-v0.3-GGUF
Benchmarks
Evaluation setup using lm-evaluation-harness, v0.3 achieved:
Dataset
The main training dataset is:
abhinav0231/Sarvam-105b-Distill-100k
using the chatml configuration, stored as a single text column of preformatted conversations.
The final training notebook loaded:
- 91,457 training examples
- 1,908 validation examples
A separate sanity-check pass over the dataset family showed a very similar distribution, including:
- 92,040 training examples
- 1,917 validation examples
- 1,918 test examples
confirming the same overall ChatML reasoning-style format.
Dataset style
The dataset uses ChatML with:
<|im_start|><|im_end|>
delimiters and includes a chat template in the tokenizer setup.
Many examples use a system prompt that explicitly asks the assistant to think through the problem in a <think> block and then give the final response in an <answer> block.
This means the model was not trained on plain raw instruction-response text alone; it was trained on a formatted conversational distribution with strong structural priors.
Length characteristics
A 5,000-sample sanity slice of the training set had:
- mean length = 1640.72 tokens
- p50 = 1219
- p90 = 3221
- p95 = 4096.15
- p99 = 6883.35
About:
- 5.00% of sampled training examples
- 4.33% of sampled validation examples
exceeded 4096 tokens.
These numbers matter because the training run used a 4096 token max sequence length, so the longest examples are subject to truncation or packing effects depending on preprocessing behavior.
Training setup
Training was run on a single NVIDIA A100-SXM4-40GB GPU in Modal, without:
- DDP
accelerate launch- multi-process orchestration
The environment used:
- Unsloth 2026.5.2
- TRL 0.22.2
- PyTorch 2.8.0+cu129
- CUDA 12.9
- Triton 3.4.0
- BF16 mixed precision
Flash Attention 2 was auto-enabled by Unsloth because the A100 supports it.
Core hyperparameters
| Parameter | Value |
|---|---|
| Max sequence length | 4096 |
| Num epochs | 2 |
| Learning rate | 2e-5 |
| Warmup steps | 100 |
| Warmup ratio | 0.03 |
| Batch size | 24 |
| Gradient accumulation | 1 |
| Effective batch size | 24 |
| Seed | 42 |
Optimization stack
The model was loaded with QLoRA 4-bit weights during training, while the final merged checkpoint was saved in 16-bit merged form for deployment and inference use.
The W&B config logged the optimizer as adamw_8bit, while the trainer config used fused AdamW (adamw_torch_fused) in the notebook training arguments.
Sequence packing was enabled, dataset preprocessing used multiprocessing, and periodic evaluation/checkpoint saving was configured during the run.
LoRA / PEFT details
The fine-tuning used:
- LoRA rank = 32
- LoRA alpha = 64
Target modules:
q_projk_projv_projo_projgate_projup_projdown_proj
The run reported approximately:
- 36.9M trainable parameters
which corresponded to around 2.34%–4.0% of total parameters depending on counting conventions.
Hardware and runtime
Training hardware:
- NVIDIA A100-SXM4-40GB
- ~42.4 GB VRAM exposed
- Compute capability 8.0
- BF16 support
- Flash Attention 2 support
The run specifically targeted A100-native BF16 and Flash Attention 2 optimizations.
Total training runtime was approximately:
- 5 hours 14 minutes
Checkpointing and merge
Intermediate checkpoints were pushed to:
abhinav0231/Lily-1.5b-distill-v3-checkpoints
during training.
The workflow included auto-resume logic from the latest Hugging Face checkpoint.
After training, the LoRA adapter was merged back into the base model in BF16/16-bit form and pushed as:
abhinav0231/Lily-1.5b-v0.3
The notebook also included GGUF export paths for quantized deployment variants.
Training logs
The trainer log reported:
- 33,297 packed training examples
- 2 epochs
- 2,776 optimization steps
Validation loss decreased from:
- 9.100862 at step 500 to
- 8.973075 at step 2500
These values should be interpreted as internal training diagnostics rather than direct end-user quality metrics.
Intended use
This model is intended for:
- instruction-following chat experiments
- structured answer generation
- research on distilled reasoning-style outputs
- lightweight local or hosted inference in the 1.5B parameter class
It is especially suited to prompts where:
- a user asks for explanations or breakdowns
- the desired answer format is structured
- the prompt resembles the ChatML style used during training
Prompting notes
Because the training data is ChatML-formatted, best results usually come from chat-style prompting rather than plain raw completion prompting.
The model may respond in a more verbose tutor-like style because many training prompts encouraged detailed reasoning followed by a final answer.
If a cleaner direct-answer style is preferred, using a concise system prompt and explicitly requesting short outputs can help steer generation.
Limitations
This model was trained on synthetic/distilled instruction data rather than broad raw web-scale pretraining data.
As a result:
- outputs may reflect teacher-style formatting biases
- responses may become over-structured
- reasoning markup may occasionally appear in generations
The dataset sanity checks also flagged formatting irregularities in sampled rows, including repeated markers and malformed counts, so downstream behavior may inherit some formatting artifacts from the source corpus.
Safety
This model is not designed for fully autonomous use in high-stakes domains such as:
- legal
- medical
- financial
- safety-critical systems
Outputs can still be:
- incorrect
- incomplete
- overconfident
Human review is recommended for consequential use cases.
Usage
Transformers
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "abhinav0231/Lily-1.5b-v0.3"
tokenizer = AutoTokenizer.from_pretrained(
model_id,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain overfitting in simple terms."},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True,
)
print(tokenizer.decode(outputs, skip_special_tokens=True))
Suggested prompting
For best results:
- use chat-style prompts,
- keep instructions explicit,
- specify desired format,
- request concise output if you do not want long reasoning-style responses.
Provenance
- Base model:
abhinav0231/Lily-1.5b-v0.1 - Training dataset:
abhinav0231/Sarvam-105b-Distill-100k(chatml) - Training framework: Unsloth + TRL
- Hardware: 1x NVIDIA A100-SXM4-40GB
- Final merged repo:
abhinav0231/Lily-1.5b-v0.3
Acknowledgements
This model was trained with Unsloth, Hugging Face Transformers, TRL, PEFT/LoRA-style fine-tuning, and W&B logging in a Modal-hosted workflow.
This qwen2 model was trained 2x faster with Unsloth and Huggingface's TRL library.
- Downloads last month
- 225
Model tree for abhinav0231/Lily-1.5b-v0.3
Base model
Qwen/Qwen2.5-1.5B
