LumiChats v1.1

A Fine-tuned Conversational AI Model Based on Llama 3.2 3B

License Model Size Base Model


📖 Overview

LumiChats v1.1 is a specialized conversational AI model built on top of Meta's Llama 3.2 3B Instruct foundation. This model has been fine-tuned using LoRA (Low-Rank Adaptation) with the Unsloth framework to deliver enhanced conversational capabilities while maintaining exceptional efficiency and performance.

Base Model: unsloth/Llama-3.2-3B-Instruct
Model Type: Conversational AI / Instruction-tuned Language Model
Parameters: 3.21 Billion (3,237,063,680 total)
Trainable Parameters: 24,313,856 (~0.75% via LoRA)
Architecture: Optimized Transformer with Auto-regressive Language Modeling


✨ Key Features

  • 💬 Enhanced Conversational Abilities: Fine-tuned on FineTome-100k for natural, engaging dialogue
  • 🚀 Efficient & Fast:
    • 2x faster training and inference with Unsloth optimizations
    • 4-bit quantization for reduced memory footprint
    • Only 0.75% of parameters trained via LoRA
  • 🌍 Multilingual Support: Supports 8+ languages (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai)
  • 📱 Edge-Ready: Optimized for deployment on edge devices and mobile platforms
  • 🎯 Superior Instruction Following: Specialized training on response-only objectives
  • 🔒 Privacy-Focused: Can run entirely on-device without cloud dependencies
  • ⚡ Memory Efficient: Trained with just 2.35 GB peak memory using gradient checkpointing

🏗️ Architecture Details

LumiChats v1.1 inherits the robust architecture of Llama 3.2 3B:

  • Model Type: Auto-regressive transformer language model (LlamaForCausalLM)
  • Training Approach:
    • Base: Supervised Fine-Tuning (SFT) + Reinforcement Learning with Human Feedback (RLHF)
    • Fine-tuning: LoRA adapters with response-only training
  • Context Length: Up to 128,000 tokens (trained with max_seq_length: 2048)
  • Vocabulary Size: Extended multilingual tokenizer
  • Optimization: 4-bit quantization, structured pruning, and knowledge distillation

LoRA Configuration Details

  • LoRA Rank (r): 16
  • LoRA Alpha: 16
  • Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • LoRA Dropout: 0
  • Trainable Parameters: 24,313,856 (0.75% of total 3.2B parameters)

🎯 Intended Use Cases

LumiChats v1.1 excels at:

  • Conversational AI: Natural dialogue and chat applications
  • Personal Assistants: Task management and information retrieval
  • Content Generation: Writing assistance and creative text generation
  • Summarization: Document and conversation summarization
  • Question Answering: Knowledge retrieval and Q&A systems
  • Code Assistance: Basic coding help and explanations
  • On-Device Applications: Mobile AI assistants and offline chatbots

🚀 Quick Start

Using Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "adityakum667388/lumichats-v1.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Prepare conversation
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "What is the capital of France?"}
]

# Generate response
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    eos_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Using Unsloth for Inference (Fastest)

from unsloth import FastLanguageModel

# Load model with Unsloth (2x faster inference)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="adityakum667388/lumichats-v1.1",
    max_seq_length=2048,
    dtype=None,  # Auto-detect
    load_in_4bit=True,  # Memory efficient
)

# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# Chat template
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain quantum computing"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=128,
    temperature=1.5,
    min_p=0.1
)
print(tokenizer.batch_decode(outputs))

Chat Template Format

LumiChats v1.1 uses the Llama 3.1 chat template format:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Special Tokens:

  • <|begin_of_text|> - Beginning of sequence
  • <|start_header_id|> - Start of role header
  • <|end_header_id|> - End of role header
  • <|eot_id|> - End of turn
  • <|finetune_right_pad_id|> - Padding token

Using GGUF Format (llama.cpp)

from llama_cpp import Llama

# Load GGUF model
llm = Llama(
    model_path="lumichats-v1.1-Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=-1  # Use GPU acceleration
)

# Format prompt with chat template
prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is machine learning?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""

# Generate response
output = llm(
    prompt,
    max_tokens=512,
    temperature=0.7,
    top_p=0.9,
    stop=["<|eot_id|>", "<|end_of_text|>", "<|im_end|>", "<|endoftext|>"]
)

print(output['choices'][0]['text'])

Using Ollama

# Pull the model (if available on Ollama)
ollama pull lumichats-v1.1

# Run inference
ollama run lumichats-v1.1 "Explain quantum computing in simple terms"

📦 Available Model Formats

Format Size Precision Use Case
SafeTensors (FP16) ~6.5 GB Full precision Training, fine-tuning, highest quality
GGUF (Q4_K_M) ~2.0 GB 4-bit quantized Recommended - Best balance of size/quality
GGUF (Q5_K_M) ~2.3 GB 5-bit quantized Higher quality, slightly larger
GGUF (Q8_0) ~3.5 GB 8-bit quantized Near-full quality
GGUF (F16) ~6.4 GB Full precision GGUF Maximum compatibility
LoRA Adapters ~100 MB Adapter weights only For merging with base model

Recommendation: For most users, Q4_K_M offers the best tradeoff between model size and output quality.


💻 Hardware Requirements

Minimum Requirements

  • RAM: 4 GB (for Q4_K_M quantized version)
  • GPU: Optional, but recommended (4GB+ VRAM)
  • Storage: 2-7 GB depending on format

Recommended Setup

  • RAM: 8 GB or more
  • GPU: NVIDIA GPU with 6GB+ VRAM (RTX 3060, T4, or better)
  • CPU: Modern multi-core processor (for CPU inference)

Performance Estimates

  • GPU (T4): 20-40 tokens/second
  • GPU (T4 with Unsloth): 40-80 tokens/second (2x faster)
  • GPU (RTX 4090): 60-100+ tokens/second
  • CPU (High-end): 5-15 tokens/second

🎨 Training Details

Training Configuration

LumiChats v1.1 was fine-tuned with the following setup:

Framework & Optimization:

  • Base Model: unsloth/Llama-3.2-3B-Instruct
  • Training Framework: Unsloth 2026.1.4 (optimized fine-tuning)
  • Fine-tuning Method: LoRA (Low-Rank Adaptation)
  • Quantization: 4-bit during training (load_in_4bit=True)
  • Gradient Checkpointing: Unsloth-optimized for memory efficiency

Dataset & Preprocessing:

  • Dataset: mlabonne/FineTome-100k
  • Format: ShareGPT → HuggingFace chat format
  • Chat Template: Llama 3.1 template
  • Training Objective: Response-only training (masks user inputs)

Hardware & Performance:

  • GPU: Tesla T4 (Max memory: 14.741 GB)
  • Peak Memory Usage: 2.35 GB additional for training
  • Training Time: 8.54 minutes (512 seconds) for 60 steps
  • Speed: 2x faster than standard PyTorch training

Training Hyperparameters

training_config = {
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 4,
    "effective_batch_size": 8,
    "warmup_steps": 5,
    "max_steps": 60,
    "learning_rate": 2e-4,
    "optimizer": "adamw_8bit",
    "weight_decay": 0.001,
    "lr_scheduler_type": "linear",
    "max_seq_length": 2048,
    "dtype": "float16",
    "seed": 3407
}

Why This Approach is Superior

  1. Efficiency: Only 0.75% of parameters trained, reducing computational cost by 99%+
  2. Speed: Unsloth optimizations provide 2x faster training and inference
  3. Memory: 4-bit quantization + gradient checkpointing enables training on consumer GPUs
  4. Quality: Response-only training focuses learning on generating high-quality outputs
  5. Versatility: Multiple export formats (HuggingFace, GGUF) for diverse deployment scenarios

The model builds upon Llama 3.2's foundation, which was pretrained on up to 9 trillion tokens from publicly available sources and further refined through supervised fine-tuning and RLHF alignment.


📊 Performance & Benchmarks

LumiChats v1.1 inherits the strong performance characteristics of Llama 3.2 3B, with enhanced conversational abilities:

  • MMLU (Massive Multitask Language Understanding): Competitive performance
  • AGIEval (General AI evaluation): Strong reasoning capabilities
  • ARC-Challenge (Abstract reasoning): Improved over base model
  • Instruction Following: Superior response quality on FineTome-100k
  • Multilingual dialogue tasks: Consistent across 8+ languages
  • Conversational Quality: Enhanced coherence and context awareness

The model outperforms similar-sized models like Gemma 2 2.6B and Phi 3.5-mini on instruction following, summarization, and conversational tasks, while maintaining efficiency advantages through LoRA and quantization.


🌐 Supported Languages

Official support for 8 languages:

  • 🇬🇧 English
  • 🇩🇪 German
  • 🇫🇷 French
  • 🇮🇹 Italian
  • 🇵🇹 Portuguese
  • 🇮🇳 Hindi
  • 🇪🇸 Spanish
  • 🇹🇭 Thai

Note: The model has been trained on additional languages and can be fine-tuned for other languages as needed.


⚖️ Limitations & Considerations

  • Context Understanding: May struggle with very long contexts despite 128k token capacity
  • Factual Accuracy: Can occasionally generate plausible but incorrect information
  • Bias: May reflect biases present in training data
  • Specialized Knowledge: Not optimized for highly technical or domain-specific tasks
  • Real-time Information: No access to current events (knowledge cutoff applies)
  • Safety: Should be deployed with appropriate content filtering and monitoring
  • LoRA Constraints: Trained parameters limited to attention and MLP layers

🔒 Responsible AI & Safety

LumiChats v1.1 is built on Llama 3.2's safety foundations:

  • Trained with safety alignment through RLHF (base model)
  • Designed to decline harmful requests
  • Tested for bias and fairness across languages
  • Implements content filtering guidelines
  • Response-only training reduces risk of prompt injection

Developers should:

  • Implement additional safety layers for production use
  • Test thoroughly for their specific use case
  • Monitor outputs for quality and appropriateness
  • Follow the Llama 3.2 Acceptable Use Policy
  • Be aware that fine-tuning may affect safety properties

📜 License

This model is released under the Llama 3.2 Community License.

  • ✅ Commercial use permitted
  • ✅ Modification and derivative works allowed
  • ✅ Distribution allowed with attribution
  • ⚠️ Subject to Llama 3.2 Acceptable Use Policy

Please review the full license at: Llama 3.2 License


🙏 Acknowledgments

  • Meta AI for developing and releasing Llama 3.2
  • Unsloth AI for the efficient fine-tuning framework and optimizations
  • Maxime Labonne for the FineTome-100k dataset
  • Hugging Face for model hosting and transformers library
  • The open-source AI community for tools and support

📞 Contact & Support


🔄 Version History

v1.1 (Current)

  • Initial release
  • Fine-tuned on Llama 3.2 3B Instruct with LoRA
  • Trained on FineTome-100k dataset
  • Optimized for conversational tasks
  • Multiple export formats available (SafeTensors, GGUF, LoRA adapters)
  • 2x faster inference with Unsloth
  • Peak training memory: 2.35 GB on Tesla T4

📚 Citation

If you use LumiChats v1.1 in your research or applications, please cite:

@misc{lumichats2025,
  author = {Aditya Kumar},
  title = {LumiChats v1.1: A Fine-tuned Conversational AI Model},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/adityakum667388/lumichats-v1.1}},
  note = {Fine-tuned using Unsloth and LoRA on FineTome-100k}
}

And the base model:

@article{llama32,
  title={Llama 3.2: Advancing Efficient and Accessible AI},
  author={Meta AI},
  year={2024},
  url={https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/}
}

And Unsloth:

@software{unsloth2024,
  author = {Unsloth AI},
  title = {Unsloth: Fast and Memory-Efficient Finetuning},
  year = {2024},
  url = {https://github.com/unslothai/unsloth}
}

Built with ❤️ using Llama 3.2 3B | Powered by Unsloth | Trained on FineTome-100k

⭐ If you find this model useful, please consider giving it a star!

Downloads last month
68
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
Input a message to start chatting with adityakum667388/lumichats-v1.1.

Model tree for adityakum667388/lumichats-v1.1

Quantized
(244)
this model