Saanvi-C0-12B π€β‘
A next-generation 12B LLM optimized for speed, efficiency, and contextual accuracy.
Powered by RAG-based enhancements β’ 4-bit quantization β’ Flash Attention 2 β’ bfloat16 β’ 128k context window
π Why Upgrade to Saanvi-C0-12B?
Saanvi-C0-12B brings a huge leap in capability over smaller models, maintaining efficiency while significantly improving reasoning, fluency, and task completion and math!
Feature | Benefit |
---|---|
β‘ Flash Attention 2 | Up to 2.7Γ faster inference |
π§ 4-bit Quantization | Runs on 8GB VRAM GPUs |
π― Instruction-Tuned | Better task performance |
π₯ RAG-Enhanced | More precise contextual retrieval |
β Math-Expert | Precise Mathematics knowledge |
π₯οΈ Optimized for Mid-Tier GPUs
- Runs on mid-range GPUs with 8GB+ VRAM (RTX 3050, RTX 2060, etc.).
- More robust than our 3B model with better contextual retention and instruction-following.
- 4-bit quantization minimizes VRAM usage without sacrificing quality.
β‘ Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "riple-saanvi-lab/Saanvi-C0-12B"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype="bfloat16", device_map="auto")
while True:
user_input = input("\nπ€ You: ").strip()
if user_input.lower() == "exit":
break
inputs = tokenizer(user_input, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_length=2048, do_sample=True)
print("π€ AI:", tokenizer.decode(output[0], skip_special_tokens=True))
π¦ Installation
pip install torch transformers
π Benchmarks
A100-40GB Performance
Batch Size | Throughput | Latency | VRAM Usage |
---|---|---|---|
1 | 42 tok/sec | 85ms | 8.2GB |
8 | 218 tok/sec | 430ms | 12.5GB |
π On Mid-Tier GPUs (RTX 3050, RTX 2060, RTX 3060 12GB)
- VRAM Usage: ~8.2GB (single batch)
- Speed: ~10-15 tok/sec
- Best Practices: Stick to smaller batch sizes for best performance.
π License
Licensed under the Apache 2.0 License. See the LICENSE file for details.
π‘ Pro Tip: For maximum efficiency, use torch.compile()
and CUDA graphs on high-end GPUs!
- Downloads last month
- 117
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.