Saanvi-C0-12B / README.md
SumanX22's picture
Update README.md
c55fe35 verified
metadata
tags:
  - text-generation
  - transformer
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation

Saanvi-C0-12B πŸ€–βš‘

License
Python 3.8+
Hugging Face

A next-generation 12B LLM optimized for speed, efficiency, and contextual accuracy.
Powered by RAG-based enhancements β€’ 4-bit quantization β€’ Flash Attention 2 β€’ bfloat16 β€’ 128k context window


πŸš€ Why Upgrade to Saanvi-C0-12B?

Saanvi-C0-12B brings a huge leap in capability over smaller models, maintaining efficiency while significantly improving reasoning, fluency, and task completion and math!

Feature Benefit
⚑ Flash Attention 2 Up to 2.7Γ— faster inference
🧠 4-bit Quantization Runs on 8GB VRAM GPUs
🎯 Instruction-Tuned Better task performance
πŸ”₯ RAG-Enhanced More precise contextual retrieval
βž— Math-Expert Precise Mathematics knowledge

πŸ–₯️ Optimized for Mid-Tier GPUs

  • Runs on mid-range GPUs with 8GB+ VRAM (RTX 3050, RTX 2060, etc.).
  • More robust than our 3B model with better contextual retention and instruction-following.
  • 4-bit quantization minimizes VRAM usage without sacrificing quality.

⚑ Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "riple-saanvi-lab/Saanvi-C0-12B"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype="bfloat16", device_map="auto")

while True:
    user_input = input("\nπŸ‘€ You: ").strip()
    if user_input.lower() == "exit":
        break
    inputs = tokenizer(user_input, return_tensors="pt").to(model.device)
    output = model.generate(**inputs, max_length=2048, do_sample=True)
    print("πŸ€– AI:", tokenizer.decode(output[0], skip_special_tokens=True))

πŸ“¦ Installation

pip install torch transformers

πŸ“Š Benchmarks

A100-40GB Performance

Batch Size Throughput Latency VRAM Usage
1 42 tok/sec 85ms 8.2GB
8 218 tok/sec 430ms 12.5GB

πŸš€ On Mid-Tier GPUs (RTX 3050, RTX 2060, RTX 3060 12GB)

  • VRAM Usage: ~8.2GB (single batch)
  • Speed: ~10-15 tok/sec
  • Best Practices: Stick to smaller batch sizes for best performance.

πŸ“œ License

Licensed under the Apache 2.0 License. See the LICENSE file for details.

πŸ’‘ Pro Tip: For maximum efficiency, use torch.compile() and CUDA graphs on high-end GPUs!