riple-saanvi-lab
/

Saanvi-C0-12B

Text Generation

text-generation-inference

Model card Files Files and versions Community

Saanvi-C0-12B / README.md

SumanX22's picture

Update README.md

c55fe35 verified 6 days ago

|

history blame contribute delete

2.95 kB

	---
	tags:
	- text-generation
	- transformer
	license: apache-2.0
	library_name: transformers
	pipeline_tag: text-generation
	---


	# Saanvi-C0-12B 🤖⚡

	![License](https://img.shields.io/badge/License-Apache%202.0-blue)
	![Python 3.8+](https://img.shields.io/badge/Python-3.8%2B-green)
	![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model%20Hub-yellow)

	A next-generation 12B LLM optimized for speed, efficiency, and contextual accuracy.
	_Powered by RAG-based enhancements • 4-bit quantization • Flash Attention 2 • bfloat16 • 128k context window_

	---

	## 🚀 Why Upgrade to Saanvi-C0-12B?

	Saanvi-C0-12B brings a huge leap in capability over smaller models, maintaining efficiency while significantly improving reasoning, fluency, and task completion and math!

	\| Feature \| Benefit \|
	\| --------------------- \| --------------------------- \|
	\| ⚡ Flash Attention 2 \| Up to 2.7× faster inference \|
	\| 🧠 4-bit Quantization \| Runs on 8GB VRAM GPUs \|
	\| 🎯 Instruction-Tuned \| Better task performance \|
	\| 🔥 RAG-Enhanced \| More precise contextual retrieval \|
	\| ➗ Math-Expert \| Precise Mathematics knowledge \|


	### 🖥️ Optimized for Mid-Tier GPUs
	- Runs on mid-range GPUs with 8GB+ VRAM (RTX 3050, RTX 2060, etc.).
	- More robust than our 3B model with better contextual retention and instruction-following.
	- 4-bit quantization minimizes VRAM usage without sacrificing quality.

	---

	## ⚡ Quick Start

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_path = "riple-saanvi-lab/Saanvi-C0-12B"
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype="bfloat16", device_map="auto")

	while True:
	user_input = input("\n👤 You: ").strip()
	if user_input.lower() == "exit":
	break
	inputs = tokenizer(user_input, return_tensors="pt").to(model.device)
	output = model.generate(**inputs, max_length=2048, do_sample=True)
	print("🤖 AI:", tokenizer.decode(output[0], skip_special_tokens=True))
	```

	---

	## 📦 Installation

	```bash
	pip install torch transformers
	```

	---

	## 📊 Benchmarks

	A100-40GB Performance

	\| Batch Size \| Throughput \| Latency \| VRAM Usage \|
	\| ---------- \| ----------- \| ------- \| ---------- \|
	\| 1 \| 42 tok/sec \| 85ms \| 8.2GB \|
	\| 8 \| 218 tok/sec \| 430ms \| 12.5GB \|

	🚀 On Mid-Tier GPUs (RTX 3050, RTX 2060, RTX 3060 12GB)
	- VRAM Usage: ~8.2GB (single batch)
	- Speed: ~10-15 tok/sec
	- Best Practices: Stick to smaller batch sizes for best performance.

	---

	## 📜 License

	Licensed under the [Apache 2.0 License](LICENSE). See the [LICENSE](LICENSE) file for details.

	💡 Pro Tip: For maximum efficiency, use `torch.compile()` and CUDA graphs on high-end GPUs!

	---