Saanvi-C0-3B π€β‘
A production-ready LLM designed to enhance user expression and improve contextual accuracy
Powered by RAG-based technology β’ 4-bit quantized β’ Flash Attention 2 β’ bfloat16 β’ 2K context
π Features
Feature | Benefit |
---|---|
β‘ Flash Attention 2 | 2.7x faster inference |
π§ 4-bit Quantization | 6.2GB VRAM usage |
π― Instruction-Tuned | Better task performance |
π₯ RAG-Enhanced | Contextual precision |
What sets it apart?
Saanvi-C0-3B can be used prior to Retrieval-Augmented Generation (RAG) to enhance contextual matching, helping refine user intent and improve the precision of responses when paired with RAG technology. Thanks to its 4-bit quantization, itβs optimized to run efficiently even on low-end GPUs with as little as 6.2GB of VRAM, making it accessible for a wide range of hardware.
β‘ Quick Start
import argparse
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
def parse_args():
"""
Parse command-line arguments for the chat application.
Returns:
argparse.Namespace: Parsed arguments including model path and generation parameters.
"""
parser = argparse.ArgumentParser(description="Streaming Terminal Chat")
parser.add_argument("--model_path", type=str, default="riple-saanvi-lab/Saanvi-C1-3B",
help="Path to the pre-trained model")
parser.add_argument("--max_length", type=int, default=512,
help="Maximum length for generated responses")
parser.add_argument("--do_sample", type=bool, default=True,
help="Whether to use sampling during generation")
return parser.parse_args()
def load_model_and_tokenizer(model_path: str):
"""
Load the model and tokenizer from the specified path.
Args:
model_path (str): Path to the pre-trained model.
Returns:
tuple: A tuple containing the loaded model (AutoModelForCausalLM) and tokenizer (AutoTokenizer).
Raises:
SystemExit: If loading fails, exits with an error message.
"""
try:
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="auto")
return model, tokenizer
except Exception as e:
print(f"Error loading model or tokenizer: {e}")
exit(1)
def chat_loop(model: AutoModelForCausalLM, tokenizer: AutoTokenizer, max_length: int, do_sample: bool):
"""
Run the interactive chat loop with streaming responses.
Args:
model (AutoModelForCausalLM): The pre-trained language model.
tokenizer (AutoTokenizer): The tokenizer associated with the model.
max_length (int): Maximum length of the generated responses.
do_sample (bool): Whether to use sampling during generation.
"""
print("π¬ Streaming Terminal Chat - Type 'exit' to quit")
while True:
user_input = input("\nπ€ You: ").strip()
if user_input.lower() == "exit":
print("π Exiting chat...")
break
# Ensure inputs are on the same device as the model
device = next(model.parameters()).device
inputs = tokenizer(user_input, return_tensors="pt").to(device)
# Generate and stream the response
print("π€ AI: ", end="", flush=True)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
_ = model.generate(**inputs, max_length=max_length, do_sample=do_sample, streamer=streamer)
print() # Add a newline after the response
def main():
"""
Main entry point for the chat application.
"""
args = parse_args()
model, tokenizer = load_model_and_tokenizer(args.model_path)
chat_loop(model, tokenizer, args.max_length, args.do_sample)
if __name__ == "__main__":
main()
π¦ Installation
# Install dependencies with CUDA 11+ support
pip install torch transformers
π Benchmarks
A100-40GB Performance
Batch Size | Throughput | Latency | VRAM Usage |
---|---|---|---|
1 | 42 tok/sec | 85ms | 6.2GB |
8 | 218 tok/sec | 430ms | 10.8GB |
Low-End GPU Compatibility
With its 4-bit quantization, Saanvi-C0-3B runs smoothly on GPUs with limited VRAM (e.g., NVIDIA GTX 1660 Ti or similar with 6GB), maintaining reasonable performance for single-batch inference.
π License
Licensed under the Apache 2.0 License. See the LICENSE file for details.
π‘ Pro Tip: For optimal performance on high-end GPUs, pair with torch.compile()
and CUDA graphs. On low-end GPUs, stick to smaller batch sizes for best results!
- Downloads last month
- 414