Saanvi-C0-3B πŸ€–βš‘

License Python 3.8+ Hugging Face

A production-ready LLM designed to enhance user expression and improve contextual accuracy
Powered by RAG-based technology β€’ 4-bit quantized β€’ Flash Attention 2 β€’ bfloat16 β€’ 2K context


πŸš€ Features

Feature Benefit
⚑ Flash Attention 2 2.7x faster inference
🧠 4-bit Quantization 6.2GB VRAM usage
🎯 Instruction-Tuned Better task performance
πŸ”₯ RAG-Enhanced Contextual precision

What sets it apart?
Saanvi-C0-3B can be used prior to Retrieval-Augmented Generation (RAG) to enhance contextual matching, helping refine user intent and improve the precision of responses when paired with RAG technology. Thanks to its 4-bit quantization, it’s optimized to run efficiently even on low-end GPUs with as little as 6.2GB of VRAM, making it accessible for a wide range of hardware.


⚑ Quick Start

import argparse
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

def parse_args():
    """
    Parse command-line arguments for the chat application.

    Returns:
        argparse.Namespace: Parsed arguments including model path and generation parameters.
    """
    parser = argparse.ArgumentParser(description="Streaming Terminal Chat")
    parser.add_argument("--model_path", type=str, default="riple-saanvi-lab/Saanvi-C1-3B",
                        help="Path to the pre-trained model")
    parser.add_argument("--max_length", type=int, default=512,
                        help="Maximum length for generated responses")
    parser.add_argument("--do_sample", type=bool, default=True,
                        help="Whether to use sampling during generation")
    return parser.parse_args()

def load_model_and_tokenizer(model_path: str):
    """
    Load the model and tokenizer from the specified path.

    Args:
        model_path (str): Path to the pre-trained model.

    Returns:
        tuple: A tuple containing the loaded model (AutoModelForCausalLM) and tokenizer (AutoTokenizer).

    Raises:
        SystemExit: If loading fails, exits with an error message.
    """
    try:
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="auto")
        return model, tokenizer
    except Exception as e:
        print(f"Error loading model or tokenizer: {e}")
        exit(1)

def chat_loop(model: AutoModelForCausalLM, tokenizer: AutoTokenizer, max_length: int, do_sample: bool):
    """
    Run the interactive chat loop with streaming responses.

    Args:
        model (AutoModelForCausalLM): The pre-trained language model.
        tokenizer (AutoTokenizer): The tokenizer associated with the model.
        max_length (int): Maximum length of the generated responses.
        do_sample (bool): Whether to use sampling during generation.
    """
    print("πŸ’¬ Streaming Terminal Chat - Type 'exit' to quit")
    while True:
        user_input = input("\nπŸ‘€ You: ").strip()
        if user_input.lower() == "exit":
            print("πŸ‘‹ Exiting chat...")
            break

        # Ensure inputs are on the same device as the model
        device = next(model.parameters()).device
        inputs = tokenizer(user_input, return_tensors="pt").to(device)

        # Generate and stream the response
        print("πŸ€– AI: ", end="", flush=True)
        streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
        _ = model.generate(**inputs, max_length=max_length, do_sample=do_sample, streamer=streamer)
        print()  # Add a newline after the response

def main():
    """
    Main entry point for the chat application.
    """
    args = parse_args()
    model, tokenizer = load_model_and_tokenizer(args.model_path)
    chat_loop(model, tokenizer, args.max_length, args.do_sample)

if __name__ == "__main__":
    main()

πŸ“¦ Installation

# Install dependencies with CUDA 11+ support
pip install torch transformers

πŸ“Š Benchmarks

A100-40GB Performance

Batch Size Throughput Latency VRAM Usage
1 42 tok/sec 85ms 6.2GB
8 218 tok/sec 430ms 10.8GB

Low-End GPU Compatibility
With its 4-bit quantization, Saanvi-C0-3B runs smoothly on GPUs with limited VRAM (e.g., NVIDIA GTX 1660 Ti or similar with 6GB), maintaining reasonable performance for single-batch inference.


πŸ“œ License

Licensed under the Apache 2.0 License. See the LICENSE file for details.


πŸ’‘ Pro Tip: For optimal performance on high-end GPUs, pair with torch.compile() and CUDA graphs. On low-end GPUs, stick to smaller batch sizes for best results!

Downloads last month
414
Safetensors
Model size
2.8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for riple-saanvi-lab/Saanvi-C0-3B

Quantizations
4 models