metadata

language:
  - hi
pipeline_tag: text-generation
tags:
  - hindi
  - quantization
  - shuvom/yuj-v1
license: apache-2.0
quantized_by: shuvom

yuj-v1-GGUF

Model creator: shuvom_
Original model: shuvom/yuj-v1

Description

This repo contains GGUF format model files for shuvom/yuj-v1.

About GGUF

GGUF and GGML are file formats used for storing models for inference, especially in the context of language models like GPT (Generative Pre-trained Transformer). It allows you to inference in consumer-grade GPUs and CPUs.

more info.

Provided files

Name	Quant method	Bits	Size	Max RAM required	Use case
yuj-v1.Q4_K_M.gguf	Q4_K_M	4	4.17 GB	6.87 GB	medium, balanced quality - recommended

Usage

Installing lamma.cpp python client and HuggingFace-hub

!pip install llama-cpp-python huggingface-hub

Downloading GGUF formatted model

!huggingface-cli download shuvom/yuj-v1-GGUF yuj-v1.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.

from llama_cpp import Llama

llm = Llama(
  model_path="./yuj-v1.Q4_K_M.gguf",  # Download the model file first
  n_ctx=2048,  # The max sequence length to use - note that longer sequence lengths require much more resources
  n_threads=8,            # The number of CPU threads to use, tailor to your system and the resulting performance
  n_gpu_layers=35         # The number of layers to offload to GPU, if you have GPU acceleration available
)

Chat Completion API

llm = Llama(model_path="/content/yuj-v1.Q4_K_M.gguf", chat_format="llama-2")  # Set chat_format according to the model you are using
llm.create_chat_completion(
    messages = [
        {"role": "system", "content": "You are a story writing assistant."},
        {
            "role": "user",
            "content": "युज शीर्ष द्विभाषी मॉडल में से एक है"
        }
    ]
)