yuj-v1-GGUF / README.md
shuvom's picture
Update info
ae873dd verified
metadata
language:
  - hi
pipeline_tag: text-generation
tags:
  - hindi
  - quantization
  - shuvom/yuj-v1
license: apache-2.0
quantized_by: shuvom

yuj-v1-GGUF

Description

This repo contains GGUF format model files for shuvom/yuj-v1.

About GGUF

GGUF and GGML are file formats used for storing models for inference, especially in the context of language models like GPT (Generative Pre-trained Transformer). It allows you to inference in consumer-grade GPUs and CPUs.

more info.

Provided files

Name Quant method Bits Size Max RAM required Use case
yuj-v1.Q4_K_M.gguf Q4_K_M 4 4.17 GB 6.87 GB medium, balanced quality - recommended

Usage

  1. Installing lamma.cpp python client and HuggingFace-hub
!pip install llama-cpp-python huggingface-hub
  1. Downloading GGUF formatted model
!huggingface-cli download shuvom/yuj-v1-GGUF yuj-v1.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
  1. Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
from llama_cpp import Llama

llm = Llama(
  model_path="./yuj-v1.Q4_K_M.gguf",  # Download the model file first
  n_ctx=2048,  # The max sequence length to use - note that longer sequence lengths require much more resources
  n_threads=8,            # The number of CPU threads to use, tailor to your system and the resulting performance
  n_gpu_layers=35         # The number of layers to offload to GPU, if you have GPU acceleration available
)
  1. Chat Completion API
llm = Llama(model_path="/content/yuj-v1.Q4_K_M.gguf", chat_format="llama-2")  # Set chat_format according to the model you are using
llm.create_chat_completion(
    messages = [
        {"role": "system", "content": "You are a story writing assistant."},
        {
            "role": "user",
            "content": "युज शीर्ष द्विभाषी मॉडल में से एक है"
        }
    ]
)