metadata
language:
- hi
pipeline_tag: text-generation
tags:
- hindi
- quantization
- shuvom/yuj-v1
license: apache-2.0
quantized_by: shuvom
yuj-v1-GGUF
- Model creator: shuvom_
- Original model: shuvom/yuj-v1
Description
This repo contains GGUF format model files for shuvom/yuj-v1.
About GGUF
GGUF and GGML are file formats used for storing models for inference, especially in the context of language models like GPT (Generative Pre-trained Transformer). It allows you to inference in consumer-grade GPUs and CPUs.
Provided files
Name | Quant method | Bits | Size | Max RAM required | Use case |
---|---|---|---|---|---|
yuj-v1.Q4_K_M.gguf | Q4_K_M | 4 | 4.17 GB | 6.87 GB | medium, balanced quality - recommended |
Usage
- Installing lamma.cpp python client and HuggingFace-hub
!pip install llama-cpp-python huggingface-hub
- Downloading GGUF formatted model
!huggingface-cli download shuvom/yuj-v1-GGUF yuj-v1.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
- Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
from llama_cpp import Llama
llm = Llama(
model_path="./yuj-v1.Q4_K_M.gguf", # Download the model file first
n_ctx=2048, # The max sequence length to use - note that longer sequence lengths require much more resources
n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
n_gpu_layers=35 # The number of layers to offload to GPU, if you have GPU acceleration available
)
- Chat Completion API
llm = Llama(model_path="/content/yuj-v1.Q4_K_M.gguf", chat_format="llama-2") # Set chat_format according to the model you are using
llm.create_chat_completion(
messages = [
{"role": "system", "content": "You are a story writing assistant."},
{
"role": "user",
"content": "युज शीर्ष द्विभाषी मॉडल में से एक है"
}
]
)