swahili LLaMA 7B v0.1 - GGUF
Description
This repo contains GGUF format model files for swahili LLaMA 7B v0.1.
Provided files
Name | Quant method | Bits | Size | Max RAM required | Use case |
---|---|---|---|---|---|
swahili_llama-7b-v0.1.gguf | Q8_0 | 8 | 6.81 GB | 12.46 GB | very large, extremely low quality loss - not recommended |
Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
Simple Text Generation with llama-cpp-python, llama-index example code
import os.path
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
StorageContext,
load_index_from_storage,
)
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import messages_to_prompt, completion_to_prompt
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
import torch
import time
import os.path
import gradio as gr
model_path = "swahili_llama-7b-v0.1.gguf"
llm = LlamaCPP(
# You can pass in the URL to a GGML model to download it automatically
model_url=None,
# optionally, you can set the path to a pre-downloaded model instead of model_url
model_path=model_path,
temperature=0.7,
max_new_tokens=300,
# llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
context_window=2000,
# kwargs to pass to __call__()
generate_kwargs={},
# kwargs to pass to __init__()
# set to at least 1 to use GPU
model_kwargs={"n_gpu_layers": 0},
# transform inputs into Llama2 format
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
verbose=True,
)
Settings.llm=llm
response= llm.complete("Mfumo wa elimu Tanzania ni ")
print(response.text)
Naive RAG with swahili_llama llama-cpp-python, llama-index example code
import os.path
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
StorageContext,
load_index_from_storage,
)
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import messages_to_prompt, completion_to_prompt
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
import gradio as gr
llm = LlamaCPP(
# You can pass in the URL to a GGML model to download it automatically
# model_url=None,
# optionally, you can set the path to a pre-downloaded model instead of model_url
model_path = "swahili_llama-7b-v0.1.gguf",
temperature=0.1,
max_new_tokens=200,
# llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
context_window=2000,
# kwargs to pass to __call__()
generate_kwargs={},
# kwargs to pass to __init__()
# set to at least 1 to use GPU
model_kwargs={"n_gpu_layers": -1},
# transform inputs into Llama2 format
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
verbose=True,
)
Settings.embed_model = HuggingFaceEmbedding(
model_name="./embeddings/bge-small-en-v1.5/"
)
Settings.llm = llm
PERSIST_DIR = "./storage"
if not os.path.exists(PERSIST_DIR):
# load the documents and create the index
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
# store it for later
index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
# load the existing index
storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
index = load_index_from_storage(storage_context)
query_engine = index.as_query_engine(streaming=True)
def main(question):
response = query_engine.query(question)
return response
# Gradio interface
ui = gr.Interface(
fn=main,
inputs="textbox",
outputs="textbox"
)
ui.launch(share=True)
Notes:
- Swahili_LLaMA is intended for research purposes. The model-generated text/code should be treated as a starting point rather than a definitive solution for potential use cases. Users should be cautious when employing these models in their applications.
- Direct adoption for production tasks is out of the scope of this research project. As a result, the swahili_llama model has not been tested to ensure that it performs adequately for any production-level application. Please refer to the limitation sections of this document for more details.
- Any use of this model is at your own risk.
Limitations of Swahili LLaMA
Generate Inaccurate Facts as the base model
Limited Scope for code: It performs poorly on code
Unreliable Responses to Instruction: The model has not undergone instruction fine-tuning. As a result, it may struggle or fail to adhere to intricate or nuanced instructions provided by users.
Language Limitations: The model is primarily designed to understand standard Swahili. The checkpoint of this model also leads to more inaccurate responses. Any Informal Swahili, slang, or any other language might challenge its comprehension, leading to potential misinterpretations or errors in response.
Potential Societal Biases: it fed with limited text it might be bias
Toxicity: It might be toxic; however, most of the dataset trained in Swahili comes from newspapers, which makes it less toxic.
Verbosity: Swahili LLaMa, being a base model, often produces irrelevant or extra text and responses following its first answer to user prompts within a single turn. This is due to its training dataset being primarily news and blogspot, which results in random response.
Training
Model
Architecture: LLaMA-2a (Transformer-based model with next-word prediction objective)
Context length: LLaMA-2 (2048 tokens)
Dataset size: 600M tokens(LLaMA-2) from C100 swahili and other craw from swahili newspaper and blogspots.
Training tokens: 1.4T tokens
GPUs: 2xA6000-48G
Training time: Expected 13 days
- Downloads last month
- 22