swahili LLaMA 7B v0.1 - GGUF

Model creator: Mollel
Original model: LLaMA-2

Description

This repo contains GGUF format model files for swahili LLaMA 7B v0.1.

Provided files

Name	Quant method	Bits	Size	Max RAM required	Use case
swahili_llama-7b-v0.1.gguf	Q8_0	8	6.81 GB	12.46 GB	very large, extremely low quality loss - not recommended

Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.

Simple Text Generation with llama-cpp-python, llama-index example code

import os.path
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
    load_index_from_storage,
)
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import messages_to_prompt, completion_to_prompt
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

import torch
import time
import os.path
import gradio as gr

model_path = "swahili_llama-7b-v0.1.gguf"

llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    model_url=None,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=model_path,
    temperature=0.7,
    max_new_tokens=300,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=2000,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": 0},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

Settings.llm=llm

response= llm.complete("Mfumo wa elimu Tanzania ni ")
print(response.text)

Naive RAG with swahili_llama llama-cpp-python, llama-index example code

import os.path
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
    load_index_from_storage,
)
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import messages_to_prompt, completion_to_prompt
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
import gradio as gr

llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    # model_url=None,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path = "swahili_llama-7b-v0.1.gguf",
    temperature=0.1,
    max_new_tokens=200,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=2000,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": -1},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

Settings.embed_model = HuggingFaceEmbedding(
    model_name="./embeddings/bge-small-en-v1.5/"
)
Settings.llm = llm

PERSIST_DIR = "./storage"
if not os.path.exists(PERSIST_DIR):
    # load the documents and create the index
    documents = SimpleDirectoryReader("data").load_data()
    index = VectorStoreIndex.from_documents(documents)
    # store it for later
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    # load the existing index
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)

query_engine = index.as_query_engine(streaming=True)

def main(question):
    response = query_engine.query(question)
    return response

# Gradio interface
ui = gr.Interface(
    fn=main,
    inputs="textbox",
    outputs="textbox"
)
ui.launch(share=True)

Notes:

Swahili_LLaMA is intended for research purposes. The model-generated text/code should be treated as a starting point rather than a definitive solution for potential use cases. Users should be cautious when employing these models in their applications.
Direct adoption for production tasks is out of the scope of this research project. As a result, the swahili_llama model has not been tested to ensure that it performs adequately for any production-level application. Please refer to the limitation sections of this document for more details.
Any use of this model is at your own risk.

Limitations of Swahili LLaMA

Generate Inaccurate Facts as the base model
Limited Scope for code: It performs poorly on code
Unreliable Responses to Instruction: The model has not undergone instruction fine-tuning. As a result, it may struggle or fail to adhere to intricate or nuanced instructions provided by users.
Language Limitations: The model is primarily designed to understand standard Swahili. The checkpoint of this model also leads to more inaccurate responses. Any Informal Swahili, slang, or any other language might challenge its comprehension, leading to potential misinterpretations or errors in response.
Potential Societal Biases: it fed with limited text it might be bias
Toxicity: It might be toxic; however, most of the dataset trained in Swahili comes from newspapers, which makes it less toxic.
Verbosity: Swahili LLaMa, being a base model, often produces irrelevant or extra text and responses following its first answer to user prompts within a single turn. This is due to its training dataset being primarily news and blogspot, which results in random response.

Training

Model

Architecture: LLaMA-2a (Transformer-based model with next-word prediction objective)
Context length: LLaMA-2 (2048 tokens)
Dataset size: 600M tokens(LLaMA-2) from C100 swahili and other craw from swahili newspaper and blogspots.
Training tokens: 1.4T tokens
GPUs: 2xA6000-48G
Training time: Expected 13 days