meta-llama/Meta-Llama-3-8B-Instruct · Local configuration with LlamaCpp

Hello HF community!
I did a local installation of Llama3 but I am facing a problem.

The code (1 and 2 ) below implements Llama-3 with the following configurations:

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

$ nvidia-smi
Wed May 22 17:11:56 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A30                     Off | 00000000:2A:00.0 Off |                    0 |
| N/A   37C    P0              36W / 165W |   7305MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A30                     Off | 00000000:BD:00.0 Off |                    0 |
| N/A   36C    P0              31W / 165W |   8797MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     58504      C   ...cc/projetos/dev-ai/.venv/bin/python     7296MiB |
|    1   N/A  N/A     58504      C   ...cc/projetos/dev-ai/.venv/bin/python     8788MiB |
+---------------------------------------------------------------------------------------+

MODEL_PATH=/data/llm-data/meta/Meta-Llama-3-8B-Instruct
LLM_MODEL=/data/llm-data/meta/Meta-Llama-3-8B-Instruct-GGUF/llama-3-8B-Instruct.gguf

I used the command below to get the "llama-3-8B-Instruct.gguf"

python3 llama.cpp/convert-hf-to-gguf.py Meta-Llama-3-8B-Instruct --outfile Meta-Llama-3-8B-Instruct-GGUF/llama-3-8B-Instruct.gguf --outtype q8_0

==================== main.py =================

import gradio as gr
from modules.qdrant_chat.chat_document import QdrantChatDocument


chat = QdrantChatDocument()
chat_history = []

with gr.Blocks() as demo:

    chatbot = gr.Chatbot()
    msg = gr.Textbox()
    clear = gr.Button("Clear")

    chat_history = []
    
    def user(user_message, chat_history):
                

        result = chat.qa({"query": user_message})
    
        chat_history.append((user_message, result["result"]))

        return gr.update(value=""), chat_history
    
    msg.submit(user, [msg, chatbot], [msg, chatbot], queue=False)
    clear.click(lambda: None, None, chatbot, queue=False)

if __name__ == "__main__":
    demo.launch(debug=True)

==================== QdrantChatDocument======

from dotenv import load_dotenv
# QDRANT
## Importar o client
from qdrant_client import QdrantClient
## Criar nossa coleção
from qdrant_client.http.models import Distance, VectorParams


# LANGCHAIN
## Importar Qdrant como vector store
from langchain_community.vectorstores import Qdrant
## Importar OpenAI embeddings
from langchain_community.embeddings import SentenceTransformerEmbeddings
## Função para auxiliar na quebra do texto em chunks
from langchain.text_splitter import CharacterTextSplitter
## Módulo para facilitar o uso de vector stores em QA (question answering)
from langchain.chains import RetrievalQA
## Importar LLM
from langchain_community.llms import LlamaCpp


#from langchain.llms import OpenAI
from transformers import AutoTokenizer, AutoModelForCausalLM
import os
import transformers
import torch

load_dotenv()

MODEL_PATH = os.getenv('MODEL_PATH')
LLM_MODEL = os.getenv('LLM_MODEL')
absolute_path = os.path.dirname(os.path.abspath(__file__))


class QdrantChatDocument:


    client = QdrantClient(host="localhost", port=6333)
    client.delete_collection(
        collection_name="pdf_collection"
    )

    client.create_collection(
        collection_name="pdf_collection",
        vectors_config=VectorParams(size=384, distance=Distance.COSINE),
    )
 
    embeddings = SentenceTransformerEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2', model_kwargs={'device': 'cuda'}, encode_kwargs={'normalize_embeddings': False})

    vectorstore = Qdrant(
            client=client,
            collection_name="pdf_collection",
            embeddings=embeddings
        )

    def get_chunks(text):
        text_splitter = CharacterTextSplitter(
            separator="\n",
            chunk_size=2048,
            chunk_overlap=200,
            length_function=len
        )
        chunks = text_splitter.split_text(text)
        return chunks

    
    with open("docs/teste.txt","r") as f:
        raw_text = f.read()
 

    texts = get_chunks(raw_text)

    vectorstore.add_texts(texts)


    qa = RetrievalQA.from_llm(
        llm=LlamaCpp(model_path=LLM_MODEL, n_gpu_layers=28, n_threads=6, n_ctx=4096, n_batch=521, verbose=True, temperature=0.1),
        return_source_documents=True,
        retriever=vectorstore.as_retriever(),
        )

========================================================

But the chat with Gradio is returning "heavy" hallucinations, according to the print.

============== server running =========

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.54it/s]
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /data/llm-data/meta/Meta-Llama-3-8B-Instruct-GGUF/llama-3-8B-Instruct.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 7
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors:        CPU buffer size =  8137.64 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 521
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   296.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
Model metadata: {'tokenizer.chat_template': "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}", 'tokenizer.ggml.eos_token_id': '128009', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'gpt2', 'general.architecture': 'llama', 'llama.rope.freq_base': '500000.000000', 'tokenizer.ggml.pre': 'llama-bpe', 'llama.context_length': '8192', 'general.name': 'Meta-Llama-3-8B-Instruct', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'tokenizer.ggml.bos_token_id': '128000', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'general.file_type': '7', 'llama.vocab_size': '128256', 'llama.rope.dimension_count': '128'}
Available chat formats from metadata: chat_template.default
Guessed chat format: llama-3
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
IMPORTANT: You are using gradio version 3.41.2, however version 4.29.0 is available, please upgrade.
--------
/home/hmdcc/projetos/dev-ai/.venv/lib/python3.10/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: The method `Chain.__call__` was deprecated in langchain 0.1.0 and will be removed in 0.3.0. Use invoke instead.
  warn_deprecated(

llama_print_timings:        load time =    3569.78 ms
llama_print_timings:      sample time =     189.22 ms /   256 runs   (    0.74 ms per token,  1352.89 tokens per second)
llama_print_timings: prompt eval time =    7216.23 ms /  1108 tokens (    6.51 ms per token,   153.54 tokens per second)
llama_print_timings:        eval time =   31871.99 ms /   255 runs   (  124.99 ms per token,     8.00 tokens per second)
llama_print_timings:       total time =   40884.20 ms /  1363 tokens