Orneyfish
/

Meta-Llama-3-8B_F_16.gguf

Text Generation

Model card Files Files and versions Community

Orneyfish commited on Apr 19

Commit

5b59a12

•

1 Parent(s): 46fd347

Update README.md

Files changed (1) hide show

README.md +73 -1

README.md CHANGED Viewed

@@ -2,4 +2,76 @@
 license: apache-2.0
 language:
 - en
----

 license: apache-2.0
 language:
 - en
+---
+# Meta-Llama-3-8B-GGUF
+- This is a GGUF quantized version of [Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B/edit/main/README.md).
+## Model Details
+* Meta developed and released the Meta Llama 3 family of large language models (LLMs) based on the Transformer architecture.
+* Model Architecture: Transformer-based with 8.5 billion parameters.
+* GGUF Quantization: Currently only availabel in f_16 version.
+## Intended Use
+* This model is intended for research and experimentation in understanding and advancing language model capabilities.
+* Sample Use Cases:
+* Text generation of various kinds (creative, factual, etc.)
+* Summarization
+* Question-answering
+## Performance & Limitations
+* Performance Metrics: Report speed improvements, performance changes compared to the original model.
+* Limitations:
+* May still generate unsafe or biased outputs, use with caution.
+* Performance changes compared to the original due to quantization.
+## How to Use
+1. **Load the model and tokenizer:**
+```python
+##### Mistral Inference #############
+from huggingface_hub import hf_hub_download
+import time
+from llama_cpp.llama import Llama, LlamaGrammar
+import httpx
+import json
+import torch
+import multiprocessing as mp
+number_of_cpu = mp.cpu_count()
+grammar_text = httpx.get("https://raw.githubusercontent.com/ggerganov/llama.cpp/master/grammars/json.gbnf").text
+grammar = LlamaGrammar.from_string(grammar_text)
+model_name_or_path = "Orneyfish/Meta-Llama-3-8B_F_16.gguf"
+model_basename = "Meta-Llama-3-8B_F_16.gguf" # the model is in bin format
+model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename, local_dir='models/')
+from langchain import PromptTemplate, LLMChain
+from langchain.callbacks.manager import CallbackManager
+from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
+# Callbacks support token-wise streaming
+callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
+# Verbose is required to pass to the callback manager
+n_gpu_layers = 40 # Change this value based on your model and your GPU VRAM pool.
+n_batch = 1024 # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
+# Loading model,
+llm = Llama(
+model_path=model_path,
+n_threads=number_of_cpu,
+n_gpu_layers=n_gpu_layers,
+n_batch=n_batch,
+n_threads_batch = 512,
+# use_mlock =True,
+callback_manager=callback_manager,
+verbose=True,
+n_ctx=8196, # Context window
+stop = ['USER:'], # Dynamic stopping when such token is detected.
+temperature = 0.2,
+# use_mmap = False,
+)
+```