Update README.md
Browse files
README.md
CHANGED
@@ -2,4 +2,76 @@
|
|
2 |
license: apache-2.0
|
3 |
language:
|
4 |
- en
|
5 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
license: apache-2.0
|
3 |
language:
|
4 |
- en
|
5 |
+
---
|
6 |
+
# Meta-Llama-3-8B-GGUF
|
7 |
+
|
8 |
+
- This is a GGUF quantized version of [Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B/edit/main/README.md).
|
9 |
+
|
10 |
+
## Model Details
|
11 |
+
|
12 |
+
* Meta developed and released the Meta Llama 3 family of large language models (LLMs) based on the Transformer architecture.
|
13 |
+
* Model Architecture: Transformer-based with 8.5 billion parameters.
|
14 |
+
* GGUF Quantization: Currently only availabel in f_16 version.
|
15 |
+
|
16 |
+
## Intended Use
|
17 |
+
|
18 |
+
* This model is intended for research and experimentation in understanding and advancing language model capabilities.
|
19 |
+
* Sample Use Cases:
|
20 |
+
* Text generation of various kinds (creative, factual, etc.)
|
21 |
+
* Summarization
|
22 |
+
* Question-answering
|
23 |
+
|
24 |
+
## Performance & Limitations
|
25 |
+
|
26 |
+
* Performance Metrics: Report speed improvements, performance changes compared to the original model.
|
27 |
+
* Limitations:
|
28 |
+
* May still generate unsafe or biased outputs, use with caution.
|
29 |
+
* Performance changes compared to the original due to quantization.
|
30 |
+
|
31 |
+
## How to Use
|
32 |
+
|
33 |
+
1. **Load the model and tokenizer:**
|
34 |
+
```python
|
35 |
+
##### Mistral Inference #############
|
36 |
+
from huggingface_hub import hf_hub_download
|
37 |
+
import time
|
38 |
+
from llama_cpp.llama import Llama, LlamaGrammar
|
39 |
+
import httpx
|
40 |
+
import json
|
41 |
+
import torch
|
42 |
+
import multiprocessing as mp
|
43 |
+
|
44 |
+
number_of_cpu = mp.cpu_count()
|
45 |
+
grammar_text = httpx.get("https://raw.githubusercontent.com/ggerganov/llama.cpp/master/grammars/json.gbnf").text
|
46 |
+
grammar = LlamaGrammar.from_string(grammar_text)
|
47 |
+
model_name_or_path = "Orneyfish/Meta-Llama-3-8B_F_16.gguf"
|
48 |
+
model_basename = "Meta-Llama-3-8B_F_16.gguf" # the model is in bin format
|
49 |
+
|
50 |
+
|
51 |
+
model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename, local_dir='models/')
|
52 |
+
from langchain import PromptTemplate, LLMChain
|
53 |
+
from langchain.callbacks.manager import CallbackManager
|
54 |
+
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
|
55 |
+
|
56 |
+
# Callbacks support token-wise streaming
|
57 |
+
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
|
58 |
+
# Verbose is required to pass to the callback manager
|
59 |
+
n_gpu_layers = 40 # Change this value based on your model and your GPU VRAM pool.
|
60 |
+
n_batch = 1024 # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
|
61 |
+
|
62 |
+
# Loading model,
|
63 |
+
llm = Llama(
|
64 |
+
model_path=model_path,
|
65 |
+
n_threads=number_of_cpu,
|
66 |
+
n_gpu_layers=n_gpu_layers,
|
67 |
+
n_batch=n_batch,
|
68 |
+
n_threads_batch = 512,
|
69 |
+
# use_mlock =True,
|
70 |
+
callback_manager=callback_manager,
|
71 |
+
verbose=True,
|
72 |
+
n_ctx=8196, # Context window
|
73 |
+
stop = ['USER:'], # Dynamic stopping when such token is detected.
|
74 |
+
temperature = 0.2,
|
75 |
+
# use_mmap = False,
|
76 |
+
)
|
77 |
+
```
|