Orneyfish commited on
Commit
5b59a12
1 Parent(s): 46fd347

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -1
README.md CHANGED
@@ -2,4 +2,76 @@
2
  license: apache-2.0
3
  language:
4
  - en
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
  language:
4
  - en
5
+ ---
6
+ # Meta-Llama-3-8B-GGUF
7
+
8
+ - This is a GGUF quantized version of [Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B/edit/main/README.md).
9
+
10
+ ## Model Details
11
+
12
+ * Meta developed and released the Meta Llama 3 family of large language models (LLMs) based on the Transformer architecture.
13
+ * Model Architecture: Transformer-based with 8.5 billion parameters.
14
+ * GGUF Quantization: Currently only availabel in f_16 version.
15
+
16
+ ## Intended Use
17
+
18
+ * This model is intended for research and experimentation in understanding and advancing language model capabilities.
19
+ * Sample Use Cases:
20
+ * Text generation of various kinds (creative, factual, etc.)
21
+ * Summarization
22
+ * Question-answering
23
+
24
+ ## Performance & Limitations
25
+
26
+ * Performance Metrics: Report speed improvements, performance changes compared to the original model.
27
+ * Limitations:
28
+ * May still generate unsafe or biased outputs, use with caution.
29
+ * Performance changes compared to the original due to quantization.
30
+
31
+ ## How to Use
32
+
33
+ 1. **Load the model and tokenizer:**
34
+ ```python
35
+ ##### Mistral Inference #############
36
+ from huggingface_hub import hf_hub_download
37
+ import time
38
+ from llama_cpp.llama import Llama, LlamaGrammar
39
+ import httpx
40
+ import json
41
+ import torch
42
+ import multiprocessing as mp
43
+
44
+ number_of_cpu = mp.cpu_count()
45
+ grammar_text = httpx.get("https://raw.githubusercontent.com/ggerganov/llama.cpp/master/grammars/json.gbnf").text
46
+ grammar = LlamaGrammar.from_string(grammar_text)
47
+ model_name_or_path = "Orneyfish/Meta-Llama-3-8B_F_16.gguf"
48
+ model_basename = "Meta-Llama-3-8B_F_16.gguf" # the model is in bin format
49
+
50
+
51
+ model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename, local_dir='models/')
52
+ from langchain import PromptTemplate, LLMChain
53
+ from langchain.callbacks.manager import CallbackManager
54
+ from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
55
+
56
+ # Callbacks support token-wise streaming
57
+ callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
58
+ # Verbose is required to pass to the callback manager
59
+ n_gpu_layers = 40 # Change this value based on your model and your GPU VRAM pool.
60
+ n_batch = 1024 # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
61
+
62
+ # Loading model,
63
+ llm = Llama(
64
+ model_path=model_path,
65
+ n_threads=number_of_cpu,
66
+ n_gpu_layers=n_gpu_layers,
67
+ n_batch=n_batch,
68
+ n_threads_batch = 512,
69
+ # use_mlock =True,
70
+ callback_manager=callback_manager,
71
+ verbose=True,
72
+ n_ctx=8196, # Context window
73
+ stop = ['USER:'], # Dynamic stopping when such token is detected.
74
+ temperature = 0.2,
75
+ # use_mmap = False,
76
+ )
77
+ ```