RichardErkhov commited on
Commit
9a73710
•
1 Parent(s): 7f9430a

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +245 -0
README.md ADDED
@@ -0,0 +1,245 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ buddhi-128k-chat-7b - GGUF
11
+ - Model creator: https://huggingface.co/aiplanet/
12
+ - Original model: https://huggingface.co/aiplanet/buddhi-128k-chat-7b/
13
+
14
+
15
+ | Name | Quant method | Size |
16
+ | ---- | ---- | ---- |
17
+ | [buddhi-128k-chat-7b.Q2_K.gguf](https://huggingface.co/RichardErkhov/aiplanet_-_buddhi-128k-chat-7b-gguf/blob/main/buddhi-128k-chat-7b.Q2_K.gguf) | Q2_K | 2.53GB |
18
+ | [buddhi-128k-chat-7b.IQ3_XS.gguf](https://huggingface.co/RichardErkhov/aiplanet_-_buddhi-128k-chat-7b-gguf/blob/main/buddhi-128k-chat-7b.IQ3_XS.gguf) | IQ3_XS | 2.81GB |
19
+ | [buddhi-128k-chat-7b.IQ3_S.gguf](https://huggingface.co/RichardErkhov/aiplanet_-_buddhi-128k-chat-7b-gguf/blob/main/buddhi-128k-chat-7b.IQ3_S.gguf) | IQ3_S | 2.96GB |
20
+ | [buddhi-128k-chat-7b.Q3_K_S.gguf](https://huggingface.co/RichardErkhov/aiplanet_-_buddhi-128k-chat-7b-gguf/blob/main/buddhi-128k-chat-7b.Q3_K_S.gguf) | Q3_K_S | 2.95GB |
21
+ | [buddhi-128k-chat-7b.IQ3_M.gguf](https://huggingface.co/RichardErkhov/aiplanet_-_buddhi-128k-chat-7b-gguf/blob/main/buddhi-128k-chat-7b.IQ3_M.gguf) | IQ3_M | 3.06GB |
22
+ | [buddhi-128k-chat-7b.Q3_K.gguf](https://huggingface.co/RichardErkhov/aiplanet_-_buddhi-128k-chat-7b-gguf/blob/main/buddhi-128k-chat-7b.Q3_K.gguf) | Q3_K | 3.28GB |
23
+ | [buddhi-128k-chat-7b.Q3_K_M.gguf](https://huggingface.co/RichardErkhov/aiplanet_-_buddhi-128k-chat-7b-gguf/blob/main/buddhi-128k-chat-7b.Q3_K_M.gguf) | Q3_K_M | 3.28GB |
24
+ | [buddhi-128k-chat-7b.Q3_K_L.gguf](https://huggingface.co/RichardErkhov/aiplanet_-_buddhi-128k-chat-7b-gguf/blob/main/buddhi-128k-chat-7b.Q3_K_L.gguf) | Q3_K_L | 3.56GB |
25
+ | [buddhi-128k-chat-7b.IQ4_XS.gguf](https://huggingface.co/RichardErkhov/aiplanet_-_buddhi-128k-chat-7b-gguf/blob/main/buddhi-128k-chat-7b.IQ4_XS.gguf) | IQ4_XS | 3.67GB |
26
+ | [buddhi-128k-chat-7b.Q4_0.gguf](https://huggingface.co/RichardErkhov/aiplanet_-_buddhi-128k-chat-7b-gguf/blob/main/buddhi-128k-chat-7b.Q4_0.gguf) | Q4_0 | 3.83GB |
27
+ | [buddhi-128k-chat-7b.IQ4_NL.gguf](https://huggingface.co/RichardErkhov/aiplanet_-_buddhi-128k-chat-7b-gguf/blob/main/buddhi-128k-chat-7b.IQ4_NL.gguf) | IQ4_NL | 3.87GB |
28
+ | [buddhi-128k-chat-7b.Q4_K_S.gguf](https://huggingface.co/RichardErkhov/aiplanet_-_buddhi-128k-chat-7b-gguf/blob/main/buddhi-128k-chat-7b.Q4_K_S.gguf) | Q4_K_S | 3.86GB |
29
+ | [buddhi-128k-chat-7b.Q4_K.gguf](https://huggingface.co/RichardErkhov/aiplanet_-_buddhi-128k-chat-7b-gguf/blob/main/buddhi-128k-chat-7b.Q4_K.gguf) | Q4_K | 4.07GB |
30
+ | [buddhi-128k-chat-7b.Q4_K_M.gguf](https://huggingface.co/RichardErkhov/aiplanet_-_buddhi-128k-chat-7b-gguf/blob/main/buddhi-128k-chat-7b.Q4_K_M.gguf) | Q4_K_M | 4.07GB |
31
+ | [buddhi-128k-chat-7b.Q4_1.gguf](https://huggingface.co/RichardErkhov/aiplanet_-_buddhi-128k-chat-7b-gguf/blob/main/buddhi-128k-chat-7b.Q4_1.gguf) | Q4_1 | 4.24GB |
32
+ | [buddhi-128k-chat-7b.Q5_0.gguf](https://huggingface.co/RichardErkhov/aiplanet_-_buddhi-128k-chat-7b-gguf/blob/main/buddhi-128k-chat-7b.Q5_0.gguf) | Q5_0 | 4.65GB |
33
+ | [buddhi-128k-chat-7b.Q5_K_S.gguf](https://huggingface.co/RichardErkhov/aiplanet_-_buddhi-128k-chat-7b-gguf/blob/main/buddhi-128k-chat-7b.Q5_K_S.gguf) | Q5_K_S | 4.65GB |
34
+ | [buddhi-128k-chat-7b.Q5_K.gguf](https://huggingface.co/RichardErkhov/aiplanet_-_buddhi-128k-chat-7b-gguf/blob/main/buddhi-128k-chat-7b.Q5_K.gguf) | Q5_K | 4.78GB |
35
+ | [buddhi-128k-chat-7b.Q5_K_M.gguf](https://huggingface.co/RichardErkhov/aiplanet_-_buddhi-128k-chat-7b-gguf/blob/main/buddhi-128k-chat-7b.Q5_K_M.gguf) | Q5_K_M | 4.78GB |
36
+ | [buddhi-128k-chat-7b.Q5_1.gguf](https://huggingface.co/RichardErkhov/aiplanet_-_buddhi-128k-chat-7b-gguf/blob/main/buddhi-128k-chat-7b.Q5_1.gguf) | Q5_1 | 5.07GB |
37
+ | [buddhi-128k-chat-7b.Q6_K.gguf](https://huggingface.co/RichardErkhov/aiplanet_-_buddhi-128k-chat-7b-gguf/blob/main/buddhi-128k-chat-7b.Q6_K.gguf) | Q6_K | 5.53GB |
38
+ | [buddhi-128k-chat-7b.Q8_0.gguf](https://huggingface.co/RichardErkhov/aiplanet_-_buddhi-128k-chat-7b-gguf/blob/main/buddhi-128k-chat-7b.Q8_0.gguf) | Q8_0 | 7.17GB |
39
+
40
+
41
+
42
+
43
+ Original model description:
44
+ ---
45
+ license: apache-2.0
46
+ pipeline_tag: text-generation
47
+ ---
48
+
49
+ <p align="center" style="font-size:34px;"><b>Buddhi-128K-Chat</b></p>
50
+
51
+ # Buddhi-128K-Chat (7B) vLLM Inference: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/11_8W8FpKK-856QdRVJLyzbu9g-DMxNfg?usp=sharing)
52
+
53
+ # Read release article: [🔗 Introducing Buddhi: Open-Source Chat Model with a 128K Context Window 🔗 ](https://medium.aiplanet.com/introducing-buddhi-open-source-chat-model-with-a-128k-context-window-06a1848121d0)
54
+
55
+ ![4.png](https://cdn-uploads.huggingface.co/production/uploads/630f3058236215d0b7078806/VUY0c4xOGpH9jTNmf6XNU.png)
56
+
57
+ ## Model Description
58
+
59
+ Buddhi-128k-Chat is a general-purpose first chat model with 128K context length window. It is meticulously fine-tuned on the Mistral 7B Instruct, and optimised to handle an extended context length of up to 128,000 tokens using the innovative YaRN (Yet another Rope Extension) Technique. This enhancement allows Buddhi to maintain a deeper understanding of context in long documents or conversations, making it particularly adept at tasks requiring extensive context retention, such as comprehensive document summarization, detailed narrative generation, and intricate question-answering.
60
+
61
+ ## Architecture
62
+ The Buddhi-128K-Chat model is fine-tuned on the Mistral-7B Instruct base model. We selected the Mistral 7B Instruct v0.2 as the parent model due to its superior reasoning capabilities. The architecture of the Mistral-7B model includes features like Grouped-Query Attention and Byte-fallback BPE tokenizer. Originally, this model has 32,768 maximum position embeddings. To increase the context size to 128K, we needed to modify the positional embeddings, which is where YaRN comes into play.
63
+
64
+ In our approach, we utilized the NTK-aware technique, which recommends alternative interpolation techniques for positional interpolation. One experimentation involved Dynamic-YARN, suggesting the dynamic value of the 's' scale factor. This is because during inference, the sequence length changes by 1 after every word prediction. By integrating these position embeddings with the Mistral-7B Instruct base model, we achieved the 128K model.
65
+
66
+ Additionally, we fine-tuned the model on our dataset to contribute one of the very few 128K chat-based models available in the open-source community with greater reasoning capabilities than all of it.
67
+
68
+ ### Hardware requirements:
69
+ > For 128k Context Length
70
+ > - 80GB VRAM - A100 Preferred
71
+
72
+ > For 32k Context Length
73
+ > - 40GB VRAM - A100 Preferred
74
+
75
+ ### vLLM - For Faster Inference
76
+
77
+ #### Installation
78
+
79
+ ```
80
+ !pip install vllm
81
+ !pip install flash_attn # If Flash Attention 2 is supported by your System
82
+ ```
83
+ Please check out [Flash Attention 2](https://github.com/Dao-AILab/flash-attention) Github Repository for more instructions on how to Install it.
84
+
85
+ **Implementation**:
86
+
87
+ > Note: The actual hardware requirements to run the model is roughly around 70GB VRAM. For experimentation, we are limiting the context length to 75K instead of 128K. This make it suitable for testing the model in 30-35 GB VRAM
88
+
89
+ ```python
90
+ from vllm import LLM, SamplingParams
91
+
92
+ llm = LLM(
93
+ model='aiplanet/buddhi-128k-chat-7b',
94
+ trust_remote_code=True,
95
+ dtype = 'bfloat16',
96
+ gpu_memory_utilization=1,
97
+ max_model_len= 75000
98
+ )
99
+
100
+ prompts = [
101
+ """<s> [INST] Please tell me a joke. [/INST] """,
102
+ """<s> [INST] What is Machine Learning? [/INST] """
103
+ ]
104
+
105
+ sampling_params = SamplingParams(
106
+ temperature=0.8,
107
+ top_p=0.95,
108
+ max_tokens=1000
109
+ )
110
+
111
+ outputs = llm.generate(prompts, sampling_params)
112
+
113
+ for output in outputs:
114
+ prompt = output.prompt
115
+ generated_text = output.outputs[0].text
116
+ print(generated_text)
117
+ print("\n\n")
118
+
119
+ # we have also attached a colab notebook, that contains: 2 more experimentations: Long Essay and Entire Book
120
+ ```
121
+
122
+ For Output, do check out the colab notebook: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/11_8W8FpKK-856QdRVJLyzbu9g-DMxNfg?usp=sharing)
123
+
124
+ ### Transformers - Basic Implementation
125
+
126
+ ```python
127
+ import torch
128
+ import transformers
129
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
130
+
131
+ bnb_config = BitsAndBytesConfig(
132
+ load_in_4bit=True,
133
+ bnb_4bit_use_double_quant=True,
134
+ bnb_4bit_quant_type="nf4",
135
+ bnb_4bit_compute_dtype=torch.bfloat16
136
+ )
137
+
138
+ model_name = "aiplanet/Buddhi-128K-Chat"
139
+
140
+ model = AutoModelForCausalLM.from_pretrained(
141
+ model_name,
142
+ quantization_config=bnb_config,
143
+ device_map="sequential",
144
+ trust_remote_code=True
145
+ )
146
+
147
+ tokenizer = AutoTokenizer.from_pretrained(
148
+ model,
149
+ trust_remote_code=True
150
+ )
151
+
152
+ prompt = "<s> [INST] Please tell me a small joke. [/INST] "
153
+
154
+ tokens = tokenizer(prompt, return_tensors="pt").to("cuda")
155
+ outputs = model.generate(
156
+ **tokens,
157
+ max_new_tokens=100,
158
+ do_sample=True,
159
+ top_p=0.95,
160
+ temperature=0.8,
161
+ )
162
+
163
+ decoded_output = tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0]
164
+ print(f"Output:\n{decoded_output[len(prompt):]}")
165
+ ```
166
+
167
+ Output
168
+
169
+ ```
170
+ Output:
171
+ Why don't scientists trust atoms?
172
+
173
+ Because they make up everything.
174
+ ```
175
+
176
+
177
+ ## Prompt Template for Buddi-128-Chat
178
+
179
+ In order to leverage instruction fine-tuning, your prompt should be surrounded by [INST] and [/INST] tokens. The very first instruction should begin with a begin of sentence id. The next instructions should not. The assistant generation will be ended by the end-of-sentence token id.
180
+
181
+ ```
182
+ "<s>[INST] What is your favourite condiment? [/INST]"
183
+ "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s> "
184
+ "[INST] Do you have mayonnaise recipes? [/INST]"
185
+
186
+ ```
187
+
188
+ # Benchmarks
189
+
190
+ ### Long Context Benchmark
191
+
192
+ <strong>LongICLBench Banking77</strong>
193
+ <div>
194
+
195
+ | Model | 1R/2k | 2R/4K | 3R/7K | 4R/9K | 5R/14K |
196
+ |-----------------------------------------|-------|-------|-------|-------|--------|
197
+ | aiplanet/buddhi-128k-chat-7b | 47.8 | 60.8 | 57.8 | 62.4 | 57.2 |
198
+ | NousResearch/Yarn-Mistral-7b-128k | 31.6 | 68.6 | 68 | 47 | 65.6 |
199
+ | CallComply/zephyr-7b-beta-128k | 40.2 | 41.2 | 33.6 | 03 | 0 |
200
+ | Eric111/Yarn-Mistral-7b-128k-DPO | 28.6 | 62.8 | 58 | 41.6 | 59.8 |
201
+
202
+ </div>
203
+
204
+ <strong>Short Context Benchmark</strong>
205
+ <div>
206
+
207
+ | Model | # Params | Average | ARC (25-shot) | HellaSwag (10-shot) | Winogrande (5-shot) | TruthfulOA (0-shot) | MMLU (5-shot) |
208
+ |-----------------------------------|----------|---------|---------------|---------------------|---------------------|---------------------|---------------|
209
+ | aiplanet/buddhi-128k-chat-7b | 7B | 64.42 | 60.84 | 84 | 77.27 | 65.72 | 60.42 |
210
+ | migtissera/Tess-XS-vl-3-yarn-128K | 7B | 62.66 | 61.09 | 82.95 | 74.43 | 50.13 | 62.15 |
211
+ | migtissera/Tess-XS-v1-3-yarn-128K | 7B | 62.49 | 61.6 | 82.96 | 74.74 | 50.2 | 62.1 |
212
+ | Eric111/Yarn-Mistral-7b-128k-DPO | 7B | 60.15 | 60.84 | 82.99 | 78.3 | 43.55 | 63.09 |
213
+ | NousResearch/Yam-Mistral-7b-128k | 7B | 59.42 | 59.64 | 82.5 | 76.95 | 41.78 | 63.02 |
214
+ | CallComply/openchat-3.5-0106-128k | 7B | 59.38 | 64.25 | 77.31 | 77.66 | 46.5 | 57.58 |
215
+ | CallComply/zephyr-7b-beta-128k | 7B | 54.45 | 58.28 | 81 | 74.74 | 46.1 | 53.57 |
216
+
217
+ </div>
218
+
219
+ ## Get in Touch
220
+
221
+ You can schedule a 1:1 meeting with our DevRel & Community Team to get started with AI Planet Open Source LLMs and GenAI Stack. Schedule the call here: [https://calendly.com/jaintarun](https://calendly.com/jaintarun)
222
+
223
+ Stay tuned for more updates and be a part of the coding evolution. Join us on this exciting journey as we make AI accessible to all at AI Planet!
224
+
225
+
226
+ ### Framework versions
227
+
228
+ - Transformers 4.39.2
229
+ - Pytorch 2.2.1+cu121
230
+ - Datasets 2.18.0
231
+ - Accelerate 0.27.2
232
+ - flash_attn 2.5.6
233
+
234
+ ### Citation
235
+
236
+ ```
237
+ @misc {Chaitanya890, lucifertrj ,
238
+ author = { Chaitanya Singhal, Tarun Jain },
239
+ title = { Buddhi-128k-Chat by AI Planet},
240
+ year = 2024,
241
+ url = { https://huggingface.co/aiplanet//Buddhi-128K-Chat },
242
+ publisher = { Hugging Face }
243
+ }
244
+ ```
245
+