lumaticai commited on
Commit
be6e21a
1 Parent(s): 2609e24

Rename lumaticai_BongLlama-1.1B-Chat-alpha-v0_model_card.md to Readme.md

Browse files
lumaticai_BongLlama-1.1B-Chat-alpha-v0_model_card.md → Readme.md RENAMED
@@ -1,69 +1,18 @@
1
  ---
2
 
3
  ---
 
4
 
 
5
 
6
 
7
-
8
-
9
-
10
- # Model Card for lumaticai/BongLlama-1.1B-Chat-alpha-v0
11
-
12
- <!-- Provide a quick summary of what the model is/does. [Optional] -->
13
- Bongllama is a sub-part of our company&#39;s initiative for developing Indic and Regional Large Language Models. We are LumaticAI continuously working on helping our clients build Custom AI Solutions for their organization.
14
- We have taken an initiative to launch open source models specific to regions and languages.
15
-
16
- Bongllama is a LLM built for West Bengal on Bengali dataset. It&#39;s a 1.1B parameters model. We have used a Bengali dataset of 10k data i.e lumatic-ai/BongChat-10k-v0 and finetuned on TinyLlama/TinyLlama-1.1B-Chat-v1.0 model to get our BongLlama 1.1B Chat Alpha v0 model.
17
-
18
- We are continuously working on training and developing this model and improve it. We are also going to launch this model with various sizes of different LLM&#39;s and Datasets.
19
-
20
-
21
-
22
-
23
- # Table of Contents
24
-
25
- - [Model Card for lumaticai/BongLlama-1.1B-Chat-alpha-v0](#model-card-for--model_id-)
26
- - [Table of Contents](#table-of-contents)
27
- - [Table of Contents](#table-of-contents-1)
28
- - [Model Details](#model-details)
29
- - [Model Description](#model-description)
30
- - [Uses](#uses)
31
- - [Direct Use](#direct-use)
32
- - [Downstream Use [Optional]](#downstream-use-optional)
33
- - [Out-of-Scope Use](#out-of-scope-use)
34
- - [Bias, Risks, and Limitations](#bias-risks-and-limitations)
35
- - [Recommendations](#recommendations)
36
- - [Training Details](#training-details)
37
- - [Training Data](#training-data)
38
- - [Training Procedure](#training-procedure)
39
- - [Preprocessing](#preprocessing)
40
- - [Speeds, Sizes, Times](#speeds-sizes-times)
41
- - [Evaluation](#evaluation)
42
- - [Testing Data, Factors & Metrics](#testing-data-factors--metrics)
43
- - [Testing Data](#testing-data)
44
- - [Factors](#factors)
45
- - [Metrics](#metrics)
46
- - [Results](#results)
47
- - [Model Examination](#model-examination)
48
- - [Environmental Impact](#environmental-impact)
49
- - [Technical Specifications [optional]](#technical-specifications-optional)
50
- - [Model Architecture and Objective](#model-architecture-and-objective)
51
- - [Compute Infrastructure](#compute-infrastructure)
52
- - [Hardware](#hardware)
53
- - [Software](#software)
54
- - [Citation](#citation)
55
- - [Glossary [optional]](#glossary-optional)
56
- - [More Information [optional]](#more-information-optional)
57
- - [Model Card Authors [optional]](#model-card-authors-optional)
58
- - [Model Card Contact](#model-card-contact)
59
- - [How to Get Started with the Model](#how-to-get-started-with-the-model)
60
 
61
 
62
  # Model Details
63
 
64
  ## Model Description
65
 
66
- <!-- Provide a longer summary of what this model is/does. -->
67
  Bongllama is a sub-part of our company&#39;s initiative for developing Indic and Regional Large Language Models. We are LumaticAI continuously working on helping our clients build Custom AI Solutions for their organization.
68
  We have taken an initiative to launch open source models specific to regions and languages.
69
 
@@ -71,34 +20,24 @@ Bongllama is a LLM built for West Bengal on Bengali dataset. It&#39;s a 1.1B par
71
 
72
  We are continuously working on training and developing this model and improve it. We are also going to launch this model with various sizes of different LLM&#39;s and Datasets.
73
 
74
- - **Developed by:** L, u, m, a, t, i, c, A, I
75
- - **Shared by [Optional]:** L, u, m, a, t, i, c, A, I
76
  - **Model type:** Language model
77
  - **Language(s) (NLP):** en, bn
78
  - **License:** apache-2.0
79
- - **Parent Model:** More information needed
80
- - **Resources for more information:** More information needed
81
-
82
 
83
 
84
  # Uses
85
 
86
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
87
-
88
  ## Direct Use
89
 
90
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
91
- <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
92
-
93
  - base model for further finetuning
94
  - get an overview of how indic LLM work on specific language
95
  - for fun
96
 
97
 
98
  ## Downstream Use [Optional]
99
-
100
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
101
- <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
102
 
103
  - can be deployed with api
104
  - used to create webapp or app to show demo
@@ -106,51 +45,33 @@ We are continuously working on training and developing this model and improve it
106
 
107
  ## Out-of-Scope Use
108
 
109
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
110
- <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
111
-
112
  - cannot be used for production purpose
113
  - cannot be used to generate text for research or academic purposes
114
 
115
 
116
  # Bias, Risks, and Limitations
117
 
118
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
119
-
120
  Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
121
 
122
 
123
- ## Recommendations
124
-
125
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
126
-
127
-
128
-
129
-
130
-
131
  # Training Details
132
 
133
  ## Training Data
134
 
135
- <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
136
 
137
- &lt;|im_start|&gt;user এটি কি ইতিবাচক পর্যালোচনা? এই ফ্র্যাঞ্চাইজি কখনও হার্ট স্টপ স্টান্টস, ফাইট সিকোয়েন্স এবং তারকা শক্তি বন্ধ করে হতাশ করে না।এবার মূল চারটি ছাড়াও নতুন মুখগুলি হলেন আবুয়েলিতা চরিত্রে রিতা মোরেনো এবং দান্তে জেসন মোমোয়া যিনি শোটি চুরি করেছেন।&lt;|im_end|&gt; &lt;|im_start|&gt;assistant এটি একটি খুব ইতিবাচক পর্যালোচনা, এটি ইঙ্গিত করে যে ফ্র্যাঞ্চাইজি শ্রোতাদের জন্য আরও পছন্দসই চরিত্রগুলি যুক্ত করার পাশাপাশি অতীতের প্রত্যাশার ভিত্তিতে বিতরণ করেছে।&lt;|im_end|&gt;
 
138
 
139
 
140
  ## Training Procedure
141
 
142
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
143
-
144
  ### Preprocessing
145
 
146
  - Dataset Format
147
  &lt;|im_start|&gt;user &lt;question&gt;&lt;|im_end|&gt; &lt;|im_start|&gt;assistant &lt;response&gt;&lt;|im_end|&gt;
148
 
149
- ### Speeds, Sizes, Times
150
-
151
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
152
-
153
- ## Training hyperparameters
154
 
155
  The following hyperparameters were used during training:
156
  - learning_rate: 0.0002
@@ -165,7 +86,7 @@ The following hyperparameters were used during training:
165
  - num_epochs: 3
166
  - mixed_precision_training: Native AMP
167
 
168
- ## Framework versions
169
 
170
  - Transformers 4.35.2
171
  - Pytorch 2.1.0+cu121
@@ -174,27 +95,8 @@ The following hyperparameters were used during training:
174
 
175
  # Evaluation
176
 
177
- <!-- This section describes the evaluation protocols and provides the results. -->
178
-
179
- ## Testing Data, Factors & Metrics
180
-
181
- ### Testing Data
182
-
183
- <!-- This should link to a Data Card if possible. -->
184
-
185
- More information needed
186
-
187
-
188
- ### Factors
189
-
190
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
191
-
192
- More information needed
193
-
194
  ### Metrics
195
 
196
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
197
-
198
  - train/loss
199
  - steps
200
 
@@ -245,8 +147,6 @@ We will be further finetuning this model on large dataset to see how it performs
245
 
246
  # Environmental Impact
247
 
248
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
249
-
250
  Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
251
 
252
  - **Hardware Type:** 1 X Tesla T4
@@ -255,23 +155,17 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
255
  - **Compute Region:** India
256
  - **Carbon Emitted:** 0.14
257
 
258
- # Technical Specifications [optional]
259
 
260
  ## Model Architecture and Objective
261
 
262
  Finetuned on Tiny-Llama 1.1B Chat model
263
 
264
- ## Compute Infrastructure
265
-
266
- More information needed
267
 
268
  ### Hardware
269
 
270
  1 X Tesla T4
271
 
272
- ### Software
273
-
274
- More information needed
275
 
276
  # Citation
277
 
@@ -279,10 +173,6 @@ More information needed
279
 
280
  **BibTeX:**
281
 
282
- More information needed
283
-
284
- **APA:**
285
-
286
  @misc{BongLlama-1.1B-Chat-alpha-v0,
287
  url={[https://huggingface.co/lumatic-ai/BongLlama-1.1B-Chat-alpha-v0](https://huggingface.co/lumatic-ai/BongLlama-1.1B-Chat-alpha-v0)},
288
  title={BongLlama 1.1B Chat Aplha V0},
@@ -290,19 +180,8 @@ More information needed
290
  year={2024}, month={Jan}
291
  }
292
 
293
- # Glossary [optional]
294
-
295
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
296
-
297
- More information needed
298
-
299
- # More Information [optional]
300
-
301
- More information needed
302
-
303
- # Model Card Authors [optional]
304
 
305
- <!-- This section provides another layer of transparency and accountability. Whose views is this model card representing? How many voices were included in its construction? Etc. -->
306
 
307
  lumatic-ai
308
 
@@ -317,6 +196,108 @@ Use the code below to get started with the model.
317
  <details>
318
  <summary> Click to expand </summary>
319
 
320
- # ⚠️ Type of model unknown
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
321
 
322
  </details>
 
1
  ---
2
 
3
  ---
4
+ # lumaticai/BongLlama-1.1B-Chat-alpha-v0
5
 
6
+ Introducing BongLlama by LumaticAI. A finetuned version of TinyLlama 1.1B Chat on Bengali Dataset.
7
 
8
 
9
+ <img class="custom-image" src="llama.png" alt="BongLlama">
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
 
12
  # Model Details
13
 
14
  ## Model Description
15
 
 
16
  Bongllama is a sub-part of our company&#39;s initiative for developing Indic and Regional Large Language Models. We are LumaticAI continuously working on helping our clients build Custom AI Solutions for their organization.
17
  We have taken an initiative to launch open source models specific to regions and languages.
18
 
 
20
 
21
  We are continuously working on training and developing this model and improve it. We are also going to launch this model with various sizes of different LLM&#39;s and Datasets.
22
 
23
+ - **Developed by:** LumaticAI
24
+ - **Shared by [Optional]:** LumaticAI
25
  - **Model type:** Language model
26
  - **Language(s) (NLP):** en, bn
27
  - **License:** apache-2.0
28
+ - **Parent Model:** TinyLlama/TinyLlama-1.1B-Chat-v1.0
 
 
29
 
30
 
31
  # Uses
32
 
 
 
33
  ## Direct Use
34
 
 
 
 
35
  - base model for further finetuning
36
  - get an overview of how indic LLM work on specific language
37
  - for fun
38
 
39
 
40
  ## Downstream Use [Optional]
 
 
 
41
 
42
  - can be deployed with api
43
  - used to create webapp or app to show demo
 
45
 
46
  ## Out-of-Scope Use
47
 
 
 
 
48
  - cannot be used for production purpose
49
  - cannot be used to generate text for research or academic purposes
50
 
51
 
52
  # Bias, Risks, and Limitations
53
 
 
 
54
  Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
55
 
56
 
 
 
 
 
 
 
 
 
57
  # Training Details
58
 
59
  ## Training Data
60
 
61
+ we used our dataset of 10k data which consists of Questions and Responses. The dataset name is lumatic-ai/BongChat-v0-10k.
62
 
63
+ **Example Data**
64
+ - &lt;|im_start|&gt;user এটি কি ইতিবাচক পর্যালোচনা? এই ফ্র্যাঞ্চাইজি কখনও হার্ট স্টপ স্টান্টস, ফাইট সিকোয়েন্স এবং তারকা শক্তি বন্ধ করে হতাশ করে না।এবার মূল চারটি ছাড়াও নতুন মুখগুলি হলেন আবুয়েলিতা চরিত্রে রিতা মোরেনো এবং দান্তে জেসন মোমোয়া যিনি শোটি চুরি করেছেন।&lt;|im_end|&gt; &lt;|im_start|&gt;assistant এটি একটি খুব ইতিবাচক পর্যালোচনা, এটি ইঙ্গিত করে যে ফ্র্যাঞ্চাইজি শ্রোতাদের জন্য আরও পছন্দসই চরিত্রগুলি যুক্ত করার পাশাপাশি অতীতের প্রত্যাশার ভিত্তিতে বিতরণ করেছে।&lt;|im_end|&gt;
65
 
66
 
67
  ## Training Procedure
68
 
 
 
69
  ### Preprocessing
70
 
71
  - Dataset Format
72
  &lt;|im_start|&gt;user &lt;question&gt;&lt;|im_end|&gt; &lt;|im_start|&gt;assistant &lt;response&gt;&lt;|im_end|&gt;
73
 
74
+ ### Training hyperparameters
 
 
 
 
75
 
76
  The following hyperparameters were used during training:
77
  - learning_rate: 0.0002
 
86
  - num_epochs: 3
87
  - mixed_precision_training: Native AMP
88
 
89
+ ### Framework versions
90
 
91
  - Transformers 4.35.2
92
  - Pytorch 2.1.0+cu121
 
95
 
96
  # Evaluation
97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98
  ### Metrics
99
 
 
 
100
  - train/loss
101
  - steps
102
 
 
147
 
148
  # Environmental Impact
149
 
 
 
150
  Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
151
 
152
  - **Hardware Type:** 1 X Tesla T4
 
155
  - **Compute Region:** India
156
  - **Carbon Emitted:** 0.14
157
 
158
+ # Technical Specifications
159
 
160
  ## Model Architecture and Objective
161
 
162
  Finetuned on Tiny-Llama 1.1B Chat model
163
 
 
 
 
164
 
165
  ### Hardware
166
 
167
  1 X Tesla T4
168
 
 
 
 
169
 
170
  # Citation
171
 
 
173
 
174
  **BibTeX:**
175
 
 
 
 
 
176
  @misc{BongLlama-1.1B-Chat-alpha-v0,
177
  url={[https://huggingface.co/lumatic-ai/BongLlama-1.1B-Chat-alpha-v0](https://huggingface.co/lumatic-ai/BongLlama-1.1B-Chat-alpha-v0)},
178
  title={BongLlama 1.1B Chat Aplha V0},
 
180
  year={2024}, month={Jan}
181
  }
182
 
 
 
 
 
 
 
 
 
 
 
 
183
 
184
+ # Model Card Authors
185
 
186
  lumatic-ai
187
 
 
196
  <details>
197
  <summary> Click to expand </summary>
198
 
199
+ ### Pipeline
200
+
201
+ ```
202
+ import torch
203
+ from transformers import AutoModelForCausalLM, AutoTokenizer
204
+ from transformers import pipeline
205
+
206
+ def formatted_prompt(question)-> str:
207
+ return f"<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant:"
208
+
209
+ hub_model_name = "lumatic-ai/BongLlama-1.1B-Chat-alpha-v0"
210
+
211
+ tokenizer = AutoTokenizer.from_pretrained(hub_model_name)
212
+ pipe = pipeline(
213
+ "text-generation",
214
+ model=hub_model_name,
215
+ torch_dtype=torch.float16,
216
+ device_map="auto",
217
+ )
218
+
219
+ from time import perf_counter
220
+ start_time = perf_counter()
221
+
222
+ prompt = formatted_prompt('হ্যালো')
223
+ sequences = pipe(
224
+ prompt,
225
+ do_sample=True,
226
+ temperature=0.1,
227
+ top_p=0.9,
228
+ num_return_sequences=1,
229
+ eos_token_id=tokenizer.eos_token_id,
230
+ max_new_tokens=256
231
+ )
232
+ for seq in sequences:
233
+ print(f"Result: {seq['generated_text']}")
234
+
235
+ output_time = perf_counter() - start_time
236
+ print(f"Time taken for inference: {round(output_time,2)} seconds")
237
+ ```
238
+
239
+ ### Streaming Response (ChatGPT, Bard like)
240
+
241
+ ```
242
+ import torch
243
+ from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
244
+
245
+ def formatted_prompt(question)-> str:
246
+ return f"<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant:"
247
+
248
+ hub_model_name = "lumatic-ai/BongLlama-1.1B-Chat-alpha-v0"
249
+
250
+ tokenizer = AutoTokenizer.from_pretrained(hub_model_name)
251
+ model = AutoModelForCausalLM.from_pretrained(hub_model_name)
252
+
253
+ prompt = formatted_prompt('prompt here')
254
+ inputs = tokenizer([prompt], return_tensors="pt")
255
+ streamer = TextStreamer(tokenizer)
256
+ _ = model.generate(**inputs, eos_token_id=[tokenizer.eos_token_id],streamer=streamer, max_new_tokens=256)
257
+ ```
258
+
259
+ ### Using Generation Config
260
+
261
+ ```
262
+ import torch
263
+ from transformers import GenerationConfig
264
+ from time import perf_counter
265
+
266
+ def formatted_prompt(question)-> str:
267
+ return f"<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant:"
268
+
269
+ hub_model_name = "lumatic-ai/BongLlama-1.1B-Chat-alpha-v0"
270
+
271
+ tokenizer = AutoTokenizer.from_pretrained(hub_model_name)
272
+ model = AutoModelForCausalLM.from_pretrained(hub_model_name)
273
+
274
+ prompt = formatted_prompt('হ্যালো')
275
+
276
+ # Check for GPU availability
277
+ if torch.cuda.is_available():
278
+ device = "cuda"
279
+ else:
280
+ device = "cpu"
281
+
282
+ # Move model and inputs to the GPU (if available)
283
+ model.to(device)
284
+ inputs = tokenizer(prompt, return_tensors="pt").to(device)
285
+
286
+ generation_config = GenerationConfig(
287
+ penalty_alpha=0.6,
288
+ do_sample=True,
289
+ top_k=5,
290
+ temperature=0.5,
291
+ repetition_penalty=1.2,
292
+ max_new_tokens=256,
293
+ pad_token_id=tokenizer.eos_token_id
294
+ )
295
+
296
+ start_time = perf_counter()
297
+ outputs = model.generate(**inputs, generation_config=generation_config)
298
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
299
+ output_time = perf_counter() - start_time
300
+ print(f"Time taken for inference: {round(output_time, 2)} seconds")
301
+ ```
302
 
303
  </details>