amezasor commited on
Commit
a37be0b
1 Parent(s): 8de579d

First commit granite-20b-code-base model card

Browse files
Files changed (1) hide show
  1. README.md +314 -0
README.md CHANGED
@@ -1,3 +1,317 @@
1
  ---
 
 
 
 
 
 
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ pipeline_tag: text-generation
3
+ inference: true
4
+ # widget:
5
+ # - text: 'Question: Please write a function in Python that performs bubble sort.\n\nAnswer:'
6
+ # example_title: Bubble sort
7
+ # group: Python
8
  license: apache-2.0
9
+ datasets:
10
+ # Mentionded in paper
11
+ - codeparrot/github-code-clean
12
+ - bigcode/starcoderdata
13
+ # - Stackexchange
14
+ # - CommonCrawl
15
+ - open-web-math/open-web-math
16
+ - math-ai/StackMathQA
17
+ # - Arxiv
18
+ # - Wikipedia
19
+ # - conceptofmind/FLAN_2022 # Original link is broken, we used IBM's filtered version | Phase 2
20
+ - nvidia/HelpSteer
21
+ metrics:
22
+ - code_eval
23
+ library_name: transformers
24
+ tags:
25
+ - code
26
+ model-index:
27
+ - name: granite-20b-code-base
28
+ results:
29
+ - task:
30
+ type: text-generation
31
+ dataset:
32
+ type: openai_humaneval # https://arxiv.org/pdf/2107.03374
33
+ name: HumanEval
34
+ metrics:
35
+ - name: pass@1
36
+ type: pass@1
37
+ value: 42.7
38
+ veriefied: false # Check
39
+ - task:
40
+ type: text-generation
41
+ dataset:
42
+ type: evalplus/humanevalplus # https://arxiv.org/pdf/2305.01210 https://github.com/evalplus/evalplus
43
+ name: HumanEval+
44
+ metrics:
45
+ - name: pass@1
46
+ type: pass@1
47
+ value: 0.0 #TO DO: Update value
48
+ veriefied: false # Check
49
+ - task:
50
+ type: text-generation
51
+ dataset:
52
+ type: mbpp # https://arxiv.org/abs/2108.07732
53
+ name: MBPP
54
+ metrics:
55
+ - name: pass@1
56
+ type: pass@1
57
+ value: 43.8
58
+ veriefied: false # Check
59
+ - task:
60
+ type: text-generation
61
+ dataset:
62
+ type: evalplus/mbppplus #
63
+ name: MBPP+
64
+ metrics:
65
+ - name: pass@1
66
+ type: pass@1
67
+ value: 51.6
68
+ veriefied: false # Check
69
+ - task:
70
+ type: text-generation
71
+ dataset:
72
+ type: bigcode/humanevalpack
73
+ name: HumanEvalSynthesis(Python)
74
+ metrics:
75
+ - name: pass@1
76
+ type: pass@1
77
+ value: 0.0 #TO DO: Update value
78
+ veriefied: false # Check
79
+ - task:
80
+ type: text-generation
81
+ dataset:
82
+ type: bigcode/humanevalpack
83
+ name: HumanEvalSynthesis(JavaScript)
84
+ metrics:
85
+ - name: pass@1
86
+ type: pass@1
87
+ value: 0.0 #TO DO: Update value
88
+ veriefied: false # Check
89
+ - task:
90
+ type: text-generation
91
+ dataset:
92
+ type: bigcode/humanevalpack
93
+ name: HumanEvalSynthesis(Java)
94
+ metrics:
95
+ - name: pass@1
96
+ type: pass@1
97
+ value: 0.0 #TO DO: Update value
98
+ veriefied: false # Check
99
+ - task:
100
+ type: text-generation
101
+ dataset:
102
+ type: bigcode/humanevalpack
103
+ name: HumanEvalSynthesis(Go)
104
+ metrics:
105
+ - name: pass@1
106
+ type: pass@1
107
+ value: 31.7
108
+ veriefied: false # Check
109
+ - task:
110
+ type: text-generation
111
+ dataset:
112
+ type: bigcode/humanevalpack
113
+ name: HumanEvalSynthesis(C++)
114
+ metrics:
115
+ - name: pass@1
116
+ type: pass@1
117
+ value: 0.0 #TO DO: Update value
118
+ veriefied: false # Check
119
+ - task:
120
+ type: text-generation
121
+ dataset:
122
+ type: bigcode/humanevalpack
123
+ name: HumanEvalSynthesis(Rust)
124
+ metrics:
125
+ - name: pass@1
126
+ type: pass@1
127
+ value: 0.0 #TO DO: Update value
128
+ veriefied: false # Check
129
+ - task:
130
+ type: text-generation
131
+ dataset:
132
+ type: bigcode/humanevalpack
133
+ name: HumanEvalExplain(Python)
134
+ metrics:
135
+ - name: pass@1
136
+ type: pass@1
137
+ value: 0.0 #TO DO: Update value
138
+ veriefied: false # Check
139
+ - task:
140
+ type: text-generation
141
+ dataset:
142
+ type: bigcode/humanevalpack
143
+ name: HumanEvalExplain(JavaScript)
144
+ metrics:
145
+ - name: pass@1
146
+ type: pass@1
147
+ value: 32.3
148
+ veriefied: false # Check
149
+ - task:
150
+ type: text-generation
151
+ dataset:
152
+ type: bigcode/humanevalpack
153
+ name: HumanEvalExplain(Java)
154
+ metrics:
155
+ - name: pass@1
156
+ type: pass@1
157
+ value: 0.0 #TO DO: Update value
158
+ veriefied: false # Check
159
+ - task:
160
+ type: text-generation
161
+ dataset:
162
+ type: bigcode/humanevalpack
163
+ name: HumanEvalExplain(Go)
164
+ metrics:
165
+ - name: pass@1
166
+ type: pass@1
167
+ value: 0.0 #TO DO: Update value
168
+ veriefied: false # Check
169
+ - task:
170
+ type: text-generation
171
+ dataset:
172
+ type: bigcode/humanevalpack
173
+ name: HumanEvalExplain(C++)
174
+ metrics:
175
+ - name: pass@1
176
+ type: pass@1
177
+ value: 0.0 #TO DO: Update value
178
+ veriefied: false # Check
179
+ - task:
180
+ type: text-generation
181
+ dataset:
182
+ type: bigcode/humanevalpack
183
+ name: HumanEvalExplain(Rust)
184
+ metrics:
185
+ - name: pass@1
186
+ type: pass@1
187
+ value: 19.5
188
+ veriefied: false # Check
189
+ - task:
190
+ type: text-generation
191
+ dataset:
192
+ type: bigcode/humanevalpack
193
+ name: HumanEvalFix(Python)
194
+ metrics:
195
+ - name: pass@1
196
+ type: pass@1
197
+ value: 0.0 #TO DO: Update value
198
+ veriefied: false # Check
199
+ - task:
200
+ type: text-generation
201
+ dataset:
202
+ type: bigcode/humanevalpack
203
+ name: HumanEvalFix(JavaScript)
204
+ metrics:
205
+ - name: pass@1
206
+ type: pass@1
207
+ value: 0.0 #TO DO: Update value
208
+ veriefied: false # Check
209
+ - task:
210
+ type: text-generation
211
+ dataset:
212
+ type: bigcode/humanevalpack
213
+ name: HumanEvalFix(Java)
214
+ metrics:
215
+ - name: pass@1
216
+ type: pass@1
217
+ value: 0.0 #TO DO: Update value
218
+ veriefied: false # Check
219
+ - task:
220
+ type: text-generation
221
+ dataset:
222
+ type: bigcode/humanevalpack
223
+ name: HumanEvalFix(Go)
224
+ metrics:
225
+ - name: pass@1
226
+ type: pass@1
227
+ value: 0.0 #TO DO: Update value
228
+ veriefied: false # Check
229
+ - task:
230
+ type: text-generation
231
+ dataset:
232
+ type: bigcode/humanevalpack
233
+ name: HumanEvalFix(C++)
234
+ metrics:
235
+ - name: pass@1
236
+ type: pass@1
237
+ value: 0.0 #TO DO: Update value
238
+ veriefied: false # Check
239
+ - task:
240
+ type: text-generation
241
+ dataset:
242
+ type: bigcode/humanevalpack
243
+ name: HumanEvalFix(Rust)
244
+ metrics:
245
+ - name: pass@1
246
+ type: pass@1
247
+ value: 0.0 #TO DO: Update value
248
+ veriefied: false # Check
249
  ---
250
+
251
+ # Granite-20B-Code-Base
252
+
253
+ ## Model Summary
254
+ **Granite-20B-Code-Base** is a decoder-only code model designed for code generative tasks (e.g., code generation, code explanation, code fixing, etc.). It is trained from scratch with a two-phase training strategy. In phase 1, our model is trained on 3 to 4 trillion tokens sourced from 116 programming languages, ensuring a comprehensive understanding of programming languages and syntax. In phase 2, our model is trained on 500 billion tokens with a carefully designed mixture of high-quality data from code and natural language domains to improve the models’ ability to reason and follow instructions.
255
+
256
+ - **Developers:** IBM Research
257
+ - **GitHub Repository:** [ibm-granite/granite-code-models](https://github.com/ibm-granite/granite-code-models)
258
+ - **Paper:** [Granite Code Models: A Family of Open Foundation Models
259
+ for Code Intelligence](https://)
260
+ - **Release Date**: May 6th, 2024
261
+ - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0).
262
+
263
+ ## Usage
264
+ ### Intended use
265
+ Prominent enterprise use cases of LLMs in software engineering productivity include code generation, code explanation, code fixing, generating unit tests, generating documentation, addressing technical debt issues, vulnerability detection, code translation, and more. All Granite Code Base models, including the **20B parameter model**, are able to handle these tasks as they were trained on a large amount of code data from 116 programming languages.
266
+
267
+ ### Generation
268
+ This is a simple example of how to use **Granite-Code-Base-3B model**.
269
+
270
+ ```python
271
+ import torch
272
+ from transformers import AutoModelForCausalLM, AutoTokenizer
273
+ device = "cuda" # or "cpu"
274
+ model_path = "ibm-granite/granite-20b-code-base"
275
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
276
+ # drop device_map if running on CPU
277
+ model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
278
+ model.eval()
279
+ # change input text as desired
280
+ input_text = "def generate():"
281
+ # tokenize the text
282
+ input_tokens = tokenizer(input_text, return_tensors="pt")
283
+ # transfer tokenized inputs to the device
284
+ for i in input_tokens:
285
+ input_tokens[i] = input_tokens[i].to(device)
286
+ # generate output tokens
287
+ output = model.generate(**input_tokens)
288
+ # decode output tokens into text
289
+ output = tokenizer.batch_decode(output)
290
+ # loop over the batch to print, in this example the batch size is 1
291
+ for i in output:
292
+ print(output)
293
+ ```
294
+
295
+ ## Training Data
296
+ - **Data Collection and Filtering:** Pretraining code data is sourced from a combination of publicly available datasets (e.g., [GitHub Code Clean](https://huggingface.co/datasets/codeparrot/github-code-clean), [Starcoder data](https://huggingface.co/datasets/bigcode/starcoderdata)), and additional public code repositories and issues from GitHub. We filter raw data to retain a list of 116 programming languages. After language filtering, we also filter out low-quality code.
297
+ - **Exact and Fuzzy Deduplication:** We adopt an aggressive deduplication strategy that includes both exact and fuzzy deduplication to remove documents having (near) identical code content.
298
+ - **HAP, PII, Malware Filtering:** We apply a HAP content filter that reduces models' likelihood of generating hateful, abusive, or profane language. We also make sure to redact Personally Identifiable Information (PII) by replacing PII content (e.g., names, email addresses, keys, passwords) with corresponding tokens (e.g., ⟨NAME⟩, ⟨EMAIL⟩, ⟨KEY⟩, ⟨PASSWORD⟩). Moreover, we scan all datasets using [ClamAV](https://www.clamav.net/) to identify and remove instances of malware in the source code.
299
+ - **Natural Language Datasets:** In addition to collecting code data for model training, we curate several publicly available high-quality natural language datasets to improve models' proficiency in language understanding and mathematical reasoning. Unlike the code data, we do not deduplicate these datasets.
300
+
301
+ ## Infrastructure
302
+ We train the Granite Code models using two of IBM's super computing clusters, namely Vela and Blue Vela, both outfitted with NVIDIA A100 and H100 GPUs respectively. These clusters provide a scalable and efficient infrastructure for training our models over thousands of GPUs.
303
+
304
+ ## Ethical Considerations and Limitations
305
+ The use of Large Language Models involves risks and ethical considerations people must be aware of. Regarding code generation, caution is urged against complete reliance on specific code models for crucial decisions or impactful information as the generated code is not guaranteed to work as intended. **Granite-20B-Code-Base** model is not the exception in this regard. Even though this model is suited for multiple code-related tasks, it has not undergone any safety alignment, there it may produce problematic outputs. Additionally, it remains uncertain whether smaller models might exhibit increased susceptibility to hallucination in generation scenarios by copying source code verbatim from the training dataset due to their reduced sizes and memorization capacities. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain. Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use **Granite-20B-Code-Base** model with ethical intentions and in a responsible way.
306
+
307
+ ## Citation
308
+ ```
309
+ @misc{granite-models,
310
+ author = {author 1, author2, ...},
311
+ title = {Granite Code Large Language Models: IBM Foundation Models for Code},
312
+ journal = {},
313
+ volume = {},
314
+ year = {2024},
315
+ url = {https://arxiv.org/abs/0000.00000},
316
+ }
317
+ ```