File size: 11,059 Bytes
f2b2794
 
dfd6a2f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f2b2794
 
c80cc60
f2b2794
dfd6a2f
f2b2794
dfd6a2f
f2b2794
 
dfd6a2f
f2b2794
dfd6a2f
f2b2794
dfd6a2f
f2b2794
dfd6a2f
f2b2794
dfd6a2f
f2b2794
dfd6a2f
f2b2794
dfd6a2f
f2b2794
dfd6a2f
f2b2794
dfd6a2f
f2b2794
dfd6a2f
f2b2794
dfd6a2f
f2b2794
dfd6a2f
f2b2794
7d9976a
dfd6a2f
f2b2794
dfd6a2f
 
 
f2b2794
 
dfd6a2f
f2b2794
dfd6a2f
f2b2794
dfd6a2f
 
 
f2b2794
dfd6a2f
 
 
 
 
f2b2794
dfd6a2f
 
f2b2794
dfd6a2f
 
 
f2b2794
 
dfd6a2f
f2b2794
 
dfd6a2f
 
 
 
f2b2794
dfd6a2f
 
 
 
 
 
f2b2794
dfd6a2f
 
f2b2794
dfd6a2f
 
 
f2b2794
dfd6a2f
 
f2b2794
dfd6a2f
f2b2794
dfd6a2f
f2b2794
dfd6a2f
f2b2794
dfd6a2f
 
 
 
f2b2794
dfd6a2f
 
 
 
 
 
 
f2b2794
dfd6a2f
 
f2b2794
dfd6a2f
 
 
f2b2794
dfd6a2f
f2b2794
dfd6a2f
 
 
f2b2794
dfd6a2f
 
f2b2794
dfd6a2f
 
f2b2794
dfd6a2f
 
 
f2b2794
dfd6a2f
f2b2794
dfd6a2f
 
 
f2b2794
dfd6a2f
 
 
 
 
f2b2794
dfd6a2f
 
f2b2794
dfd6a2f
 
 
f2b2794
dfd6a2f
f2b2794
dfd6a2f
f2b2794
dfd6a2f
 
 
f2b2794
dfd6a2f
f2b2794
dfd6a2f
 
f2b2794
dfd6a2f
 
f2b2794
dfd6a2f
 
 
f2b2794
dfd6a2f
f2b2794
dfd6a2f
 
 
f2b2794
dfd6a2f
f2b2794
dfd6a2f
 
f2b2794
dfd6a2f
 
f2b2794
dfd6a2f
 
 
f2b2794
 
dfd6a2f
f2b2794
dfd6a2f
f2b2794
dfd6a2f
f2b2794
dfd6a2f
 
 
 
 
 
 
f2b2794
dfd6a2f
f2b2794
dfd6a2f
 
f2b2794
dfd6a2f
f2b2794
dfd6a2f
 
 
 
f2b2794
7d9976a
dfd6a2f
f2b2794
dfd6a2f
 
 
 
 
 
f2b2794
dfd6a2f
 
 
 
 
f2b2794
dfd6a2f
f2b2794
dfd6a2f
 
 
 
 
f2b2794
dfd6a2f
 
 
f2b2794
dfd6a2f
 
f2b2794
dfd6a2f
f2b2794
dfd6a2f
 
 
 
 
f2b2794
dfd6a2f
f2b2794
dfd6a2f
 
 
 
f2b2794
dfd6a2f
f2b2794
dfd6a2f
f2b2794
dfd6a2f
f2b2794
dfd6a2f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
---
library_name: transformers
widget:
- messages:
  - role: user
    content: How does the brain work?
inference:
  parameters:
    max_new_tokens: 200
extra_gated_heading: Access Gemma on Hugging Face
extra_gated_prompt: >-
  To access Gemma on Hugging Face, you’re required to review and agree to
  Google’s usage license. To do this, please ensure you’re logged-in to Hugging
  Face and click below. Requests are processed immediately.
extra_gated_button_content: Acknowledge license
datasets:
- yatharth97/10k_reports_gemma
---

# yatharth-gemma-7b-it-10k Model Card

**Reference Model Page**: [Gemma](https://ai.google.dev/gemma/docs)

This model card pertains to the version of the Gemma model that has been fine-tuned on a dataset of 10K reports, specifically to enhance performance on tasks related to answering questions about these reports


**Authors**: Yatharth Mahesh Sant

## Model Information

Summary description and brief definition of inputs and outputs.

### Description

The model presented here is an advanced adaptation of the Gemma 7B-IT, a member of the Gemma family of lightweight yet state-of-the-art models developed by Google. Leveraging the breakthrough research and technology that brought forth the Gemini models, our fine-tuned iteration specializes in parsing and understanding financial texts, particularly those found in 10-K reports.

Dubbed the "yatharth-gemma-7B-it-10k" this model retains the text-to-text, decoder-only architecture of its progenitors, functioning optimally in English. What sets it apart is its refined focus on question-answering tasks specific to the intricate domain of 10-K reports — an invaluable resource for financial analysts, investors, and regulatory professionals seeking AI-driven insights.

Preserving the open-weights philosophy of the original Gemma models, this variant has been instruction-tuned with a curated dataset of 10-K reports. It not only demonstrates an enhanced proficiency in generating accurate, context-aware responses to user queries but also maintains the flexibility and efficiency that allow deployment in various settings, from personal computers to cloud-based environments.

The "yatharth-gemma-7B-it-10k" upholds the Gemma tradition of facilitating text generation tasks such as summarization and complex reasoning. Its unique optimization for financial reports exemplifies our commitment to pushing the boundaries of specialized AI, providing an unparalleled tool for dissecting and interpreting one of the business world's most information-dense documents.

By marrying the accessibility of the Gemma models with the niche expertise required to navigate 10-K reports, we extend the frontiers of what's possible with AI, democratizing cutting-edge technology to empower financial analysis and decision-making.

### Usage

Below we share some code snippets on how to get quickly started with running the model. First make sure to `pip install -U transformers`, then copy the snippet from the section that is relevant for your usecase.

#### Fine-tuning the model

You can find fine-tuning scripts and notebook under the [`examples/` directory](https://huggingface.co/google/gemma-7b/tree/main/examples) of [`google/gemma-7b`](https://huggingface.co/google/gemma-7b) repository. To adapt it to this model, simply change the model-id to `yatharth97/yatharth-gemma-7b-it-10k`.
In that repository, we provide:

* A script to perform Supervised Fine-Tuning (SFT) on UltraChat dataset using QLoRA
* A script to perform SFT using FSDP on TPU devices
* A notebook that you can run on a free-tier Google Colab instance to perform SFT on English quotes dataset


#### Running the model on a CPU

As explained below, we recommend `torch.bfloat16` as the default dtype. You can use [a different precision](#precisions) if necessary.

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k")
model = AutoModelForCausalLM.from_pretrained(
    "yatharth97/yatharth-gemma-7b-it-10k",
    torch_dtype=torch.bfloat16
)

input_text = 'Can you tell me what the Total Debt was in 2023?'
input_ids = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
```


#### Running the model on a single / multi GPU


```python
# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k")
model = AutoModelForCausalLM.from_pretrained(
    "yatharth97/yatharth-gemma-7b-it-10k",
    device_map="auto",
    torch_dtype=torch.bfloat16
)

input_text = 'Can you tell me what the Total Debt was in 2023?'
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
```

<a name="precisions"></a>
#### Running the model on a GPU using different precisions

The native weights of this model were exported in `bfloat16` precision. You can use `float16`, which may be faster on certain hardware, indicating the `torch_dtype` when loading the model. For convenience, the `float16` revision of the repo contains a copy of the weights already converted to that precision.

You can also use `float32` if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to `float32`). See examples below.

* _Using `torch.float16`_

```python
# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k")
model = AutoModelForCausalLM.from_pretrained(
    "yatharth97/yatharth-gemma-7b-it-10k",
    device_map="auto",
    torch_dtype=torch.float16,
    revision="float16",
)

input_text = 'Can you tell me what the Total Debt was in 2023?'
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
```

* _Using `torch.bfloat16`_

```python
# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k")
model = AutoModelForCausalLM.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k", device_map="auto", torch_dtype=torch.bfloat16)

input_text = 'Can you tell me what the Total Debt was in 2023?'
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
```

* _Upcasting to `torch.float32`_

```python
# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k")
model = AutoModelForCausalLM.from_pretrained(
    "yatharth97/yatharth-gemma-7b-it-10k",
    device_map="auto"
)

input_text = 'Can you tell me what the Total Debt was in 2023?'
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
```

#### Quantized Versions through `bitsandbytes`

* _Using 8-bit precision (int8)_

```python
# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k")
model = AutoModelForCausalLM.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k", quantization_config=quantization_config)

input_text = 'Can you tell me what the Total Debt was in 2023?'
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
```

* _Using 4-bit precision_

```python
# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k")
model = AutoModelForCausalLM.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k", quantization_config=quantization_config)

input_text = 'Can you tell me what the Total Debt was in 2023?'
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
```


#### Other optimizations

* _Flash Attention 2_

First make sure to install `flash-attn` in your environment `pip install flash-attn`

```diff
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
+   attn_implementation="flash_attention_2"
).to(0)
```

### Chat Template

The instruction-tuned models use a chat template that must be adhered to for conversational use.
The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet.

Let's load the model and apply the chat template to a conversation. In this example, we'll start with a single user interaction:

```py
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model_id = "yatharth97/yatharth-gemma-7b-it-10k"
dtype = torch.bfloat16

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    torch_dtype=dtype,
)

chat = [
    { "role": "user", "content": "Can you tell me what the Total Debt was in 2023?" },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
```

At this point, the prompt contains the following text:

```
<bos><start_of_turn>user
Can you tell me what the Total Debt was in 2023?<end_of_turn>
<start_of_turn>model
```

As you can see, each turn is preceded by a `<start_of_turn>` delimiter and then the role of the entity
(either `user`, for content supplied by the user, or `model` for LLM responses). Turns finish with
the `<end_of_turn>` token.

You can follow this format to build the prompt manually, if you need to do it without the tokenizer's
chat template.

After the prompt is ready, generation can be performed like this:

```py
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
print(tokenizer.decode(outputs[0]))
```

### Inputs and outputs

*   **Input:** Text string, such as a question, a prompt, or a 10K document to be
    summarized.
*   **Output:** Generated English-language text in response to the input, such
    as an answer to a question, or a summary of uploaded 10K document. For summarization currently a separate model is being used.

## Model Data

Data used for model training and how the data was processed.

### Training Dataset

This model is fine tuned on the dataset: "yatharth97/10k_reports_gemma" which has a conversational based format allowing the user to ask questions about the uploaded 10K report