yatharth-gemma-2b-it-10k Model Card
Reference Model Page: Gemma
This model card pertains to the version of the Gemma model that has been fine-tuned on a dataset of 10K reports, specifically to enhance performance on tasks related to answering questions about these reports
Authors: Yatharth Mahesh Sant
Model Information
Summary description and brief definition of inputs and outputs.
Description
The model presented here is an advanced adaptation of the Gemma 2B-IT, a member of the Gemma family of lightweight yet state-of-the-art models developed by Google. Leveraging the breakthrough research and technology that brought forth the Gemini models, our fine-tuned iteration specializes in parsing and understanding financial texts, particularly those found in 10-K reports.
Dubbed the "yatharth-gemma-2B-it-10k" this model retains the text-to-text, decoder-only architecture of its progenitors, functioning optimally in English. What sets it apart is its refined focus on question-answering tasks specific to the intricate domain of 10-K reports — an invaluable resource for financial analysts, investors, and regulatory professionals seeking AI-driven insights.
Preserving the open-weights philosophy of the original Gemma models, this variant has been instruction-tuned with a curated dataset of 10-K reports. It not only demonstrates an enhanced proficiency in generating accurate, context-aware responses to user queries but also maintains the flexibility and efficiency that allow deployment in various settings, from personal computers to cloud-based environments.
The "yatharth-gemma-2B-it-10k" upholds the Gemma tradition of facilitating text generation tasks such as summarization and complex reasoning. Its unique optimization for financial reports exemplifies our commitment to pushing the boundaries of specialized AI, providing an unparalleled tool for dissecting and interpreting one of the business world's most information-dense documents.
By marrying the accessibility of the Gemma models with the niche expertise required to navigate 10-K reports, we extend the frontiers of what's possible with AI, democratizing cutting-edge technology to empower financial analysis and decision-making.
Usage
Below we share some code snippets on how to get quickly started with running the model. First make sure to pip install -U transformers
, then copy the snippet from the section that is relevant for your usecase.
Fine-tuning the model
You can find fine-tuning scripts and notebook under the examples/
directory of google/gemma-2b
repository. To adapt it to this model, simply change the model-id to yatharth97/yatharth-gemma-2b-it-10k
.
In that repository, we provide:
- A script to perform Supervised Fine-Tuning (SFT) on UltraChat dataset using QLoRA
- A script to perform SFT using FSDP on TPU devices
- A notebook that you can run on a free-tier Google Colab instance to perform SFT on English quotes dataset
Running the model on a CPU
As explained below, we recommend torch.bfloat16
as the default dtype. You can use a different precision if necessary.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-2b-it-10k")
model = AutoModelForCausalLM.from_pretrained(
"yatharth97/yatharth-gemma-2b-it-10k",
torch_dtype=torch.bfloat16
)
input_text = 'Can you tell me what the Total Debt was in 2023?'
input_ids = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
Running the model on a single / multi GPU
# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-2b-it-10k")
model = AutoModelForCausalLM.from_pretrained(
"yatharth97/yatharth-gemma-2b-it-10k",
device_map="auto",
torch_dtype=torch.bfloat16
)
input_text = 'Can you tell me what the Total Debt was in 2023?'
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
Running the model on a GPU using different precisions
The native weights of this model were exported in bfloat16
precision. You can use float16
, which may be faster on certain hardware, indicating the torch_dtype
when loading the model. For convenience, the float16
revision of the repo contains a copy of the weights already converted to that precision.
You can also use float32
if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to float32
). See examples below.
- Using
torch.float16
# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-2b-it-10k")
model = AutoModelForCausalLM.from_pretrained(
"yatharth97/yatharth-gemma-2b-it-10k",
device_map="auto",
torch_dtype=torch.float16,
revision="float16",
)
input_text = 'Can you tell me what the Total Debt was in 2023?'
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
- Using
torch.bfloat16
# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-2b-it-10k")
model = AutoModelForCausalLM.from_pretrained("yatharth97/yatharth-gemma-2b-it-10k", device_map="auto", torch_dtype=torch.bfloat16)
input_text = 'Can you tell me what the Total Debt was in 2023?'
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
- Upcasting to
torch.float32
# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-2b-it-10k")
model = AutoModelForCausalLM.from_pretrained(
"yatharth97/yatharth-gemma-2b-it-10k",
device_map="auto"
)
input_text = 'Can you tell me what the Total Debt was in 2023?'
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
Quantized Versions through bitsandbytes
- Using 8-bit precision (int8)
# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-2b-it-10k")
model = AutoModelForCausalLM.from_pretrained("yatharth97/yatharth-gemma-2b-it-10k", quantization_config=quantization_config)
input_text = 'Can you tell me what the Total Debt was in 2023?'
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
- Using 4-bit precision
# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-2b-it-10k")
model = AutoModelForCausalLM.from_pretrained("yatharth97/yatharth-gemma-2b-it-10k", quantization_config=quantization_config)
input_text = 'Can you tell me what the Total Debt was in 2023?'
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
Other optimizations
- Flash Attention 2
First make sure to install flash-attn
in your environment pip install flash-attn
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
+ attn_implementation="flash_attention_2"
).to(0)
Chat Template
The instruction-tuned models use a chat template that must be adhered to for conversational use. The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet.
Let's load the model and apply the chat template to a conversation. In this example, we'll start with a single user interaction:
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
model_id = "yatharth97/yatharth-gemma-2b-it-10k"
dtype = torch.bfloat16
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="cuda",
torch_dtype=dtype,
)
chat = [
{ "role": "user", "content": "Can you tell me what the Total Debt was in 2023?" },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
At this point, the prompt contains the following text:
<bos><start_of_turn>user
Can you tell me what the Total Debt was in 2023?<end_of_turn>
<start_of_turn>model
As you can see, each turn is preceded by a <start_of_turn>
delimiter and then the role of the entity
(either user
, for content supplied by the user, or model
for LLM responses). Turns finish with
the <end_of_turn>
token.
You can follow this format to build the prompt manually, if you need to do it without the tokenizer's chat template.
After the prompt is ready, generation can be performed like this:
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
print(tokenizer.decode(outputs[0]))
Inputs and outputs
- Input: Text string, such as a question, a prompt, or a 10K document to be summarized.
- Output: Generated English-language text in response to the input, such as an answer to a question, or a summary of uploaded 10K document. For summarization currently a separate model is being used.
Model Data
Data used for model training and how the data was processed.
Training Dataset
This model is fine tuned on the dataset: "yatharth97/10k_reports_gemma" which has a conversational based format allowing the user to ask questions about the uploaded 10K report
- Downloads last month
- 3