yatharth-gemma-7b-it-10k Model Card

Reference Model Page: Gemma

This model card pertains to the version of the Gemma model that has been fine-tuned on a dataset of 10K reports, specifically to enhance performance on tasks related to answering questions about these reports

Authors: Yatharth Mahesh Sant

Model Information

Summary description and brief definition of inputs and outputs.

Description

The model presented here is an advanced adaptation of the Gemma 7B-IT, a member of the Gemma family of lightweight yet state-of-the-art models developed by Google. Leveraging the breakthrough research and technology that brought forth the Gemini models, our fine-tuned iteration specializes in parsing and understanding financial texts, particularly those found in 10-K reports.

Dubbed the "yatharth-gemma-7B-it-10k" this model retains the text-to-text, decoder-only architecture of its progenitors, functioning optimally in English. What sets it apart is its refined focus on question-answering tasks specific to the intricate domain of 10-K reports — an invaluable resource for financial analysts, investors, and regulatory professionals seeking AI-driven insights.

Preserving the open-weights philosophy of the original Gemma models, this variant has been instruction-tuned with a curated dataset of 10-K reports. It not only demonstrates an enhanced proficiency in generating accurate, context-aware responses to user queries but also maintains the flexibility and efficiency that allow deployment in various settings, from personal computers to cloud-based environments.

The "yatharth-gemma-7B-it-10k" upholds the Gemma tradition of facilitating text generation tasks such as summarization and complex reasoning. Its unique optimization for financial reports exemplifies our commitment to pushing the boundaries of specialized AI, providing an unparalleled tool for dissecting and interpreting one of the business world's most information-dense documents.

By marrying the accessibility of the Gemma models with the niche expertise required to navigate 10-K reports, we extend the frontiers of what's possible with AI, democratizing cutting-edge technology to empower financial analysis and decision-making.

Usage

Below we share some code snippets on how to get quickly started with running the model. First make sure to pip install -U transformers, then copy the snippet from the section that is relevant for your usecase.

Fine-tuning the model

You can find fine-tuning scripts and notebook under the examples/ directory of google/gemma-7b repository. To adapt it to this model, simply change the model-id to yatharth97/yatharth-gemma-7b-it-10k. In that repository, we provide:

A script to perform Supervised Fine-Tuning (SFT) on UltraChat dataset using QLoRA
A script to perform SFT using FSDP on TPU devices
A notebook that you can run on a free-tier Google Colab instance to perform SFT on English quotes dataset

Running the model on a CPU

As explained below, we recommend torch.bfloat16 as the default dtype. You can use a different precision if necessary.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k")
model = AutoModelForCausalLM.from_pretrained(
    "yatharth97/yatharth-gemma-7b-it-10k",
    torch_dtype=torch.bfloat16
)

input_text = 'Can you tell me what the Total Debt was in 2023?'
input_ids = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

Running the model on a single / multi GPU

# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k")
model = AutoModelForCausalLM.from_pretrained(
    "yatharth97/yatharth-gemma-7b-it-10k",
    device_map="auto",
    torch_dtype=torch.bfloat16
)

input_text = 'Can you tell me what the Total Debt was in 2023?'
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

Running the model on a GPU using different precisions

The native weights of this model were exported in bfloat16 precision. You can use float16, which may be faster on certain hardware, indicating the torch_dtype when loading the model. For convenience, the float16 revision of the repo contains a copy of the weights already converted to that precision.

You can also use float32 if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to float32). See examples below.

Using torch.float16

# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k")
model = AutoModelForCausalLM.from_pretrained(
    "yatharth97/yatharth-gemma-7b-it-10k",
    device_map="auto",
    torch_dtype=torch.float16,
    revision="float16",
)

input_text = 'Can you tell me what the Total Debt was in 2023?'
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

Using torch.bfloat16

# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k")
model = AutoModelForCausalLM.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k", device_map="auto", torch_dtype=torch.bfloat16)

input_text = 'Can you tell me what the Total Debt was in 2023?'
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

Upcasting to torch.float32

# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k")
model = AutoModelForCausalLM.from_pretrained(
    "yatharth97/yatharth-gemma-7b-it-10k",
    device_map="auto"
)

input_text = 'Can you tell me what the Total Debt was in 2023?'
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

Quantized Versions through `bitsandbytes`

Using 8-bit precision (int8)

# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k")
model = AutoModelForCausalLM.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k", quantization_config=quantization_config)

input_text = 'Can you tell me what the Total Debt was in 2023?'
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

Using 4-bit precision

# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

tokenizer = AutoTokenizer.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k")
model = AutoModelForCausalLM.from_pretrained("yatharth97/yatharth-gemma-7b-it-10k", quantization_config=quantization_config)

input_text = 'Can you tell me what the Total Debt was in 2023?'
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

Other optimizations

Flash Attention 2

First make sure to install flash-attn in your environment pip install flash-attn

model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
+   attn_implementation="flash_attention_2"
).to(0)

Chat Template

The instruction-tuned models use a chat template that must be adhered to for conversational use. The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet.

Let's load the model and apply the chat template to a conversation. In this example, we'll start with a single user interaction:

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model_id = "yatharth97/yatharth-gemma-7b-it-10k"
dtype = torch.bfloat16

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    torch_dtype=dtype,
)

chat = [
    { "role": "user", "content": "Can you tell me what the Total Debt was in 2023?" },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

At this point, the prompt contains the following text:

<bos><start_of_turn>user
Can you tell me what the Total Debt was in 2023?<end_of_turn>
<start_of_turn>model

As you can see, each turn is preceded by a <start_of_turn> delimiter and then the role of the entity (either user, for content supplied by the user, or model for LLM responses). Turns finish with the <end_of_turn> token.

You can follow this format to build the prompt manually, if you need to do it without the tokenizer's chat template.

After the prompt is ready, generation can be performed like this:

inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
print(tokenizer.decode(outputs[0]))

Inputs and outputs

Input: Text string, such as a question, a prompt, or a 10K document to be summarized.
Output: Generated English-language text in response to the input, such as an answer to a question, or a summary of uploaded 10K document. For summarization currently a separate model is being used.

Model Data

Data used for model training and how the data was processed.

Training Dataset

This model is fine tuned on the dataset: "yatharth97/10k_reports_gemma" which has a conversational based format allowing the user to ask questions about the uploaded 10K report