Introducing Lutece-Vision-Base 🚀

A Specialized Vision-Language Model (VLM) designed for financial document analysis and question answering

Model Description

Lutece-Vision-Base, named after the ancient name of Paris, is a specialized Vision-Language Model (VLM) designed for financial document analysis and question answering. This model is a fine-tuned version of the Microsoft Florence-2-base-ft, specifically tailored to interpret and answer questions about financial documents, reports, and images.

Model Details

Base Model: microsoft/Florence-2-base-ft
Fine-tuning Dataset: sujet-ai/Sujet-Finance-QA-Vision-100k
Training Data: 100,629 Q&A pairs (spanning 9,212 images)
Validation Data: 589 Q&A pairs (one pair per image from a total of 6,421 entries in the validation set)
Language: English
License: MIT

Training Specifications

Number of Epochs: 7
Learning Rate: 1e-6
Optimizer: AdamW
Architecture: Encoder parameters were frozen during fine-tuning
Hardware: One NVIDIA A100 GPU
Training Duration: Approximately 38 hours

Performance and Evaluation

We evaluated the model's performance using two approaches:

GPT-4o as an LLM judge
Cosine similarity measurement

GPT-4o Evaluation

This method compares the answers generated by both the vanilla Florence model and our fine-tuned Lutece-Vision-Base model.

Evaluation Process:

For each (image, question) pair in the validation set, we generate answers using both models.
GPT-4o acts as an impartial judge, evaluating the correctness of both answers without prior knowledge of the ground truth.
The evaluation considers factors such as numerical accuracy, spelling and minor wording differences, completeness of the answer, and relevance of information.

Evaluation Criteria:

Numerical Accuracy: Exact matches required for numbers, dates, and quantities.
Spelling and Minor Wording: Minor differences are acceptable if the core information is correct.
Completeness: Answers must fully address the question.
Relevance: Additional information is acceptable unless it contradicts the correct part of the answer.

GPT-4o Judge Prompt:

Analyze the image and the question, then evaluate the answers provided by the Vanilla Model and the Finetuned Model.

Question: {question}
Vanilla Model Answer: {vanilla_answer}
Finetuned Model Answer: {finetuned_answer}

Your task is to determine if each answer is correct or incorrect based on the image and question.
Consider the following guidelines:

1. Numerical Accuracy: 
   - For questions involving numbers (e.g., prices, dates, quantities), the answer must be exactly correct.
   - Example: If the correct price is $10.50, an answer of $10.49 or $10.51 is incorrect.
   - Example: If the correct date is June 15, 2023, an answer of June 14, 2023 or June 16, 2023 is incorrect.

2. Spelling and Minor Wording:
   - Minor spelling mistakes or slight wording differences should not be counted as incorrect if the core information is right.
   - Example: If the correct name is "John Smith", answers like "Jon Smith" or "John Smyth" should be considered correct.
   - Example: "The CEO of the company" instead of "The company's CEO" is acceptable.

3. Completeness:
   - The answer must fully address the question to be considered correct.
   - Partially correct answers or answers that miss key parts of the question should be marked as incorrect.

4. Irrelevant Information:
   - Additional irrelevant information does not make an otherwise correct answer incorrect.
   - However, if the irrelevant information contradicts the correct part of the answer, mark it as incorrect.

Respond using the following JSON format:
{
    "vanilla_correct": <boolean>,
    "finetuned_correct": <boolean>,
    "explanation": "Your explanation here"
}

Where:
- "vanilla_correct" is true if the Vanilla Model's answer is correct, false otherwise.
- "finetuned_correct" is true if the Finetuned Model's answer is correct, false otherwise.
- "explanation" briefly explains your evaluation for both answers, referencing the guidelines above.

Your response should contain ONLY the JSON output, and no text before or after to avoid output parsing errors.

Cosine Similarity Measurement

In addition to the GPT-4o evaluation, we also measured the cosine similarity between the answers given by the models and what was labeled as ground truth by GPT-4o. This provides a quantitative measure of how close the model outputs are to the expected answers in the embedding space.

Process:

We used the BAAI/bge-base-en-v1.5 embedding model (https://huggingface.co/BAAI/bge-base-en-v1.5) to convert the answers into vector representations.
Cosine similarity was calculated between the embeddings of the model-generated answers and the ground truth answers.
This similarity score provides an additional metric for evaluating the models' performance, capturing semantic similarity beyond exact word matching.

Performance Comparison:

For a detailed overview of the metrics logged during the training, please refer to our Weights & Biases report.

Usage

Here's a quick start guide to using Lutece-Vision-Base:

from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
import torch 

# Load and configure the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
config = AutoConfig.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)
config.vision_config.model_type = "davit"
model = AutoModelForCausalLM.from_pretrained("sujet-ai/Lutece-Vision-Base", config=config, trust_remote_code=True).to(device).eval()
processor = AutoProcessor.from_pretrained("sujet-ai/Lutece-Vision-Base", config=config, trust_remote_code=True)
task = "<FinanceQA>"

# Load input image and define the question
image = Image.open('test.png').convert('RGB')
prompt = "How much decrease in prepaid expenses was reported?"

# Process input and generate answer
inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=1024,
    do_sample=False,
    num_beams=3,
)

# Decode and parse the answer
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(generated_text, task=task, image_size=(image.width, image.height))
print(parsed_answer[task])

Demo and Further Resources

Interactive Demo: Try out Lutece-Vision-Base using our Hugging Face Space. Please note that this demo runs on CPU, so inference might be slower than on GPU.
Fine-tuning Tutorial: If you're interested in fine-tuning Florence 2 for your own tasks, we recommend this excellent tutorial on Hugging Face.

Limitations and Disclaimer

While Lutece-Vision-Base has been trained on a diverse set of financial documents, it may not cover all possible financial scenarios or document types. The model can make mistakes, especially in complex or ambiguous situations. Users should verify critical information and not rely solely on the model's output for making important financial decisions.

Disclaimer: Sujet AI provides Lutece-Vision-Base as-is, without any warranties, expressed or implied. We are not responsible for any consequences resulting from the use of this model. Users should exercise their own judgment and verify information when using the model for financial analysis or decision-making purposes.

The model may reflect biases present in its training data and should be used with this understanding. Continuous evaluation and updating of the model with diverse and representative data are recommended for maintaining its relevance and accuracy.

Citation and Contact

If you use Lutece-Vision-Base in your research or applications, please cite it as:

@software{Lutece-Vision-Base,
  author = {Sujet AI, Allaa Boutaleb, Hamed Rahimi},
  title = {Lutece-Vision-Base: A Fine-tuned VLM for Financial Document Analysis},
  year = {2024},
  url = {https://huggingface.co/sujet-ai/Lutece-Vision-Base}
}

For questions, feedback, or collaborations, please reach out to us on LinkedIn or visit our website https://sujet.ai.

sujet-ai
/

Lutece-Vision-Base