ArchitRastogi's picture
v1 readme
7a086c0 verified
|
raw
history blame
4.34 kB
metadata
license: apache-2.0
datasets:
  - ArchitRastogi/Italian-BERT-FineTuning-Embeddings
language:
  - it
base_model:
  - dbmdz/bert-base-italian-xxl-uncased

bert-base-italian-embeddings: A Fine-Tuned Italian BERT Model for IR and RAG Applications

Model Overview

This model is a fine-tuned version of dbmdz/bert-base-italian-xxl-uncased tailored for Italian language Information Retrieval (IR) and Retrieval-Augmented Generation (RAG) tasks. It leverages contrastive learning to generate high-quality embeddings suitable for both industry and academic applications.

Model Size

  • Size: Approximately 450 MB

Training Details

  • Base Model: dbmdz/bert-base-italian-xxl-uncased
  • Dataset: Italian-BERT-FineTuning-Embeddings
    • Derived from the C4 dataset using sliding window segmentation and in-document sampling.
    • Size: ~5GB (4.5GB train, 0.5GB test)
  • Training Configuration:
    • Hardware: NVIDIA A40 GPU
    • Epochs: 3
    • Total Steps: 922,958
    • Training Time: Approximately 5 days, 2 hours, and 23 minutes
  • Training Objective: Contrastive Learning

Evaluation Metrics

Evaluations were performed using the mMARCO dataset, a multilingual version of MS MARCO. The model was assessed on 6,980 queries.

Results Comparison

Metric Base Model (dbmdz/bert-base-italian-xxl-uncased) facebook/mcontriever-msmarco Fine-Tuned Model
Recall@1 0.0026 0.0828 0.2106
Recall@100 0.0417 0.5028 0.8356
Recall@1000 0.2061 0.8049 0.9719
Average Precision 0.0050 0.1397 0.3173
NDCG@10 0.0043 0.1591 0.3601
NDCG@100 0.0108 0.2086 0.4218
NDCG@1000 0.0299 0.2454 0.4391
MRR@10 0.0036 0.1299 0.3047
MRR@100 0.0045 0.1385 0.3167
MRR@1000 0.0050 0.1397 0.3173

Note: The fine-tuned model significantly outperforms both the base model and facebook/mcontriever-msmarco across all metrics.

Usage

You can load and use the model directly with the Hugging Face Transformers library:

# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("ArchitRastogi/bert-base-italian-embeddings")
model = AutoModelForMaskedLM.from_pretrained("ArchitRastogi/bert-base-italian-embeddings")

# Example usage
text = "Stanchi di non riuscire a trovare il partner perfetto?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Intended Use

This model is intended for:

  • Information Retrieval (IR): Enhancing search engines and retrieval systems in the Italian language.
  • Retrieval-Augmented Generation (RAG): Improving the quality of generated content by providing relevant context.

Suitable for both industry applications and academic research.

Limitations

  • The model may inherit biases present in the C4 dataset.
  • Performance is primarily evaluated on mMARCO; results may vary with other datasets.

Contact

Archit Rastogi
📧 architrastogi20@gmail.com