LEGIT-SCRATCH-BART / README.md
morenolq's picture
Update README.md
fb14aac verified
|
raw
history blame
4.85 kB
metadata
language:
  - it
tags:
  - text2text-generation
  - summarization
  - legal-ai
  - italian-law
license: mit
datasets:
  - joelniklaus/Multi_Legal_Pile
library_name: transformers
pipeline_tag: text2text-generation
widget:
  - text: '<mask> 1234: Il contratto si intende concluso quando...'
base_model:
  - morenolq/bart-it

πŸ“Œ Model Card: LEGIT-BART Series

πŸ›οΈ Model Overview

The LEGIT-BART models are a family of pre-trained transformer-based models for Italian legal text processing. They build upon BART-IT (morenolq/bart-it) and are further pre-trained on Italian legal corpora.

πŸ’‘ Key features:

  • Extended context length with Local-Sparse-Global (LSG) Attention (up to 16,384 tokens) πŸ“œ
  • Trained on legal documents such as statutes, case law, and contracts πŸ“‘
  • Not fine-tuned for specific tasks (requires further adaptation)

πŸ“‚ Available Models

Model Description Link
LEGIT-BART Continued pre-training of morenolq/bart-it on Italian legal texts πŸ”— Link
LEGIT-BART-LSG-4096 Continued pre-training of morenolq/bart-it, supporting 4,096 tokens πŸ”— Link
LEGIT-BART-LSG-16384 Continued pre-training of morenolq/bart-it, supporting 16,384 tokens πŸ”— Link
LEGIT-SCRATCH-BART Trained from scratch on Italian legal texts πŸ”— Link
LEGIT-SCRATCH-BART-LSG-4096 Trained from scratch with LSG attention, supporting 4,096 tokens πŸ”— Link
LEGIT-SCRATCH-BART-LSG-16384 Trained from scratch with LSG attention, supporting 16,384 tokens πŸ”— Link
BART-IT-LSG-4096 morenolq/bart-it with LSG attention, supporting 4,096 tokens (no legal adaptation) πŸ”— Link
BART-IT-LSG-16384 morenolq/bart-it with LSG attention, supporting 16,384 tokens (no legal adaptation) πŸ”— Link

πŸ› οΈ Model Details

πŸ”Ή Architecture

  • Base Model: morenolq/bart-it
  • Transformer Encoder-Decoder
  • LSG Attention for long documents
  • Specific tokenizers for models trained from scratch (underperforming continual pre-training in our experiments).

πŸ”Ή Training Data

  • Dataset: joelniklaus/Multi_Legal_Pile
  • Types of legal texts used:
    • Legislation (laws, codes, amendments)
    • Case law (judicial decisions)
    • Contracts (public legal agreements)

πŸš€ How to Use

from transformers import BartForConditionalGeneration, AutoTokenizer

# Load tokenizer and model
model_name = "morenolq/LEGIT-SCRATCH-BART"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

# Example input
input_text = "<mask> 1234: Il contratto si intende concluso quando..."
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)

# Pre-trained model fill the mask
output_ids = model.generate(inputs.input_ids, max_length=150, num_beams=4, early_stopping=True)
output_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("πŸ“:", output_text)

⚠️ Limitations & Ethical Considerations

  • Not fine-tuned for specific tasks: The models are pre-trained on legal texts and may require further adaptation for specific legal NLP tasks (e.g., summarization, question-answering).
  • Bias and fairness: Legal texts may contain biases present in the legal system. Care should be taken to ensure fairness and ethical use of the models.
  • Legal advice: The models are not a substitute for professional legal advice. Always consult a qualified legal professional for legal matters.

πŸ“š Reference

The paper presenting LEGIT-BART models is currently under review and will be updated here once published.

@article{benedetto2025legitbart,
    title        = {LegItBART: a summarization model for Italian legal documents},
    author       = {Benedetto, Irene and La Quatra, Moreno and Cagliero, Luca},
    year         = 2025,
    journal      = {Artificial Intelligence and Law},
    publisher    = {Springer},
    pages        = {1--31},
    doi          = {10.1007/s10506-025-09436-y},
    url          = {doi.org/10.1007/s10506-025-09436-y}
}