File size: 5,456 Bytes

---
language: en
license: apache-2.0
tags:
- text-summarization
- t5
- t5-small
- legislative
- texas
- long-document-summarization
metrics:
- rouge
- bertscore
- compression_ratio
model-index:
- name: T5-small Texas Legislative Summarization
  model_id: your-huggingface-username/t5-small-texas-legislative-summarization
  results:
  - task:
      type: text-summarization
    dataset:
      name: Texas Legislative Bills
      type: custom
    metrics:
      - name: Rouge1
        value: N/A
        type: rouge1
      - name: Rouge2
        value: N/A
        type: rouge2
      - name: RougeL
        value: N/A
        type: rougeL
      - name: BERTScore F1
        value: N/A
        type: bertscore_f1
      - name: Compression Ratio
        value: N/A
        type: compression_ratio
---

# T5-small Texas Legislative Summarization

This model is a fine-tuned version of [google/t5-small](https://huggingface.co/google/t5-small) for summarizing Texas legislative bills. It was trained on a dataset of Texas legislative bills and their corresponding summaries, with a focus on long document summarization.

## Model Details

-   **Model Name:** T5-small Texas Legislative Summarization
-   **Base Model:** [google/t5-small](https://huggingface.co/google/t5-small)
-   **Model Type:** Seq2Seq Language Model
-   **Architecture:** T5ForConditionalGeneration
-   **Language:** English
-   **License:** Apache 2.0

### Model Description

It is important to note this model is not the best case trained but an example use case.  This model takes the enrolled text of a Texas legislative bill as input and generates a concise summary.  It aims to capture the key points of the bill in a shorter, more accessible format. Due to the length and complexity of legislative documents, the model is designed to handle long-document summarization.

### Intended Use

This model can be used for:

-   Summarizing Texas legislative bills for easier understanding.
-   Providing a quick overview of bill content for researchers, journalists, and the general public.
-   Automating the summarization process to save time and resources.

## Training Data

The model was trained on a custom dataset of Texas legislative bills and their summaries. The dataset was created from:

-   **Source:** `cleaned_texas_leg_data.json` (This file is not publicly available and would need to be replaced with a public dataset or a description of how to create one).
-   **Source Text Column:** `enrolled_text`
-   **Target Text Column:** `summary_text`
-   **Data Preprocessing:** Data was loaded, split into training and testing sets (80/20 split), and tokenized using the T5 tokenizer.

## Training Procedure

The model was fine-tuned using the following parameters:

-   **Model:** `google/t5-small`
-   **Training Framework:** Transformers library using `Seq2SeqTrainer`
-   **Optimizer:** Adafactor
-   **Loss Function:** Cross-Entropy Loss
-   **Epochs:** 5
-   **Batch Size:** 1 (per device)
-   **Gradient Accumulation Steps:** 4
-   **Learning Rate:** 1e-05
-   **Weight Decay:** 0.0
-   **FP16 Training:** Enabled
-   **Gradient Checkpointing:** Enabled
-   **Evaluation Strategy:** Epoch
-   **Save Strategy:** Epoch
-   **Early Stopping:** Enabled (patience=3, threshold=0.01)
-   **Random Seed:** 42
-   **Max Source Length:** 4979
-   **Max Target Length:** 752
-   **Prefix:** `"summarize: "`

### Hyperparameter Tuning

A hyperparameter search was conducted to find the optimal training configuration. The following hyperparameters were explored:

-   Learning Rates: `[1e-5, 2e-5, 3e-5, 4e-5, 5e-5, 6e-5]`
-   Weight Decays: `[0.0, 0.01, 0.015, 0.001, 0.005]`
-   Gradient Accumulation Steps: `[4]`

The best model was selected based on the lowest perplexity on the evaluation set.

**Best Parameters:**

-   Learning Rate: 1e-05
-   Weight Decay:  0.0
-   Gradient Accumulation Steps: 4
-   Perplexity: N/A

## Evaluation

The model was evaluated on a held-out test set using the following metrics:

-   **ROUGE (Rouge1, Rouge2, RougeL):** Measures the overlap of n-grams between the generated summaries and the reference summaries.
-   **BERTScore (Precision, Recall, F1):**  Calculates semantic similarity between the generated and reference summaries using BERT embeddings.
-   **Compression Ratio:**  Measures the ratio of the length of the generated summary to the length of the original document (sentence-based).

**Evaluation Results:**

Evaluation metrics were not calculated during training, therefore results are not available

## Usage

Here's how to use the model for inference:

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("your-huggingface-username/t5-small-texas-legislative-summarization")
model = AutoModelForSeq2SeqLM.from_pretrained("your-huggingface-username/t5-small-texas-legislative-summarization")

def summarize(text):
    input_text = "summarize: " + text
    input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=4979, truncation=True)
    summary_ids = model.generate(input_ids,
                                 max_length=752,
                                 num_beams=4,
                                 early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Example usage:
bill_text = "Your long Texas legislative bill text here..."
summary = summarize(bill_text)
print(summary)