File size: 5,456 Bytes
afb5e04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6f0a400
afb5e04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
language: en
license: apache-2.0
tags:
- text-summarization
- t5
- t5-small
- legislative
- texas
- long-document-summarization
metrics:
- rouge
- bertscore
- compression_ratio
model-index:
- name: T5-small Texas Legislative Summarization
  model_id: your-huggingface-username/t5-small-texas-legislative-summarization
  results:
  - task:
      type: text-summarization
    dataset:
      name: Texas Legislative Bills
      type: custom
    metrics:
      - name: Rouge1
        value: N/A
        type: rouge1
      - name: Rouge2
        value: N/A
        type: rouge2
      - name: RougeL
        value: N/A
        type: rougeL
      - name: BERTScore F1
        value: N/A
        type: bertscore_f1
      - name: Compression Ratio
        value: N/A
        type: compression_ratio
---

# T5-small Texas Legislative Summarization

This model is a fine-tuned version of [google/t5-small](https://huggingface.co/google/t5-small) for summarizing Texas legislative bills. It was trained on a dataset of Texas legislative bills and their corresponding summaries, with a focus on long document summarization.

## Model Details

-   **Model Name:** T5-small Texas Legislative Summarization
-   **Base Model:** [google/t5-small](https://huggingface.co/google/t5-small)
-   **Model Type:** Seq2Seq Language Model
-   **Architecture:** T5ForConditionalGeneration
-   **Language:** English
-   **License:** Apache 2.0

### Model Description

It is important to note this model is not the best case trained but an example use case.  This model takes the enrolled text of a Texas legislative bill as input and generates a concise summary.  It aims to capture the key points of the bill in a shorter, more accessible format. Due to the length and complexity of legislative documents, the model is designed to handle long-document summarization.

### Intended Use

This model can be used for:

-   Summarizing Texas legislative bills for easier understanding.
-   Providing a quick overview of bill content for researchers, journalists, and the general public.
-   Automating the summarization process to save time and resources.

## Training Data

The model was trained on a custom dataset of Texas legislative bills and their summaries. The dataset was created from:

-   **Source:** `cleaned_texas_leg_data.json` (This file is not publicly available and would need to be replaced with a public dataset or a description of how to create one).
-   **Source Text Column:** `enrolled_text`
-   **Target Text Column:** `summary_text`
-   **Data Preprocessing:** Data was loaded, split into training and testing sets (80/20 split), and tokenized using the T5 tokenizer.

## Training Procedure

The model was fine-tuned using the following parameters:

-   **Model:** `google/t5-small`
-   **Training Framework:** Transformers library using `Seq2SeqTrainer`
-   **Optimizer:** Adafactor
-   **Loss Function:** Cross-Entropy Loss
-   **Epochs:** 5
-   **Batch Size:** 1 (per device)
-   **Gradient Accumulation Steps:** 4
-   **Learning Rate:** 1e-05
-   **Weight Decay:** 0.0
-   **FP16 Training:** Enabled
-   **Gradient Checkpointing:** Enabled
-   **Evaluation Strategy:** Epoch
-   **Save Strategy:** Epoch
-   **Early Stopping:** Enabled (patience=3, threshold=0.01)
-   **Random Seed:** 42
-   **Max Source Length:** 4979
-   **Max Target Length:** 752
-   **Prefix:** `"summarize: "`

### Hyperparameter Tuning

A hyperparameter search was conducted to find the optimal training configuration. The following hyperparameters were explored:

-   Learning Rates: `[1e-5, 2e-5, 3e-5, 4e-5, 5e-5, 6e-5]`
-   Weight Decays: `[0.0, 0.01, 0.015, 0.001, 0.005]`
-   Gradient Accumulation Steps: `[4]`

The best model was selected based on the lowest perplexity on the evaluation set.

**Best Parameters:**

-   Learning Rate: 1e-05
-   Weight Decay:  0.0
-   Gradient Accumulation Steps: 4
-   Perplexity: N/A

## Evaluation

The model was evaluated on a held-out test set using the following metrics:

-   **ROUGE (Rouge1, Rouge2, RougeL):** Measures the overlap of n-grams between the generated summaries and the reference summaries.
-   **BERTScore (Precision, Recall, F1):**  Calculates semantic similarity between the generated and reference summaries using BERT embeddings.
-   **Compression Ratio:**  Measures the ratio of the length of the generated summary to the length of the original document (sentence-based).

**Evaluation Results:**

Evaluation metrics were not calculated during training, therefore results are not available

## Usage

Here's how to use the model for inference:

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("your-huggingface-username/t5-small-texas-legislative-summarization")
model = AutoModelForSeq2SeqLM.from_pretrained("your-huggingface-username/t5-small-texas-legislative-summarization")

def summarize(text):
    input_text = "summarize: " + text
    input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=4979, truncation=True)
    summary_ids = model.generate(input_ids,
                                 max_length=752,
                                 num_beams=4,
                                 early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Example usage:
bill_text = "Your long Texas legislative bill text here..."
summary = summarize(bill_text)
print(summary)