File size: 5,456 Bytes
afb5e04 6f0a400 afb5e04 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
---
language: en
license: apache-2.0
tags:
- text-summarization
- t5
- t5-small
- legislative
- texas
- long-document-summarization
metrics:
- rouge
- bertscore
- compression_ratio
model-index:
- name: T5-small Texas Legislative Summarization
model_id: your-huggingface-username/t5-small-texas-legislative-summarization
results:
- task:
type: text-summarization
dataset:
name: Texas Legislative Bills
type: custom
metrics:
- name: Rouge1
value: N/A
type: rouge1
- name: Rouge2
value: N/A
type: rouge2
- name: RougeL
value: N/A
type: rougeL
- name: BERTScore F1
value: N/A
type: bertscore_f1
- name: Compression Ratio
value: N/A
type: compression_ratio
---
# T5-small Texas Legislative Summarization
This model is a fine-tuned version of [google/t5-small](https://huggingface.co/google/t5-small) for summarizing Texas legislative bills. It was trained on a dataset of Texas legislative bills and their corresponding summaries, with a focus on long document summarization.
## Model Details
- **Model Name:** T5-small Texas Legislative Summarization
- **Base Model:** [google/t5-small](https://huggingface.co/google/t5-small)
- **Model Type:** Seq2Seq Language Model
- **Architecture:** T5ForConditionalGeneration
- **Language:** English
- **License:** Apache 2.0
### Model Description
It is important to note this model is not the best case trained but an example use case. This model takes the enrolled text of a Texas legislative bill as input and generates a concise summary. It aims to capture the key points of the bill in a shorter, more accessible format. Due to the length and complexity of legislative documents, the model is designed to handle long-document summarization.
### Intended Use
This model can be used for:
- Summarizing Texas legislative bills for easier understanding.
- Providing a quick overview of bill content for researchers, journalists, and the general public.
- Automating the summarization process to save time and resources.
## Training Data
The model was trained on a custom dataset of Texas legislative bills and their summaries. The dataset was created from:
- **Source:** `cleaned_texas_leg_data.json` (This file is not publicly available and would need to be replaced with a public dataset or a description of how to create one).
- **Source Text Column:** `enrolled_text`
- **Target Text Column:** `summary_text`
- **Data Preprocessing:** Data was loaded, split into training and testing sets (80/20 split), and tokenized using the T5 tokenizer.
## Training Procedure
The model was fine-tuned using the following parameters:
- **Model:** `google/t5-small`
- **Training Framework:** Transformers library using `Seq2SeqTrainer`
- **Optimizer:** Adafactor
- **Loss Function:** Cross-Entropy Loss
- **Epochs:** 5
- **Batch Size:** 1 (per device)
- **Gradient Accumulation Steps:** 4
- **Learning Rate:** 1e-05
- **Weight Decay:** 0.0
- **FP16 Training:** Enabled
- **Gradient Checkpointing:** Enabled
- **Evaluation Strategy:** Epoch
- **Save Strategy:** Epoch
- **Early Stopping:** Enabled (patience=3, threshold=0.01)
- **Random Seed:** 42
- **Max Source Length:** 4979
- **Max Target Length:** 752
- **Prefix:** `"summarize: "`
### Hyperparameter Tuning
A hyperparameter search was conducted to find the optimal training configuration. The following hyperparameters were explored:
- Learning Rates: `[1e-5, 2e-5, 3e-5, 4e-5, 5e-5, 6e-5]`
- Weight Decays: `[0.0, 0.01, 0.015, 0.001, 0.005]`
- Gradient Accumulation Steps: `[4]`
The best model was selected based on the lowest perplexity on the evaluation set.
**Best Parameters:**
- Learning Rate: 1e-05
- Weight Decay: 0.0
- Gradient Accumulation Steps: 4
- Perplexity: N/A
## Evaluation
The model was evaluated on a held-out test set using the following metrics:
- **ROUGE (Rouge1, Rouge2, RougeL):** Measures the overlap of n-grams between the generated summaries and the reference summaries.
- **BERTScore (Precision, Recall, F1):** Calculates semantic similarity between the generated and reference summaries using BERT embeddings.
- **Compression Ratio:** Measures the ratio of the length of the generated summary to the length of the original document (sentence-based).
**Evaluation Results:**
Evaluation metrics were not calculated during training, therefore results are not available
## Usage
Here's how to use the model for inference:
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("your-huggingface-username/t5-small-texas-legislative-summarization")
model = AutoModelForSeq2SeqLM.from_pretrained("your-huggingface-username/t5-small-texas-legislative-summarization")
def summarize(text):
input_text = "summarize: " + text
input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=4979, truncation=True)
summary_ids = model.generate(input_ids,
max_length=752,
num_beams=4,
early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return summary
# Example usage:
bill_text = "Your long Texas legislative bill text here..."
summary = summarize(bill_text)
print(summary) |