--- language: en license: apache-2.0 tags: - text-summarization - t5 - t5-small - legislative - texas - long-document-summarization metrics: - rouge - bertscore - compression_ratio model-index: - name: T5-small Texas Legislative Summarization model_id: your-huggingface-username/t5-small-texas-legislative-summarization results: - task: type: text-summarization dataset: name: Texas Legislative Bills type: custom metrics: - name: Rouge1 value: N/A type: rouge1 - name: Rouge2 value: N/A type: rouge2 - name: RougeL value: N/A type: rougeL - name: BERTScore F1 value: N/A type: bertscore_f1 - name: Compression Ratio value: N/A type: compression_ratio --- # T5-small Texas Legislative Summarization This model is a fine-tuned version of [google/t5-small](https://huggingface.co/google/t5-small) for summarizing Texas legislative bills. It was trained on a dataset of Texas legislative bills and their corresponding summaries, with a focus on long document summarization. ## Model Details - **Model Name:** T5-small Texas Legislative Summarization - **Base Model:** [google/t5-small](https://huggingface.co/google/t5-small) - **Model Type:** Seq2Seq Language Model - **Architecture:** T5ForConditionalGeneration - **Language:** English - **License:** Apache 2.0 ### Model Description It is important to note this model is not the best case trained but an example use case. This model takes the enrolled text of a Texas legislative bill as input and generates a concise summary. It aims to capture the key points of the bill in a shorter, more accessible format. Due to the length and complexity of legislative documents, the model is designed to handle long-document summarization. ### Intended Use This model can be used for: - Summarizing Texas legislative bills for easier understanding. - Providing a quick overview of bill content for researchers, journalists, and the general public. - Automating the summarization process to save time and resources. ## Training Data The model was trained on a custom dataset of Texas legislative bills and their summaries. The dataset was created from: - **Source:** `cleaned_texas_leg_data.json` (This file is not publicly available and would need to be replaced with a public dataset or a description of how to create one). - **Source Text Column:** `enrolled_text` - **Target Text Column:** `summary_text` - **Data Preprocessing:** Data was loaded, split into training and testing sets (80/20 split), and tokenized using the T5 tokenizer. ## Training Procedure The model was fine-tuned using the following parameters: - **Model:** `google/t5-small` - **Training Framework:** Transformers library using `Seq2SeqTrainer` - **Optimizer:** Adafactor - **Loss Function:** Cross-Entropy Loss - **Epochs:** 5 - **Batch Size:** 1 (per device) - **Gradient Accumulation Steps:** 4 - **Learning Rate:** 1e-05 - **Weight Decay:** 0.0 - **FP16 Training:** Enabled - **Gradient Checkpointing:** Enabled - **Evaluation Strategy:** Epoch - **Save Strategy:** Epoch - **Early Stopping:** Enabled (patience=3, threshold=0.01) - **Random Seed:** 42 - **Max Source Length:** 4979 - **Max Target Length:** 752 - **Prefix:** `"summarize: "` ### Hyperparameter Tuning A hyperparameter search was conducted to find the optimal training configuration. The following hyperparameters were explored: - Learning Rates: `[1e-5, 2e-5, 3e-5, 4e-5, 5e-5, 6e-5]` - Weight Decays: `[0.0, 0.01, 0.015, 0.001, 0.005]` - Gradient Accumulation Steps: `[4]` The best model was selected based on the lowest perplexity on the evaluation set. **Best Parameters:** - Learning Rate: 1e-05 - Weight Decay: 0.0 - Gradient Accumulation Steps: 4 - Perplexity: N/A ## Evaluation The model was evaluated on a held-out test set using the following metrics: - **ROUGE (Rouge1, Rouge2, RougeL):** Measures the overlap of n-grams between the generated summaries and the reference summaries. - **BERTScore (Precision, Recall, F1):** Calculates semantic similarity between the generated and reference summaries using BERT embeddings. - **Compression Ratio:** Measures the ratio of the length of the generated summary to the length of the original document (sentence-based). **Evaluation Results:** Evaluation metrics were not calculated during training, therefore results are not available ## Usage Here's how to use the model for inference: ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("your-huggingface-username/t5-small-texas-legislative-summarization") model = AutoModelForSeq2SeqLM.from_pretrained("your-huggingface-username/t5-small-texas-legislative-summarization") def summarize(text): input_text = "summarize: " + text input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=4979, truncation=True) summary_ids = model.generate(input_ids, max_length=752, num_beams=4, early_stopping=True) summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) return summary # Example usage: bill_text = "Your long Texas legislative bill text here..." summary = summarize(bill_text) print(summary)