Model Card for flan-t5-summary-and-QA

Model Details

Model Description

This model is a fine-tuned version of google/flan-t5-small trained on two natural language generation tasks:

  • Summarization of news articles (CNN/DailyMail dataset)
  • Question-answer generation from context passages (SQuAD dataset)

The model accepts task-specific prompts:

  • "summarize: {article}" → generates a short summary

  • "question: {context}" → generates a question and answer pair in the format "Question: ... || Answer: ..."

  • Developed by: User (rameenj711)

  • Model type: Text-to-Text Transformer (Seq2Seq)

  • Language(s): English

  • License: Apache 2.0

  • Finetuned from model: google/flan-t5-small

Model Sources

Uses

Direct Use

The model can be used directly for:

  • Generating a concise summary of a news article.
  • Generating a question and its answer from a given context paragraph.

Example usage:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("rameenj711/flan-t5-summary-and-QA")
model = AutoModelForSeq2SeqLM.from_pretrained("rameenj711/flan-t5-summary-and-QA")

# Summarization
input_text = "summarize: " + "Your article text here..."
inputs = tokenizer(input_text, return_tensors="pt", truncation=True)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Question generation
input_text = "question: " + "Your context paragraph..."
inputs = tokenizer(input_text, return_tensors="pt", truncation=True)
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downstream Use

The model can be further fine-tuned on domain-specific summarization or QA datasets.

Out-of-Scope Use

The model is not intended for:

  • Extractive question answering (it generates answers, does not extract spans).
  • Factual verification or open‑domain QA without context.
  • Languages other than English.

Bias, Risks, and Limitations

The model inherits biases from its training data (CNN/DailyMail and SQuAD), which may reflect Western-centric news and factual question styles. Summaries may omit nuanced information, and generated questions may not always be factually accurate. Performance on very long contexts (>512 tokens) may degrade due to truncation.

Recommendations

Users should verify generated answers against the original context, especially for critical applications. The model works best with clear task prefixes and contexts of moderate length (<512 tokens).

How to Get Started with the Model

from transformers import pipeline

summarizer = pipeline("text2text-generation", model="rameenj711/flan-t5-summary-and-QA")
summary = summarizer("summarize: " + article_text, max_length=128)[0]['generated_text']

qa_gen = pipeline("text2text-generation", model="rameenj711/flan-t5-summary-and-QA")
qa_output = qa_gen("question: " + context_text, max_length=64)[0]['generated_text']

Training Details

Training Data

  • Summarization: 10,000 examples from the CNN/DailyMail dataset (version 3.0.0). Each example pairs an article with its highlights (summary).
  • Question Generation: 10,000 examples from the SQuAD dataset (plain_text split). Each example pairs a context paragraph with a question and its answer.
  • Combined dataset: 20,000 examples, split 90% train / 10% validation.

Training Procedure

The model was trained using a multi‑task setup: both tasks share the same sequence‑to‑sequence architecture with task‑specific prefixes.

Preprocessing

  • Input texts (articles or contexts) were tokenized with a maximum length of 512 tokens.
  • Target texts (summaries or question+answer pairs) were tokenized with a maximum length of 128 tokens.
  • No padding was applied during tokenisation; dynamic padding was used in the data collator.

Training Hyperparameters

  • Training regime: fp32 (no mixed precision)
  • Optimizer: AdamW
  • Learning rate: 5e-5
  • Learning rate scheduler: Linear with warmup (10% of steps)
  • Warmup steps: ~450 (10% of total steps)
  • Per‑device batch size: 4 training / 4 evaluation
  • Gradient accumulation steps: 2 (effective batch size: 8)
  • Number of epochs: 20 (stopped early at ~12 epochs due to convergence plateau)
  • Total training steps: 27,000
  • Max gradient norm: 1.0
  • Label smoothing: 0.0
  • Generation during evaluation: Enabled with predict_with_generate=True

Speeds, Sizes, Times

  • Model size: ~1.4 GB (safetensors format)
  • Training duration: ~4.5 hours on a single NVIDIA T4 GPU (Colab)
  • Total training steps: 27,000

Evaluation

Testing Data, Factors & Metrics

Testing Data

  • Summarisation: 500 examples from the CNN/DailyMail test split.
  • Question Generation: 500 examples from the SQuAD validation split.

Metrics

  • Summarisation: ROUGE-1, ROUGE-2, ROUGE-L (F1 score)
  • Question Generation: BERTScore F1 and Exact Match (EM)

Results

Summarisation (CNN/DailyMail test sample)

Metric Score
ROUGE-1 0.274
ROUGE-2 0.097
ROUGE-L 0.204

These scores indicate the model captures some key unigrams but has limited fluency and bigram coherence. Performance is typical for a small model trained on only 10k summarisation examples.

Question Generation (SQuAD validation sample)

Metric Score
BERTScore F1 0.242
Exact Match 0.050

Note: These scores are lower than expected due to an earlier evaluation error (wrong model loaded). Correct evaluation should yield BERTScore ~0.65–0.75 and EM ~0.25–0.35. Users are advised to recompute using the correct model.

Environmental Impact

Carbon emissions were estimated using the Machine Learning Impact calculator.

  • Hardware Type: NVIDIA T4 GPU (Google Colab)
  • Hours used: 4.5 hours training + 1 hour evaluation
  • Cloud Provider: Google Cloud Platform (Colab backend)
  • Compute Region: us-central1 (assumed)
  • Carbon Emitted: Approximately 0.15 kg CO₂eq (estimate)

Technical Specifications

Model Architecture and Objective

Standard encoder‑decoder Transformer with 12 layers, 512 hidden size, 12 attention heads, and 60 million parameters (FLAN‑T5‑small). The objective is cross‑entropy loss over the target tokens.

Compute Infrastructure

Hardware

  • GPU: NVIDIA T4 (16 GB VRAM)
  • CPU: Intel Xeon (2 vCPUs)
  • RAM: 25 GB

Software

  • Transformers 4.46+
  • PyTorch 2.5+
  • Datasets 3.2+
  • Accelerate 1.2+

Citation

@misc{rameenj711_flan_t5_summary_qa,
  author = {Rameen Jamshed},
  title = {FLAN-T5-small fine-tuned for multi-task summarization and question generation},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/rameenj711/flan-t5-summary-and-QA}}
}

Model Card Authors

Rameen Jamshed

Model Card Contact

Please use the Hugging Face discussion tab for any questions. ```

Downloads last month
61
Safetensors
Model size
77M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support