asdfasdfdsafdsa's picture
Upload 3 files
89b7ad2 verified

A newer version of the Gradio SDK is available: 5.49.1

Upgrade
metadata
title: Czech GEC Punctuation Pipeline
emoji: πŸ”„
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: apache-2.0

πŸ‡¨πŸ‡Ώ Czech Grammar & Punctuation Correction Pipeline

A comprehensive pipeline that combines grammatical error correction with punctuation restoration for Czech text.

🎯 Features

Pipeline Mode

  • Two-Stage Processing:
    1. Grammar correction using ByT5 Czech GEC model
    2. Punctuation and capitalization using XLM-RoBERTa
  • 9 Output Variants: Each stage produces 3 variants (Conservative, Balanced, Exploratory)
  • Visual Grid Layout: Easy comparison of all combinations

Benchmark Mode

  • Ground Truth Comparison: Test how well the pipeline recovers original text
  • Similarity Scoring: Percentage match for each variant
  • Comma Removal Tool: Quick test case generation
  • Best Match Identification: Automatically highlights the closest match

πŸ”§ How It Works

  1. Input β†’ Czech text with errors and missing punctuation
  2. Stage 1: GEC β†’ 3 grammatical corrections
  3. Stage 2: Punctuation β†’ 3 punctuation variants for each GEC output
  4. Output β†’ 9 total combinations to choose from

πŸ“Š Benchmark Workflow

  1. Paste your original, correct Czech text
  2. Click "Remove Commas" to create a test case
  3. Manually introduce grammatical errors
  4. Run benchmark to see recovery accuracy
  5. Review similarity scores and best match

πŸš€ Models Used

  • Grammar Correction: ufal/byt5-large-geccc-mate

    • ByT5-large model fine-tuned on Czech GEC corpus
    • Handles complex grammatical errors
  • Punctuation: kredor/punctuate-all

    • Token classification model for punctuation restoration
    • Supports Czech and 11 other languages
    • Adds punctuation marks: . , ? - :

πŸ’‘ Use Cases

  • Text Correction: Fix both grammar and punctuation in one pipeline
  • Quality Assessment: Benchmark correction models against known good text
  • Model Comparison: Compare different correction strategies
  • Educational Tool: Understand how different parameters affect output

🎨 Generation Strategies

Conservative

  • Minimal changes, high confidence
  • Lower beam search, no sampling
  • Best for minor corrections

Balanced

  • Moderate corrections
  • Mixed parameters
  • Good general-purpose option

Exploratory

  • More creative corrections
  • Higher diversity
  • Best for heavily corrupted text

πŸ“ˆ Performance

  • Processing time: 5-15 seconds for full pipeline
  • Best on GPU but works on CPU
  • Memory requirement: ~16GB for both models

πŸ” Example Input

včera jsem őel do obchodu a koupil jsem si rohlíky mÑslo a mléko bylo to levné

This will be corrected for grammar and have punctuation added in 9 different ways!