Spaces:
Runtime error
Runtime error
A newer version of the Gradio SDK is available:
5.49.1
metadata
title: Czech GEC Punctuation Pipeline
emoji: π
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: apache-2.0
π¨πΏ Czech Grammar & Punctuation Correction Pipeline
A comprehensive pipeline that combines grammatical error correction with punctuation restoration for Czech text.
π― Features
Pipeline Mode
- Two-Stage Processing:
- Grammar correction using ByT5 Czech GEC model
- Punctuation and capitalization using XLM-RoBERTa
- 9 Output Variants: Each stage produces 3 variants (Conservative, Balanced, Exploratory)
- Visual Grid Layout: Easy comparison of all combinations
Benchmark Mode
- Ground Truth Comparison: Test how well the pipeline recovers original text
- Similarity Scoring: Percentage match for each variant
- Comma Removal Tool: Quick test case generation
- Best Match Identification: Automatically highlights the closest match
π§ How It Works
- Input β Czech text with errors and missing punctuation
- Stage 1: GEC β 3 grammatical corrections
- Stage 2: Punctuation β 3 punctuation variants for each GEC output
- Output β 9 total combinations to choose from
π Benchmark Workflow
- Paste your original, correct Czech text
- Click "Remove Commas" to create a test case
- Manually introduce grammatical errors
- Run benchmark to see recovery accuracy
- Review similarity scores and best match
π Models Used
Grammar Correction: ufal/byt5-large-geccc-mate
- ByT5-large model fine-tuned on Czech GEC corpus
- Handles complex grammatical errors
Punctuation: kredor/punctuate-all
- Token classification model for punctuation restoration
- Supports Czech and 11 other languages
- Adds punctuation marks: . , ? - :
π‘ Use Cases
- Text Correction: Fix both grammar and punctuation in one pipeline
- Quality Assessment: Benchmark correction models against known good text
- Model Comparison: Compare different correction strategies
- Educational Tool: Understand how different parameters affect output
π¨ Generation Strategies
Conservative
- Minimal changes, high confidence
- Lower beam search, no sampling
- Best for minor corrections
Balanced
- Moderate corrections
- Mixed parameters
- Good general-purpose option
Exploratory
- More creative corrections
- Higher diversity
- Best for heavily corrupted text
π Performance
- Processing time: 5-15 seconds for full pipeline
- Best on GPU but works on CPU
- Memory requirement: ~16GB for both models
π Example Input
vΔera jsem Ε‘el do obchodu a koupil jsem si rohlΓky mΓ‘slo a mlΓ©ko bylo to levnΓ©
This will be corrected for grammar and have punctuation added in 9 different ways!