Spaces:

asdfasdfdsafdsa
/

czech-gec-punctuation-pipeline

Runtime error

App Files Files Community

czech-gec-punctuation-pipeline / README.md

asdfasdfdsafdsa

Upload 3 files

89b7ad2 verified about 2 months ago

preview code

raw

history blame contribute delete

2.9 kB

A newer version of the Gradio SDK is available: 5.49.1

Upgrade

metadata

title: Czech GEC Punctuation Pipeline
emoji: 🔄
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: apache-2.0

🇨🇿 Czech Grammar & Punctuation Correction Pipeline

A comprehensive pipeline that combines grammatical error correction with punctuation restoration for Czech text.

🎯 Features

Pipeline Mode

Two-Stage Processing:
1. Grammar correction using ByT5 Czech GEC model
2. Punctuation and capitalization using XLM-RoBERTa
9 Output Variants: Each stage produces 3 variants (Conservative, Balanced, Exploratory)
Visual Grid Layout: Easy comparison of all combinations

Benchmark Mode

Ground Truth Comparison: Test how well the pipeline recovers original text
Similarity Scoring: Percentage match for each variant
Comma Removal Tool: Quick test case generation
Best Match Identification: Automatically highlights the closest match

🔧 How It Works

Input → Czech text with errors and missing punctuation
Stage 1: GEC → 3 grammatical corrections
Stage 2: Punctuation → 3 punctuation variants for each GEC output
Output → 9 total combinations to choose from

📊 Benchmark Workflow

Paste your original, correct Czech text
Click "Remove Commas" to create a test case
Manually introduce grammatical errors
Run benchmark to see recovery accuracy
Review similarity scores and best match

🚀 Models Used

Grammar Correction: ufal/byt5-large-geccc-mate
- ByT5-large model fine-tuned on Czech GEC corpus
- Handles complex grammatical errors
Punctuation: kredor/punctuate-all
- Token classification model for punctuation restoration
- Supports Czech and 11 other languages
- Adds punctuation marks: . , ? - :

💡 Use Cases

Text Correction: Fix both grammar and punctuation in one pipeline
Quality Assessment: Benchmark correction models against known good text
Model Comparison: Compare different correction strategies
Educational Tool: Understand how different parameters affect output

🎨 Generation Strategies

Conservative

Minimal changes, high confidence
Lower beam search, no sampling
Best for minor corrections

Balanced

Moderate corrections
Mixed parameters
Good general-purpose option

Exploratory

More creative corrections
Higher diversity
Best for heavily corrupted text

📈 Performance

Processing time: 5-15 seconds for full pipeline
Best on GPU but works on CPU
Memory requirement: ~16GB for both models

🔍 Example Input

včera jsem šel do obchodu a koupil jsem si rohlíky máslo a mléko bylo to levné

This will be corrected for grammar and have punctuation added in 9 different ways!