PCS (Punctuation + Capitalization + Segmentation) โ GGUF
GGUF conversion of 1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase for use with CrispASR and CrisperWeaver.
Model
PCS performs three text post-processing tasks in a single pass:
- Punctuation restoration โ adds commas, periods, question marks, exclamation marks, colons, semicolons, dashes, and 10 other punctuation types
- Truecasing โ restores proper capitalization (per-character upper/lower classification)
- Sentence boundary detection โ identifies sentence breaks
Particularly useful for ASR backends that output unpunctuated lowercase text (wav2vec2, fastconformer-ctc, firered-asr, parakeet-ctc, omniasr-ctc).
Architecture
- Encoder: XLM-RoBERTa-base (12 layers, d=768, 12 heads, SentencePiece tokenizer)
- 4 classification heads:
post_punc: Linear(768 -> 256 -> 17) โ post-word punctuation (., ,, ?, !, :, ;, -, etc.)pre_punc: Linear(768 -> 256 -> 2) โ pre-word punctuation (ยฟ, ยก)sbd: Linear(772 -> 128 -> 2) โ sentence boundary detectiontruecase: Linear(769 -> 128 -> 16) โ per-character upper/lower case
Languages
Supports 47 languages via XLM-RoBERTa's multilingual encoder. Quality is best on the 12 languages the classification heads were trained on (EN, DE, FR, ES, IT, PT, NL, PL, RU, UK, CS, DA) but generalizes to all 47 XLM-R languages.
Files
| File | Size | Description |
|---|---|---|
pcs-xlmr-base.gguf |
903 MB | Full-precision (F16) โ reference quality |
pcs-xlmr-base-q4_k.gguf |
155 MB | Q4_K quantised โ ~6x smaller |
Usage
CrispASR CLI
# Apply PCS to unpunctuated text
crispasr --punc-model pcs-xlmr-base-q4_k.gguf \
-f audio.wav \
--backend parakeet
# Standalone text processing
echo "hello how are you doing today i am fine" | \
crispasr-pcs pcs-xlmr-base-q4_k.gguf
# Output: "Hello, how are you doing today? I am fine."
CrisperWeaver (Flutter GUI)
Download from Model Management (Post-processors section). Enable "Restore punctuation" in Advanced Options โ PCS runs automatically as a post-processing step after transcription.
C API
#include "crispasr.h"
void* pcs = crispasr_pcs_init("pcs-xlmr-base-q4_k.gguf");
const char* result = crispasr_pcs_process(pcs, "hello how are you");
// result: "Hello, how are you?"
crispasr_pcs_free_text(result);
crispasr_pcs_free(pcs);
Dart FFI
final pcs = crispasr.PcsModel.open('pcs-xlmr-base-q4_k.gguf');
final text = pcs.process('hello how are you doing today');
print(text); // "Hello, how are you doing today?"
pcs.close();
Comparison with other post-processors
| Model | Languages | Punct | Truecase | SBD | Size (Q4_K) |
|---|---|---|---|---|---|
| PCS | 47 | 17 types | per-char | yes | 155 MB |
| FireRedPunc | ZH + EN | yes | yes | no | ~100 MB |
| Fullstop-punc | EN/DE/FR/IT | yes | yes | no | ~300 MB |
| Truecaser LSTM | DE/EN/ES/RU | no | yes | no | ~3 MB |
PCS is the most comprehensive option โ it handles all three tasks in one pass across the widest language set.
License
MIT (same as upstream 1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase).
Links
- Upstream model: 1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase
- Engine: CrispASR
- App: CrisperWeaver
- Downloads last month
- 148
We're not able to determine the quantization variants.