PCS (Punctuation + Capitalization + Segmentation) — GGUF

GGUF conversion of 1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase for use with CrispASR and CrisperWeaver.

Model

PCS performs three text post-processing tasks in a single pass:

Punctuation restoration — adds commas, periods, question marks, exclamation marks, colons, semicolons, dashes, and 10 other punctuation types
Truecasing — restores proper capitalization (per-character upper/lower classification)
Sentence boundary detection — identifies sentence breaks

Particularly useful for ASR backends that output unpunctuated lowercase text (wav2vec2, fastconformer-ctc, firered-asr, parakeet-ctc, omniasr-ctc).

Architecture

Encoder: XLM-RoBERTa-base (12 layers, d=768, 12 heads, SentencePiece tokenizer)
4 classification heads:
- post_punc: Linear(768 -> 256 -> 17) — post-word punctuation (., ,, ?, !, :, ;, -, etc.)
- pre_punc: Linear(768 -> 256 -> 2) — pre-word punctuation (¿, ¡)
- sbd: Linear(772 -> 128 -> 2) — sentence boundary detection
- truecase: Linear(769 -> 128 -> 16) — per-character upper/lower case

Languages

Supports 47 languages via XLM-RoBERTa's multilingual encoder. Quality is best on the 12 languages the classification heads were trained on (EN, DE, FR, ES, IT, PT, NL, PL, RU, UK, CS, DA) but generalizes to all 47 XLM-R languages.

Files

File	Size	Description
`pcs-xlmr-base.gguf`	903 MB	Full-precision (F16) — reference quality
`pcs-xlmr-base-q4_k.gguf`	155 MB	Q4_K quantised — ~6x smaller

Usage

CrispASR CLI

# Apply PCS to unpunctuated text
crispasr --punc-model pcs-xlmr-base-q4_k.gguf \
  -f audio.wav \
  --backend parakeet

# Standalone text processing
echo "hello how are you doing today i am fine" | \
  crispasr-pcs pcs-xlmr-base-q4_k.gguf
# Output: "Hello, how are you doing today? I am fine."

CrisperWeaver (Flutter GUI)

Download from Model Management (Post-processors section). Enable "Restore punctuation" in Advanced Options — PCS runs automatically as a post-processing step after transcription.

C API

#include "crispasr.h"

void* pcs = crispasr_pcs_init("pcs-xlmr-base-q4_k.gguf");
const char* result = crispasr_pcs_process(pcs, "hello how are you");
// result: "Hello, how are you?"
crispasr_pcs_free_text(result);
crispasr_pcs_free(pcs);

Dart FFI

final pcs = crispasr.PcsModel.open('pcs-xlmr-base-q4_k.gguf');
final text = pcs.process('hello how are you doing today');
print(text); // "Hello, how are you doing today?"
pcs.close();

Comparison with other post-processors

Model	Languages	Punct	Truecase	SBD	Size (Q4_K)
PCS	47	17 types	per-char	yes	155 MB
FireRedPunc	ZH + EN	yes	yes	no	~100 MB
Fullstop-punc	EN/DE/FR/IT	yes	yes	no	~300 MB
Truecaser LSTM	DE/EN/ES/RU	no	yes	no	~3 MB

PCS is the most comprehensive option — it handles all three tasks in one pass across the widest language set.

License

MIT (same as upstream 1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase).

Model tree for cstr/pcs-xlmr-base-GGUF

Base model

1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase

Quantized

(2)

this model

cstr
/

pcs-xlmr-base-GGUF