PCS (Punctuation + Capitalization + Segmentation) โ€” GGUF

GGUF conversion of 1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase for use with CrispASR and CrisperWeaver.

Model

PCS performs three text post-processing tasks in a single pass:

  1. Punctuation restoration โ€” adds commas, periods, question marks, exclamation marks, colons, semicolons, dashes, and 10 other punctuation types
  2. Truecasing โ€” restores proper capitalization (per-character upper/lower classification)
  3. Sentence boundary detection โ€” identifies sentence breaks

Particularly useful for ASR backends that output unpunctuated lowercase text (wav2vec2, fastconformer-ctc, firered-asr, parakeet-ctc, omniasr-ctc).

Architecture

  • Encoder: XLM-RoBERTa-base (12 layers, d=768, 12 heads, SentencePiece tokenizer)
  • 4 classification heads:
    • post_punc: Linear(768 -> 256 -> 17) โ€” post-word punctuation (., ,, ?, !, :, ;, -, etc.)
    • pre_punc: Linear(768 -> 256 -> 2) โ€” pre-word punctuation (ยฟ, ยก)
    • sbd: Linear(772 -> 128 -> 2) โ€” sentence boundary detection
    • truecase: Linear(769 -> 128 -> 16) โ€” per-character upper/lower case

Languages

Supports 47 languages via XLM-RoBERTa's multilingual encoder. Quality is best on the 12 languages the classification heads were trained on (EN, DE, FR, ES, IT, PT, NL, PL, RU, UK, CS, DA) but generalizes to all 47 XLM-R languages.

Files

File Size Description
pcs-xlmr-base.gguf 903 MB Full-precision (F16) โ€” reference quality
pcs-xlmr-base-q4_k.gguf 155 MB Q4_K quantised โ€” ~6x smaller

Usage

CrispASR CLI

# Apply PCS to unpunctuated text
crispasr --punc-model pcs-xlmr-base-q4_k.gguf \
  -f audio.wav \
  --backend parakeet

# Standalone text processing
echo "hello how are you doing today i am fine" | \
  crispasr-pcs pcs-xlmr-base-q4_k.gguf
# Output: "Hello, how are you doing today? I am fine."

CrisperWeaver (Flutter GUI)

Download from Model Management (Post-processors section). Enable "Restore punctuation" in Advanced Options โ€” PCS runs automatically as a post-processing step after transcription.

C API

#include "crispasr.h"

void* pcs = crispasr_pcs_init("pcs-xlmr-base-q4_k.gguf");
const char* result = crispasr_pcs_process(pcs, "hello how are you");
// result: "Hello, how are you?"
crispasr_pcs_free_text(result);
crispasr_pcs_free(pcs);

Dart FFI

final pcs = crispasr.PcsModel.open('pcs-xlmr-base-q4_k.gguf');
final text = pcs.process('hello how are you doing today');
print(text); // "Hello, how are you doing today?"
pcs.close();

Comparison with other post-processors

Model Languages Punct Truecase SBD Size (Q4_K)
PCS 47 17 types per-char yes 155 MB
FireRedPunc ZH + EN yes yes no ~100 MB
Fullstop-punc EN/DE/FR/IT yes yes no ~300 MB
Truecaser LSTM DE/EN/ES/RU no yes no ~3 MB

PCS is the most comprehensive option โ€” it handles all three tasks in one pass across the widest language set.

License

MIT (same as upstream 1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase).

Links

Downloads last month
148
GGUF
Model size
0.3B params
Architecture
pcs
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cstr/pcs-xlmr-base-GGUF