You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Improving TrOCR Robustness under Unicode Diacritic Obfuscation

This model repository contains the fine-tuned checkpoint for a text recognition model optimized to defend automated hate speech detection pipelines against character-level diacritic attacks (Zalgo text obfuscation).

Model Details

  • Base Architecture: Microsoft TrOCR (microsoft/trocr-base-printed)
  • Task: Optical Character Recognition (OCR) / Vision Encoder-Decoder
  • Fine-Tuning Objective: Restoring text structural integrity from combining diacritical character variations to shield downstream NLP classifiers.
  • Academic Context: Developed as part of the EE-559 Deep Learning course project (2026) at EPFL.

Training Data & Lineage

The model was fine-tuned using word-level diacritic injection mappings paired with:

License & Citation

Following the licensing permissions of the base foundational architecture, this fine-tuned checkpoint is distributed under the MIT License.

If using this model or reproducing our pipeline evaluation layout, please ensure proper attribution to Microsoft's TrOCR architecture and the Jigsaw/Google Civil Comments curation teams.

Downloads last month
23
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Mamaa2001/trocr-model-diacritic