Full Text Search - Hugging Face

Full-text search

models datasets spaces

482 results

viklofg / swedish-ocr-correction

README.md

model

6 matches

tags: transformers, pytorch, t5, text2text-generation, sv, autotrain_compatible, text-generation-inference, endpoints_compatible, region:us

dish OCR correction

This model corrects OCR errors in Swedish text.

ml6team / byt5-base-dutch-ocr-correction

README.md

model

4 matches

tags: transformers, pytorch, t5, text2text-generation, autotrain_compatible, text-generation-inference, endpoints_compatible, region:us

utch OCR Correction

This model is a finetuned byT5 model that corrects OCR mistakes found in dutch sentences. The [google/byt5-base](https://huggingface.co/google/byt5-base) model is finetuned on the dutch section of the [OSCAR](https://huggingface.co/datasets/oscar) dataset.

yelpfeast / byt5-base-english-ocr-correction

README.md

model

14 matches

tags: transformers, pytorch, t5, text2text-generation, en, dataset:wikitext, arxiv:2105.13626, autotrain_compatible, text-generation-inference, endpoints_compatible, region:us

for OCR Correction

This model is a fine-tuned version of the [byt5-base](https://huggingface.co/google/byt5-base) for OCR Correction. ByT5 was

introduced in [this paper](https://arxiv.org/abs/2105.13626) and the idea and code for fine-tuning the model for OCR Correction was taken from [here](https://blog.ml6.eu/ocr-correction-with-byt5-5994d1217c07).

PleIAs / OCRonos

README.md

model

10 matches

tags: transformers, safetensors, llama, text-generation, conversational, fr, en, de, es, it, license:apache-2.0, autotrain_compatible, text-generation-inference, endpoints_compatible, region:us

the correction of badly digitized texts, as part of the **Bad Data Toolbox**.

OCROnos models are versatile tools supporting the correction of OCR errors, wrong word cut/merge and overall broken text structures. The training data includes a highly diverse set of ocrized texts in multiple languages from PleIAs open pre-training corpus, drawn from cultural heritage sources (Common Corpus) and financial and administrative documents in open data (Finance Commons).

This release currently features a model based on llama-3-8b that has been the most tested to date. The model was trained using HPC resources from GENCI–IDRIS (Grant 2023-AD011014736) on Jean-Zay. Future release will focus on smaller internal models that provides a better ratio of generation cost/quality.

pykale / bart-base-ocr

README.md

model

7 matches

tags: transformers, safetensors, bart, text2text-generation, en, license:mit, autotrain_compatible, endpoints_compatible, region:us

-OCR Correction of Historical Newspapers](https://aclanthology.org/2024.lt4hala-1.14/) and designed to correct OCR text. [BART-base](https://huggingface.co/facebook/bart-base) is fine-tuned for post-OCR correction of historical English, using [BLN600](https://aclanthology.org/2024.lrec-main.219/), a parallel corpus of 19th century newspaper machine/human transcription.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

pykale / bart-large-ocr

README.md

model

7 matches

tags: transformers, safetensors, bart, text2text-generation, en, license:mit, autotrain_compatible, endpoints_compatible, region:us

-OCR Correction of Historical Newspapers](https://aclanthology.org/2024.lt4hala-1.14/) and designed to correct OCR text. [BART-large](https://huggingface.co/facebook/bart-large) is fine-tuned for post-OCR correction of historical English, using [BLN600](https://aclanthology.org/2024.lrec-main.219/), a parallel corpus of 19th century newspaper machine/human transcription.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

pykale / llama-2-7b-ocr

README.md

model

8 matches

tags: peft, safetensors, en, base_model:meta-llama/Llama-2-7b-hf, license:mit, region:us

-OCR Correction of Historical Newspapers](https://aclanthology.org/2024.lt4hala-1.14/) and designed to correct OCR text. [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b-hf) is instruction-tuned for post-OCR correction of historical English, using [BLN600](https://aclanthology.org/2024.lrec-main.219/), a parallel corpus of 19th century newspaper machine/human transcription.

from peft import AutoPeftModelForCausalLM

pykale / llama-2-13b-ocr

README.md

model

8 matches

tags: peft, safetensors, en, base_model:meta-llama/Llama-2-13b-hf, license:mit, region:us

-OCR Correction of Historical Newspapers](https://aclanthology.org/2024.lt4hala-1.14/) and designed to correct OCR text. [Llama 2 13B](https://huggingface.co/meta-llama/Llama-2-13b-hf) is instruction-tuned for post-OCR correction of historical English, using [BLN600](https://aclanthology.org/2024.lrec-main.219/), a parallel corpus of 19th century newspaper machine/human transcription.

from peft import AutoPeftModelForCausalLM

versae / filiberto-7B-instruct-exp1

README.md

model

3 matches

tags: mlx, safetensors, gguf, mistral, finetuned, text-generation, conversational, license:apache-2.0, region:us

### OCR correction

text = """Otra vez, Don Iuan, me dad,

y otras mil vezes los braços.

jvdzwaan / ocrpostcorrection-task-1

README.md

model

10 matches

tags: transformers, pytorch, bert, token-classification, post-ocr correction, ocr postcorrection, bg, cs, de, en, es, fi, fr, nl, pl, sl, multilingual, autotrain_compatible, endpoints_compatible, region:us

# OCR postcorrection task 1

This is a BertForTokenClassification model that predicts whether a token is an OCR

mistake or not. It is based on [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased)

and finetuned on the dataset of the

DeepMount00 / OCR_corrector

README.md

model

8 matches

tags: transformers, safetensors, t5, text2text-generation, it, license:apache-2.0, autotrain_compatible, text-generation-inference, endpoints_compatible, region:us

lian OCR Error Correction Sequence-to-Sequence Model

## Model Details

This model represents the first version of an experimental sequence-to-sequence architecture designed specifically for the Italian language. It aims to correct approximately 93% of the errors generated by low-quality Optical Character Recognition (OCR) systems, which tend to perform poorly on Italian text. By taking raw, OCR-scanned text as input, the model outputs the corrected version of the text, significantly reducing errors and improving readability and accuracy.

manu / ocr_correction

README.md

model

2 matches

tags: tensorboard, safetensors, generated_from_trainer, region:us

# ocr_correction

This model was trained from scratch on the None dataset.

## Model description

Var3n / hmByT5_anno

README.md

model

2 matches

tags: transformers, pytorch, t5, text2text-generation, ByT5, historical, ocr-correction, de, license:mit, autotrain_compatible, text-generation-inference, endpoints_compatible, region:us

rect OCR mistakes. The max_length was set to 350.

SacreBLEU eval dataset: 10.83

PleIAs / Segmentext

README.md

model

2 matches

tags: transformers, safetensors, deberta-v2, token-classification, license:apache-2.0, autotrain_compatible, endpoints_compatible, region:us

the correction of OCR errors and other digitization artifact.

Segmentext support the following text segmentation:

Thang203 / general_nlp_research_paper

README.md

model

10 matches

tags: bertopic, text-classification, region:us

rror correction - error correction - correction | 40 | 57_gec_grammatical error_grammatical error correction_error correction |

| 58 | intent - intent detection - slot - slot filling - filling | 40 | 58_intent_intent detection_slot_slot filling |

| 59 | temporal - events - temporal relations - expressions - temporal relation | 39 | 59_temporal_events_temporal relations_expressions |

| 60 | adaptation - domain - domain adaptation - indomain - translation | 37 | 60_adaptation_domain_domain adaptation_indomain |

| 61 | stance - stance detection - detection - tweets - veracity | 37 | 61_stance_stance detection_detection_tweets |

slone / canine-c-bashkir-gec-v1

README.md

model

4 matches

tags: transformers, pytorch, canine, token-classification, grammatical error correction, ba, license:apache-2.0, autotrain_compatible, endpoints_compatible, region:us

ling Correction v1

This model is a version of [google/canine-c](https://huggingface.co/openai/whisper-small) fine-tuned to fix corrupted texts.

It was trained on a mixture of two parallel datasets in the Bashkir language:

- sentences post-edited by humans after OCR

pszemraj / grammar-synthesis-large

README.md

model

11 matches

tags: transformers, pytorch, safetensors, t5, text2text-generation, grammar, spelling, punctuation, error-correction, grammar synthesis, dataset:jfleg, arxiv:2107.06751, license:cc-by-nc-sa-4.0, license:apache-2.0, autotrain_compatible, text-generation-inference, endpoints_compatible, region:us

mmar correction on an expanded version of the [JFLEG](https://paperswithcode.com/dataset/jfleg) dataset.

usage in Python (after `pip install transformers`):

pszemraj / grammar-synthesis-base

README.md

model

10 matches

tags: transformers, pytorch, safetensors, t5, text2text-generation, grammar, spelling, punctuation, error-correction, grammar synthesis, dataset:jfleg, arxiv:2107.06751, license:cc-by-nc-sa-4.0, autotrain_compatible, text-generation-inference, endpoints_compatible, region:us

mmar correction on an expanded version of the [JFLEG](https://paperswithcode.com/dataset/jfleg) dataset. Check out a [demo notebook on Colab here](https://colab.research.google.com/gist/pszemraj/91abb08aa99a14d9fdc59e851e8aed66/demo-for-grammar-synthesis-base.ipynb).

usage in Python (after `pip install transformers`):

pszemraj / grammar-synthesis-small

README.md

model

10 matches

tags: transformers, pytorch, onnx, safetensors, t5, text2text-generation, grammar, spelling, punctuation, error-correction, grammar synthesis, FLAN, dataset:jfleg, arxiv:2107.06751, license:cc-by-nc-sa-4.0, license:apache-2.0, autotrain_compatible, endpoints_compatible, text-generation-inference, region:us

mmar correction on an expanded version of the [JFLEG](https://paperswithcode.com/dataset/jfleg) dataset.

usage in Python (after `pip install transformers`):

pszemraj / flan-t5-large-grammar-synthesis

README.md

model

12 matches

tags: transformers, pytorch, onnx, safetensors, t5, text2text-generation, grammar, spelling, punctuation, error-correction, grammar synthesis, FLAN, dataset:jfleg, arxiv:2107.06751, doi:10.57967/hf/0138, license:cc-by-nc-sa-4.0, license:apache-2.0, autotrain_compatible, endpoints_compatible, text-generation-inference, region:us

mmar correction on an expanded version of the [JFLEG](https://paperswithcode.com/dataset/jfleg) dataset. [Demo](https://huggingface.co/spaces/pszemraj/FLAN-grammar-correction) on HF spaces.

![example](https://i.imgur.com/PIhrc7E.png)