Full Text Search - Hugging Face

Full-text search

models datasets spaces

186 results

louisraedisch / AlphaNum

README.md

dataset

15 matches

tags: task_categories:image-classification, size_categories:100K<n<1M, language:en, license:mit, OCR, Handwriting, Character Recognition, Grayscale Images, ASCII Labels, Optical Character Recognition, croissant, region:us

tten characters and numerals as well as special character, each sized 24x24 pixels. This dataset is designed to bolster Optical Character Recognition (OCR) research and development.

For consistency, images extracted from the MNIST dataset have been color-inverted to match the grayscale aesthetics of the AlphaNum dataset.

## Data Sources

benhachem / KHATT

README.md

dataset

4 matches

tags: task_categories:image-to-text, size_categories:1K<n<10K, language:ar, OCR, Optical Character Recognition , Arabic OCR, arabic , ocr, Textline images, croissant, region:us

ting Recognition (ICFHR 2012), Bari, Italy, 2012, pp. 447-452, IEEE Computer Society.

iarata / PHCR-DB25

README.md

dataset

21 matches

tags: size_categories:1K<n<10K, language:fa, ocr, character-recognition, persian, historical, handwritten, nastaliq, character, croissant, doi:10.57967/hf/1799, region:us

tten Characters

## Dataset Description

- **Model**: https://huggingface.co/iarata/Few-Shot-PHCR

TrainingDataPro / ocr-trains-dataset

README.md

dataset

7 matches

tags: task_categories:image-to-text, task_categories:object-detection, language:en, license:cc-by-nc-nd-4.0, code, finance, region:us

ough optical character recognition (OCR) technology, which extracts text from images, in this case, **the train number**.

# 💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on **[TrainingData](https://trainingdata.pro/datasets/train-numbers?utm_source=huggingface&utm_medium=cpc&utm_campaign=ocr-trains-dataset)** to buy the dataset

The dataset be used to train machine learning models for extracting and analyzing text from train-related documents or images, to develop algorithms or models for real-time updates, or building intelligent systems related to trains and transportation.

TheBritishLibrary / blbooks

README.md

dataset

15 matches

tags: task_categories:text-generation, task_categories:fill-mask, task_categories:other, task_ids:language-modeling, task_ids:masked-language-modeling, annotations_creators:no-annotation, language_creators:machine-generated, multilinguality:multilingual, size_categories:100K<n<1M, source_datasets:original, language:de, language:en, language:es, language:fr, language:it, language:nl, license:cc0-1.0, digital-humanities-research, croissant, region:us

- [Optical Character Recognition](#optical-character-recognition)

- [OCR word confidence](#ocr-word-confidence)

- [Dataset Structure](#dataset-structure)

- [Data Instances](#data-instances)

- [Data Fields](#data-fields)

nateraw / rendered-sst2

README.md

dataset

3 matches

tags: task_categories:image-classification, task_ids:multi-class-image-classification, annotations_creators:machine-generated, language_creators:crowdsourced, multilinguality:monolingual, size_categories:1K<n<10K, source_datasets:extended|sst2, language:en, license:unknown, croissant, region:us

y on optical character recognition. This dataset was generated by rendering sentences in the Standford Sentiment Treebank v2 dataset.

This dataset contains two classes (positive and negative) and is divided in three splits: a train split containing 6920 images (3610 positive and 3310 negative), a validation split containing 872 images (444 positive and 428 negative), and a test split containing 1821 images (909 positive and 912 negative).

TrainingDataPro / ocr-receipts-text-detection

README.md

dataset

8 matches

tags: task_categories:image-to-text, task_categories:object-detection, language:en, license:cc-by-nc-nd-4.0, code, finance, region:us

to **Optical Character Recognition (OCR)** and is useful for retail.

# 💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on **[TrainingData](https://trainingdata.pro/datasets/ocr-receipts-text-detection?utm_source=huggingface&utm_medium=cpc&utm_campaign=ocr-receipts-text-detection)** to buy the dataset

Each image in the dataset is accompanied by bounding box annotations, indicating the precise locations of specific text segments on the receipts. The text segments are categorized into four classes: **item, store, date_time and total**.

TrainingDataPro / ocr-barcodes-detection

README.md

dataset

8 matches

tags: task_categories:image-to-text, language:en, license:cc-by-nc-nd-4.0, code, finance, region:us

lly, Optical Character Recognition (**OCR**) has been performed on each bounding box to extract the barcode numbers.

# 💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on **[TrainingData](https://trainingdata.pro/datasets?utm_source=huggingface&utm_medium=cpc&utm_campaign=ocr-barcodes-detection)** to buy the dataset

The dataset is particularly valuable for applications in *grocery retail, inventory management, supply chain optimization, and automated checkout systems*. It serves as a valuable resource for researchers, developers, and businesses working on barcode-related projects in the retail and logistics domains.

gksriharsha / chitralekha

README.md

dataset

3 matches

tags: task_categories:image-to-text, size_categories:1M<n<10M, language:te, license:gpl-3.0, croissant, region:us

for Optical Character Recognition (OCR) in the Telugu language, featuring an impressive array of 80+ configurations. Each configuration in this dataset corresponds to a unique font, meticulously curated by Dr. Rakesh Achanta and sourced from his GitHub repository (https://github.com/TeluguOCR/banti_telugu_ocr).

The dataset is specifically designed to support and enhance the development of OCR models, ranging from simple Convolutional Recurrent Neural Network (CRNN) architectures to more advanced systems like trOCR. The versatility of this dataset lies in its large volume and diversity, making it an ideal choice for researchers and developers aiming to build robust OCR systems for the Telugu script.

cpans / idcard_name

README.md

dataset

5 matches

tags: license:apache-2.0, code, region:us

OCR (Optical Character Recognition) recognition, you can explore various open-source platforms and repositories such as GitHub, Model Zoo, or specific frameworks' model hubs like TensorFlow Hub or PyTorch Hub. ID OCR recognition models are designed to extract text from identity cards, including personal details like name, ID number, date of birth, and other relevant information. These models are trained on diverse datasets to accurately recognize and extract text from various ID card formats and designs.

<a href="https://github.com/CCCpan/Gebaini"> Click on me free access </a>

![image/png](https://cdn-uploads.huggingface.co/production/uploads/646ec72b66f7b97a94fe3aa5/ehrut2cuO2UzJ239Vh0QO.png)

learn2train / the_times_archive_1824

README.md

dataset

3 matches

tags: language:en, license:cc0-1.0, newspaper, history, croissant, region:us

from Optical Character Recognition software on digitised newspaper pages. This dataset includes the plain text from the OCR alongside some minimal metadata associated with the newspaper from which the text is derived.

This dataset can be used for:

historical research and digital humanities research

TrainingDataPro / ocr-generated-machine-readable-zone-mrz-text-detection

README.md

dataset

9 matches

tags: task_categories:image-to-text, task_categories:object-detection, language:en, license:cc-by-nc-nd-4.0, code, legal, croissant, region:us

nd **Optical Character Recognition (OCR)** results.

# 💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on **[TrainingData](https://trainingdata.pro/datasets/ocr-machine-readable-zone-mrz?utm_source=huggingface&utm_medium=cpc&utm_campaign=ocr-generated-machine-readable-zone-mrz-text-detection)** to buy the dataset

This dataset is useful for developing applications related to *document verification, identity authentication, or automated data extraction from identification documents*.

biglam / hmd_newspapers

README.md

dataset

5 matches

tags: task_categories:text-generation, size_categories:1M<n<10M, language:en, license:cc0-1.0, newspapers, croissant, region:us

from Optical Character Recognition software on digitised newspaper pages. This dataset includes the plain text from the OCR alongside some minimal metadata associated with the newspaper from which the text is derived and OCR confidence score information generated from the OCR software.

### Supported Tasks and Leaderboards

taln-ls2n / semeval-2010-pre

README.md

dataset

3 matches

tags: task_categories:text-generation, annotations_creators:unknown, language_creators:unknown, multilinguality:monolingual, size_categories:n<1K, language:en, license:cc-by-4.0, croissant, region:us

g an Optical Character Recognition (OCR) system and perform document logical structure detection using ParsCit v110505.

We use the detected logical structure to remove author-assigned keyphrases and select only relevant elements : title, headers, abstract, introduction, related work, body text and conclusion.

We finally apply a systematic dehyphenation at line breaks.s

* `lvl-3`: we further abridge the input text from level 2 preprocessed documents to the following: title, headers, abstract, introduction, related work, background and conclusion.

bigscience-data / roots_en_odiencorp

README.md

dataset

3 matches

tags: language:en, license:cc-by-nc-sa-4.0, region:us

and optical character recognition. OdiEnCorp 2.0 served in WAT 2020 EnglishOdia Indic Task.

https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3211

bigscience-data / roots_indic-or_odiencorp

README.md

dataset

3 matches

tags: language:or, license:cc-by-nc-sa-4.0, region:us

and optical character recognition. OdiEnCorp 2.0 served in WAT 2020 EnglishOdia Indic Task.

https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3211

Livingwithmachines / hmd-erwt-training

README.md

dataset

3 matches

tags: task_categories:fill-mask, task_ids:masked-language-modeling, annotations_creators:no-annotation, language_creators:machine-generated, multilinguality:monolingual, size_categories:100K<n<1M, language:en, license:cc0-1.0, library,lam,newspapers,1800-1900, croissant, region:us

from Optical Character Recognition software on digitised newspaper pages. This dataset includes the plain text from the OCR alongside some minimal metadata associated with the newspaper from which the text is derived and OCR confidence score information generated from the OCR software.

#### Breakdown of word counts over time

Whilst the dataset covers a time period between 1800 and 1870, the number of words in the dataset is not distributed evenly across time in this dataset. The figures below give a sense of the breakdown over time in terms of the number of words which appear in the dataset.

biglam / blbooks-parquet

README.md

dataset

15 matches

tags: task_categories:text-generation, task_categories:fill-mask, task_categories:other, task_ids:language-modeling, task_ids:masked-language-modeling, annotations_creators:no-annotation, language_creators:machine-generated, multilinguality:multilingual, size_categories:100K<n<1M, source_datasets:blbooks, language:de, language:en, language:es, language:fr, language:it, language:nl, license:cc0-1.0, digital-humanities-research, croissant, region:us

- [Optical Character Recognition](#optical-character-recognition)

- [OCR word confidence](#ocr-word-confidence)

- [Dataset Structure](#dataset-structure)

- [Data Instances](#data-instances)

- [Data Fields](#data-fields)

PypayaTech / PypayaNumbers

README.md

dataset

8 matches

tags: task_categories:feature-extraction, size_categories:10K<n<100K, license:mit, ocr, numbers, computervision, region:us

k of Optical Character Recognition (OCR) and object detection. Specifically, it can be used for tasks like digit recognition in images.

The dataset does not contain any natural language data.

saurabh1896 / OMR-scanned-documents

README.md

dataset

4 matches

tags: croissant, region:us

form recognition, data validation, and patient data management.

Additionally, this dataset serves as a valuable training and evaluation resource for image processing and optical character recognition (OCR) algorithms, enhancing the accuracy and efficiency of document digitization efforts within the healthcare sector. With the potential to improve data accuracy, reduce administrative burdens, and enhance patient care, the medical forms dataset with scanned documents is a cornerstone for advancing healthcare data management and accessibility.