Full Text Search - Hugging Face

Full-text search

models datasets spaces

364 results

louisraedisch / AlphaNum

README.md

dataset

15 matches

tags: task_categories:image-classification, language:en, license:mit, size_categories:100K<n<1M, format:imagefolder, modality:image, library:datasets, library:mlcroissant, region:us, OCR, Handwriting, Character Recognition, Grayscale Images, ASCII Labels, Optical Character Recognition

tten characters and numerals as well as special character, each sized 24x24 pixels. This dataset is designed to bolster Optical Character Recognition (OCR) research and development.

For consistency, images extracted from the MNIST dataset have been color-inverted to match the grayscale aesthetics of the AlphaNum dataset.

## Data Sources

benhachem / KHATT

README.md

dataset

4 matches

tags: task_categories:image-to-text, language:ar, size_categories:1K<n<10K, region:us, OCR, Optical Character Recognition , Arabic OCR, arabic , ocr, Textline images

ting Recognition (ICFHR 2012), Bari, Italy, 2012, pp. 447-452, IEEE Computer Society.

Nada2125 / Khatt-Dataset-Unique-lines-full

README.md

dataset

3 matches

tags: task_categories:image-to-text, language:ar, license:mit, size_categories:1K<n<10K, format:text, modality:text, library:datasets, library:mlcroissant, region:us, OCR, Optical Character Recognition , Arabic OCR, arabic , ocr, Textline images

iarata / PHCR-DB25

README.md

dataset

21 matches

tags: language:fa, size_categories:1K<n<10K, format:imagefolder, modality:image, library:datasets, library:mlcroissant, doi:10.57967/hf/1799, region:us, ocr, character-recognition, persian, historical, handwritten, nastaliq, character

tten Characters

## Dataset Description

- **Model**: https://huggingface.co/iarata/Few-Shot-PHCR

LULab / myOCR

README.md

dataset

9 matches

tags: language:my, license:cc-by-nc-sa-4.0, size_categories:10K<n<100K, format:text, modality:image, modality:text, library:datasets, library:mlcroissant, region:us, ocr, myanmar-language, myanmar

OCR: Optical Character Recognition Corpus for Myanmar language (Burmese)

The Burmese text data are word-segmented with the delimiter _

Line-level Text Images for OCR are under the folder dataset and zipped.

huyhuy123 / ViOCRVQA

README.md

dataset

6 matches

tags: modality:image, arxiv:2404.18397, region:us

mese Optical Character Recognition - Visual Question Answering

![examples](sample.png)

bsesic / HebrewManuscripts

README.md

dataset

7 matches

tags: task_categories:object-detection, language:he, license:mit, size_categories:n<1K, region:us, hebrew, manuscripts, digital-humanities

tter Recognition Dataset

## Dataset Description

This dataset contains images of **Hebrew letters** and the **stop symbols** for training and evaluating optical character recognition (OCR) models. The dataset is designed to support the development of machine learning models capable of identifying individual Hebrew letters from images, making it ideal for tasks such as:

Omarrran / Kashmiri_one_word_text_reconition_dataset

README.md

dataset

3 matches

tags: size_categories:n<1K, format:imagefolder, modality:image, modality:text, library:datasets, library:mlcroissant, region:us

for Optical Character Recognition (OCR) tasks involving the Kashmiri language. It contains images generated with different fonts and texts from a Kashmiri text corpus.

## Dataset Structure

- **Training Set**: 85% of the total samples.

abdoelsayed / CORU

README.md

dataset

3 matches

tags: task_categories:object-detection, task_categories:text-classification, task_categories:zero-shot-classification, language:en, language:ar, license:mit, size_categories:10K<n<100K, modality:image, modality:text, arxiv:2406.04493, region:us

s of Optical Character Recognition (OCR) and Natural Language Processing (NLP), integrating multilingual capabilities remains a critical challenge, especially when considering languages with complex scripts such as Arabic. This paper introduces the Comprehensive Post-OCR Parsing and Receipt Understanding Dataset (CORU), a novel dataset specifically designed to enhance OCR and information extraction from receipts in multilingual contexts involving Arabic and English. CORU consists of over 20,000 annotated receipts from diverse retail settings in Egypt, including supermarkets and clothing stores, alongside 30,000 annotated images for OCR that were utilized to recognize each detected line, and 10,000 items annotated for detailed information extraction. These annotations capture essential details such as merchant names, item descriptions, total prices, receipt numbers, and dates. They are structured to support three primary computational tasks: object detection, OCR, and information extraction. We establish the baseline performance for a range of models on CORU to evaluate the effectiveness of traditional methods, like Tesseract OCR, and more advanced neural network-based approaches. These baselines are crucial for processing the complex and noisy document layouts typical of real-world receipts and for advancing the state of automated multilingual document processing.

## Dataset Overview

CORU is divided into Three challenges:

suchut / thaitrocr-eval-dataset-beta

README.md

dataset

3 matches

tags: language:th, language:en, license:cc-by-sa-4.0, size_categories:n<1K, format:imagefolder, modality:image, modality:text, library:datasets, library:mlcroissant, region:us, OCR, dataset, evaluation, multilingual, handwritten

ting Optical Character Recognition (OCR) models across various domains. It includes images and textual data derived from various open-source websites.

This dataset aims to provide a comprehensive evaluation resource for researchers and developers working on OCR systems, particularly in Thai language processing.

### Data Fields

TrainingDataPro / ocr-trains-dataset

README.md

dataset

7 matches

tags: task_categories:image-to-text, task_categories:object-detection, language:en, license:cc-by-nc-nd-4.0, region:us, code, finance

ough optical character recognition (OCR) technology, which extracts text from images, in this case, **the train number**.

# 💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on **[TrainingData](https://trainingdata.pro/datasets/train-numbers?utm_source=huggingface&utm_medium=cpc&utm_campaign=ocr-trains-dataset)** to buy the dataset

The dataset be used to train machine learning models for extracting and analyzing text from train-related documents or images, to develop algorithms or models for real-time updates, or building intelligent systems related to trains and transportation.

openthaigpt / thai-ocr-evaluation

README.md

dataset

3 matches

tags: language:th, language:en, license:cc-by-sa-4.0, size_categories:n<1K, format:imagefolder, modality:image, modality:text, library:datasets, library:mlcroissant, region:us, OCR, dataset, evaluation, multilingual, handwritten

ting Optical Character Recognition (OCR) models across various domains. It includes images and textual data derived from various open-source websites.

This dataset aims to provide a comprehensive evaluation resource for researchers and developers working on OCR systems, particularly in Thai language processing.

### Data Fields

m-biriuchinskii / ICDAR2017-filtered-1800-1900

README.md

dataset

6 matches

tags: task_categories:image-to-text, language:fr, size_categories:n<1K, format:parquet, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, OCR, NLP, TAL

Text Recognition, focusing on monograph texts written between 1800 and 1900. It consists of a total of **957 documents**, divided into training, validation, and testing sets, and is designed for post-correction of OCR (Optical Character Recognition) text.

- **Total Documents**: 957

- **Training Set**: 765

- **Validation Set**: 95

SEACrowd / alice_thi

README.md

dataset

11 matches

tags: language:tha, license:unknown, arxiv:2406.10118, region:us, optical-character-recognition

images, which is split into Thai handwritten character dataset (THI-C68) for

14490 images and Thai handwritten digit dataset (THI-D10) for 9555 images. The

data was collected from 150 native writers aged from 20 to 23 years old. The

participants were allowed to write only the isolated Thai script on the form and

TrainingDataPro / ocr-receipts-text-detection

README.md

dataset

8 matches

tags: task_categories:image-to-text, task_categories:object-detection, language:en, license:cc-by-nc-nd-4.0, region:us, code, finance

to **Optical Character Recognition (OCR)** and is useful for retail.

# 💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on **[TrainingData](https://trainingdata.pro/datasets/ocr-receipts-text-detection?utm_source=huggingface&utm_medium=cpc&utm_campaign=ocr-receipts-text-detection)** to buy the dataset

Each image in the dataset is accompanied by bounding box annotations, indicating the precise locations of specific text segments on the receipts. The text segments are categorized into four classes: **item, store, date_time and total**.

alexbeatson / burmese_ocr_data

README.md

dataset

9 matches

tags: language:my, language:en, license:other, size_categories:n<1K, doi:10.57967/hf/3361, region:us, ocr

ning Optical Character Recognition (OCR) models.

The data was curated from the [Burma Library](https://www.burmalibrary.org/) archive, which collects and preserves government and NGO documents. These documents were processed using Google Document AI to extract text and bounding boxes. Images of the identified text were then cropped and organized in this dataset.

TheBritishLibrary / blbooks

README.md

dataset

15 matches

tags: task_categories:text-generation, task_categories:fill-mask, task_categories:other, task_ids:language-modeling, task_ids:masked-language-modeling, annotations_creators:no-annotation, language_creators:machine-generated, multilinguality:multilingual, source_datasets:original, language:de, language:en, language:es, language:fr, language:it, language:nl, license:cc0-1.0, size_categories:100K<n<1M, region:us, digital-humanities-research

- [Optical Character Recognition](#optical-character-recognition)

- [OCR word confidence](#ocr-word-confidence)

- [Dataset Structure](#dataset-structure)

- [Data Instances](#data-instances)

- [Data Fields](#data-fields)

nateraw / rendered-sst2

README.md

dataset

3 matches

tags: task_categories:image-classification, task_ids:multi-class-image-classification, annotations_creators:machine-generated, language_creators:crowdsourced, multilinguality:monolingual, source_datasets:extended|sst2, language:en, license:unknown, size_categories:1K<n<10K, format:parquet, modality:image, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us

y on optical character recognition. This dataset was generated by rendering sentences in the Standford Sentiment Treebank v2 dataset.

This dataset contains two classes (positive and negative) and is divided in three splits: a train split containing 6920 images (3610 positive and 3310 negative), a validation split containing 872 images (444 positive and 428 negative), and a test split containing 1821 images (909 positive and 912 negative).

TrainingDataPro / ocr-barcodes-detection

README.md

dataset

8 matches

tags: task_categories:image-to-text, language:en, license:cc-by-nc-nd-4.0, region:us, code, finance

lly, Optical Character Recognition (**OCR**) has been performed on each bounding box to extract the barcode numbers.

# 💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on **[TrainingData](https://trainingdata.pro/datasets?utm_source=huggingface&utm_medium=cpc&utm_campaign=ocr-barcodes-detection)** to buy the dataset

The dataset is particularly valuable for applications in *grocery retail, inventory management, supply chain optimization, and automated checkout systems*. It serves as a valuable resource for researchers, developers, and businesses working on barcode-related projects in the retail and logistics domains.

cpans / idcard_name

README.md

dataset

5 matches

tags: license:apache-2.0, region:us, code

OCR (Optical Character Recognition) recognition, you can explore various open-source platforms and repositories such as GitHub, Model Zoo, or specific frameworks' model hubs like TensorFlow Hub or PyTorch Hub. ID OCR recognition models are designed to extract text from identity cards, including personal details like name, ID number, date of birth, and other relevant information. These models are trained on diverse datasets to accurately recognize and extract text from various ID card formats and designs.

<a href="https://github.com/CCCpan/Gebaini"> Click on me free access </a>

![image/png](https://cdn-uploads.huggingface.co/production/uploads/646ec72b66f7b97a94fe3aa5/ehrut2cuO2UzJ239Vh0QO.png)