Full-text search
326 results
louisraedisch / AlphaNum
README.md
dataset
15 matches
tags:
task_categories:image-classification, language:en, license:mit, size_categories:100K<n<1M, format:imagefolder, modality:image, library:datasets, library:mlcroissant, region:us, OCR, Handwriting, Character Recognition, Grayscale Images, ASCII Labels, Optical Character Recognition
25
26
27
28
29
tten characters and numerals as well as special character, each sized 24x24 pixels. This dataset is designed to bolster Optical Character Recognition (OCR) research and development.
For consistency, images extracted from the MNIST dataset have been color-inverted to match the grayscale aesthetics of the AlphaNum dataset.
## Data Sources
Nada2125 / Khatt-Dataset-Unique-lines-full
README.md
dataset
3 matches
iarata / PHCR-DB25
README.md
dataset
21 matches
tags:
language:fa, size_categories:1K<n<10K, format:imagefolder, modality:image, library:datasets, library:mlcroissant, doi:10.57967/hf/1799, region:us, ocr, character-recognition, persian, historical, handwritten, nastaliq, character
17
18
19
20
21
tten Characters
## Dataset Description
- **Model**: https://huggingface.co/iarata/Few-Shot-PHCR
abdoelsayed / CORU
README.md
dataset
3 matches
tags:
task_categories:object-detection, task_categories:text-classification, task_categories:zero-shot-classification, language:en, language:ar, license:mit, size_categories:10K<n<100K, modality:image, modality:text, arxiv:2406.04493, region:us
15
16
17
18
19
s of Optical Character Recognition (OCR) and Natural Language Processing (NLP), integrating multilingual capabilities remains a critical challenge, especially when considering languages with complex scripts such as Arabic. This paper introduces the Comprehensive Post-OCR Parsing and Receipt Understanding Dataset (CORU), a novel dataset specifically designed to enhance OCR and information extraction from receipts in multilingual contexts involving Arabic and English. CORU consists of over 20,000 annotated receipts from diverse retail settings in Egypt, including supermarkets and clothing stores, alongside 30,000 annotated images for OCR that were utilized to recognize each detected line, and 10,000 items annotated for detailed information extraction. These annotations capture essential details such as merchant names, item descriptions, total prices, receipt numbers, and dates. They are structured to support three primary computational tasks: object detection, OCR, and information extraction. We establish the baseline performance for a range of models on CORU to evaluate the effectiveness of traditional methods, like Tesseract OCR, and more advanced neural network-based approaches. These baselines are crucial for processing the complex and noisy document layouts typical of real-world receipts and for advancing the state of automated multilingual document processing.
## Dataset Overview
CORU is divided into Three challenges:
suchut / thaitrocr-eval-dataset-beta
README.md
dataset
3 matches
tags:
language:th, language:en, license:cc-by-sa-4.0, size_categories:n<1K, format:imagefolder, modality:image, modality:text, library:datasets, library:mlcroissant, region:us, OCR, dataset, evaluation, multilingual, handwritten
18
19
20
21
22
ting Optical Character Recognition (OCR) models across various domains. It includes images and textual data derived from various open-source websites.
This dataset aims to provide a comprehensive evaluation resource for researchers and developers working on OCR systems, particularly in Thai language processing.
### Data Fields
TrainingDataPro / ocr-trains-dataset
README.md
dataset
7 matches
tags:
task_categories:image-to-text, task_categories:object-detection, language:en, license:cc-by-nc-nd-4.0, region:us, code, finance
29
30
31
32
33
ough optical character recognition (OCR) technology, which extracts text from images, in this case, **the train number**.
# 💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on **[TrainingData](https://trainingdata.pro/datasets/train-numbers?utm_source=huggingface&utm_medium=cpc&utm_campaign=ocr-trains-dataset)** to buy the dataset
The dataset be used to train machine learning models for extracting and analyzing text from train-related documents or images, to develop algorithms or models for real-time updates, or building intelligent systems related to trains and transportation.
openthaigpt / thai-ocr-evaluation
README.md
dataset
3 matches
tags:
language:th, language:en, license:cc-by-sa-4.0, size_categories:n<1K, format:imagefolder, modality:image, modality:text, library:datasets, library:mlcroissant, region:us, OCR, dataset, evaluation, multilingual, handwritten
18
19
20
21
22
ting Optical Character Recognition (OCR) models across various domains. It includes images and textual data derived from various open-source websites.
This dataset aims to provide a comprehensive evaluation resource for researchers and developers working on OCR systems, particularly in Thai language processing.
### Data Fields
SEACrowd / alice_thi
README.md
dataset
11 matches
tags:
language:tha, license:unknown, arxiv:2406.10118, region:us, optical-character-recognition
13
14
15
16
17
4045 character
images, which is split into Thai handwritten character dataset (THI-C68) for
14490 images and Thai handwritten digit dataset (THI-D10) for 9555 images. The
data was collected from 150 native writers aged from 20 to 23 years old. The
participants were allowed to write only the isolated Thai script on the form and
TheBritishLibrary / blbooks
README.md
dataset
15 matches
tags:
task_categories:text-generation, task_categories:fill-mask, task_categories:other, task_ids:language-modeling, task_ids:masked-language-modeling, annotations_creators:no-annotation, language_creators:machine-generated, multilinguality:multilingual, source_datasets:original, language:de, language:en, language:es, language:fr, language:it, language:nl, license:cc0-1.0, size_categories:100K<n<1M, region:us, digital-humanities-research
397
398
399
400
401
- [Optical Character Recognition](#optical-character-recognition)
- [OCR word confidence](#ocr-word-confidence)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
ytu-ce-cosmos / turkce-kitap
README.md
dataset
3 matches
tags:
size_categories:100K<n<1M, format:json, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us
14
15
16
17
18
OCR (Optical Character Recognition) abilities of the [Turkish-LLaVA-v0.1](https://huggingface.co/ytu-ce-cosmos/Turkish-LLaVA-v0.1) model. It was created by collecting **100,000** books entirely from Turkish sources. The primary goal of this dataset is to enhance the model's ability to detect and interpret any text present in images.
## Dataset Usage in Finetuning
This dataset played a crucial role in the finetuning process of the Turkish-LLaVA-v0.1 model. It was concatenated with [another dataset](#) (Soon..) to form a comprehensive training set that significantly refined the model's OCR capabilities. This enhancement process ensured that the model could accurately recognize and interpret Turkish text in various visual contexts.
nateraw / rendered-sst2
README.md
dataset
3 matches
tags:
task_categories:image-classification, task_ids:multi-class-image-classification, annotations_creators:machine-generated, language_creators:crowdsourced, multilinguality:monolingual, source_datasets:extended|sst2, language:en, license:unknown, size_categories:1K<n<10K, format:parquet, modality:image, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us
27
28
29
y on optical character recognition. This dataset was generated by rendering sentences in the Standford Sentiment Treebank v2 dataset.
This dataset contains two classes (positive and negative) and is divided in three splits: a train split containing 6920 images (3610 positive and 3310 negative), a validation split containing 872 images (444 positive and 428 negative), and a test split containing 1821 images (909 positive and 912 negative).
TrainingDataPro / ocr-receipts-text-detection
README.md
dataset
8 matches
tags:
task_categories:image-to-text, task_categories:object-detection, language:en, license:cc-by-nc-nd-4.0, region:us, code, finance
60
61
62
63
64
to **Optical Character Recognition (OCR)** and is useful for retail.
# 💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on **[TrainingData](https://trainingdata.pro/datasets/ocr-receipts-text-detection?utm_source=huggingface&utm_medium=cpc&utm_campaign=ocr-receipts-text-detection)** to buy the dataset
Each image in the dataset is accompanied by bounding box annotations, indicating the precise locations of specific text segments on the receipts. The text segments are categorized into four classes: **item, store, date_time and total**.
TrainingDataPro / ocr-barcodes-detection
README.md
dataset
8 matches
tags:
task_categories:image-to-text, language:en, license:cc-by-nc-nd-4.0, region:us, code, finance
54
55
56
57
58
lly, Optical Character Recognition (**OCR**) has been performed on each bounding box to extract the barcode numbers.
# 💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on **[TrainingData](https://trainingdata.pro/datasets?utm_source=huggingface&utm_medium=cpc&utm_campaign=ocr-barcodes-detection)** to buy the dataset
The dataset is particularly valuable for applications in *grocery retail, inventory management, supply chain optimization, and automated checkout systems*. It serves as a valuable resource for researchers, developers, and businesses working on barcode-related projects in the retail and logistics domains.
cpans / idcard_name
README.md
dataset
5 matches
tags:
license:apache-2.0, region:us, code
14
15
16
17
18
OCR (Optical Character Recognition) recognition, you can explore various open-source platforms and repositories such as GitHub, Model Zoo, or specific frameworks' model hubs like TensorFlow Hub or PyTorch Hub. ID OCR recognition models are designed to extract text from identity cards, including personal details like name, ID number, date of birth, and other relevant information. These models are trained on diverse datasets to accurately recognize and extract text from various ID card formats and designs.
<a href="https://github.com/CCCpan/Gebaini"> Click on me free access </a>
![image/png](https://cdn-uploads.huggingface.co/production/uploads/646ec72b66f7b97a94fe3aa5/ehrut2cuO2UzJ239Vh0QO.png)
learn2train / the_times_archive_1824
README.md
dataset
3 matches
tags:
language:en, license:cc0-1.0, size_categories:10K<n<100K, format:parquet, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, newspaper, history
34
35
36
37
38
from Optical Character Recognition software on digitised newspaper pages. This dataset includes the plain text from the OCR alongside some minimal metadata associated with the newspaper from which the text is derived.
This dataset can be used for:
historical research and digital humanities research
TrainingDataPro / ocr-generated-machine-readable-zone-mrz-text-detection
README.md
dataset
9 matches
tags:
task_categories:image-to-text, task_categories:object-detection, language:en, license:cc-by-nc-nd-4.0, size_categories:n<1K, format:imagefolder, modality:image, library:datasets, library:mlcroissant, region:us, code, legal
13
14
15
16
17
nd **Optical Character Recognition (OCR)** results.
# 💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on **[TrainingData](https://trainingdata.pro/datasets/ocr-machine-readable-zone-mrz?utm_source=huggingface&utm_medium=cpc&utm_campaign=ocr-generated-machine-readable-zone-mrz-text-detection)** to buy the dataset
This dataset is useful for developing applications related to *document verification, identity authentication, or automated data extraction from identification documents*.
SEACrowd / baybayin
README.md
dataset
14 matches
tags:
language:tgl, license:cc-by-4.0, arxiv:2406.10118, region:us, optical-character-recognition
13
14
15
16
17
ayin characters, Latin
characters, and 4 character symbols of Baybayin diacritics in MATLAB format. It
consisted of 17000 images for Baybayin (1000 per character), 18200 images for
Latin (700 per character), and 2000 images for Baybayin diacritics (500 per
symbol). Each character image is strictly center-fitted with a size 56x56
Salesforce / blip3-ocr-200m
README.md
dataset
4 matches
tags:
language:en, license:apache-2.0, size_categories:10M<n<100M, format:parquet, modality:image, modality:text, library:datasets, library:dask, library:mlcroissant, library:polars, arxiv:2408.08872, region:us, dataset, ocr, multimodal, vision, image-text-to-text
23
24
25
26
27
ates Optical Character Recognition (OCR) data during the pre-training phase of VLMs. This integration enhances vision-language alignment by providing detailed textual information alongside visual data.
- **Text-Rich Content**: The dataset focuses on text-rich images and includes OCR-specific annotations, enabling more accurate handling of content like documents, charts, and other text-heavy visuals.
- **Parquet Format**: The dataset is stored in Parquet format, facilitating efficient storage, processing, and retrieval of OCR metadata and images. This format is well-suited for handling large-scale datasets and can be easily integrated into data processing pipelines.
### Purpose: