Full-text search
143 results
AntiplagiatCompany / HWR200
README.md
dataset
11 matches
tags:
language:ru, license:apache-2.0, size_categories:10K<n<100K, doi:10.57967/hf/3226, region:us, ocr, htr, handwritten text recognition, near duplicate detection, reuse detection
19
20
21
22
23
t of handwritten texts images in Russian
This is a dataset of handwritten texts images in Russian created by 200 writers with
different handwriting and photographed in different environment.
shaoncsecu / BN-HTRd_Splitted
README.md
dataset
12 matches
tags:
task_categories:image-segmentation, task_categories:image-to-text, language:bn, license:cc-by-4.0, size_categories:10K<n<100K, format:imagefolder, modality:image, modality:text, library:datasets, library:mlcroissant, arxiv:2206.08977, doi:10.57967/hf/0546, region:us, Handwriting Recognition, Document Imaging, Annotation, Image Segmentation, Bengali Language, Word Spotting
21
22
23
24
25
ngla Handwritten Text Recognition (HTR)"
Link: https://data.mendeley.com/datasets/743k6dm543
### Description
We introduce a new dataset for offline Handwritten Text Recognition (HTR) from images of Bangla scripts comprising words, lines, and document-level annotations. The BN-HTRd dataset is based on the BBC Bangla News corpus - which acted as ground truth texts for the handwritings. Our dataset contains a total of 786 full-page images collected from 150 different writers. With a staggering 108,147 instances of handwritten words, distributed over 13,867 lines and 23,115 unique words, this is currently the 'largest and most comprehensive dataset' in this field. We also provided the bounding box annotations (YOLO format) for the segmentation of words/lines and the ground truth annotations for full-text, along with the segmented images and their positions. The contents of our dataset came from a diverse news category, and annotators of different ages, genders, and backgrounds, having variability in writing styles. The BN-HTRd dataset can be adopted as a basis for various handwriting classification tasks such as end-to-end document recognition, word-spotting, word/line segmentation, and so on.
Teklia / Belfort-line
README.md
dataset
7 matches
tags:
task_categories:image-to-text, language:fr, license:mit, size_categories:10K<n<100K, format:parquet, modality:image, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, atr, htr, ocr, historical, handwritten
45
46
47
48
49
:** [Handwritten Text Recognition from Crowdsourced Annotations](https://doi.org/10.1145/3604951.3605517)
- **Point of Contact:** [TEKLIA](https://teklia.com)
## Dataset Summary
Teklia / NorHand-v1-line
README.md
dataset
7 matches
tags:
task_categories:image-to-text, language:nb, license:mit, size_categories:10K<n<100K, format:parquet, modality:image, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, atr, htr, ocr, historical, handwritten
45
46
47
48
49
for Handwritten Text Recognition in Norwegian](https://link.springer.com/chapter/10.1007/978-3-031-06555-2_27)
- **Point of Contact:** [TEKLIA](https://teklia.com)
## Dataset Summary
johnlockejrr / KHATT_v1.0_dataset
README.md
dataset
23 matches
tags:
task_categories:image-to-text, language:ar, license:mit, modality:image, region:us, atr, htr, ocr, historical, handwritten, arabic
48
49
50
51
52
FUPM Handwritten Arabic TexT) database is a database of unconstrained handwritten Arabic Text written by 1000 different writers. This research database’s development was undertaken by a research group from KFUPM, Dhahran, S audi Arabia headed by Professor Sabri Mahmoud in collaboration with Professor Fink from TU-Dortmund, Germany and Dr. Märgner from TU-Braunschweig, Germany.
The database includes 2000 similar-text paragraph images and 2000 unique-text paragraph images and their extracted text line images. The images are accompanied with manually verified ground-truth and Latin representation of the ground-truth. The database can be used in various handwriting recognition related researches like, but not limited to, text recognition, and writer identification. Interested readers can refer to the paper [1], and [2] for more details on the database. The version 1.0 of the KHATT database is available free of charge (for academic and research purposes) to the researchers.
Database Overview:
m-biriuchinskii / ICDAR2017-filtered-1800-1900
README.md
dataset
8 matches
tags:
task_categories:image-to-text, language:fr, size_categories:n<1K, format:parquet, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, OCR, NLP, TAL
55
56
57
58
59
n on Handwritten Text Recognition, focusing on monograph texts written between 1800 and 1900. It consists of a total of **957 documents**, divided into training, validation, and testing sets, and is designed for post-correction of OCR (Optical Character Recognition) text.
- **Total Documents**: 957
- **Training Set**: 765
- **Validation Set**: 95
TrainingDataPro / ocr-text-detection-in-the-documents
README.md
dataset
27 matches
tags:
task_categories:image-to-text, task_categories:object-detection, language:en, license:cc-by-nc-nd-4.0, size_categories:n<1K, format:imagefolder, modality:image, library:datasets, library:mlcroissant, region:us, code, legal, finance
14
15
16
17
18
OCR Text Detection in the Documents Object Detection dataset
The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.
The dataset contains a variety of document types, including different *layouts, font sizes, and styles*. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.
biglam / unsilence_voc
README.md
dataset
11 matches
tags:
task_categories:token-classification, task_ids:named-entity-recognition, language:nl, license:cc-by-4.0, size_categories:1K<n<10K, format:parquet, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, arxiv:2210.02194, region:us, lam
130
131
132
133
134
tity Recognition
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
CATMuS / medieval
README.md
dataset
21 matches
tags:
task_categories:image-to-text, language:fr, language:en, language:nl, language:it, language:es, language:ca, license:cc-by-4.0, size_categories:100K<n<1M, format:parquet, modality:image, modality:text, library:datasets, library:dask, library:mlcroissant, library:polars, region:us, optical-character-recognition, humanities, handwritten-text-recognition
29
30
31
32
33
Handwritten Text Recognition (HTR) has emerged as a crucial tool for converting manuscripts images into machine-readable formats,
enabling researchers and scholars to analyse vast collections efficiently.
Despite significant technological progress, establishing consistent ground truth across projects for HTR tasks,
particularly for complex and heterogeneous historical sources like medieval manuscripts in Latin scripts (8th-15th century CE), remains nonetheless challenging.
We introduce the **Consistent Approaches to Transcribing Manuscripts (CATMuS)** dataset for medieval manuscripts,
CATMuS / modern
README.md
dataset
34 matches
tags:
task_categories:image-to-text, language:fr, language:de, language:en, language:it, language:es, language:oc, language:la, license:cc-by-4.0, size_categories:100K<n<1M, format:parquet, modality:image, modality:text, library:datasets, library:dask, library:mlcroissant, library:polars, region:us, optical-character-recognition, humanities, handwritten-text-recognition, modern documents, contemporary documents, good quality
34
35
36
37
38
Handwritten Text Recognition (HTR) has emerged as a crucial tool for converting manuscripts images into machine-readable formats, enabling researchers and scholars to analyze vast collections efficiently. Despite significant technological progress, establishing consistent ground truth across projects for HTR tasks, particularly for complex and heterogeneous historical sources, remains nonetheless challenging.
We introduce the Consistent Approaches to Transcribing Manuscripts (CATMuS) dataset for **m**odern and **c**ontemporary manuscripts (McCATMuS), which offers:
- a uniform framework framework for annotating modern and contemporary manuscripts;
c3rl / IIIT-INDIC-HW-WORDS-Tamil
README.md
dataset
11 matches
tags:
language:ta, language:en, size_categories:100K<n<1M, format:parquet, modality:image, modality:text, library:datasets, library:dask, library:mlcroissant, library:polars, region:us
36
37
38
39
40
s of hand written words in Devanagari by various humans and the corresponding text of those images.
## Overview
The dataset, originally developed by the Centre for Visual Information Technology (CVIT) at IIIT Hyderabad, has been transformed into Parquet format to facilitate its use in modern machine learning workflows. This dataset primarily targets recognition of handwritten Tamil words and aims to advance research and development in handwritten text recognition technologies for Indic scripts.
c3rl / IIIT-INDIC-HW-WORDS-Hindi
README.md
dataset
11 matches
tags:
task_categories:image-to-text, task_categories:image-classification, task_categories:image-to-image, language:hi, size_categories:10K<n<100K, format:parquet, modality:image, modality:text, library:datasets, library:dask, library:mlcroissant, library:polars, region:us
41
42
43
44
45
s of hand written words in Devanagari by various humans and the corresponding text of those images.
## Overview
The dataset, originally developed by the Centre for Visual Information Technology (CVIT) at IIIT Hyderabad, has been transformed into Parquet format to facilitate its use in modern machine learning workflows. This dataset primarily targets recognition of handwritten Hindi words and aims to advance research and development in handwritten text recognition technologies for Indic scripts.
Voxel51 / USPS
README.md
dataset
15 matches
tags:
task_categories:image-classification, language:en, license:unknown, size_categories:1K<n<10K, format:imagefolder, modality:image, library:datasets, library:mlcroissant, library:fiftyone, region:us, fiftyone, image, image-classification
107
108
109
110
111
for Handwritten Text Recognition Research](https://ieeexplore.ieee.org/document/291440) and available at [https://paperswithcode.com/dataset/usps](https://paperswithcode.com/dataset/usps).
- **Language(s) (NLP):** en
- **License:** unknown
### Dataset Sources [optional]
flwrlabs / usps
README.md
dataset
3 matches
tags:
task_categories:image-classification, license:unknown, size_categories:1K<n<10K, format:parquet, modality:image, library:datasets, library:pandas, library:mlcroissant, library:polars, arxiv:2007.14390, region:us
113
114
115
116
117
for handwritten text recognition research},
journal={IEEE Transactions on pattern analysis and machine intelligence},
volume={16},
number={5},
pages={550--554},
CATMuS / medieval-segmentation
README.md
dataset
10 matches
tags:
task_categories:image-segmentation, task_categories:object-detection, task_categories:mask-generation, license:cc-by-4.0, size_categories:1K<n<10K, format:imagefolder, modality:image, modality:text, library:datasets, library:mlcroissant, region:us, layout-analysis, humanities, historical-documents
138
139
140
141
142
news text and headlines, social media posts, translated sentences, ...).
#### Data Collection and Processi
#### Who are the source data producers?
benhachem / KHATT
README.md
dataset
10 matches
tags:
task_categories:image-to-text, language:ar, size_categories:1K<n<10K, region:us, OCR, Optical Character Recognition , Arabic OCR, arabic , ocr, Textline images
17
18
19
20
21
FUPM Handwritten Arabic TexT (KHATT) database
### Version 1.0 (September 2012 Release)
The database contains handwritten Arabic text images and its ground-truth developed for
research in the area of Arabic handwritten text. It contains the line images and their ground-truth. It was used for the pilot experimentation as reported in the paper: <ins> S. A. Mahmoud, I. Ahmad, M. Alshayeb, W. G. Al-Khatib, M. T. Parvez, G. A. Fink, V. Margner, and H. EL Abed, “KHATT: Arabic Offline
Teklia / IAM-line
README.md
dataset
8 matches
tags:
task_categories:image-to-text, language:en, license:mit, size_categories:10K<n<100K, format:parquet, modality:image, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, atr, htr, ocr, modern, handwritten
46
47
48
49
50
ting recognition](https://doi.org/10.1007/s100320200071)
- **Point of Contact:** [TEKLIA](https://teklia.com)
## Dataset Summary
louisraedisch / AlphaNum
README.md
dataset
9 matches
tags:
task_categories:image-classification, language:en, license:mit, size_categories:100K<n<1M, format:imagefolder, modality:image, library:datasets, library:mlcroissant, region:us, OCR, Handwriting, Character Recognition, Grayscale Images, ASCII Labels, Optical Character Recognition
25
26
27
28
29
s of handwritten characters and numerals as well as special character, each sized 24x24 pixels. This dataset is designed to bolster Optical Character Recognition (OCR) research and development.
For consistency, images extracted from the MNIST dataset have been color-inverted to match the grayscale aesthetics of the AlphaNum dataset.
## Data Sources
iapp / thai_handwriting_dataset
README.md
dataset
13 matches
tags:
task_categories:text-to-image, task_categories:image-to-text, language:th, license:apache-2.0, size_categories:10K<n<100K, format:parquet, modality:image, modality:text, library:datasets, library:dask, library:mlcroissant, library:polars, region:us, handwriting-recognition, ocr
29
30
31
32
33
ting Recognition dataset (train-0000.parquet)
2. Thai Handwritten Free Dataset by Wang (train-0001.parquet onwards)
## Maintainer
kobkrit@iapp.co.th
agomberto / FrenchCensus-handwritten-texts
README.md
dataset
13 matches
tags:
task_categories:image-to-text, language:fr, license:mit, size_categories:1K<n<10K, format:parquet, modality:image, modality:text, library:datasets, library:dask, library:mlcroissant, library:polars, region:us, imate-to-text, trocr
42
43
44
45
46
ting text recognition. These datasets have been published in [Recognition and information extraction in historical handwritten tables: toward understanding early 20th century Paris census at DAS 2022](https://link.springer.com/chapter/10.1007/978-3-031-06555-2_10).
The 3 datasets are called “Generic dataset”, “Belleville”, and “Chaussée d’Antin” and contains lines made from the extracted rows of census tables from 1926. Each table in the Paris census contains 30 rows, thus each page in these datasets corresponds to 30 lines.
We publish here only the lines. If you want the pages, go [here](https://zenodo.org/record/6581158). This dataset is made 4800 annotated lines extracted from 80 double pages of the 1926 Paris census.