Full-text search
Search in
Scope to owner or repo
699 results
AntiplagiatCompany / HWR200
README.md
dataset
11 matches
tags: language:ru, license:apache-2.0, size_categories:10K<n<100K, doi:10.57967/hf/3226, region:us, ocr, htr, handwritten text recognition, near duplicate detection, reuse detection
19
# HWR200: New open access dataset of handwritten texts images in Russian
20
21
This is a dataset of handwritten texts images in Russian created by 200 writers with
⋯
36
* Total number of images with text is 30030
37
* Number of writers is 200
38
* Every handwritten text is photographed in three different ways: scanned, in poor light, in good light
⋯
56
id: <original text file name>
⋯
61
id: <original text file name>
⋯
84
title={HWR200: New open access dataset of handwritten texts images in Russian},
UniqueData / ocr-text-detection-in-the-documents
README.md
dataset
30 matches
tags: task_categories:image-to-text, task_categories:object-detection, language:en, license:cc-by-nc-nd-4.0, size_categories:n<1K, format:imagefolder, modality:image, modality:text, library:datasets, library:mlcroissant, region:us, OCR, text, NLP, deep learning, document text recognition
18
# OCR Text Detection in the Documents Object Detection dataset
19
...n of images that have been annotated with the location of text in the document. The dataset is specifically curated for ...
21
...ng box annotations that outline the exact location of the text within the document.
⋯
25
The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for te...
⋯
36
...ting the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are pr...
⋯
39
- **"Text Title"** - corresponds to titles, the box is **red**
40
- **"Text Paragraph"** - corresponds to paragraphs of text, the box is **blue**
41
- **"Table"** - corresponds to the table, the box is **green**
42
- **"Handwritten"** - corresponds to handwritten text, the box is **purple**
⋯
47
# Text Detection in the Documents might be made in accordance with your requirements.
Mobiusi / Pharmacy-Prescription-Text-Extraction-Dataset
README.md
dataset
14 matches
tags: task_categories:image-text-to-text, language:en, license:cc-by-nc-sa-4.0, size_categories:1B<n<10B, region:us, Text Recognition, Image Classification, Natural Language Processing, Healthcare Informatization, Intelligent Drug Management, Electronic Prescription Systems
19
# Pharmacy Prescription Text Extraction Dataset
21
...owever, current solutions have significant limitations in text extraction accuracy and handwritten text recognition capa...
This file contains 7 more matches not shown. See all 12
matches in the full file.
caveman273 / aida-handwritten
README.md
dataset
12 matches
tags: task_categories:image-to-text, language:fi, language:sv, language:en, license:mit, size_categories:1K<n<10K, format:parquet, modality:image, modality:text, library:datasets, library:pandas, library:polars, library:mlcroissant, region:us, HTR, OCR, handwritten
34
# Handwritten OCR training data from AIDA-project
⋯
38
This dataset contains handwritten textline images and their transcriptions from the AIDA-project. It is a subset of the ...
⋯
42
The dataset was created for handwritten text recognition (HTR).
⋯
54
- `text`: the transcription
⋯
62
| `text` | string | Ground-truth transcription |
caveman273 / aida-ship-info
README.md
dataset
12 matches
tags: task_categories:image-to-text, language:fi, language:sv, language:en, license:mit, size_categories:1K<n<10K, format:parquet, modality:image, modality:text, library:datasets, library:pandas, library:polars, library:mlcroissant, region:us, HTR, OCR, handwritten
34
# Handwritten OCR training data from AIDA-project (Ship Registry)
⋯
38
This dataset contains handwritten textline images and their transcriptions from the AIDA-project. It is a subset of the ...
⋯
42
The dataset was created for handwritten text recognition (HTR).
⋯
54
- `text`: the transcription
⋯
62
| `text` | string | Ground-truth transcription |
shaoncsecu / BN-HTRd_Splitted
README.md
dataset
12 matches
tags: task_categories:image-segmentation, task_categories:image-to-text, language:bn, license:cc-by-4.0, size_categories:10K<n<100K, format:imagefolder, modality:image, modality:text, library:datasets, library:mlcroissant, arxiv:2206.08977, doi:10.57967/hf/0546, region:us, Handwriting Recognition, Document Imaging, Annotation, Image Segmentation, Bengali Language, Word Spotting
21
...A Benchmark Dataset for Document Level Offline Bangla Handwritten Text Recognition (HTR)"
22
Link: https://data.mendeley.com/datasets/743k6dm543
23
24
### Description
25
We introduce a new dataset for offline Handwritten Text Recognition (HTR) from images of Bangla scripts comprising words...
⋯
47
...A Benchmark Dataset for Document Level Offline Bangla Handwritten Text Recognition (HTR) and Line Segmentation},
48
publisher = {arXiv},
This file contains 2 more matches not shown. See all 11
matches in the full file.
Teklia / Belfort-line
README.md
dataset
7 matches
tags: task_categories:image-to-text, language:fr, license:mit, size_categories:10K<n<100K, format:parquet, modality:image, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, atr, htr, ocr, historical, handwritten
45
- **Paper:** [Handwritten Text Recognition from Crowdsourced Annotations](https://doi.org/10.1145/3604951.3605517)
⋯
51
Text lines were extracted using an automatic model and may contain segmentation errors. The transcriptions were obtained...
⋯
66
'text': 'les intérêts des 30000 francs jusqu'au moment de la'
⋯
74
- `text`: the label transcription of the image.
johnlockejrr / KHATT_v1.0_dataset
README.md
dataset
23 matches
tags: task_categories:image-to-text, language:ar, license:mit, modality:image, region:us, atr, htr, ocr, historical, handwritten, arabic
48
KHATT (KFUPM Handwritten Arabic TexT) database is a database of unconstrained handwritten Arabic Text written by 1000 di...
50
...and 2000 unique-text paragraph images and their extracted text line images. The images are accompanied with manually ver...
⋯
58
... paragraph images and their segmented line images (source text from different topics like arts, education, health, natur...
59
- 2000 paragraph images containing similar text, each covering all Arabic characters and shapes and their segmented line...
⋯
63
... and binarization and noise removal techniques beside handwritten text recognition.
⋯
67
...ärgner, Gernot A. Fink, KHATT: an open Arabic offline handwritten text database , Pattern Recognition.[http://www.scienc...
69
...Volker Margner, Haikal El Abed, KHATT: Arabic offline handwritten text database, 13th International Conference on Fronti...
⋯
82
'text': 'رفاظ قيار يؤل نب فوؤر هبحصب ماغرض رفظم حون بهذ'
Teklia / NorHand-v1-line
README.md
dataset
7 matches
tags: task_categories:image-to-text, language:nb, license:mit, size_categories:10K<n<100K, format:parquet, modality:image, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, atr, htr, ocr, historical, handwritten
45
...Comprehensive Comparison of Open-Source Libraries for Handwritten Text Recognition in Norwegian](https://link.springer.c...
⋯
50
The NorHand v1 dataset comprises Norwegian letter and diary line images and text from 19th and early 20th century.
⋯
65
'text': 'fredag 1923'
⋯
73
- `text`: the label transcription of the image.
rustensai / russian-handwriting-ocr
README.md
dataset
7 matches
tags: task_categories:image-to-text, language:ru, size_categories:10K<n<100K, format:parquet, format:optimized-parquet, modality:image, modality:text, library:datasets, library:dask, library:polars, library:mlcroissant, region:us, handwriting-recognition, russian, ocr, qwen3-vl
15
# Russian Handwritten Text Recognition Dataset
⋯
48
"content": [
49
{"type": "text", "text": "Распознай текст с изображения."},
50
{"type": "image"}
⋯
55
"content": [
56
{"type": "text", "text": "<transcribed_text>"}
57
]
Limerencii / russian-handwriting-ocr
README.md
dataset
7 matches
tags: task_categories:image-to-text, language:ru, size_categories:10K<n<100K, format:parquet, modality:image, modality:text, library:datasets, library:dask, library:mlcroissant, library:polars, region:us, handwriting-recognition, russian, ocr, qwen3-vl
15
# Russian Handwritten Text Recognition Dataset
⋯
48
"content": [
49
{"type": "text", "text": "Распознай текст с изображения."},
50
{"type": "image"}
⋯
55
"content": [
56
{"type": "text", "text": "<transcribed_text>"}
57
]
UkrainianCatholicUniversity / rukopys
README.md
dataset
49 matches
tags: task_categories:object-detection, task_categories:image-to-text, language:uk, license:cc-by-nc-sa-4.0, size_categories:1K<n<10K, format:imagefolder, modality:image, modality:text, library:datasets, library:mlcroissant, region:us, handwriting-recognition, htr, ocr, bounding-box, ukrainian, document-analysis, cyrillic
51
# RUKOPYS: Ukrainian Handwritten Text Recognition Dataset
53
...) is the first large-scale open dataset for Ukrainian handwritten text recognition (HTR). It spans over a century of Ukr...
⋯
57
> **Competition:** RUKOPYS powers the [Handwritten to Data](https://www.kaggle.com/competitions/handwritten-to-data) cha...
⋯
65
It combines four sources that differ across every dimension that makes handwriting recognition hard:
⋯
78
...ndwriting dataset, many pages naturally contain **printed text** alongside handwriting — preprinted headers, textbook ex...
⋯
111
...ion | `dictation` | 2020–2025 | 509 | Phone photos of handwritten Ukrainian National Dictation. One canonical text per y...
113
...–2025 | 246 | Scanned student exam work from 5 faculties: text, math formulas, chemistry, tables. |
swswswswsw / rukopys
README.md
dataset
32 matches
tags: task_categories:object-detection, task_categories:image-to-text, language:uk, license:cc-by-nc-sa-4.0, size_categories:1K<n<10K, format:imagefolder, modality:image, modality:text, library:datasets, library:mlcroissant, region:us, handwriting-recognition, htr, ocr, bounding-box, ukrainian, document-analysis, cyrillic
51
# RUKOPYS: Ukrainian Handwritten Text Recognition Dataset
53
...) is the first large-scale open dataset for Ukrainian handwritten text recognition (HTR). It spans over a century of Ukr...
⋯
57
> **Competition:** RUKOPYS powers the [Handwritten to Data](https://www.kaggle.com/competitions/handwritten-to-data) cha...
⋯
65
It combines four sources that differ across every dimension that makes handwriting recognition hard:
⋯
97
...ion | `dictation` | 2020–2025 | 456 | Phone photos of handwritten Ukrainian National Dictation. One canonical text per y...
99
...–2025 | 246 | Scanned student exam work from 5 faculties: text, math formulas, chemistry, tables. |
⋯
109
metadata.jsonl # bbox + type + language + legibility + text
Virajbhanage / rukopys
README.md
dataset
35 matches
tags: task_categories:object-detection, task_categories:image-to-text, language:uk, license:cc-by-nc-sa-4.0, size_categories:n<1K, format:imagefolder, modality:image, modality:text, library:datasets, library:mlcroissant, region:us, handwriting-recognition, htr, ocr, bounding-box, ukrainian, document-analysis, cyrillic
51
# RUKOPYS: Ukrainian Handwritten Text Recognition Dataset
53
...) is the first large-scale open dataset for Ukrainian handwritten text recognition (HTR). It spans over a century of Ukr...
⋯
57
> **Competition:** RUKOPYS powers the [Handwritten to Data](https://www.kaggle.com/competitions/handwritten-to-data) cha...
⋯
65
It combines four sources that differ across every dimension that makes handwriting recognition hard:
⋯
78
...ndwriting dataset, many pages naturally contain **printed text** alongside handwriting — preprinted headers, textbook ex...
⋯
111
...ion | `dictation` | 2020–2025 | 509 | Phone photos of handwritten Ukrainian National Dictation. One canonical text per y...
113
...–2025 | 246 | Scanned student exam work from 5 faculties: text, math formulas, chemistry, tables. |
Iltanix / rukopys
README.md
dataset
44 matches
tags: task_categories:object-detection, task_categories:image-to-text, language:uk, license:cc-by-nc-sa-4.0, size_categories:n<1K, format:imagefolder, modality:image, modality:text, library:datasets, library:mlcroissant, region:us, handwriting-recognition, htr, ocr, bounding-box, ukrainian, document-analysis, cyrillic
51
# RUKOPYS: Ukrainian Handwritten Text Recognition Dataset
53
...) is the first large-scale open dataset for Ukrainian handwritten text recognition (HTR). It spans over a century of Ukr...
⋯
57
> **Competition:** RUKOPYS powers the [Handwritten to Data](https://www.kaggle.com/competitions/handwritten-to-data) cha...
⋯
65
It combines four sources that differ across every dimension that makes handwriting recognition hard:
⋯
78
...ndwriting dataset, many pages naturally contain **printed text** alongside handwriting — preprinted headers, textbook ex...
⋯
111
...ion | `dictation` | 2020–2025 | 509 | Phone photos of handwritten Ukrainian National Dictation. One canonical text per y...
113
...–2025 | 246 | Scanned student exam work from 5 faculties: text, math formulas, chemistry, tables. |
BSC-CSSH / AMSMB-line-transcription
README.md
dataset
23 matches
tags: task_categories:image-to-text, annotations_creators:expert-generated, language:ca, language:la, license:cc-by-sa-4.0, size_categories:1K<n<10K, format:parquet, modality:image, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, handwritten-text-recognition, htr, transcription
51
Dataset for line-level handwritten text recognition on medieval historical manuscripts, consisting of 3,369 lines (image...
⋯
55
This is a dataset for line-level handwritten text recognition of medieval manuscripts, focusing on notarial charters wri...
⋯
70
- `text` (`string`): the transcription of the line.
⋯
94
...odels, and developing tools for processing and extracting text from historical documents, making their content accessibl...
⋯
128
title={{AMSMB HTR: A Dataset for Handwritten Text Recognition in Medieval Notarial Charters Written on Parchment (1208...
LocalDoc / azerbaijani-htr-benchmark
README.md
dataset
12 matches
tags: task_categories:image-to-text, language:az, license:cc-by-4.0, size_categories:n<1K, format:parquet, modality:image, modality:text, library:datasets, library:pandas, library:polars, library:mlcroissant, region:us, htr, handwritten-text-recognition, azerbaijani, ocr, benchmark
25
# Azerbaijani Handwritten OCR Benchmark
26
27
A manually annotated benchmark for handwritten text recognition (HTR) on Azerbaijani Latin script. Real-world scanned pa...
⋯
31
## `lines` — Line-level recognition
32
33
Cropped images of single text lines paired with their transcription. Rotated regions are deskewed (warped to be axis-ali...
⋯
44
| `text` | string | Ground truth transcription |
⋯
81
text = sample["lines"]["text"][i]
82
print(f" Line {line_id}: bbox={bbox}, text={text}")
⋯
91
1. Real Azerbaijani handwritten pages were collected
LocalDoc / azerbaijani-htr-synthetic
README.md
dataset
18 matches
tags: task_categories:image-to-text, language:az, license:cc-by-4.0, size_categories:1M<n<10M, format:parquet, format:optimized-parquet, modality:image, modality:text, library:datasets, library:dask, library:polars, library:mlcroissant, region:us, ocr, htr, handwritten-text-recognition, azerbaijani, synthetic
49
# Azerbaijani Synthetic Handwritten OCR Dataset
50
51
A large-scale synthetic dataset for training handwritten text recognition (HTR) models on Azerbaijani Latin script. Gene...
⋯
71
| `text` | string | Ground truth transcription (NFC-normalized) |
⋯
77
The pipeline takes plain text from Azerbaijani corpora, renders each line using a randomly selected handwriting font, ap...
79
### Step 1 — Text corpus assembly
80
81
Two text sources were combined to balance natural prose with document-specific patterns rarely seen in standard corpora:
⋯
85
... producing realistic short strings that are common in handwritten documents but absent from prose corpora:
⋯
117
...mations.** Instead of rendering the full line as a single text element, each word is rendered separately with:
m-biriuchinskii / ICDAR2017-filtered-1800-1900
README.md
dataset
8 matches
tags: task_categories:image-to-text, language:fr, size_categories:n<1K, format:parquet, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, OCR, NLP, TAL
55
... a filtered version of the *ICDAR2017* Competition on Handwritten Text Recognition, focusing on monograph texts written ...
⋯
63
...rection, specifically addressing the challenges of French text of 19th century.
⋯
87
- **1st line**: "[OCR_toInput] " => Raw OCRed text to be denoised.
⋯
92
For a better view of the alignment, make sure to disable the "word wrap" option in your text editor.
⋯
98
...iginal dataset source: [ICDAR2017 Competition on Post-OCR Text Correction](http://l3i.univ-larochelle.fr/ICDAR2017PostOC...
This file contains 1 more match not shown. See all 8
matches in the full file.
m-biriuchinskii / ICDAR2017-filtered-1800-1900-6
README.md
dataset
8 matches
tags: language:fr, size_categories:1K<n<10K, format:parquet, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us
59
... a filtered version of the *ICDAR2017* Competition on Handwritten Text Recognition, focusing on monograph texts written ...
⋯
66
...rection, specifically addressing the challenges of French text of 19th century.
⋯
70
...ith detailed information about OCR (Optical Character Recognition) outputs and their corresponding ground truths (GT). B...
⋯
76
| **Region_OCR** | `string` | OCR-recognized region of text. |
⋯
96
...iginal dataset source: [ICDAR2017 Competition on Post-OCR Text Correction](http://l3i.univ-larochelle.fr/ICDAR2017PostOC...
This file contains 1 more match not shown. See all 8
matches in the full file.