Full-text search
84 results
AntiplagiatCompany / HWR200
README.md
dataset
10 matches
tags:
size_categories:10K<n<100K, language:ru, license:apache-2.0, ocr, htr, handwritten text recognition, near duplicate detection, reuse detection, croissant, region:us
18
19
20
21
22
t of handwritten texts images in Russian
This is a dataset of handwritten texts images in Russian created by 200 writers with
different handwriting and photographed in different environment.
shaoncsecu / BN-HTRd_Splitted
README.md
dataset
12 matches
tags:
task_categories:image-segmentation, task_categories:image-to-text, size_categories:10K<n<100K, language:bn, license:cc-by-4.0, Handwriting Recognition, Document Imaging, Annotation, Image Segmentation, Bengali Language, Word Spotting, croissant, arxiv:2206.08977, doi:10.57967/hf/0546, region:us
21
22
23
24
25
ngla Handwritten Text Recognition (HTR)"
Link: https://data.mendeley.com/datasets/743k6dm543
### Description
We introduce a new dataset for offline Handwritten Text Recognition (HTR) from images of Bangla scripts comprising words, lines, and document-level annotations. The BN-HTRd dataset is based on the BBC Bangla News corpus - which acted as ground truth texts for the handwritings. Our dataset contains a total of 786 full-page images collected from 150 different writers. With a staggering 108,147 instances of handwritten words, distributed over 13,867 lines and 23,115 unique words, this is currently the 'largest and most comprehensive dataset' in this field. We also provided the bounding box annotations (YOLO format) for the segmentation of words/lines and the ground truth annotations for full-text, along with the segmented images and their positions. The contents of our dataset came from a diverse news category, and annotators of different ages, genders, and backgrounds, having variability in writing styles. The BN-HTRd dataset can be adopted as a basis for various handwriting classification tasks such as end-to-end document recognition, word-spotting, word/line segmentation, and so on.
Teklia / Belfort-line
README.md
dataset
6 matches
tags:
task_categories:image-to-text, language:fr, license:mit, atr, htr, ocr, historical, handwritten, croissant, region:us
45
46
47
48
49
:** [Handwritten Text Recognition from Crowdsourced Annotations](https://doi.org/10.1145/3604951.3605517)
- **Point of Contact:** [TEKLIA](https://teklia.com)
## Dataset Summary
Teklia / NorHand-v1-line
README.md
dataset
6 matches
tags:
task_categories:image-to-text, language:nb, license:mit, atr, htr, ocr, historical, handwritten, croissant, region:us
45
46
47
48
49
for Handwritten Text Recognition in Norwegian](https://link.springer.com/chapter/10.1007/978-3-031-06555-2_27)
- **Point of Contact:** [TEKLIA](https://teklia.com)
## Dataset Summary
TrainingDataPro / ocr-text-detection-in-the-documents
README.md
dataset
27 matches
tags:
task_categories:image-to-text, task_categories:object-detection, language:en, license:cc-by-nc-nd-4.0, code, legal, finance, croissant, region:us
14
15
16
17
18
OCR Text Detection in the Documents Object Detection dataset
The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.
The dataset contains a variety of document types, including different *layouts, font sizes, and styles*. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.
biglam / unsilence_voc
README.md
dataset
11 matches
tags:
task_categories:token-classification, task_ids:named-entity-recognition, size_categories:1K<n<10K, language:nl, license:cc-by-4.0, lam , croissant, arxiv:2210.02194, region:us
130
131
132
133
134
tity Recognition
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
CATMuS / medieval
README.md
dataset
20 matches
tags:
task_categories:image-to-text, size_categories:100K<n<1M, language:fr, language:en, language:nl, language:it, language:es, language:ca, license:cc-by-4.0, optical-character-recognition, humanities, handwritten-text-recognition, croissant, region:us
28
29
30
31
32
Handwritten Text Recognition (HTR) has emerged as a crucial tool for converting manuscripts images into machine-readable formats,
enabling researchers and scholars to analyse vast collections efficiently.
Despite significant technological progress, establishing consistent ground truth across projects for HTR tasks,
particularly for complex and heterogeneous historical sources like medieval manuscripts in Latin scripts (8th-15th century CE), remains nonetheless challenging.
We introduce the **Consistent Approaches to Transcribing Manuscripts (CATMuS)** dataset for medieval manuscripts,
benhachem / KHATT
README.md
dataset
10 matches
tags:
task_categories:image-to-text, size_categories:1K<n<10K, language:ar, OCR, Optical Character Recognition , Arabic OCR, arabic , ocr, Textline images, croissant, region:us
17
18
19
20
21
FUPM Handwritten Arabic TexT (KHATT) database
### Version 1.0 (September 2012 Release)
The database contains handwritten Arabic text images and its ground-truth developed for
research in the area of Arabic handwritten text. It contains the line images and their ground-truth. It was used for the pilot experimentation as reported in the paper: <ins> S. A. Mahmoud, I. Ahmad, M. Alshayeb, W. G. Al-Khatib, M. T. Parvez, G. A. Fink, V. Margner, and H. EL Abed, “KHATT: Arabic Offline
louisraedisch / AlphaNum
README.md
dataset
9 matches
tags:
task_categories:image-classification, size_categories:100K<n<1M, language:en, license:mit, OCR, Handwriting, Character Recognition, Grayscale Images, ASCII Labels, Optical Character Recognition, croissant, region:us
25
26
27
28
29
s of handwritten characters and numerals as well as special character, each sized 24x24 pixels. This dataset is designed to bolster Optical Character Recognition (OCR) research and development.
For consistency, images extracted from the MNIST dataset have been color-inverted to match the grayscale aesthetics of the AlphaNum dataset.
## Data Sources
agomberto / FrenchCensus-handwritten-texts
README.md
dataset
13 matches
tags:
task_categories:image-to-text, size_categories:1K<n<10K, language:fr, license:mit, imate-to-text, trocr, croissant, region:us
42
43
44
45
46
ting text recognition. These datasets have been published in [Recognition and information extraction in historical handwritten tables: toward understanding early 20th century Paris census at DAS 2022](https://link.springer.com/chapter/10.1007/978-3-031-06555-2_10).
The 3 datasets are called “Generic dataset”, “Belleville”, and “Chaussée d’Antin” and contains lines made from the extracted rows of census tables from 1926. Each table in the Paris census contains 30 rows, thus each page in these datasets corresponds to 30 lines.
We publish here only the lines. If you want the pages, go [here](https://zenodo.org/record/6581158). This dataset is made 4800 annotated lines extracted from 80 double pages of the 1926 Paris census.
Teklia / POPP-line
README.md
dataset
4 matches
tags:
task_categories:image-to-text, language:fr, license:mit, atr, htr, ocr, historical, handwritten, croissant, region:us
45
46
47
48
49
:** [Recognition and Information Extraction in Historical Handwritten Tables: Toward Understanding Early 20th Century Paris Census](https://link.springer.com/chapter/10.1007/978-3-031-06555-2_10)
- **Point of Contact:** [TEKLIA](https://teklia.com)
## Dataset Summary
Teklia / CASIA-HWDB2-line
README.md
dataset
6 matches
tags:
task_categories:image-to-text, language:zh, license:mit, atr, htr, ocr, modern, handwritten, croissant, region:us
45
46
47
48
49
line handwritten Chinese character recognition: Benchmarking on new databases](https://www.sciencedirect.com/science/article/abs/pii/S0031320312002919)
- **Point of Contact:** [TEKLIA](https://teklia.com)
## Dataset Summary
Teklia / Esposalles-line
README.md
dataset
3 matches
Teklia / RIMES-2011-line
README.md
dataset
7 matches
tags:
task_categories:image-to-text, language:fr, license:mit, atr, htr, ocr, modern, handwritten, croissant, region:us
49
50
51
52
53
ase (Recognition and Indexation of handwritten documents and faxes) was created to evaluate automatic recognition and indexing systems for handwritten letters.
The database was collected by asking volunteers to write handwritten letters in exchange for gift certificates. Volunteers were given a fictitious identity (same gender as the real one) and up to 5 scenarios. Each scenario was chosen from among 9 realistic topics: change of personal data (address, bank account), request for information, opening and closing (customer account), change of contract or order, complaint (poor quality of service...), payment difficulties (request for delay, tax exemption...), reminder, complaint with other circumstances and a target (administrations or service providers (telephone, electricity, bank, insurance). The volunteers wrote a letter with this information in their own words. The layout was free and the only request was to use white paper and write legibly in black ink.
The campaign was a success, with more than 1,300 people contributing to the RIMES database by writing up to 5 letters. The resulting RIMES database contains 12,723 pages, corresponding to 5605 mails of two to three pages each.
lowercaseonly / cghd
README.md
dataset
16 matches
tags:
task_categories:object-detection, task_categories:image-segmentation, size_categories:1K<n<10K, language:en, language:de, license:cc-by-3.0, croissant, region:us
14
15
16
17
18
for Handwritten Circuit Diagrams (GTDB-HD)
This repository contains images of hand-drawn electrical circuit diagrams as well as accompanying bounding box annotation for object detection as well as segmentation ground truth files. This dataset is intended to train (e.g. neural network) models for the purpose of the extraction of electrical graphs from raster graphics.
## Structure
The folder structure is made up as follows:
OCR-Ethiopic / HHD-Ethiopic
README.md
dataset
10 matches
tags:
license:cc-by-4.0, doi:10.57967/hf/0691, region:us
5
6
7
8
9
mage recognition tasks. It contains a collection of historical handwritten Manuscripts in the Ethiopic script. The dataset is intended to facilitate research and development for Ethiopic text-image recognition.
### Dataset Details/
- __Size__: 79,684 <br>
- __Training Set__: 57,374 <br>
wanderkid / UniMER_Dataset
README.md
dataset
6 matches
tags:
size_categories:1M<n<10M, language:en, language:zh, license:apache-2.0, data, math, MER, arxiv:2404.15254, region:us
19
20
21
22
23
sion Recognition (MER). It encompasses the comprehensive UniMER-1M training set, featuring over one million instances that represent a diverse and intricate range of mathematical expressions, coupled with the UniMER Test Set, meticulously designed to benchmark MER models against real-world scenarios. The dataset details are as follows:
- **UniMER-1M Training Set:**
- Total Samples: 1,061,791 Latex-Image pairs
- Composition: A balanced mix of concise and complex, extended formula expressions
ai-forever / MERA
README.md
dataset
90 matches
tags:
language:ru, license:mit, croissant, arxiv:2007.01852, arxiv:2112.00861, region:us
892
893
894
895
896
— a text fragment with a question from the game “What? Where? When?";
- `topic` — a string containing the category of the question;
- `outputs` — a string containing the correct answer to the question.
#### *Data Instances*
armvectores / handwritten_text_detection
README.md
dataset
5 matches
tags:
task_categories:object-detection, size_categories:n<1K, language:hy, handwritten text, dictation, YOLOv8, croissant, region:us
16
17
18
19
20
# Handwritten text detection dataset
## Data domain
The blanks were provided by youth organization "Armenian Club" ([telegram](https://t.me/armenian_club), [instagram](https://www.instagram.com/armenian.club?igsh=MTJjYTN0dTdjamtxMQ==) ), Russia Moscow.