Full Text Search - Hugging Face

Full-text search

models datasets spaces

143 results

AntiplagiatCompany / HWR200

README.md

dataset

11 matches

tags: language:ru, license:apache-2.0, size_categories:10K<n<100K, doi:10.57967/hf/3226, region:us, ocr, htr, handwritten text recognition, near duplicate detection, reuse detection

t of handwritten texts images in Russian

This is a dataset of handwritten texts images in Russian created by 200 writers with

different handwriting and photographed in different environment.

shaoncsecu / BN-HTRd_Splitted

README.md

dataset

12 matches

tags: task_categories:image-segmentation, task_categories:image-to-text, language:bn, license:cc-by-4.0, size_categories:10K<n<100K, format:imagefolder, modality:image, modality:text, library:datasets, library:mlcroissant, arxiv:2206.08977, doi:10.57967/hf/0546, region:us, Handwriting Recognition, Document Imaging, Annotation, Image Segmentation, Bengali Language, Word Spotting

ngla Handwritten Text Recognition (HTR)"

Link: https://data.mendeley.com/datasets/743k6dm543

### Description

We introduce a new dataset for offline Handwritten Text Recognition (HTR) from images of Bangla scripts comprising words, lines, and document-level annotations. The BN-HTRd dataset is based on the BBC Bangla News corpus - which acted as ground truth texts for the handwritings. Our dataset contains a total of 786 full-page images collected from 150 different writers. With a staggering 108,147 instances of handwritten words, distributed over 13,867 lines and 23,115 unique words, this is currently the 'largest and most comprehensive dataset' in this field. We also provided the bounding box annotations (YOLO format) for the segmentation of words/lines and the ground truth annotations for full-text, along with the segmented images and their positions. The contents of our dataset came from a diverse news category, and annotators of different ages, genders, and backgrounds, having variability in writing styles. The BN-HTRd dataset can be adopted as a basis for various handwriting classification tasks such as end-to-end document recognition, word-spotting, word/line segmentation, and so on.

Teklia / Belfort-line

README.md

dataset

7 matches

tags: task_categories:image-to-text, language:fr, license:mit, size_categories:10K<n<100K, format:parquet, modality:image, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, atr, htr, ocr, historical, handwritten

:** [Handwritten Text Recognition from Crowdsourced Annotations](https://doi.org/10.1145/3604951.3605517)

- **Point of Contact:** [TEKLIA](https://teklia.com)

## Dataset Summary

Teklia / NorHand-v1-line

README.md

dataset

7 matches

tags: task_categories:image-to-text, language:nb, license:mit, size_categories:10K<n<100K, format:parquet, modality:image, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, atr, htr, ocr, historical, handwritten

for Handwritten Text Recognition in Norwegian](https://link.springer.com/chapter/10.1007/978-3-031-06555-2_27)

- **Point of Contact:** [TEKLIA](https://teklia.com)

## Dataset Summary

johnlockejrr / KHATT_v1.0_dataset

README.md

dataset

23 matches

tags: task_categories:image-to-text, language:ar, license:mit, modality:image, region:us, atr, htr, ocr, historical, handwritten, arabic

FUPM Handwritten Arabic TexT) database is a database of unconstrained handwritten Arabic Text written by 1000 different writers. This research database’s development was undertaken by a research group from KFUPM, Dhahran, S audi Arabia headed by Professor Sabri Mahmoud in collaboration with Professor Fink from TU-Dortmund, Germany and Dr. Märgner from TU-Braunschweig, Germany.

The database includes 2000 similar-text paragraph images and 2000 unique-text paragraph images and their extracted text line images. The images are accompanied with manually verified ground-truth and Latin representation of the ground-truth. The database can be used in various handwriting recognition related researches like, but not limited to, text recognition, and writer identification. Interested readers can refer to the paper [1], and [2] for more details on the database. The version 1.0 of the KHATT database is available free of charge (for academic and research purposes) to the researchers.

Database Overview:

m-biriuchinskii / ICDAR2017-filtered-1800-1900

README.md

dataset

8 matches

tags: task_categories:image-to-text, language:fr, size_categories:n<1K, format:parquet, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, OCR, NLP, TAL

n on Handwritten Text Recognition, focusing on monograph texts written between 1800 and 1900. It consists of a total of **957 documents**, divided into training, validation, and testing sets, and is designed for post-correction of OCR (Optical Character Recognition) text.

- **Total Documents**: 957

- **Training Set**: 765

- **Validation Set**: 95

TrainingDataPro / ocr-text-detection-in-the-documents

README.md

dataset

27 matches

tags: task_categories:image-to-text, task_categories:object-detection, language:en, license:cc-by-nc-nd-4.0, size_categories:n<1K, format:imagefolder, modality:image, library:datasets, library:mlcroissant, region:us, code, legal, finance

OCR Text Detection in the Documents Object Detection dataset

The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.

The dataset contains a variety of document types, including different *layouts, font sizes, and styles*. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.

biglam / unsilence_voc

README.md

dataset

11 matches

tags: task_categories:token-classification, task_ids:named-entity-recognition, language:nl, license:cc-by-4.0, size_categories:1K<n<10K, format:parquet, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, arxiv:2210.02194, region:us, lam

tity Recognition

## Table of Contents

- [Table of Contents](#table-of-contents)

- [Dataset Description](#dataset-description)

CATMuS / medieval

README.md

dataset

21 matches

tags: task_categories:image-to-text, language:fr, language:en, language:nl, language:it, language:es, language:ca, license:cc-by-4.0, size_categories:100K<n<1M, format:parquet, modality:image, modality:text, library:datasets, library:dask, library:mlcroissant, library:polars, region:us, optical-character-recognition, humanities, handwritten-text-recognition

Handwritten Text Recognition (HTR) has emerged as a crucial tool for converting manuscripts images into machine-readable formats,

enabling researchers and scholars to analyse vast collections efficiently.

Despite significant technological progress, establishing consistent ground truth across projects for HTR tasks,

particularly for complex and heterogeneous historical sources like medieval manuscripts in Latin scripts (8th-15th century CE), remains nonetheless challenging.

We introduce the **Consistent Approaches to Transcribing Manuscripts (CATMuS)** dataset for medieval manuscripts,

CATMuS / modern

README.md

dataset

34 matches

tags: task_categories:image-to-text, language:fr, language:de, language:en, language:it, language:es, language:oc, language:la, license:cc-by-4.0, size_categories:100K<n<1M, format:parquet, modality:image, modality:text, library:datasets, library:dask, library:mlcroissant, library:polars, region:us, optical-character-recognition, humanities, handwritten-text-recognition, modern documents, contemporary documents, good quality

Handwritten Text Recognition (HTR) has emerged as a crucial tool for converting manuscripts images into machine-readable formats, enabling researchers and scholars to analyze vast collections efficiently. Despite significant technological progress, establishing consistent ground truth across projects for HTR tasks, particularly for complex and heterogeneous historical sources, remains nonetheless challenging.

We introduce the Consistent Approaches to Transcribing Manuscripts (CATMuS) dataset for **m**odern and **c**ontemporary manuscripts (McCATMuS), which offers:

- a uniform framework framework for annotating modern and contemporary manuscripts;

c3rl / IIIT-INDIC-HW-WORDS-Tamil

README.md

dataset

11 matches

tags: language:ta, language:en, size_categories:100K<n<1M, format:parquet, modality:image, modality:text, library:datasets, library:dask, library:mlcroissant, library:polars, region:us

s of hand written words in Devanagari by various humans and the corresponding text of those images.

The dataset, originally developed by the Centre for Visual Information Technology (CVIT) at IIIT Hyderabad, has been transformed into Parquet format to facilitate its use in modern machine learning workflows. This dataset primarily targets recognition of handwritten Tamil words and aims to advance research and development in handwritten text recognition technologies for Indic scripts.

c3rl / IIIT-INDIC-HW-WORDS-Hindi

README.md

dataset

11 matches

tags: task_categories:image-to-text, task_categories:image-classification, task_categories:image-to-image, language:hi, size_categories:10K<n<100K, format:parquet, modality:image, modality:text, library:datasets, library:dask, library:mlcroissant, library:polars, region:us

s of hand written words in Devanagari by various humans and the corresponding text of those images.

The dataset, originally developed by the Centre for Visual Information Technology (CVIT) at IIIT Hyderabad, has been transformed into Parquet format to facilitate its use in modern machine learning workflows. This dataset primarily targets recognition of handwritten Hindi words and aims to advance research and development in handwritten text recognition technologies for Indic scripts.

Voxel51 / USPS

README.md

dataset

15 matches

tags: task_categories:image-classification, language:en, license:unknown, size_categories:1K<n<10K, format:imagefolder, modality:image, library:datasets, library:mlcroissant, library:fiftyone, region:us, fiftyone, image, image-classification

for Handwritten Text Recognition Research](https://ieeexplore.ieee.org/document/291440) and available at [https://paperswithcode.com/dataset/usps](https://paperswithcode.com/dataset/usps).

- **Language(s) (NLP):** en

- **License:** unknown

### Dataset Sources [optional]

flwrlabs / usps

README.md

dataset

3 matches

tags: task_categories:image-classification, license:unknown, size_categories:1K<n<10K, format:parquet, modality:image, library:datasets, library:pandas, library:mlcroissant, library:polars, arxiv:2007.14390, region:us

for handwritten text recognition research},

journal={IEEE Transactions on pattern analysis and machine intelligence},

pages={550--554},

CATMuS / medieval-segmentation

README.md

dataset

10 matches

tags: task_categories:image-segmentation, task_categories:object-detection, task_categories:mask-generation, license:cc-by-4.0, size_categories:1K<n<10K, format:imagefolder, modality:image, modality:text, library:datasets, library:mlcroissant, region:us, layout-analysis, humanities, historical-documents

news text and headlines, social media posts, translated sentences, ...).

#### Data Collection and Processi

#### Who are the source data producers?

benhachem / KHATT

README.md

dataset

10 matches

tags: task_categories:image-to-text, language:ar, size_categories:1K<n<10K, region:us, OCR, Optical Character Recognition , Arabic OCR, arabic , ocr, Textline images

FUPM Handwritten Arabic TexT (KHATT) database

### Version 1.0 (September 2012 Release)

The database contains handwritten Arabic text images and its ground-truth developed for

research in the area of Arabic handwritten text. It contains the line images and their ground-truth. It was used for the pilot experimentation as reported in the paper: <ins> S. A. Mahmoud, I. Ahmad, M. Alshayeb, W. G. Al-Khatib, M. T. Parvez, G. A. Fink, V. Margner, and H. EL Abed, “KHATT: Arabic Offline

Teklia / IAM-line

README.md

dataset

8 matches

tags: task_categories:image-to-text, language:en, license:mit, size_categories:10K<n<100K, format:parquet, modality:image, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, atr, htr, ocr, modern, handwritten

ting recognition](https://doi.org/10.1007/s100320200071)

- **Point of Contact:** [TEKLIA](https://teklia.com)

## Dataset Summary

louisraedisch / AlphaNum

README.md

dataset

9 matches

tags: task_categories:image-classification, language:en, license:mit, size_categories:100K<n<1M, format:imagefolder, modality:image, library:datasets, library:mlcroissant, region:us, OCR, Handwriting, Character Recognition, Grayscale Images, ASCII Labels, Optical Character Recognition

s of handwritten characters and numerals as well as special character, each sized 24x24 pixels. This dataset is designed to bolster Optical Character Recognition (OCR) research and development.

For consistency, images extracted from the MNIST dataset have been color-inverted to match the grayscale aesthetics of the AlphaNum dataset.

## Data Sources

iapp / thai_handwriting_dataset

README.md

dataset

13 matches

tags: task_categories:text-to-image, task_categories:image-to-text, language:th, license:apache-2.0, size_categories:10K<n<100K, format:parquet, modality:image, modality:text, library:datasets, library:dask, library:mlcroissant, library:polars, region:us, handwriting-recognition, ocr

ting Recognition dataset (train-0000.parquet)

2. Thai Handwritten Free Dataset by Wang (train-0001.parquet onwards)

kobkrit@iapp.co.th

agomberto / FrenchCensus-handwritten-texts

README.md

dataset

13 matches

tags: task_categories:image-to-text, language:fr, license:mit, size_categories:1K<n<10K, format:parquet, modality:image, modality:text, library:datasets, library:dask, library:mlcroissant, library:polars, region:us, imate-to-text, trocr

ting text recognition. These datasets have been published in [Recognition and information extraction in historical handwritten tables: toward understanding early 20th century Paris census at DAS 2022](https://link.springer.com/chapter/10.1007/978-3-031-06555-2_10).

The 3 datasets are called “Generic dataset”, “Belleville”, and “Chaussée d’Antin” and contains lines made from the extracted rows of census tables from 1926. Each table in the Paris census contains 30 rows, thus each page in these datasets corresponds to 30 lines.

We publish here only the lines. If you want the pages, go [here](https://zenodo.org/record/6581158). This dataset is made 4800 annotated lines extracted from 80 double pages of the 1926 Paris census.