Full Text Search - Hugging Face

Full-text search

models datasets spaces

84 results

AntiplagiatCompany / HWR200

README.md

dataset

10 matches

tags: size_categories:10K<n<100K, language:ru, license:apache-2.0, ocr, htr, handwritten text recognition, near duplicate detection, reuse detection, croissant, region:us

t of handwritten texts images in Russian

This is a dataset of handwritten texts images in Russian created by 200 writers with

different handwriting and photographed in different environment.

shaoncsecu / BN-HTRd_Splitted

README.md

dataset

12 matches

tags: task_categories:image-segmentation, task_categories:image-to-text, size_categories:10K<n<100K, language:bn, license:cc-by-4.0, Handwriting Recognition, Document Imaging, Annotation, Image Segmentation, Bengali Language, Word Spotting, croissant, arxiv:2206.08977, doi:10.57967/hf/0546, region:us

ngla Handwritten Text Recognition (HTR)"

Link: https://data.mendeley.com/datasets/743k6dm543

### Description

We introduce a new dataset for offline Handwritten Text Recognition (HTR) from images of Bangla scripts comprising words, lines, and document-level annotations. The BN-HTRd dataset is based on the BBC Bangla News corpus - which acted as ground truth texts for the handwritings. Our dataset contains a total of 786 full-page images collected from 150 different writers. With a staggering 108,147 instances of handwritten words, distributed over 13,867 lines and 23,115 unique words, this is currently the 'largest and most comprehensive dataset' in this field. We also provided the bounding box annotations (YOLO format) for the segmentation of words/lines and the ground truth annotations for full-text, along with the segmented images and their positions. The contents of our dataset came from a diverse news category, and annotators of different ages, genders, and backgrounds, having variability in writing styles. The BN-HTRd dataset can be adopted as a basis for various handwriting classification tasks such as end-to-end document recognition, word-spotting, word/line segmentation, and so on.

Teklia / Belfort-line

README.md

dataset

6 matches

tags: task_categories:image-to-text, language:fr, license:mit, atr, htr, ocr, historical, handwritten, croissant, region:us

:** [Handwritten Text Recognition from Crowdsourced Annotations](https://doi.org/10.1145/3604951.3605517)

- **Point of Contact:** [TEKLIA](https://teklia.com)

## Dataset Summary

Teklia / NorHand-v1-line

README.md

dataset

6 matches

tags: task_categories:image-to-text, language:nb, license:mit, atr, htr, ocr, historical, handwritten, croissant, region:us

for Handwritten Text Recognition in Norwegian](https://link.springer.com/chapter/10.1007/978-3-031-06555-2_27)

- **Point of Contact:** [TEKLIA](https://teklia.com)

## Dataset Summary

TrainingDataPro / ocr-text-detection-in-the-documents

README.md

dataset

27 matches

tags: task_categories:image-to-text, task_categories:object-detection, language:en, license:cc-by-nc-nd-4.0, code, legal, finance, croissant, region:us

OCR Text Detection in the Documents Object Detection dataset

The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.

The dataset contains a variety of document types, including different *layouts, font sizes, and styles*. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.

biglam / unsilence_voc

README.md

dataset

11 matches

tags: task_categories:token-classification, task_ids:named-entity-recognition, size_categories:1K<n<10K, language:nl, license:cc-by-4.0, lam , croissant, arxiv:2210.02194, region:us

tity Recognition

## Table of Contents

- [Table of Contents](#table-of-contents)

- [Dataset Description](#dataset-description)

CATMuS / medieval

README.md

dataset

20 matches

tags: task_categories:image-to-text, size_categories:100K<n<1M, language:fr, language:en, language:nl, language:it, language:es, language:ca, license:cc-by-4.0, optical-character-recognition, humanities, handwritten-text-recognition, croissant, region:us

Handwritten Text Recognition (HTR) has emerged as a crucial tool for converting manuscripts images into machine-readable formats,

enabling researchers and scholars to analyse vast collections efficiently.

Despite significant technological progress, establishing consistent ground truth across projects for HTR tasks,

particularly for complex and heterogeneous historical sources like medieval manuscripts in Latin scripts (8th-15th century CE), remains nonetheless challenging.

We introduce the **Consistent Approaches to Transcribing Manuscripts (CATMuS)** dataset for medieval manuscripts,

benhachem / KHATT

README.md

dataset

10 matches

tags: task_categories:image-to-text, size_categories:1K<n<10K, language:ar, OCR, Optical Character Recognition , Arabic OCR, arabic , ocr, Textline images, croissant, region:us

FUPM Handwritten Arabic TexT (KHATT) database

### Version 1.0 (September 2012 Release)

The database contains handwritten Arabic text images and its ground-truth developed for

research in the area of Arabic handwritten text. It contains the line images and their ground-truth. It was used for the pilot experimentation as reported in the paper: <ins> S. A. Mahmoud, I. Ahmad, M. Alshayeb, W. G. Al-Khatib, M. T. Parvez, G. A. Fink, V. Margner, and H. EL Abed, “KHATT: Arabic Offline

Teklia / IAM-line

README.md

dataset

7 matches

tags: task_categories:image-to-text, language:en, license:mit, atr, htr, ocr, modern, handwritten, croissant, region:us

ting recognition](https://doi.org/10.1007/s100320200071)

- **Point of Contact:** [TEKLIA](https://teklia.com)

## Dataset Summary

louisraedisch / AlphaNum

README.md

dataset

9 matches

tags: task_categories:image-classification, size_categories:100K<n<1M, language:en, license:mit, OCR, Handwriting, Character Recognition, Grayscale Images, ASCII Labels, Optical Character Recognition, croissant, region:us

s of handwritten characters and numerals as well as special character, each sized 24x24 pixels. This dataset is designed to bolster Optical Character Recognition (OCR) research and development.

For consistency, images extracted from the MNIST dataset have been color-inverted to match the grayscale aesthetics of the AlphaNum dataset.

## Data Sources

agomberto / FrenchCensus-handwritten-texts

README.md

dataset

13 matches

tags: task_categories:image-to-text, size_categories:1K<n<10K, language:fr, license:mit, imate-to-text, trocr, croissant, region:us

ting text recognition. These datasets have been published in [Recognition and information extraction in historical handwritten tables: toward understanding early 20th century Paris census at DAS 2022](https://link.springer.com/chapter/10.1007/978-3-031-06555-2_10).

The 3 datasets are called “Generic dataset”, “Belleville”, and “Chaussée d’Antin” and contains lines made from the extracted rows of census tables from 1926. Each table in the Paris census contains 30 rows, thus each page in these datasets corresponds to 30 lines.

We publish here only the lines. If you want the pages, go [here](https://zenodo.org/record/6581158). This dataset is made 4800 annotated lines extracted from 80 double pages of the 1926 Paris census.

Teklia / POPP-line

README.md

dataset

4 matches

tags: task_categories:image-to-text, language:fr, license:mit, atr, htr, ocr, historical, handwritten, croissant, region:us

:** [Recognition and Information Extraction in Historical Handwritten Tables: Toward Understanding Early 20th Century Paris Census](https://link.springer.com/chapter/10.1007/978-3-031-06555-2_10)

- **Point of Contact:** [TEKLIA](https://teklia.com)

## Dataset Summary

Teklia / CASIA-HWDB2-line

README.md

dataset

6 matches

tags: task_categories:image-to-text, language:zh, license:mit, atr, htr, ocr, modern, handwritten, croissant, region:us

line handwritten Chinese character recognition: Benchmarking on new databases](https://www.sciencedirect.com/science/article/abs/pii/S0031320312002919)

- **Point of Contact:** [TEKLIA](https://teklia.com)

## Dataset Summary

Teklia / Esposalles-line

README.md

dataset

3 matches

tags: task_categories:image-to-text, language:ca, license:mit, atr, htr, ocr, historical, handwritten, croissant, region:us

ting recognition](https://doi.org/10.1016/j.patcog.2012.11.024)

- **Point of Contact:** [TEKLIA](https://teklia.com)

## Dataset Summary

Teklia / RIMES-2011-line

README.md

dataset

7 matches

tags: task_categories:image-to-text, language:fr, license:mit, atr, htr, ocr, modern, handwritten, croissant, region:us

ase (Recognition and Indexation of handwritten documents and faxes) was created to evaluate automatic recognition and indexing systems for handwritten letters.

The database was collected by asking volunteers to write handwritten letters in exchange for gift certificates. Volunteers were given a fictitious identity (same gender as the real one) and up to 5 scenarios. Each scenario was chosen from among 9 realistic topics: change of personal data (address, bank account), request for information, opening and closing (customer account), change of contract or order, complaint (poor quality of service...), payment difficulties (request for delay, tax exemption...), reminder, complaint with other circumstances and a target (administrations or service providers (telephone, electricity, bank, insurance). The volunteers wrote a letter with this information in their own words. The layout was free and the only request was to use white paper and write legibly in black ink.

The campaign was a success, with more than 1,300 people contributing to the RIMES database by writing up to 5 letters. The resulting RIMES database contains 12,723 pages, corresponding to 5605 mails of two to three pages each.

lowercaseonly / cghd

README.md

dataset

16 matches

tags: task_categories:object-detection, task_categories:image-segmentation, size_categories:1K<n<10K, language:en, language:de, license:cc-by-3.0, croissant, region:us

for Handwritten Circuit Diagrams (GTDB-HD)

This repository contains images of hand-drawn electrical circuit diagrams as well as accompanying bounding box annotation for object detection as well as segmentation ground truth files. This dataset is intended to train (e.g. neural network) models for the purpose of the extraction of electrical graphs from raster graphics.

The folder structure is made up as follows:

OCR-Ethiopic / HHD-Ethiopic

README.md

dataset

10 matches

tags: license:cc-by-4.0, doi:10.57967/hf/0691, region:us

mage recognition tasks. It contains a collection of historical handwritten Manuscripts in the Ethiopic script. The dataset is intended to facilitate research and development for Ethiopic text-image recognition.

### Dataset Details/

- __Size__: 79,684 <br>

- __Training Set__: 57,374 <br>

wanderkid / UniMER_Dataset

README.md

dataset

6 matches

tags: size_categories:1M<n<10M, language:en, language:zh, license:apache-2.0, data, math, MER, arxiv:2404.15254, region:us

sion Recognition (MER). It encompasses the comprehensive UniMER-1M training set, featuring over one million instances that represent a diverse and intricate range of mathematical expressions, coupled with the UniMER Test Set, meticulously designed to benchmark MER models against real-world scenarios. The dataset details are as follows:

- **UniMER-1M Training Set:**

- Total Samples: 1,061,791 Latex-Image pairs

- Composition: A balanced mix of concise and complex, extended formula expressions

ai-forever / MERA

README.md

dataset

90 matches

tags: language:ru, license:mit, croissant, arxiv:2007.01852, arxiv:2112.00861, region:us

— a text fragment with a question from the game “What? Where? When?";

- `topic` — a string containing the category of the question;

- `outputs` — a string containing the correct answer to the question.

#### *Data Instances*

armvectores / handwritten_text_detection

README.md

dataset

5 matches

tags: task_categories:object-detection, size_categories:n<1K, language:hy, handwritten text, dictation, YOLOv8, croissant, region:us

# Handwritten text detection dataset

The blanks were provided by youth organization "Armenian Club" ([telegram](https://t.me/armenian_club), [instagram](https://www.instagram.com/armenian.club?igsh=MTJjYTN0dTdjamtxMQ==) ), Russia Moscow.