Text Recognition

This page is a manual preparation guide for datasets not yet supported by [Dataset Preparer](./dataset_preparer.md), which all these scripts will be eventually migrated into.

Overview

Dataset	images	annotation file	annotation file
		training	test
coco_text	homepage	train_labels.json	-
ICDAR2011	homepage	-	-
SynthAdd	SynthText_Add.zip (code:627x)	train_labels.json	-
OpenVINO	Open Images	annotations	annotations
DeText	homepage	-	-
Lecture Video DB	homepage	-	-
LSVT	homepage	-	-
IMGUR	homepage	-	-
KAIST	homepage	-	-
MTWI	homepage	-	-
ReCTS	homepage	-	-
IIIT-ILST	homepage	-	-
VinText	homepage	-	-
BID	homepage	-	-
RCTW	homepage	-	-
HierText	homepage	-	-
ArT	homepage	-	-

(*) Since the official homepage is unavailable now, we provide an alternative for quick reference. However, we do not guarantee the correctness of the dataset.

Install AWS CLI (optional)

Since there are some datasets that require the AWS CLI to be installed in advance, we provide a quick installation guide here:

  curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
  unzip awscliv2.zip
  sudo ./aws/install
  ./aws/install -i /usr/local/aws-cli -b /usr/local/bin
  !aws configure
  # this command will require you to input keys, you can skip them except
  # for the Default region name
  # AWS Access Key ID [None]:
  # AWS Secret Access Key [None]:
  # Default region name [None]: us-east-1
  # Default output format [None]

For users in China, these datasets can also be downloaded from OpenDataLab with high speed:

ICDAR 2011 (Born-Digital Images)

Step1: Download Challenge1_Training_Task3_Images_GT.zip, Challenge1_Test_Task3_Images.zip, and Challenge1_Test_Task3_GT.txt from homepage Task 1.3: Word Recognition (2013 edition).

mkdir icdar2011 && cd icdar2011
mkdir annotations

# Download ICDAR 2011
wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task3_Images_GT.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_Images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_GT.txt --no-check-certificate

# For images
mkdir crops
unzip -q Challenge1_Training_Task3_Images_GT.zip -d crops/train
unzip -q Challenge1_Test_Task3_Images.zip -d crops/test

# For annotations
mv Challenge1_Test_Task3_GT.txt annotations && mv crops/train/gt.txt annotations/Challenge1_Train_Task3_GT.txt

Step2: Convert original annotations to train_labels.json and test_labels.json with the following command:
```
python tools/dataset_converters/textrecog/ic11_converter.py PATH/TO/icdar2011
```

After running the above codes, the directory structure should be as follows:

├── icdar2011
│   ├── crops
│   ├── train_labels.json
│   └── test_labels.json

coco_text

Step1: Download from homepage
Step2: Download train_labels.json

After running the above codes, the directory structure should be as follows:

├── coco_text
│   ├── train_labels.json
│   └── train_words

SynthAdd

Step1: Download SynthText_Add.zip from SynthAdd (code:627x))
Step2: Download train_labels.json

Step3:

mkdir SynthAdd && cd SynthAdd

mv /path/to/SynthText_Add.zip .

unzip SynthText_Add.zip

mv /path/to/train_labels.json .

# create soft link
cd /path/to/mmocr/data/recog

ln -s /path/to/SynthAdd SynthAdd

After running the above codes, the directory structure should be as follows:

├── SynthAdd
│   ├── train_labels.json
│   └── SynthText_Add

OpenVINO

Step1 (optional): Install AWS CLI.

Step2: Download Open Images subsets train_1, train_2, train_5, train_f, and validation to openvino/.

mkdir openvino && cd openvino

# Download Open Images subsets
for s in 1 2 5 f; do
  aws s3 --no-sign-request cp s3://open-images-dataset/tar/train_${s}.tar.gz .
done
aws s3 --no-sign-request cp s3://open-images-dataset/tar/validation.tar.gz .

# Download annotations
for s in 1 2 5 f; do
  wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_train_${s}.json
done
wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_validation.json

# Extract images
mkdir -p openimages_v5/val
for s in 1 2 5 f; do
  tar zxf train_${s}.tar.gz -C openimages_v5
done
tar zxf validation.tar.gz -C openimages_v5/val

Step3: Generate train_{1,2,5,f}_labels.json, val_labels.json and crop images using 4 processes with the following command:
```
python tools/dataset_converters/textrecog/openvino_converter.py /path/to/openvino 4
```

After running the above codes, the directory structure should be as follows:

├── OpenVINO
│   ├── image_1
│   ├── image_2
│   ├── image_5
│   ├── image_f
│   ├── image_val
│   ├── train_1_labels.json
│   ├── train_2_labels.json
│   ├── train_5_labels.json
│   ├── train_f_labels.json
│   └── val_labels.json

DeText

Step1: Download ch9_training_images.zip, ch9_training_localization_transcription_gt.zip, ch9_validation_images.zip, and ch9_validation_localization_transcription_gt.zip from Task 3: End to End on the homepage.

mkdir detext && cd detext
mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val

# Download DeText
wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate

# Extract images and annotations
unzip -q ch9_training_images.zip -d imgs/training && unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training && unzip -q ch9_validation_images.zip -d imgs/val && unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val

# Remove zips
rm ch9_training_images.zip && rm ch9_training_localization_transcription_gt.zip && rm ch9_validation_images.zip && rm ch9_validation_localization_transcription_gt.zip

Step2: Generate train_labels.json and test_labels.json with following command:

# Add --preserve-vertical to preserve vertical texts for training, otherwise
# vertical images will be filtered and stored in PATH/TO/detext/ignores
python tools/dataset_converters/textrecog/detext_converter.py PATH/TO/detext --nproc 4

After running the above codes, the directory structure should be as follows:

├── detext
│   ├── crops
│   ├── ignores
│   ├── train_labels.json
│   └── test_labels.json

NAF

Step1: Download labeled_images.tar.gz to naf/.

mkdir naf && cd naf

# Download NAF dataset
wget https://github.com/herobd/NAF_dataset/releases/download/v1.0/labeled_images.tar.gz
tar -zxf labeled_images.tar.gz

# For images
mkdir annotations && mv labeled_images imgs

# For annotations
git clone https://github.com/herobd/NAF_dataset.git
mv NAF_dataset/train_valid_test_split.json annotations/ && mv NAF_dataset/groups annotations/

rm -rf NAF_dataset && rm labeled_images.tar.gz

Step2: Generate train_labels.json, val_labels.json, and test_labels.json with following command:

# Add --preserve-vertical to preserve vertical texts for training, otherwise
# vertical images will be filtered and stored in PATH/TO/naf/ignores
python tools/dataset_converters/textrecog/naf_converter.py PATH/TO/naf --nproc 4

After running the above codes, the directory structure should be as follows:

├── naf
│   ├── crops
│   ├── train_labels.json
│   ├── val_labels.json
│   └── test_labels.json

Lecture Video DB

This section is not fully tested yet.

The LV dataset has already provided cropped images and the corresponding annotations

Step1: Download IIIT-CVid.zip to lv/.

mkdir lv && cd lv

# Download LV dataset
wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip
unzip -q IIIT-CVid.zip

# For image
mv IIIT-CVid/Crops ./

# For annotation
mv IIIT-CVid/train.txt train_labels.json && mv IIIT-CVid/val.txt val_label.txt && mv IIIT-CVid/test.txt test_labels.json

rm IIIT-CVid.zip

Step2: Generate train_labels.json, val.json, and test.json with following command:
```
python tools/dataset_converters/textdreog/lv_converter.py PATH/TO/lv
```

After running the above codes, the directory structure should be as follows:

├── lv
│   ├── Crops
│   ├── train_labels.json
│   └── test_labels.json

LSVT

This section is not fully tested yet.

Step1: Download train_full_images_0.tar.gz, train_full_images_1.tar.gz, and train_full_labels.json to lsvt/.

mkdir lsvt && cd lsvt

# Download LSVT dataset
wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz
wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz
wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json

mkdir annotations
tar -xf train_full_images_0.tar.gz && tar -xf train_full_images_1.tar.gz
mv train_full_labels.json annotations/ && mv train_full_images_1/*.jpg train_full_images_0/
mv train_full_images_0 imgs

rm train_full_images_0.tar.gz && rm train_full_images_1.tar.gz && rm -rf train_full_images_1

Step2: Generate train_labels.json and val_label.json (optional) with the following command:

# Annotations of LSVT test split is not publicly available, split a validation
# set by adding --val-ratio 0.2
# Add --preserve-vertical to preserve vertical texts for training, otherwise
# vertical images will be filtered and stored in PATH/TO/lsvt/ignores
python tools/dataset_converters/textdrecog/lsvt_converter.py PATH/TO/lsvt --nproc 4

After running the above codes, the directory structure should be as follows:

├── lsvt
│   ├── crops
│   ├── ignores
│   ├── train_labels.json
│   └── val_label.json (optional)

IMGUR

This section is not fully tested yet.

Step1: Run download_imgur5k.py to download images. You can merge PR#5 in your local repository to enable a much faster parallel execution of image download.

mkdir imgur && cd imgur

git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git

# Download images from imgur.com. This may take SEVERAL HOURS!
python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs

# For annotations
mkdir annotations
mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations

rm -rf IMGUR5K-Handwriting-Dataset

Step2: Generate train_labels.json, val_label.txt and test_labels.json and crop images with the following command:
```
python tools/dataset_converters/textrecog/imgur_converter.py PATH/TO/imgur
```

After running the above codes, the directory structure should be as follows:

├── imgur
│   ├── crops
│   ├── train_labels.json
│   ├── test_labels.json
│   └── val_label.json

KAIST

This section is not fully tested yet.

Step1: Download KAIST_all.zip to kaist/.

mkdir kaist && cd kaist
mkdir imgs && mkdir annotations

# Download KAIST dataset
wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip
unzip -q KAIST_all.zip && rm KAIST_all.zip

Step2: Extract zips:

python tools/dataset_converters/common/extract_kaist.py PATH/TO/kaist

Step3: Generate train_labels.json and val_label.json (optional) with following command:

# Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2
# Add --preserve-vertical to preserve vertical texts for training, otherwise
# vertical images will be filtered and stored in PATH/TO/kaist/ignores
python tools/dataset_converters/textrecog/kaist_converter.py PATH/TO/kaist --nproc 4

After running the above codes, the directory structure should be as follows:

├── kaist
│   ├── crops
│   ├── ignores
│   ├── train_labels.json
│   └── val_label.json (optional)

MTWI

This section is not fully tested yet.

Step1: Download mtwi_2018_train.zip from homepage.

mkdir mtwi && cd mtwi

unzip -q mtwi_2018_train.zip
mv image_train imgs && mv txt_train annotations

rm mtwi_2018_train.zip

Step2: Generate train_labels.json and val_label.json (optional) with the following command:

# Annotations of MTWI test split is not publicly available, split a validation
# set by adding --val-ratio 0.2
# Add --preserve-vertical to preserve vertical texts for training, otherwise
# vertical images will be filtered and stored in PATH/TO/mtwi/ignores
python tools/dataset_converters/textrecog/mtwi_converter.py PATH/TO/mtwi --nproc 4

After running the above codes, the directory structure should be as follows:

├── mtwi
│   ├── crops
│   ├── train_labels.json
│   └── val_label.json (optional)

ReCTS

This section is not fully tested yet.

Step1: Download ReCTS.zip to rects/ from the homepage.

mkdir rects && cd rects

# Download ReCTS dataset
# You can also find Google Drive link on the dataset homepage
wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate
unzip -q ReCTS.zip

mv img imgs && mv gt_unicode annotations

rm ReCTS.zip -f && rm -rf gt

Step2: Generate train_labels.json and val_label.json (optional) with the following command:

# Annotations of ReCTS test split is not publicly available, split a validation
# set by adding --val-ratio 0.2
# Add --preserve-vertical to preserve vertical texts for training, otherwise
# vertical images will be filtered and stored in PATH/TO/rects/ignores
python tools/dataset_converters/textrecog/rects_converter.py PATH/TO/rects --nproc 4

After running the above codes, the directory structure should be as follows:

├── rects
│   ├── crops
│   ├── ignores
│   ├── train_labels.json
│   └── val_label.json (optional)

ILST

This section is not fully tested yet.

Step1: Download IIIT-ILST.zip from onedrive link

Step2: Run the following commands

unzip -q IIIT-ILST.zip && rm IIIT-ILST.zip
cd IIIT-ILST

# rename files
cd Devanagari && for i in `ls`; do mv -f $i `echo "devanagari_"$i`; done && cd ..
cd Malayalam && for i in `ls`; do mv -f $i `echo "malayalam_"$i`; done && cd ..
cd Telugu && for i in `ls`; do mv -f $i `echo "telugu_"$i`; done && cd ..

# transfer image path
mkdir imgs && mkdir annotations
mv Malayalam/{*jpg,*jpeg} imgs/ && mv Malayalam/*xml annotations/
mv Devanagari/*jpg imgs/ && mv Devanagari/*xml annotations/
mv Telugu/*jpeg imgs/ && mv Telugu/*xml annotations/

# remove unnecessary files
rm -rf Devanagari && rm -rf Malayalam && rm -rf Telugu && rm -rf README.txt

Step3: Generate train_labels.json and val_label.json (optional) and crop images using 4 processes with the following command (add --preserve-vertical if you wish to preserve the images containing vertical texts). Since the original dataset doesn't have a validation set, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.
```
python tools/dataset_converters/textrecog/ilst_converter.py PATH/TO/IIIT-ILST --nproc 4
```

After running the above codes, the directory structure should be as follows:

├── IIIT-ILST
│   ├── crops
│   ├── ignores
│   ├── train_labels.json
│   └── val_label.json (optional)

VinText

This section is not fully tested yet.

Step1: Download vintext.zip to vintext

mkdir vintext && cd vintext

# Download dataset from google drive
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml" -O vintext.zip && rm -rf /tmp/cookies.txt

# Extract images and annotations
unzip -q vintext.zip && rm vintext.zip
mv vietnamese/labels ./ && mv vietnamese/test_image ./ && mv vietnamese/train_images ./ && mv vietnamese/unseen_test_images ./
rm -rf vietnamese

# Rename files
mv labels annotations && mv test_image test && mv train_images  training && mv unseen_test_images  unseen_test
mkdir imgs
mv training imgs/ && mv test imgs/ && mv unseen_test imgs/

Step2: Generate train_labels.json, test_labels.json, unseen_test_labels.json, and crop images using 4 processes with the following command (add --preserve-vertical if you wish to preserve the images containing vertical texts).
```
python tools/dataset_converters/textrecog/vintext_converter.py PATH/TO/vietnamese --nproc 4
```

After running the above codes, the directory structure should be as follows:

├── vintext
│   ├── crops
│   ├── ignores
│   ├── train_labels.json
│   ├── test_labels.json
│   └── unseen_test_labels.json

BID

This section is not fully tested yet.

Step1: Download BID Dataset.zip

Step2: Run the following commands to preprocess the dataset

# Rename
mv BID\ Dataset.zip BID_Dataset.zip

# Unzip and Rename
unzip -q BID_Dataset.zip && rm BID_Dataset.zip
mv BID\ Dataset BID

# The BID dataset has a problem of permission, and you may
# add permission for this file
chmod -R 777 BID
cd BID
mkdir imgs && mkdir annotations

# For images and annotations
mv CNH_Aberta/*in.jpg imgs && mv CNH_Aberta/*txt annotations && rm -rf CNH_Aberta
mv CNH_Frente/*in.jpg imgs && mv CNH_Frente/*txt annotations && rm -rf CNH_Frente
mv CNH_Verso/*in.jpg imgs && mv CNH_Verso/*txt annotations && rm -rf CNH_Verso
mv CPF_Frente/*in.jpg imgs && mv CPF_Frente/*txt annotations && rm -rf CPF_Frente
mv CPF_Verso/*in.jpg imgs && mv CPF_Verso/*txt annotations && rm -rf CPF_Verso
mv RG_Aberto/*in.jpg imgs && mv RG_Aberto/*txt annotations && rm -rf RG_Aberto
mv RG_Frente/*in.jpg imgs && mv RG_Frente/*txt annotations && rm -rf RG_Frente
mv RG_Verso/*in.jpg imgs && mv RG_Verso/*txt annotations && rm -rf RG_Verso

# Remove unnecessary files
rm -rf desktop.ini

Step3: Generate train_labels.json and val_label.json (optional) and crop images using 4 processes with the following command (add --preserve-vertical if you wish to preserve the images containing vertical texts). Since the original dataset doesn't have a validation set, you may specify --val-ratio to split the dataset. E.g., if test-ratio is 0.2, then 20% of the data are left out as the validation set in this example.
```
python tools/dataset_converters/textrecog/bid_converter.py PATH/TO/BID --nproc 4
```

After running the above codes, the directory structure should be as follows:

├── BID
│   ├── crops
│   ├── ignores
│   ├── train_labels.json
│   └── val_label.json (optional)

RCTW

This section is not fully tested yet.

Step1: Download train_images.zip.001, train_images.zip.002, and train_gts.zip from the homepage, extract the zips to rctw/imgs and rctw/annotations, respectively.

Step2: Generate train_labels.json and val_label.json (optional). Since the original dataset doesn't have a validation set, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

# Annotations of RCTW test split is not publicly available, split a validation set by adding --val-ratio 0.2
# Add --preserve-vertical to preserve vertical texts for training, otherwise vertical images will be filtered and stored in PATH/TO/rctw/ignores
python tools/dataset_converters/textrecog/rctw_converter.py PATH/TO/rctw --nproc 4

After running the above codes, the directory structure should be as follows:

│── rctw
│   ├── crops
│   ├── ignores
│   ├── train_labels.json
│   └── val_label.json (optional)

HierText

This section is not fully tested yet.

Step1 (optional): Install AWS CLI.

Step2: Clone HierText repo to get annotations

mkdir HierText
git clone https://github.com/google-research-datasets/hiertext.git

Step3: Download train.tgz, validation.tgz from aws

aws s3 --no-sign-request cp s3://open-images-dataset/ocr/train.tgz .
aws s3 --no-sign-request cp s3://open-images-dataset/ocr/validation.tgz .

Step4: Process raw data

# process annotations
mv hiertext/gt ./
rm -rf hiertext
mv gt annotations
gzip -d annotations/train.json.gz
gzip -d annotations/validation.json.gz
# process images
mkdir imgs
mv train.tgz imgs/
mv validation.tgz imgs/
tar -xzvf imgs/train.tgz
tar -xzvf imgs/validation.tgz

Step5: Generate train_labels.json and val_label.json. HierText includes different levels of annotation, including paragraph, line, and word. Check the original paper for details. E.g. set --level paragraph to get paragraph-level annotation. Set --level line to get line-level annotation. set --level word to get word-level annotation.
```
# Collect word annotation from HierText  --level word
# Add --preserve-vertical to preserve vertical texts for training, otherwise vertical images will be filtered and stored in PATH/TO/HierText/ignores
python tools/dataset_converters/textrecog/hiertext_converter.py PATH/TO/HierText --level word --nproc 4
```

After running the above codes, the directory structure should be as follows:

│── HierText
│   ├── crops
│   ├── ignores
│   ├── train_labels.json
│   └── val_label.json

ArT

This section is not fully tested yet.

Step1: Download train_images.tar.gz, and train_labels.json from the homepage to art/

mkdir art && cd art
mkdir annotations

# Download ArT dataset
wget https://dataset-bj.cdn.bcebos.com/art/train_task2_images.tar.gz
wget https://dataset-bj.cdn.bcebos.com/art/train_task2_labels.json

# Extract
tar -xf train_task2_images.tar.gz
mv train_task2_images crops
mv train_task2_labels.json annotations/

# Remove unnecessary files
rm train_images.tar.gz

Step2: Generate train_labels.json and val_label.json (optional). Since the test annotations are not publicly available, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.
```
# Annotations of ArT test split is not publicly available, split a validation set by adding --val-ratio 0.2
python tools/dataset_converters/textrecog/art_converter.py PATH/TO/art
```

After running the above codes, the directory structure should be as follows:

│── art
│   ├── crops
│   ├── train_labels.json
│   └── val_label.json (optional)