Spaces:
Sleeping
Text Recognition
This page is a manual preparation guide for datasets not yet supported by [Dataset Preparer](./dataset_preparer.md), which all these scripts will be eventually migrated into.
Overview
| Dataset | images | annotation file | annotation file |
|---|---|---|---|
| training | test | ||
| coco_text | homepage | train_labels.json | - |
| ICDAR2011 | homepage | - | - |
| SynthAdd | SynthText_Add.zip (code:627x) | train_labels.json | - |
| OpenVINO | Open Images | annotations | annotations |
| DeText | homepage | - | - |
| Lecture Video DB | homepage | - | - |
| LSVT | homepage | - | - |
| IMGUR | homepage | - | - |
| KAIST | homepage | - | - |
| MTWI | homepage | - | - |
| ReCTS | homepage | - | - |
| IIIT-ILST | homepage | - | - |
| VinText | homepage | - | - |
| BID | homepage | - | - |
| RCTW | homepage | - | - |
| HierText | homepage | - | - |
| ArT | homepage | - | - |
(*) Since the official homepage is unavailable now, we provide an alternative for quick reference. However, we do not guarantee the correctness of the dataset.
Install AWS CLI (optional)
Since there are some datasets that require the AWS CLI to be installed in advance, we provide a quick installation guide here:
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install ./aws/install -i /usr/local/aws-cli -b /usr/local/bin !aws configure # this command will require you to input keys, you can skip them except # for the Default region name # AWS Access Key ID [None]: # AWS Secret Access Key [None]: # Default region name [None]: us-east-1 # Default output format [None]
For users in China, these datasets can also be downloaded from OpenDataLab with high speed:
ICDAR 2011 (Born-Digital Images)
Step1: Download
Challenge1_Training_Task3_Images_GT.zip,Challenge1_Test_Task3_Images.zip, andChallenge1_Test_Task3_GT.txtfrom homepageTask 1.3: Word Recognition (2013 edition).mkdir icdar2011 && cd icdar2011 mkdir annotations # Download ICDAR 2011 wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task3_Images_GT.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_Images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_GT.txt --no-check-certificate # For images mkdir crops unzip -q Challenge1_Training_Task3_Images_GT.zip -d crops/train unzip -q Challenge1_Test_Task3_Images.zip -d crops/test # For annotations mv Challenge1_Test_Task3_GT.txt annotations && mv crops/train/gt.txt annotations/Challenge1_Train_Task3_GT.txtStep2: Convert original annotations to
train_labels.jsonandtest_labels.jsonwith the following command:python tools/dataset_converters/textrecog/ic11_converter.py PATH/TO/icdar2011After running the above codes, the directory structure should be as follows:
βββ icdar2011 β βββ crops β βββ train_labels.json β βββ test_labels.json
coco_text
Step1: Download from homepage
Step2: Download train_labels.json
After running the above codes, the directory structure should be as follows:
βββ coco_text β βββ train_labels.json β βββ train_words
SynthAdd
Step1: Download
SynthText_Add.zipfrom SynthAdd (code:627x))Step2: Download train_labels.json
Step3:
mkdir SynthAdd && cd SynthAdd mv /path/to/SynthText_Add.zip . unzip SynthText_Add.zip mv /path/to/train_labels.json . # create soft link cd /path/to/mmocr/data/recog ln -s /path/to/SynthAdd SynthAddAfter running the above codes, the directory structure should be as follows:
βββ SynthAdd β βββ train_labels.json β βββ SynthText_Add
OpenVINO
Step1 (optional): Install AWS CLI.
Step2: Download Open Images subsets
train_1,train_2,train_5,train_f, andvalidationtoopenvino/.mkdir openvino && cd openvino # Download Open Images subsets for s in 1 2 5 f; do aws s3 --no-sign-request cp s3://open-images-dataset/tar/train_${s}.tar.gz . done aws s3 --no-sign-request cp s3://open-images-dataset/tar/validation.tar.gz . # Download annotations for s in 1 2 5 f; do wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_train_${s}.json done wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_validation.json # Extract images mkdir -p openimages_v5/val for s in 1 2 5 f; do tar zxf train_${s}.tar.gz -C openimages_v5 done tar zxf validation.tar.gz -C openimages_v5/valStep3: Generate
train_{1,2,5,f}_labels.json,val_labels.jsonand crop images using 4 processes with the following command:python tools/dataset_converters/textrecog/openvino_converter.py /path/to/openvino 4After running the above codes, the directory structure should be as follows:
βββ OpenVINO β βββ image_1 β βββ image_2 β βββ image_5 β βββ image_f β βββ image_val β βββ train_1_labels.json β βββ train_2_labels.json β βββ train_5_labels.json β βββ train_f_labels.json β βββ val_labels.json
DeText
Step1: Download
ch9_training_images.zip,ch9_training_localization_transcription_gt.zip,ch9_validation_images.zip, andch9_validation_localization_transcription_gt.zipfrom Task 3: End to End on the homepage.mkdir detext && cd detext mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val # Download DeText wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate # Extract images and annotations unzip -q ch9_training_images.zip -d imgs/training && unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training && unzip -q ch9_validation_images.zip -d imgs/val && unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val # Remove zips rm ch9_training_images.zip && rm ch9_training_localization_transcription_gt.zip && rm ch9_validation_images.zip && rm ch9_validation_localization_transcription_gt.zipStep2: Generate
train_labels.jsonandtest_labels.jsonwith following command:# Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/detext/ignores python tools/dataset_converters/textrecog/detext_converter.py PATH/TO/detext --nproc 4After running the above codes, the directory structure should be as follows:
βββ detext β βββ crops β βββ ignores β βββ train_labels.json β βββ test_labels.json
NAF
Step1: Download labeled_images.tar.gz to
naf/.mkdir naf && cd naf # Download NAF dataset wget https://github.com/herobd/NAF_dataset/releases/download/v1.0/labeled_images.tar.gz tar -zxf labeled_images.tar.gz # For images mkdir annotations && mv labeled_images imgs # For annotations git clone https://github.com/herobd/NAF_dataset.git mv NAF_dataset/train_valid_test_split.json annotations/ && mv NAF_dataset/groups annotations/ rm -rf NAF_dataset && rm labeled_images.tar.gzStep2: Generate
train_labels.json,val_labels.json, andtest_labels.jsonwith following command:# Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/naf/ignores python tools/dataset_converters/textrecog/naf_converter.py PATH/TO/naf --nproc 4After running the above codes, the directory structure should be as follows:
βββ naf β βββ crops β βββ train_labels.json β βββ val_labels.json β βββ test_labels.json
Lecture Video DB
This section is not fully tested yet.
The LV dataset has already provided cropped images and the corresponding annotations
Step1: Download IIIT-CVid.zip to
lv/.mkdir lv && cd lv # Download LV dataset wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip unzip -q IIIT-CVid.zip # For image mv IIIT-CVid/Crops ./ # For annotation mv IIIT-CVid/train.txt train_labels.json && mv IIIT-CVid/val.txt val_label.txt && mv IIIT-CVid/test.txt test_labels.json rm IIIT-CVid.zipStep2: Generate
train_labels.json,val.json, andtest.jsonwith following command:python tools/dataset_converters/textdreog/lv_converter.py PATH/TO/lvAfter running the above codes, the directory structure should be as follows:
βββ lv β βββ Crops β βββ train_labels.json β βββ test_labels.json
LSVT
This section is not fully tested yet.
Step1: Download train_full_images_0.tar.gz, train_full_images_1.tar.gz, and train_full_labels.json to
lsvt/.mkdir lsvt && cd lsvt # Download LSVT dataset wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json mkdir annotations tar -xf train_full_images_0.tar.gz && tar -xf train_full_images_1.tar.gz mv train_full_labels.json annotations/ && mv train_full_images_1/*.jpg train_full_images_0/ mv train_full_images_0 imgs rm train_full_images_0.tar.gz && rm train_full_images_1.tar.gz && rm -rf train_full_images_1Step2: Generate
train_labels.jsonandval_label.json(optional) with the following command:# Annotations of LSVT test split is not publicly available, split a validation # set by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/lsvt/ignores python tools/dataset_converters/textdrecog/lsvt_converter.py PATH/TO/lsvt --nproc 4After running the above codes, the directory structure should be as follows:
βββ lsvt β βββ crops β βββ ignores β βββ train_labels.json β βββ val_label.json (optional)
IMGUR
This section is not fully tested yet.
Step1: Run
download_imgur5k.pyto download images. You can merge PR#5 in your local repository to enable a much faster parallel execution of image download.mkdir imgur && cd imgur git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git # Download images from imgur.com. This may take SEVERAL HOURS! python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs # For annotations mkdir annotations mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations rm -rf IMGUR5K-Handwriting-DatasetStep2: Generate
train_labels.json,val_label.txtandtest_labels.jsonand crop images with the following command:python tools/dataset_converters/textrecog/imgur_converter.py PATH/TO/imgurAfter running the above codes, the directory structure should be as follows:
βββ imgur β βββ crops β βββ train_labels.json β βββ test_labels.json β βββ val_label.json
KAIST
This section is not fully tested yet.
Step1: Download KAIST_all.zip to
kaist/.mkdir kaist && cd kaist mkdir imgs && mkdir annotations # Download KAIST dataset wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip unzip -q KAIST_all.zip && rm KAIST_all.zipStep2: Extract zips:
python tools/dataset_converters/common/extract_kaist.py PATH/TO/kaistStep3: Generate
train_labels.jsonandval_label.json(optional) with following command:# Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/kaist/ignores python tools/dataset_converters/textrecog/kaist_converter.py PATH/TO/kaist --nproc 4After running the above codes, the directory structure should be as follows:
βββ kaist β βββ crops β βββ ignores β βββ train_labels.json β βββ val_label.json (optional)
MTWI
This section is not fully tested yet.
Step1: Download
mtwi_2018_train.zipfrom homepage.mkdir mtwi && cd mtwi unzip -q mtwi_2018_train.zip mv image_train imgs && mv txt_train annotations rm mtwi_2018_train.zipStep2: Generate
train_labels.jsonandval_label.json(optional) with the following command:# Annotations of MTWI test split is not publicly available, split a validation # set by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/mtwi/ignores python tools/dataset_converters/textrecog/mtwi_converter.py PATH/TO/mtwi --nproc 4After running the above codes, the directory structure should be as follows:
βββ mtwi β βββ crops β βββ train_labels.json β βββ val_label.json (optional)
ReCTS
This section is not fully tested yet.
Step1: Download ReCTS.zip to
rects/from the homepage.mkdir rects && cd rects # Download ReCTS dataset # You can also find Google Drive link on the dataset homepage wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate unzip -q ReCTS.zip mv img imgs && mv gt_unicode annotations rm ReCTS.zip -f && rm -rf gtStep2: Generate
train_labels.jsonandval_label.json(optional) with the following command:# Annotations of ReCTS test split is not publicly available, split a validation # set by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/rects/ignores python tools/dataset_converters/textrecog/rects_converter.py PATH/TO/rects --nproc 4After running the above codes, the directory structure should be as follows:
βββ rects β βββ crops β βββ ignores β βββ train_labels.json β βββ val_label.json (optional)
ILST
This section is not fully tested yet.
Step1: Download
IIIT-ILST.zipfrom onedrive linkStep2: Run the following commands
unzip -q IIIT-ILST.zip && rm IIIT-ILST.zip cd IIIT-ILST # rename files cd Devanagari && for i in `ls`; do mv -f $i `echo "devanagari_"$i`; done && cd .. cd Malayalam && for i in `ls`; do mv -f $i `echo "malayalam_"$i`; done && cd .. cd Telugu && for i in `ls`; do mv -f $i `echo "telugu_"$i`; done && cd .. # transfer image path mkdir imgs && mkdir annotations mv Malayalam/{*jpg,*jpeg} imgs/ && mv Malayalam/*xml annotations/ mv Devanagari/*jpg imgs/ && mv Devanagari/*xml annotations/ mv Telugu/*jpeg imgs/ && mv Telugu/*xml annotations/ # remove unnecessary files rm -rf Devanagari && rm -rf Malayalam && rm -rf Telugu && rm -rf README.txtStep3: Generate
train_labels.jsonandval_label.json(optional) and crop images using 4 processes with the following command (add--preserve-verticalif you wish to preserve the images containing vertical texts). Since the original dataset doesn't have a validation set, you may specify--val-ratioto split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.python tools/dataset_converters/textrecog/ilst_converter.py PATH/TO/IIIT-ILST --nproc 4After running the above codes, the directory structure should be as follows:
βββ IIIT-ILST β βββ crops β βββ ignores β βββ train_labels.json β βββ val_label.json (optional)
VinText
This section is not fully tested yet.
Step1: Download vintext.zip to
vintextmkdir vintext && cd vintext # Download dataset from google drive wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml" -O vintext.zip && rm -rf /tmp/cookies.txt # Extract images and annotations unzip -q vintext.zip && rm vintext.zip mv vietnamese/labels ./ && mv vietnamese/test_image ./ && mv vietnamese/train_images ./ && mv vietnamese/unseen_test_images ./ rm -rf vietnamese # Rename files mv labels annotations && mv test_image test && mv train_images training && mv unseen_test_images unseen_test mkdir imgs mv training imgs/ && mv test imgs/ && mv unseen_test imgs/Step2: Generate
train_labels.json,test_labels.json,unseen_test_labels.json, and crop images using 4 processes with the following command (add--preserve-verticalif you wish to preserve the images containing vertical texts).python tools/dataset_converters/textrecog/vintext_converter.py PATH/TO/vietnamese --nproc 4After running the above codes, the directory structure should be as follows:
βββ vintext β βββ crops β βββ ignores β βββ train_labels.json β βββ test_labels.json β βββ unseen_test_labels.json
BID
This section is not fully tested yet.
Step1: Download BID Dataset.zip
Step2: Run the following commands to preprocess the dataset
# Rename mv BID\ Dataset.zip BID_Dataset.zip # Unzip and Rename unzip -q BID_Dataset.zip && rm BID_Dataset.zip mv BID\ Dataset BID # The BID dataset has a problem of permission, and you may # add permission for this file chmod -R 777 BID cd BID mkdir imgs && mkdir annotations # For images and annotations mv CNH_Aberta/*in.jpg imgs && mv CNH_Aberta/*txt annotations && rm -rf CNH_Aberta mv CNH_Frente/*in.jpg imgs && mv CNH_Frente/*txt annotations && rm -rf CNH_Frente mv CNH_Verso/*in.jpg imgs && mv CNH_Verso/*txt annotations && rm -rf CNH_Verso mv CPF_Frente/*in.jpg imgs && mv CPF_Frente/*txt annotations && rm -rf CPF_Frente mv CPF_Verso/*in.jpg imgs && mv CPF_Verso/*txt annotations && rm -rf CPF_Verso mv RG_Aberto/*in.jpg imgs && mv RG_Aberto/*txt annotations && rm -rf RG_Aberto mv RG_Frente/*in.jpg imgs && mv RG_Frente/*txt annotations && rm -rf RG_Frente mv RG_Verso/*in.jpg imgs && mv RG_Verso/*txt annotations && rm -rf RG_Verso # Remove unnecessary files rm -rf desktop.iniStep3: Generate
train_labels.jsonandval_label.json(optional) and crop images using 4 processes with the following command (add--preserve-verticalif you wish to preserve the images containing vertical texts). Since the original dataset doesn't have a validation set, you may specify--val-ratioto split the dataset. E.g., if test-ratio is 0.2, then 20% of the data are left out as the validation set in this example.python tools/dataset_converters/textrecog/bid_converter.py PATH/TO/BID --nproc 4After running the above codes, the directory structure should be as follows:
βββ BID β βββ crops β βββ ignores β βββ train_labels.json β βββ val_label.json (optional)
RCTW
This section is not fully tested yet.
Step1: Download
train_images.zip.001,train_images.zip.002, andtrain_gts.zipfrom the homepage, extract the zips torctw/imgsandrctw/annotations, respectively.Step2: Generate
train_labels.jsonandval_label.json(optional). Since the original dataset doesn't have a validation set, you may specify--val-ratioto split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.# Annotations of RCTW test split is not publicly available, split a validation set by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise vertical images will be filtered and stored in PATH/TO/rctw/ignores python tools/dataset_converters/textrecog/rctw_converter.py PATH/TO/rctw --nproc 4After running the above codes, the directory structure should be as follows:
βββ rctw β βββ crops β βββ ignores β βββ train_labels.json β βββ val_label.json (optional)
HierText
This section is not fully tested yet.
Step1 (optional): Install AWS CLI.
Step2: Clone HierText repo to get annotations
mkdir HierText git clone https://github.com/google-research-datasets/hiertext.gitStep3: Download
train.tgz,validation.tgzfrom awsaws s3 --no-sign-request cp s3://open-images-dataset/ocr/train.tgz . aws s3 --no-sign-request cp s3://open-images-dataset/ocr/validation.tgz .Step4: Process raw data
# process annotations mv hiertext/gt ./ rm -rf hiertext mv gt annotations gzip -d annotations/train.json.gz gzip -d annotations/validation.json.gz # process images mkdir imgs mv train.tgz imgs/ mv validation.tgz imgs/ tar -xzvf imgs/train.tgz tar -xzvf imgs/validation.tgzStep5: Generate
train_labels.jsonandval_label.json. HierText includes different levels of annotation, includingparagraph,line, andword. Check the original paper for details. E.g. set--level paragraphto get paragraph-level annotation. Set--level lineto get line-level annotation. set--level wordto get word-level annotation.# Collect word annotation from HierText --level word # Add --preserve-vertical to preserve vertical texts for training, otherwise vertical images will be filtered and stored in PATH/TO/HierText/ignores python tools/dataset_converters/textrecog/hiertext_converter.py PATH/TO/HierText --level word --nproc 4After running the above codes, the directory structure should be as follows:
βββ HierText β βββ crops β βββ ignores β βββ train_labels.json β βββ val_label.json
ArT
This section is not fully tested yet.
Step1: Download
train_images.tar.gz, andtrain_labels.jsonfrom the homepage toart/mkdir art && cd art mkdir annotations # Download ArT dataset wget https://dataset-bj.cdn.bcebos.com/art/train_task2_images.tar.gz wget https://dataset-bj.cdn.bcebos.com/art/train_task2_labels.json # Extract tar -xf train_task2_images.tar.gz mv train_task2_images crops mv train_task2_labels.json annotations/ # Remove unnecessary files rm train_images.tar.gzStep2: Generate
train_labels.jsonandval_label.json(optional). Since the test annotations are not publicly available, you may specify--val-ratioto split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.# Annotations of ArT test split is not publicly available, split a validation set by adding --val-ratio 0.2 python tools/dataset_converters/textrecog/art_converter.py PATH/TO/artAfter running the above codes, the directory structure should be as follows:
βββ art β βββ crops β βββ train_labels.json β βββ val_label.json (optional)