why chinese image ocr error ocde

by Viking714 - opened Jun 3, 2023

Jun 3, 2023

Hello, I recently use this model to do Chinese image OCR, but I got the wrong words output, the code I use is below:

from PIL import Image
img_pil = Image.open('/kaggle/input/timuimage/timu.jpg')
image = img_pil.convert("RGB")

from transformers import LayoutXLMProcessor
processor = LayoutXLMProcessor.from_pretrained("Microsoft/layoutlmv3-base-chinese")
feature_extractor = processor.feature_extractor

preprocess image to text

encoded_inputs = feature_extractor(image)
words = encoded_inputs.words

Just output the words in a format

text = ""
for word in words[0]:
text = text + word
print(text)

The output is as below:
re\1AlltTTiani|iete44si)ii"eahi|WAiL“4HNHHAilKtintteersNaaiftyUeawliditieeaHuseuay1he‘4LrLHauiiiasiliatififiaigMtiiarecuaEtaaii!t~BCpecaaOaeeiyfnaeipiesaoriyeae4raBiia4aiaei{thiulEiuaadlfh,aeaatteateeileweypakPotHsae

The Image I use is from https://www.kaggle.com/datasets/viking714/timuimage, everyone can see the image, it's public.
I use the same method to OCR English images to words by LayoutXLM and LayoutLMV2 models, they are both ok.
Thank you very much.

yuyijiong

Aug 27, 2023

你需要设置ocr语言为中文+英文，也就是'chi_sim+eng'

model_name="microsoft/layoutlmv3-base-chinese"
image_processor = LayoutLMv3ImageProcessor.from_pretrained(model_name,ocr_lang='chi_sim+eng')
tokenizer = XLMRobertaTokenizer.from_pretrained(model_name)
processor = LayoutLMv3Processor(image_processor=image_processor,tokenizer=tokenizer,apply_ocr=True)

resulmamiyev

Sep 15, 2023

Hello, I was trying to use it in the same way. But I got this error:

ValueError Traceback (most recent call last)
in <cell line: 4>()
2 image_processor = LayoutLMv3ImageProcessor.from_pretrained(model_name,ocr_lang='chi_sim+eng')
3 tokenizer = XLMRobertaTokenizer.from_pretrained(model_name)
----> 4 processor = LayoutLMv3Processor(image_processor=image_processor,tokenizer=tokenizer,apply_ocr=True)

ValueError: Received XLMRobertaTokenizer for argument tokenizer, but a ('LayoutLMv3Tokenizer', 'LayoutLMv3TokenizerFast') was expected.

What can be wrong? Thanks

yuyijiong

Sep 15, 2023

找到LayoutLMv3Processor的源码，把
tokenizer_class = ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast")
改成
tokenizer_class = ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast",'XLMRobertaTokenizer','XLMRobertaTokenizerFast','LayoutXLMTokenizer')

fandl

Nov 2, 2023

您好，请问解决了吗，我参考上面的方法最后显示出来的还是只有英文

alex1qaz

Dec 1, 2023

参考之前的回答，按照以下方式可以的到中文结果。如果不行的话可以看一下你的tesseract-ocr是不是缺少chi_sim.traineddata文件，一般会保存在/usr/share/tesseract-ocr/4.00/tessdata/

from transformers import XLMRobertaTokenizer, AutoModel, AutoProcessor, LayoutLMv3ImageProcessor, LayoutLMv3Processor
model_name = "Microsoft/layoutlmv3-base-chinese"
image_processor = LayoutLMv3ImageProcessor.from_pretrained(model_name, ocr_lang='chi_sim+eng')
tokenizer = XLMRobertaTokenizer.from_pretrained(model_name)
processor = LayoutLMv3Processor(image_processor=image_processor,tokenizer=tokenizer,apply_ocr=True)
feature_extractor = processor.feature_extractor
inputs = feature_extractor(image)
inputs['words']

Alvein

Aug 31, 2024

•

edited Aug 31, 2024

不明白为什么要去改源码，你只需要自己定一个拓展类LayoutLMv3ChineseProcessor就可以了。

model_name="microsoft/layoutlmv3-base-chinese"
image_processor = LayoutLMv3ImageProcessor.from_pretrained(model_name,ocr_lang='chi_sim+eng')
tokenizer = XLMRobertaTokenizer.from_pretrained(model_name, clean_up_tokenization_spaces=True)

class LayoutLMv3ChineseProcessor(LayoutLMv3Processor):
         tokenizer_class = ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast",'XLMRobertaTokenizer','XLMRobertaTokenizerFast','LayoutXLMTokenizer')
    
processor = LayoutLMv3ChineseProcessor(image_processor=image_processor,tokenizer=tokenizer,apply_ocr=True)

tianchiguaixia

Sep 5, 2024

上面几个根本没有说的核心地方。改毛线代码。按照我的来,不要59行代码就训练和推理完成：

tokenizer = LayoutXLMTokenizer.from_pretrained(
"./layoutlmv3-base-chinese"
)

image_processor = LayoutLMv3ImageProcessor.from_pretrained(
"./layoutlmv3-base-chinese", apply_ocr=False
)

processor = LayoutLMv3Processor(tokenizer=tokenizer, image_processor=image_processor, apply_ocr=False)

AhaLucas

Sep 25, 2024

上面几个根本没有说的核心地方。改毛线代码。按照我的来,不要59行代码就训练和推理完成：

tokenizer = LayoutXLMTokenizer.from_pretrained(
"./layoutlmv3-base-chinese"
)

image_processor = LayoutLMv3ImageProcessor.from_pretrained(
"./layoutlmv3-base-chinese", apply_ocr=False
)

processor = LayoutLMv3Processor(tokenizer=tokenizer, image_processor=image_processor, apply_ocr=False)

这是怎么画出来的图

tianchiguaixia

Sep 26, 2024

模型推理的

AhaLucas

Oct 10, 2024

This comment has been hidden

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment