why chinese image ocr error ocde

#4
by Viking714 - opened

Hello, I recently use this model to do Chinese image OCR, but I got the wrong words output, the code I use is below:

from PIL import Image
img_pil = Image.open('/kaggle/input/timuimage/timu.jpg')
image = img_pil.convert("RGB")

from transformers import LayoutXLMProcessor
processor = LayoutXLMProcessor.from_pretrained("Microsoft/layoutlmv3-base-chinese")
feature_extractor = processor.feature_extractor

preprocess image to text

encoded_inputs = feature_extractor(image)
words = encoded_inputs.words

Just output the words in a format

text = ""
for word in words[0]:
text = text + word
print(text)

The output is as below:
re\1AlltTTiani|iete44si)ii"eahi|WAiL“4HNHHAilKtintteersNaaiftyUeawliditieeaHuseuay1he‘4LrLHauiiiasiliatififiaigMtiiarecuaEtaaii!t~BCpecaaOaeeiyfnaeipiesaoriyeae4raBiia4aiaei{thiulEiuaadlfh,aeaatteateeileweypakPotHsae

The Image I use is from https://www.kaggle.com/datasets/viking714/timuimage, everyone can see the image, it's public.
I use the same method to OCR English images to words by LayoutXLM and LayoutLMV2 models, they are both ok.
Thank you very much.

你需要设置ocr语言为中文+英文,也就是'chi_sim+eng'

model_name="microsoft/layoutlmv3-base-chinese"
image_processor = LayoutLMv3ImageProcessor.from_pretrained(model_name,ocr_lang='chi_sim+eng')
tokenizer = XLMRobertaTokenizer.from_pretrained(model_name)
processor = LayoutLMv3Processor(image_processor=image_processor,tokenizer=tokenizer,apply_ocr=True)

Hello, I was trying to use it in the same way. But I got this error:


ValueError Traceback (most recent call last)
in <cell line: 4>()
2 image_processor = LayoutLMv3ImageProcessor.from_pretrained(model_name,ocr_lang='chi_sim+eng')
3 tokenizer = XLMRobertaTokenizer.from_pretrained(model_name)
----> 4 processor = LayoutLMv3Processor(image_processor=image_processor,tokenizer=tokenizer,apply_ocr=True)

ValueError: Received XLMRobertaTokenizer for argument tokenizer, but a ('LayoutLMv3Tokenizer', 'LayoutLMv3TokenizerFast') was expected.

What can be wrong? Thanks

找到LayoutLMv3Processor的源码,把
tokenizer_class = ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast")
改成
tokenizer_class = ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast",'XLMRobertaTokenizer','XLMRobertaTokenizerFast','LayoutXLMTokenizer')

您好,请问解决了吗,我参考上面的方法最后显示出来的还是只有英文

参考之前的回答,按照以下方式可以的到中文结果。如果不行的话可以看一下你的tesseract-ocr是不是缺少chi_sim.traineddata文件,一般会保存在/usr/share/tesseract-ocr/4.00/tessdata/

from transformers import XLMRobertaTokenizer, AutoModel, AutoProcessor, LayoutLMv3ImageProcessor, LayoutLMv3Processor
model_name = "Microsoft/layoutlmv3-base-chinese"
image_processor = LayoutLMv3ImageProcessor.from_pretrained(model_name, ocr_lang='chi_sim+eng')
tokenizer = XLMRobertaTokenizer.from_pretrained(model_name)
processor = LayoutLMv3Processor(image_processor=image_processor,tokenizer=tokenizer,apply_ocr=True)
feature_extractor = processor.feature_extractor
inputs = feature_extractor(image)
inputs['words']

Sign up or log in to comment