Spaces:
Runtime error
!חג שמח
Hi! Did you succeed with this project? I'm trying almost the same thing but for medieval Hebrew manuscripts (handwritten), I have a lot of data but I don't know where to start from. How did you setup your dataset? I see on your huggingface account but is already pickle, thanks! And חג שמח!
thanks for your interest in the project! actualy i did not succeded so much in that project. but my problem was mostly the data (or the lack of data). i had 30K lines, while microsoft used for their TrOCR over 600M(!)
basically, the code that i used is mostly this while i used Swin for encoder and BEREL for decoder.
the dataset could be created using tools like TRDG. (see this: https://huggingface.co/datasets/sivan22/synth-HTR).
and then you may use this method to create an HF dataset.
you may be interested in this thread as well: forum
wishing you a חג כשר ושמח.
Thank you so much for the clarification! I have a lot of data for medieval manuscripts (Mishnah, responsa, Talmud etc.) in ALTO and I made a python script to get all the data as image (line for TROCR) and transcription, that should be sufficient. Now I have to find a good model for training, I think of BEREL-2.0 or HeBERT, I don't know yet, I trained a test but the result was just gibberish, I think I missed something, I worked mostly with eScriptorium/Kraken for OCR and NLP, I'm new on HuggingFace and BERT.
This is my script I tried to train with: https://github.com/johnlockejrr/sam-trocr/blob/main/sam-trocr.py
And my result:
(huggingface-source-py3.10) incognito@DESKTOP-NHKR7QL:~/TrOCR-py3.10$ python test_ocr.py LINES/sam_gt/2.4.jpg /home/incognito/huggingface-source-py3.10/lib/python3.10/site-packages/transformers/generation/utils.py:1252: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration ) warnings.warn( � � �� ��� � �ת� �ת ��ת���ת � �ב ��ב� �ב��ב � � �� �תת�תת �תב �ת� ���� ����