trocr-old-russian

Info

The model is trained to recognize printed texts in Old Russian language

Usage

Base-usage

from PIL import Image
from transformers import TrOCRProcessor, VisionEncoderDecoderModel

hf_model = VisionEncoderDecoderModel.from_pretrained("Serovvans/trocr-prereform-orthography")

image = Image.open("./path/to/your/image")

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")
pixel_values = processor(images=image, return_tensors="pt").pixel_values

generated_ids = hf_model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)

Usage for recognizing the page

  1. Download the following files from the repository: recognize_page.py, requirements_page.txt

  2. Install tesseract-ocr

    • Linux
    sudo apt install tesseract-ocr
    
    • MacOS
    brew install tesseract-ocr
    
  3. Run this command

    • Linux
    sudo apt-get install poppler-utils
    
    • MacOS
    brew install poppler
    
    • Windows
      1. Download the latest poppler package from @oschwartz10612 version which is the most up-to-date.
      2. Move the extracted directory to the desired place on your system
      3. Add the bin/ directory to your PATH
      4. Test that all went well by opening cmd and making sure that you can call pdftoppm -h
  4. Install dependencies:

pip install -r requirements_page.txt
  1. Use this code
from recognize_page import recognize_page

page_path = "cleaned_pages/C2V10/page11.png"
text = recognize_page(page_path, text_output_path="output/file.txt")

print(f"Текст страницы:\n")
print(text)

Usage for recognizing the book from pdf-file

  1. Download the following files from the repository: recognize_book.py, recognize_page.py, requirements_book.txt

  2. Install tesseract-ocr

    • Linux
    sudo apt install tesseract-ocr
    
    • MacOS
    brew install tesseract-ocr
    
  3. Run this command

    • Linux
    sudo apt-get install poppler-utils
    
    • MacOS
    brew install poppler
    
    • Windows
      1. Download the latest poppler package from @oschwartz10612 version which is the most up-to-date.
      2. Move the extracted directory to the desired place on your system
      3. Add the bin/ directory to your PATH
      4. Test that all went well by opening cmd and making sure that you can call pdftoppm -h
  4. Install dependencies:

pip install -r requirements_book.txt
  1. Use this code
from recognize_book import recognize_book

recognize_book(book_path="path/to/your/book.pdf", text_output_path="book_text.txt")

Metrics on test

  • CER (Char Error Rate) = 0.095
  • WER (Word Error Rate) = 0.298
Downloads last month
3,409
Safetensors
Model size
61.6M params
Tensor type
F32
·
Inference API
Unable to determine this model's library. Check the docs .