--- license: apache-2.0 base_model: - PleIAs/OCRonos-Vintage library_name: transformers language: - es pipeline_tag: text-generation tags: - OCR - text-correction - ocr-correction - archives - GPT2 - history - SLM - pre-train - drama --- **Filiberto 124M Instruct** is a small specialized model for OCR correction of Spanish Golden Age Dramas OCR, based on the [OCRonos-Vintage](https://hf.co/PleIAs/OCRonos-Vintage) model for cultural heritage archives OCR correction. Filiberto 124M Instruct is only 124 million parameters. It can run easily on CPU or provide correction at scale on GPUs (>10k tokens/seconds). ## Training The pre-trained included a collection of individual verses and their correction taken from the [TEXORO](https://etso.es/texoro) corpus, via a collaboration with [ETSO](https://etso.es/), totalling ~5 million tokens. Pre-training ran on 5 epochs with levanter (500 steps total, each processing 1024 sequences of 512 tokens) on a TPUv4-32 for 15 minutes. Tokenization is currently done with the GPT-2 tokenizer. ## Example of OCR correction Filiberto 124M Instruct has been pre-trained on an instruction dataset with a hard-coded structure: `### Text ###` for OCRized text submissiong and `### Correction ###` for the generated correction. Filiberto 124M Instruct can be imported like any GPT-2 like model: ```python import torch from transformers import GPT2LMHeadModel, GPT2Tokenizer # Load pre-trained model and tokenizer model_name = "bertin-project/filiberto-124M-instruct" model = GPT2LMHeadModel.from_pretrained(model_name) tokenizer = GPT2Tokenizer.from_pretrained(model_name) # Set the device to GPU if available, otherwise use CPU device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) ``` And afterwards inference can be run like this: ```python # Function to generate text def ocr_correction(prompt, max_new_tokens=600): prompt = f"""### Text ###\n{prompt}\n\n\n### Correction ###\n""" input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device) # Generate text output = model.generate(input_ids, max_new_tokens=max_new_tokens, pad_token_id=tokenizer.eos_token_id, top_k=50) # Decode and return the generated text return tokenizer.decode(output[0], skip_special_tokens=True).split("### Correction ###")[-1].strip() ocr_result = ocr_correction(prompt) print(ocr_result) ``` An example of an OCRized drama: ``` Otra vez, Don Iuan, me dad, y otras mil vezes los braços. Otra, y otras mil sean lazos de nuestra antigua amistad. Como venis? Yo me siento tan alegre, tan vfano, tan venturoso, tan vano, que no podrà el pensamiento encareceros jamàs las venturas que posseo, porque el pensamiento creo ``` would yield this result: ``` Otra vez, Don Iuan, me dad, y otras mil vezes los braços. Otra, y otras mil sean lazos de nuestra antigua amistad. Como venis? Yo me siento tan alegre, tan vfano, tan venturoso, tan vano, que no podrà el pensamiento encareceros jamàs las venturas que posseo, porque el pensamiento creo ```