---
license: apache-2.0
base_model:
- PleIAs/OCRonos-Vintage
library_name: transformers
language:
- es
pipeline_tag: text-generation
tags:
- OCR
- text-correction
- ocr-correction
- archives
- GPT2
- history
- SLM
- pre-train
- drama
---

**Filiberto 124M Instruct** is a small specialized model for OCR correction of Spanish Golden Age Dramas OCR, based on the [OCRonos-Vintage](https://hf.co/PleIAs/OCRonos-Vintage) model for cultural heritage archives OCR correction.

Filiberto 124M Instruct is only 124 million parameters. It can run easily on CPU or provide correction at scale on GPUs (>10k tokens/seconds).

## Training
The pre-trained included a collection of individual verses and their correction taken from the [TEXORO](https://etso.es/texoro) corpus, via a collaboration with [ETSO](https://etso.es/), totalling ~5 million tokens. 

Pre-training ran on 5 epochs with levanter (500 steps total, each processing 1024 sequences of 512 tokens) on a TPUv4-32 for 15 minutes.

Tokenization is currently done with the GPT-2 tokenizer.

## Example of OCR correction
Filiberto 124M Instruct has been pre-trained on an instruction dataset with a hard-coded structure: `### Text ###` for OCRized text submissiong and `### Correction ###` for the generated correction.

Filiberto 124M Instruct can be imported like any GPT-2 like model:

```python
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained model and tokenizer
model_name = "bertin-project/filiberto-124M-instruct"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Set the device to GPU if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
```

And afterwards inference can be run like this:

```python
# Function to generate text
def ocr_correction(prompt, max_new_tokens=600):

    prompt = f"""### Text ###\n{prompt}\n\n\n### Correction ###\n"""
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

    # Generate text
    output = model.generate(input_ids,
                            max_new_tokens=max_new_tokens,
                            pad_token_id=tokenizer.eos_token_id,
                            top_k=50)

    # Decode and return the generated text
    return tokenizer.decode(output[0], skip_special_tokens=True).split("### Correction ###")[-1].strip()

ocr_result = ocr_correction(prompt)
print(ocr_result)
```

An example of an OCRized drama:

```
Otra vez, Don Iuan, me dad,
y otras mil vezes los braços.
Otra, y otras mil sean lazos
de nuestra antigua amistad.
Como venis?
Yo me siento
tan alegre, tan vfano,
tan venturoso, tan vano,
que no podrà el pensamiento
encareceros jamàs
las venturas que posseo,
porque el pensamiento creo
```

would yield this result:

```
Otra vez, Don Iuan, me dad,
y otras mil vezes los braços.
Otra, y otras mil sean lazos
de nuestra antigua amistad.
Como venis?
Yo me siento
tan alegre, tan vfano,
tan venturoso, tan vano,
que no podrà el pensamiento
encareceros jamàs
las venturas que posseo,
porque el pensamiento creo
```