jbochi/madlad400-3b-mt · Running out of memory with 12GB VRAM

Hi.
I'm trying to run this model on RTX 3060 12GB, but I get Cuda run out of memory error. I was wondering if there is any workaround to fix that without running it on CPU. Because I'm trying to translate a lot of text (subtitle files) and setting device_map to auto instead of cuda just makes it very slow.
Thank you in advance.
This is the code I'm using:

class Translator:
    def __init__(self, model_name_or_path: str = 'google/madlad400-3b-mt') -> None:
        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path, device_map="cuda")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

    def translate(self, input_text: str, target_language: str = "fa") -> str:
        inputs = self.tokenizer(f"<2{target_language}> {input_text}", return_tensors="pt").to(self.model.device)
        output_ids = self.model.generate(**inputs, max_new_tokens=250)
        output = self.tokenizer.batch_decode(output_ids, skip_special_tokens=True)