--- base_model: meta-llama/Meta-Llama-3-8B library_name: peft license: llama3 tags: - axolotl - generated_from_trainer model-index: - name: llama-3-8b-ocr-correction results: [] datasets: - pbevan11/synthetic-ocr-correction-gpt4o repository: https://github.com/pbevan1/finetune-llm-ocr-correction --- [Built with Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl)
See axolotl config axolotl version: `0.4.1` ```yaml base_model: meta-llama/Meta-Llama-3-8B model_type: AutoModelForCausalLM tokenizer_type: AutoTokenizer load_in_8bit: false load_in_4bit: true strict: false lora_fan_in_fan_out: false data_seed: 49 seed: 49 datasets: - path: ft_data/alpaca_data.jsonl type: alpaca dataset_prepared_path: last_run_prepared val_set_size: 0.1 output_dir: ./qlora-alpaca-out hub_model_id: pbevan11/llama-3-8b-ocr-correction adapter: qlora lora_model_dir: sequence_len: 4096 sample_packing: true pad_to_sequence_len: true lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_linear: true lora_fan_in_fan_out: lora_target_modules: wandb_project: ocr-ft wandb_entity: sncds wandb_name: test gradient_accumulation_steps: 4 micro_batch_size: 2 # was 16 eval_batch_size: 2 # was 16 num_epochs: 3 optimizer: paged_adamw_32bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true loss_watchdog_threshold: 5.0 loss_watchdog_patience: 3 warmup_steps: 10 evals_per_epoch: 4 eval_table_size: saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens: pad_token: "<|end_of_text|>" ```

# llama-3-8b-ocr-correction This model is a qlora fine-tuned adapter for [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) on the [pbevan11/synthetic-ocr-correction-gpt4o](https://huggingface.co/datasets/pbevan11/synthetic-ocr-correction-gpt4o) dataset. It achieves the following results on the evaluation set: - Loss: 0.1778 ## Usage First, download the model ```python from peft import AutoPeftModelForCausalLM from transformers import AutoTokenizer model_id='pbevan11/llama-3-8b-ocr-correction' model = AutoPeftModelForCausalLM.from_pretrained(model_id).cuda() tokenizer = AutoTokenizer.from_pretrained(model_id) tokenizer.pad_token = tokenizer.eos_token ``` Then, construct the prompt template like so: ```python def prompt(instruction, inp): return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {instruction} ### Input: {inp} ### Response: """ def prompt_tok(instruction, inp, return_ids=False): _p = prompt(instruction, inp) input_ids = tokenizer(_p, return_tensors="pt", truncation=True).input_ids.cuda() out_ids = model.generate(input_ids=input_ids, max_new_tokens=5000, do_sample=False) ids = out_ids.detach().cpu().numpy() if return_ids: return out_ids full_output = tokenizer.batch_decode(ids, skip_special_tokens=True)[0] response_start = full_output.find("### Response:") if response_start != -1: return full_output[response_start + len("### Response:"):] else: return full_output[len(_p):] ``` Finally, you can get predictions like this: ```python # model inputs instruction = "You are an assistant that takes a piece of text that has been corrupted during OCR digitisation, and produce a corrected version of the same text." inp = "Do Not Kule Oi't hy.er-l'rieed AjijqIi: imac - Analyst (fteuiers) Hcuiers - A | ) | ilf, <;/) in |) nter |iic . conic! deeiilf. l.o sell n lower-|)rieofl wersinn oi its Macintosh cornutor to nttinct ronsnnu-rs already euami'red ot its iPod music jiayo-r untl annoyoil. by sccnrit.y problems ivitJi Willtlows PCs , Piper.iaffray analyst. (Jcne Muster