--- base_model: meta-llama/Llama-3.2-11B-Vision-Instruct library_name: peft datasets: - HuggingFaceM4/WebSight --- # Model Card for Llama-3.2-11B-Vision-WebSight LLama 3.2 Vision Instruct trained on 10k samples from https://huggingface.co/datasets/HuggingFaceM4/WebSight. ## Model Details ### Model Description * **Developed by:** pdufour * **Model type:** Vision Language Model * **Language(s) (NLP):** English * **License:** MIT * **Finetuned from model:** meta-llama/Llama-3.2-11B-Vision-Instruct ## How to Get Started with the Model ```python from transformers import AutoModelForVision2Seq, AutoTokenizer, AutoProcessor from peft import PeftModel from PIL import Image import torch model = PeftModel.from_pretrained( AutoModelForVision2Seq.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct", device_map="auto", load_in_4bit=True), "pdufour/Llama-3.2-11B-Vision-WebSight" ) tokenizer = AutoTokenizer.from_pretrained("pdufour/Llama-3.2-11B-Vision-WebSight") processor = AutoProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct") inputs = processor(text="Generate code for a web page that looks exactly like this. <|image|>", images=Image.open("fashion.jpg"), return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate(input_ids=inputs['input_ids'], max_new_tokens=4096, do_sample=True, temperature=0.7, top_p=0.9) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Training Details ### Training Data Vision-language dataset used for instruction tuning. ### Training Procedure #### Training Hyperparameters * **Training regime:** Fine-tuning with LoRA * **Learning rate:** 0.0002 * **Batch size:** 10 * **Gradient accumulation steps:** 1 * **Number of epochs:** 3.0 * **Optimizer:** adamw_torch_fused * **LR scheduler type:** constant * **Weight decay:** 0.0 * **FP16 Training:** False ### Speeds, Sizes, Times * **Training Duration:** Unknown hours * **Number of Parameters:** Unknown trainable parameters * **Model Size:** 0.08 GB ## Evaluation ### Metrics #### Results * **epoch:** 0.9000 * **grad_norm:** 0.2568 * **learning_rate:** 0.0002 * **loss:** 0.0791 * **step:** 900.0000 ## Technical Specifications ### Model Architecture and Objective LoRA-tuned Vision-Language Model based on Llama architecture. ### Compute Infrastructure * **Hardware Type:** GPU * **Number of GPUs:** 1 ### Software * **Framework versions:** * PEFT 0.13.2 * PyTorch 2.5.0+cu121 ## Model Card Contact For questions about this model, please file an issue on the GitHub repository.