--- license: apache-2.0 inference: false --- # LLaVA-RLHF Model Card ## Model details **Model type:** LLaVA-RLHF represents a novel aligned end-to-end trained large multimodal model that combines a CLIP vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive visual reasoning and perception capabilities mimicking spirits of the multimodal GPT-4. Via Factually Augmented RLHF, LLaVA-RLHF is presented to be more helpful and less hallucinated than LLaVA or other open-sourced LMMs. **Usage:** **NOTE: The RLHFed model is trained with LoRA and the bfloat16 data type.** Users have to apply the PEFT-LoRA on the LLaVA-SFT+ model. ```python dtype = torch.bfloat16 model_path = "LLaVA-RLHF-13b-v1.5-336/sft_model" lora_path = "LLaVA-RLHF-13b-v1.5-336/rlhf_lora_adapter_model" model = LlavaLlamaForCausalLM.from_pretrained( model_path, device_map={"": "cuda:0"}, torch_dtype=dtype, ) model = PeftModel.from_pretrained( model, lora_path, ) ``` **Model date:** LLaVA was trained in Sept 2024. **Paper or resources for more information:** https://llava-rlhf.github.io/ **License:** Apache License 2.0 **Where to send questions or comments about the model:** https://github.com/llava-rlhf/LLaVA-RLHF/issues ## Intended use **Primary intended uses:** The primary use of LLaVA-RLHF is research on large multimodal chatbots. **Primary intended users:** The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence. ## Training dataset 595K filtered image-text pairs from CC3M. 150K GPT-generated multimodal instruction-following chat data. 83K VQA v2 instruction-following VQA data. 16K A-OKVQA instruction-following CoT-VQA data. 23K FLICKR instruction-following spotting captioning data. 10K LLaVA-based human preference data