--- language: en tags: - multimodal - text - image - image-to-text datasets: - HuggingFaceM4/OBELICS - laion/laion2B-en - coyo-700m - mmc4 pipeline_tag: text-generation inference: true --- ## Paper More detailes can be found in our paper at https://arxiv.org/abs/2403.01487. We have released the pretraining model and the pyotrch code at https://github.com/InfiMM/infimm-hd/. Feel free to build your model from our pretrained model. ## Quickstart Use the code below to get started with the base model: ```python import torch from transformers import AutoModelForCausalLM, AutoProcessor processor = AutoProcessor.from_pretrained("Infi-MM/infimm-hd", trust_remote_code=True) prompts = [ { "role": "user", "content": [ {"image": "/xxx/test.jpg"}, # change it with you image "Please describe the image in detail.", ], } ] inputs = processor(prompts) # use bf16 and gpu 0 model = AutoModelForCausalLM.from_pretrained( "Infi-MM/infimm-hd", torch_dtype=torch.bfloat16, trust_remote_code=True, ).to(0).eval() inputs = inputs inputs["batch_images"] = inputs["batch_images"].to(torch.bfloat16) for k in inputs: inputs[k] = inputs[k].to(model.device) generated_ids = model.generate( **inputs, min_new_tokens=0, max_new_tokens=256, ) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) print(generated_text) ``` ## License This project is licensed under the **CC BY-NC 4.0**. The copyright of the images belongs to the original authors. See [LICENSE](LICENSE) for more information. ## Contact Us Please feel free to contact us via email [infimmbytedance@gmail.com](infimmbytedance@gmail.com) if you have any questions. ## Citation ```latex @misc{liu2024infimmhd, title={InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding}, author={Haogeng Liu and Quanzeng You and Xiaotian Han and Yiqi Wang and Bohan Zhai and Yongfei Liu and Yunzhe Tao and Huaibo Huang and Ran He and Hongxia Yang}, year={2024}, eprint={2403.01487}, archivePrefix={arXiv}, primaryClass={cs.CV} } ```