--- language: en tags: - multimodal - text - image - image-to-text license: mit datasets: - HuggingFaceM4/OBELICS - laion/laion2B-en - coyo-700m - mmc4 pipeline_tag: text-generation inference: true --- ## Paper More detailes can be found in our paper ## Quickstart Use the code below to get started with the base model: ```python import torch from transformers import AutoModelForCausalLM, AutoProcessor from open_flamingo.eval.models.cruise_model import EvalModel processor = AutoProcessor.from_pretrained("infimm/infimm-hd", trust_remote_code=True) prompts = [ { "role": "user", "content": [ {"image": "/xxx/test.jpg"}, # change it with you image "Please describe the image in detail.", ], } ] inputs = processor(prompts) # use bf16 and gpu 0 model = AutoModelForCausalLM.from_pretrained( "infimm/infimm-hd", local_files_only=True, torch_dtype=torch.bfloat16, trust_remote_code=True, ).to(0).eval() inputs = inputs inputs["batch_images"] = inputs["batch_images"].to(torch.bfloat16) for k in inputs: inputs[k] = inputs[k].to(model.device) generated_ids = model.generate( **inputs, min_new_tokens=0, max_new_tokens=256, ) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) print(generated_text) ``` ## License This project is licensed under the **CC BY-NC 4.0**. The copyright of the images belongs to the original authors. See [LICENSE](LICENSE) for more information. ## Contact Us Please feel free to contact us via email [infimmbytedance@gmail.com](infimmbytedance@gmail.com) if you have any questions.