Safetensors
custom_code
kyusonglee commited on
Commit
b7bfbfe
β€’
1 Parent(s): 359d7b1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +79 -3
README.md CHANGED
@@ -1,3 +1,79 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ # OmChat: A Family of Powerful Native Multimodal Language Models
5
+ We are thrilled to announce the release of OmChat Beta 2.0, a research version of our models from Om AI. This release includes the Qwen2 7B LLM-base and the InterVIT6B vision tower-based model, combining to form the OmChat Beta 13B model. These models are now available as open-source for researchers in the multimodal field, aimed at advancing meaningful research and contributing to the AI ecosystem's progress.
6
+
7
+ In the near future, we plan to release OmChat Beta 2.1, which will include support for long context as detailed in the OmChat paper, as well as a lighter version of the model. We will continue to update our latest versions for research purposes. For performance evaluation, we have tested our models using the OpenCompass benchmarks.
8
+
9
+ ## Updates
10
+ * 08/10/2024: The OmChat open-source project has been unveiled. πŸŽ‰
11
+ * 07/06/2024: [The OmChat research paper has been published.](https://arxiv.org/abs/2407.04923)
12
+
13
+
14
+
15
+ ### An Example with Huggingface transformers
16
+ Download huggingface model
17
+ ```bash
18
+ git lfs install
19
+ git clone https://huggingface.co/omlab/omchat-v2.0-13B-single-beta_hf
20
+ ```
21
+
22
+ ```python
23
+ from transformers import AutoModel, AutoProcessor, AutoTokenizer
24
+ from PIL import Image
25
+ import requests
26
+ import torch
27
+ from transformers import TextStreamer
28
+
29
+ model = AutoModel.from_pretrained("/omlab/omchat-v2.0-13B-single-beta_hf",trust_remote_code=True, torch_dtype=torch.float16).cuda().eval()
30
+ processor = AutoProcessor.from_pretrained("/omlab/omchat-v2.0-13B-single-beta_hf", trust_remote_code=True)
31
+
32
+ url = "https://www.ilankelman.org/stopsigns/australia.jpg"
33
+ image = Image.open(requests.get(url, stream=True).raw)
34
+ prompt ="What's the content of the image?"
35
+ inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda")
36
+
37
+ with torch.inference_mode():
38
+ output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False, eos_token_id=model.generation_config.eos_token_id, pad_token_id=processor.tokenizer.pad_token_id)
39
+
40
+ outputs = processor.tokenizer.decode(output_ids[0, inputs.input_ids.shape[1] :]).strip()
41
+ print (outputs)
42
+ # The image features a stop sign in front of a Chinese archway, with a black car driving past. The stop sign is located on the left side of the scene, while the car is on the right side. There are also two statues of lions on either side of the archway, adding to the cultural ambiance of the scene.<|im_end|>
43
+
44
+ ```
45
+
46
+ ### Available HF Models from Om AI
47
+ - [omchat-v2.0-13B-single-beta_hf](https://huggingface.co/omlab/omchat-v2.0-13B-single-beta_hf) Currently, it supports only single images, but we will soon release models with multi-image and video support.
48
+
49
+
50
+ ## Citation
51
+ If you find our repository beneficial, please cite our paper:
52
+ ```bibtex
53
+ @article{zhao2024omchat,
54
+ title={OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding},
55
+ author={Zhao, Tiancheng and Zhang, Qianqian and Lee, Kyusong and Liu, Peng and Zhang, Lu and Fang, Chunxin and Liao, Jiajia and Jiang, Kelei and Ma, Yibo and Xu, Ruochen},
56
+ journal={arXiv preprint arXiv:2407.04923},
57
+ year={2024}
58
+ }
59
+ ```
60
+
61
+ ## Acknowledgement
62
+ The codebase and models are built upon the following projects:
63
+ - [LLaVA](https://github.com/haotian-liu/LLaVA)
64
+ - [LLaVA-Next](https://github.com/LLaVA-VL/LLaVA-NeXT)
65
+ - [InternVL2](https://internvl.github.io/blog/2024-07-02-InternVL-2.0/)
66
+ - [MoE-LLaVA](https://github.com/PKU-YuanGroup/MoE-LLaVA)
67
+ - [Qwen2](https://github.com/QwenLM/Qwen2)
68
+
69
+
70
+ ## Projects from Om AI Team
71
+ If you are intrigued by multimodal algorithms, large language models, and agent technologies, we invite you to delve deeper into our research endeavors:
72
+ πŸ”† [OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer](https://arxiv.org/abs/2406.16620)
73
+ 🏠 [Github Repository](https://github.com/om-ai-lab/OmAgentn)
74
+
75
+ πŸ”† [How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection](https://arxiv.org/abs/2308.13177)(AAAI24)
76
+ 🏠 [Github Repository](https://github.com/om-ai-lab/OVDEval/tree/main)
77
+
78
+ πŸ”† [OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network](https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/cvi2.12268)(IET Computer Vision)
79
+ 🏠 [Github Repository](https://github.com/om-ai-lab/OmDet)