klldmofashi commited on
Commit
e87d80e
1 Parent(s): c6c4425

Update README

Browse files
Files changed (1) hide show
  1. README.md +45 -2
README.md CHANGED
@@ -7,7 +7,50 @@ tags:
7
  - VLM
8
  ---
9
 
10
- VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
11
 
 
12
 
13
- The code is release at https://github.com/Efficient-Large-Model/VILA. Welcome to have a try and share your feedbacks!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  - VLM
8
  ---
9
 
10
+ # VILA Model Card
11
 
12
+ ## Model details
13
 
14
+ **Model type:**
15
+ VILA is a visual language model (VLM) pretrained with interleaved image-text data at scale, enabling multi-image VLM. VILA is deployable on the edge, including Jetson Orin and laptop by AWQ 4bit quantization through TinyChat framework. We find: (1) image-text pairs are not enough, interleaved image-text is essential; (2) unfreezing LLM during interleaved image-text pre-training enables in-context learning; (3)re-blending text-only instruction data is crucial to boost both VLM and text-only performance. VILA unveils appealing capabilities, including: multi-image reasoning, in-context learning, visual chain-of-thought, and better world knowledge.
16
+
17
+ **Model date:**
18
+ VILA-7b was trained in Feb 2024.
19
+
20
+ **Paper or resources for more information:**
21
+ https://github.com/Efficient-Large-Model/VILA
22
+
23
+ ```
24
+ @misc{lin2023vila,
25
+ title={VILA: On Pre-training for Visual Language Models},
26
+ author={Ji Lin and Hongxu Yin and Wei Ping and Yao Lu and Pavlo Molchanov and Andrew Tao and Huizi Mao and Jan Kautz and Mohammad Shoeybi and Song Han},
27
+ year={2023},
28
+ eprint={2312.07533},
29
+ archivePrefix={arXiv},
30
+ primaryClass={cs.CV}
31
+ }
32
+ ```
33
+
34
+ ## License
35
+ - The code is released under the Apache 2.0 license as found in the [LICENSE](./LICENSE) file.
36
+ - The pretrained weights are released under the [CC-BY-NC-SA-4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en).
37
+ - The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:
38
+ - [Model License](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) of LLaMA
39
+ - [Terms of Use](https://openai.com/policies/terms-of-use) of the data generated by OpenAI
40
+ - [Dataset Licenses](https://github.com/Efficient-Large-Model/VILA/blob/main/data_prepare/LICENSE) for each one used during training.
41
+
42
+ **Where to send questions or comments about the model:**
43
+ https://github.com/Efficient-Large-Model/VILA/issues
44
+
45
+ ## Intended use
46
+ **Primary intended uses:**
47
+ The primary use of VILA is research on large multimodal models and chatbots.
48
+
49
+ **Primary intended users:**
50
+ The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
51
+
52
+ ## Training dataset
53
+ See [Dataset Preparation](https://github.com/Efficient-Large-Model/VILA/blob/main/data_prepare/README.md) for more details.
54
+
55
+ ## Evaluation dataset
56
+ A collection of 12 benchmarks, including 5 academic VQA benchmarks and 7 recent benchmarks specifically proposed for instruction-following LMMs.