zixianma commited on
Commit
f2405a0
1 Parent(s): 2798911

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -3
README.md CHANGED
@@ -1,3 +1,66 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ base_model: lmms-lab/llava-onevision-qwen2-7b-mid-stage-a4
4
+ model-index:
5
+ - name: llama3-siglip-taco-8b
6
+ results: []
7
+ ---
8
+
9
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
10
+ should probably proofread and complete it, then remove this comment. -->
11
+
12
+ # 🌮 TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action
13
+
14
+ <h3 align="left"> <a href="https://taco-project.github.io/">🌐 Website</a> | <a href="https://arxiv.org/pdf/2412.05479">📑 Arxiv</a> | <a href="">🤗 Mdoel Weights</a> | <a href="">💻 Demo</a></h3>
15
+
16
+ <h5 align="left"> If you like our project or are interested in its updates, please star us :) Thank you! ⭐ </h2>
17
+
18
+ ## Model description
19
+ We introduce TACO as a family of multi-modal large action models designed to improve performance on such complex, multi-step and multi-modal tasks. During inference, TACO produces chains-of-thought-and–action (CoTA), executes intermediate steps by invoking external tools such as OCR, depth estimation and calculator, then integrates both the thoughts and action outputs to produce coherent responses. Our TACO models outperform the instruction-tuned baseline across 8 benchmarks, achieving a 3.6% improvement on average, with gains up to 15% in MMVet tasks involving OCR, mathematical reasoning and spatial reasoning.
20
+
21
+ <p align="center">
22
+ <img src="teaser.png" width="1000" style="margin-bottom: 0.2;"/>
23
+ <p align="center">Figure 1. TACO vs. other multi-modal models</p>
24
+ <p>
25
+
26
+ ## Usage
27
+
28
+ See our [Github repository](https://github.com/SalesforceAIResearch/TACO).
29
+
30
+ ## Intended uses & limitations
31
+
32
+ This model is intended to be used on complex, multi-step and multi-modal question answering tasks. It is trained to answer visual questions with some of the following 15 actions:```OCR```, ```LocalizeObjects```, ```GetObjects```,
33
+ ```EstimateRegionDepth```, ```EstimateObjectDepth```, ```Crop```, ```ZoomIn```, ```QueryLanguageModel```, ```GetImageToImagesSimilarity```, ```GetImageToTextsSimilarity```,
34
+ ```GetTextToImagesSimilarity```, ```DetectFaces```, ```QueryKnowledgeBase```, ```Calculate```, and ```SolveMathEquation```. Additionally, the ```Terminate``` action
35
+ is also supported for the model to provide a final answer.
36
+
37
+ For other types of tasks that don't benefit from the actions above, you might need to train a new model or further finetune it with other actions.
38
+
39
+ ## Training and evaluation data
40
+
41
+ See our [paper]("https://arxiv.org/pdf/2412.05479") for details.
42
+
43
+ ## Training procedure and hyperparameters
44
+
45
+ See our [paper]("https://arxiv.org/pdf/2412.05479") for details.
46
+
47
+ ## Training results
48
+
49
+ See our [paper]("https://arxiv.org/pdf/2412.05479") for details.
50
+
51
+ ### License information
52
+ This release is for research purposes only in support of an academic paper. This repository is licensed under the noncommercial license [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/).
53
+
54
+ ### Citation
55
+ Please cite us if you find our repository helpful. Thank you!
56
+ ```
57
+ @misc{ma2024tacolearningmultimodalaction,
58
+ title={TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action},
59
+ author={Zixian Ma and Jianguo Zhang and Zhiwei Liu and Jieyu Zhang and Juntao Tan and Manli Shu and Juan Carlos Niebles and Shelby Heinecke and Huan Wang and Caiming Xiong and Ranjay Krishna and Silvio Savarese},
60
+ year={2024},
61
+ eprint={2412.05479},
62
+ archivePrefix={arXiv},
63
+ primaryClass={cs.CV},
64
+ url={https://arxiv.org/abs/2412.05479},
65
+ }
66
+ ```