Aman
/

Aman commited on
Commit
f773f0f
·
verified ·
1 Parent(s): 0e79ded

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +157 -0
README.md CHANGED
@@ -1,3 +1,160 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+
5
+ # Multi-Modal Experience Inspired AI Creation
6
+
7
+ [![Python](https://img.shields.io/badge/Python-3.7-blue.svg)](#PyTorch)
8
+ [![PyTorch](https://img.shields.io/badge/PyTorch-1.10-green.svg)](#PyTorch)
9
+
10
+ [**Paper**](https://arxiv.org/pdf/2209.02427.pdf) |
11
+ [**Data**](https://github.com/Aman-4-Real/MMTG#Data) |
12
+ [**GitHub**](https://github.com/Aman-4-Real/MMTG)
13
+
14
+ This repository contains the source code and datasets for the ACM MM 2022 paper [Multi-Modal Experience Inspired AI Creation](https://arxiv.org/pdf/2209.02427.pdf) by Cao et al.
15
+
16
+
17
+ # Abstract
18
+ AI creation, such as poem or lyrics generation, has attracted increasing attention from both industry and academic communities, with many promising models proposed in the past few years. Existing methods usually estimate the outputs based on single and independent visual or textual information. However, in reality, humans usually make creations according to their experiences, which may involve different modalities and be sequentially correlated. To model such human capabilities, in this paper, we define and solve a novel AI creation problem based on human experiences.
19
+ <details> <summary> More (Click me) </summary> More specifically, we study how to generate texts based on sequential multi-modal information. Compared with the previous works, this task is much more difficult because the designed model has to well understand and adapt the semantics among different modalities and effectively convert them into the output in a sequential manner. To alleviate these difficulties, we firstly design a multi-channel sequence-to-sequence architecture equipped with a multi-modal attention network. For more effective optimization, we then propose a curriculum negative sampling strategy tailored for the sequential inputs. To benchmark this problem and demonstrate the effectiveness of our model, we manually labeled a new multi-modal experience dataset. With this dataset, we conduct extensive experiments by comparing our model with a series of representative baselines, where we can demonstrate significant improvements in our model based on both automatic and human-centered metrics.
20
+ </details>
21
+
22
+ # Before You Start
23
+ - Please note that this is a work done for AI creation in **Chinese**, thus the following dataset and model checkpoints are all in Chinese. However, we have tried our model training on the English data, which is constructed on English poems in the same way with our proposed pipeline, and received the same good generated results. You can try to construct some English data (based on English corpora like poems and English text-image datasets like [MovieNet](https://movienet.github.io/)) and adapt to your own domain if necessary.
24
+ - Some parts of our work are based on the large-scale Chinese multimodal pre-trained model [WenLan (a.k.a. BriVL)](https://arxiv.org/abs/2103.06561). Please refer to [this repo](https://github.com/chuhaojin/WenLan-api-document) for more information of usage. For the English version, you can replace the WenLan with OpenAI CLIP or other multimodal representation model (more details in our paper).
25
+
26
+
27
+ # Setup
28
+ Create a new virtual environment:
29
+ ```
30
+ $ git clone https://github.com/Aman-4-Real/MMTG.git
31
+ $ cd MMTG/
32
+ $ conda create -n mmtg python=3.7
33
+ $ conda activate mmtg
34
+ ```
35
+ Install the Python packages. Change the cudatoolkit version according to your environment if necessary.
36
+ ```
37
+ $ conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
38
+ $ pip install -r requirements.txt
39
+ ```
40
+
41
+
42
+ # Download
43
+ Here are the resources, which you can download at [Hugging Face](https://huggingface.co/Aman/MMTG), [GoogleDrive](https://drive.google.com/drive/folders/1y7yD6s8U7-Vm_n-G4trYdfQzZjVXXgiX?usp=sharing) or [BaiduNetDisk(0dwq)](https://pan.baidu.com/s/1_Xlfz-7MdL1gDi47EoT06g):
44
+ | FileName | Description | Path |
45
+ | - | - | - |
46
+ | \*\_data\_\*.pkl | Train, validation, and test data. | _sharing_link_/data/ |
47
+ | mmtg_ckpt.pth | The checkpoint of MMTG for your reproduction. | _sharing_link_/ckpts/ |
48
+ | GPT2_lyrics_ckpt_epoch00.ckpt | The pre-trained decoder checkpoint. It is based on GPT2 and trained on lyrics corpus. | _sharing_link_/ckpts/ |
49
+ | token_id2emb_dict.pkl | The dict file of each token in vocabulary to WenLan embeddings. | _sharing_link_/ |
50
+ ## Data
51
+ The dataset used in our paper is released as follows. Due to copyright issues, we only release the visual features of the images used in our dataset. All the `.pkl` files are in list type and each item of them is in the following format:
52
+ ```
53
+ {
54
+ 'topic': STRING # the topic words
55
+ 'topic_emb': LIST # embs of the topic words
56
+ 'lyrics': LIST # list of lyrics sentences
57
+ 'img_0_emb': LIST # emb of the 1st image
58
+ 'r_0': STRING # the 1st text
59
+ 'r_0_emb': LIST # emb of the 1st text
60
+ 'img_1_emb': LIST # emb of the 2nd image
61
+ 'r_1': STRING # the 2nd text
62
+ 'r_1_emb': LIST # emb of the 2nd text
63
+ ...,
64
+ 'img_4_emb': LIST # emb of the 4th image
65
+ 'r_4': STRING # the 4th text
66
+ 'r_4_emb': LIST # emb of the 4th text
67
+ 'rating': INT # the sample level (range from 1 to 5, 5 refers to the most positive one while 1 refers to the least).
68
+ }
69
+ ```
70
+ For the test data, there are additional keys:
71
+ ```
72
+ {
73
+ 'score_0': {
74
+ 'img_rel': [2, 2], # the relevance score of the 1st image and the 1st & 2nd lyrics sentences (range from 1 to 5).
75
+ 'r_rel': [1, 1], # the relevence score of the 1st text and the 1st & 2nd lyrics sentences (range from 1 to 5).
76
+ 'cmp_rel': [0, 0] # whether the image or the text is more relevant to the lyrics. 0 refers to the image and 2 refers to the text (1 means a tie).
77
+ } # a list above means: [rator1_score, rator2_score]
78
+ ...,
79
+ 'score_4': ...
80
+ }
81
+ ```
82
+ You can use this additional labeled information to analyze your parameters (like attention weights) and results.
83
+
84
+ ## Checkpoints
85
+ `mmtg_ckpt.pth`: The checkpoint of MMTG for your reproduction. It is trained on the dataset we released. You can simply load it and use it to generate on your own data or for the demo.
86
+
87
+ `GPT2_lyrics_ckpt_epoch00.ckpt`: The pre-trained decoder checkpoint. As mentioned in our paper, we use a pre-trained GPT2 to initialize our decoder and fine-tune it on our lyrics corpus (phase 1). While doing the whole training (phase 2), we start from this fine-tuned one.
88
+
89
+ ## Other
90
+ `token_id2emb_dict.pkl`: The dict file of each token in vocabulary to WenLan embeddings. It is used to convert the token ids to the corresponding embeddings in phase 1 and phase 2. This is to adapt the text embedding space to the image embedding space. You can also use other pre-trained multimodal representation models (like OpenAI CLIP) to replace WenLan and construct an English one.
91
+
92
+
93
+ # Usage
94
+ 1. Download the `data files`, `pre-trained GPT2 checkpoint`, and `token_id2emb_dict.pkl`.
95
+ 2. Put them in `./data/`, `./src/pretrained/` (change the path in `./src/configs.py` correspondingly) and `./src/vocab/` respectively.
96
+
97
+ ## Training
98
+ Change your configs and run:
99
+ ```
100
+ $ cd src/
101
+ $ bash train.sh
102
+ ```
103
+
104
+ ## Generate
105
+ Change your configs and run:
106
+ ```
107
+ $ cd src/
108
+ $ bash generate.sh
109
+ ```
110
+ This will generate the results of the test data and save them in your `save_samples_path`. You can also use the checkpoint we released to generate on your own data. The format of the data is the same as the test data (without the scores and ratings). You can refer to `./data/test_data.pkl` for more details.
111
+
112
+ <!-- ## Demo
113
+ We provide a demo to easily visualize the input and the output. You can run:
114
+ ```
115
+ $ cd src/demo/
116
+ $ python main.py
117
+ ```
118
+ Then go to the interactive and more user-friendly page and enjoy! -->
119
+
120
+
121
+ # Citation
122
+ If you find this paper and repo useful, please cite us in your work:
123
+
124
+ <!--
125
+ ```
126
+ @article{cao2022multi,
127
+ title={Multi-Modal Experience Inspired AI Creation},
128
+ author={Cao, Qian and Chen, Xu and Song, Ruihua and Jiang, Hao and Yang, Guang and Cao, Zhao},
129
+ journal={arXiv preprint arXiv:2209.02427},
130
+ year={2022}
131
+ }
132
+ ```
133
+ -->
134
+
135
+ ```
136
+ @inproceedings{10.1145/3503161.3548189,
137
+ author = {Cao, Qian and Chen, Xu and Song, Ruihua and Jiang, Hao and Yang, Guang and Cao, Zhao},
138
+ title = {Multi-Modal Experience Inspired AI Creation},
139
+ year = {2022},
140
+ isbn = {9781450392037},
141
+ publisher = {Association for Computing Machinery},
142
+ address = {New York, NY, USA},
143
+ url = {https://doi.org/10.1145/3503161.3548189},
144
+ doi = {10.1145/3503161.3548189},
145
+ booktitle = {Proceedings of the 30th ACM International Conference on Multimedia},
146
+ pages = {1445–1454},
147
+ numpages = {10},
148
+ keywords = {AI creation, multi-modal, experience},
149
+ location = {Lisboa, Portugal},
150
+ series = {MM '22}
151
+ }
152
+ ```
153
+ For any questions, please feel free to reach me at caoqian4real@ruc.edu.cn.
154
+
155
+
156
+
157
+
158
+
159
+
160
+