File size: 4,683 Bytes
0151ca7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2429315
0151ca7
 
 
1432a23
0151ca7
 
1432a23
 
0151ca7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1432a23
0151ca7
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
license: mit
---

# MT-LLaMA Model Card

## Model details

**Model type:**
MT-LLaMA is an open-source multi-task model trained by fine-tuning LLaMA on the massive tasks in [P3](https://huggingface.co/datasets/bigscience/P3) (i.e., T0 Train). Concretely, the used datasets during training and task taxonomy are listed below:
* Multi-choice QA: CommonsenseQA, Cosmos QA, DREAM, QuAIL, QuaRTz, QASC, QuaRel, SciQ, Social IQA, Wiki Hop, WiQA  
* Extractive QA: Adversarial QA, DuoRC, Quoref, ROPES  
* Close-Book QA: Hotpot QA, Wiki QA  
* Sentiment Classification: Amazon, App Reviews, IMDB, Rotten Tomatoes, Yelp  
* Topic Classification: AG News, DBPedia, TREC  
* Structure-to-Text Generation: Common Gen, Wiki Bio  
* Text Summarization: CNN Daily Mail, Gigaword, MultiNews, SamSum, XSum  
* Paraphrase Identification: MRPC, PAWS, QQP  

**Organizations developing the model:**
The MT-LLaMA team with members from Alibaba Damo Academy and the Chinese University of Hong Kong.

## Intended use

You can try the codes from our [github repo](https://github.com/DAMO-NLP-SG/MT-LLaMA).

                                          
## Zero-shot Evaluation

We primarily follow the protocols of [Bigscience T0](https://openreview.net/forum?id=9Vrb9D0WI4) to assess the generalization capability of our Multi-task LLaMA to: (1) _**Unseen Datasets**_ (i.e., datasets from seen tasks); (2) _**Unseen Tasks**_.
                                                     
#### Prompt Format

Extractive QA:

1. XQuAD, TyDiQA, MLQA, SQuAD
   ```angular2html
    Input: Answer the question according to the context. Question: ${question}. Context: ${context}. Answer:
    Output: ${Answer}
   ```

Sentiment:

1. SST-2
   ```angular2html
   Input: ${sentence} Based on this review, would the user recommend this product? No or Yes?
   Output: Yes / No
   ```
Multiple-Choice QA:

1. OpenbookQA
   ```angular2html
   Input: ${question} Which is the correct answer? - (A) ${choiceA} - (B) ${choiceB} - (C) ${choiceC} - (D) ${choiceD}
   Output: ${choiceA} / ${choiceB} / ${choiceC} / ${choiceD}
   ```
Sentence Completion:

1. COPA
   ```angular2html
   Input: ${premise} {% if question == "cause" %} This happened because... {% else %} As a consequence... Help me pick the more plausible option: - ${text1} - ${text2}
   Output: ${text1} / ${text2}
   ```
Coreference Resolution:
1. Winogrande:
   ```angular2html    
   Input: ${sentence} In the previous sentence, does _ refer to ${option1} or ${option2}?
   Output: ${option1} / ${option2}
   ```
Word Sense Disambiguation:
1. WiC
   ```angular2html
   Input: Does the word "${word}" have the same meaning in these two sentences? Yes, No? ${sentence1} ${sentence2}
   Output: ${sentence1} / ${sentence2}
   ```
Natural Language Inference:

1. MNLI:
   ```angular2html
   Input: ${premise} Question: Does this imply that ${hypothesis}? Please response with 'Yes', 'No', or 'Maybe'.
   Output: Yes / No / Maybe
   ```
2. RTE
   ```angular2html  
   Input: Given ${premise} Is it guaranteed true that "${hypothesis}"? Yes or no?
   Output: Yes / no
   ```
#### Results on _Unseen Datasets_

| Model       | XQuAD-en (F1/EM) | TyDiQA-en (F1/EM) | MLQA-en (F1/EM) | SQuAD (F1/EM) | SST-2 (Acc.) | OpenbookQA (Acc.) |
|:------------|------------------|-------------------|-----------------|---------------|--------------|-------------------|
| LLaMA-7b    | 9.5 / 2.0        | 14.3 / 2.6        | 13.4 / 3.3      | 29.4 / 11.5   | 50.5         | 32.4              |
| MT-LLaMA-7b | 42.3 / 31.1      | 38.9 / 26.9       | 45.4 / 31.5     | 85.9 / 77.6   | 92.6         | 38.2              |
#### Results on _Unseen Tasks_                                                     
| Model       | COPA (Acc.) | Winogrande (Acc.)  | WiC (Acc.) | MNLI (Acc.) | RTE (Acc.) |
|:------------|-------------|--------------------|------------|-------------|------------|
| LLaMA-7b    | 56.0        | 49.3               | 51.7       | 30.2        | 52.7       |
| MT-LLaMA-7b | 88.0        | 54.9               | 52.2       | 49.6        | 79.1       |

## Acknowledgement

* Our training codes are largely borrowed from [FastChat](https://github.com/lm-sys/FastChat)
* We are also grateful for the efforts of [LLaMA](https://github.com/facebookresearch/llama) (from FAIR) and [T0](https://github.com/bigscience-workshop/t-zero) (from BigScience), which serve as the foundation of our work 

If you find this resource useful, please cite the repo as follows:
```
@software{damonlpsg2023mtllama,
  author = {Xu, Weiwen and Li, Xin and Bing, Lidong},
  title = {Multi-task Instruction-tuned LLaMA},
  year = 2023,
  url = {https://github.com/DAMO-NLP-SG/MT-LLaMA}
}
```