zihanliu commited on
Commit
ab81972
1 Parent(s): 4647a16

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -17
README.md CHANGED
@@ -13,28 +13,28 @@ tags:
13
 
14
 
15
  ## Model Details
16
- We introduce Llama3-ChatQA-1.5, which excels at conversational question answering (QA) and retrieval-augmented generation (RAG). Llama3-ChatQA-1.5 is developed using an improved training recipe from [ChatQA (1.0)](https://arxiv.org/abs/2401.10225), and it is built on top of [Llama-3 base model](https://huggingface.co/meta-llama/Meta-Llama-3-8B). Specifically, we incorporate more conversational QA data to enhance its tabular and arithmetic calculation capability. Llama3-ChatQA-1.5 has two variants: Llama3-ChatQA-1.5-8B and Llama3-ChatQA-1.5-70B. Both models were originally trained using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), we converted the checkpoints to Hugging Face format. **For more information about ChatQA, check the [website](https://chatqa-project.github.io/)!**
17
 
18
  ## Other Resources
19
- [Llama3-ChatQA-1.5-70B](https://huggingface.co/nvidia/Llama3-ChatQA-1.5-70B)   [Evaluation Data](https://huggingface.co/datasets/nvidia/ChatRAG-Bench)   [Training Data](https://huggingface.co/datasets/nvidia/ChatQA-Training-Data)   [Retriever](https://huggingface.co/nvidia/dragon-multiturn-query-encoder)   [Website](https://chatqa-project.github.io/)   [Paper](https://arxiv.org/abs/2401.10225)
20
 
21
  ## Benchmark Results
22
  Results in [ChatRAG Bench](https://huggingface.co/datasets/nvidia/ChatRAG-Bench) are as follows:
23
 
24
- | | ChatQA-1.0-7B | Command-R-Plus | Llama-3-instruct-70b | GPT-4-0613 | ChatQA-1.0-70B | ChatQA-1.5-8B | ChatQA-1.5-70B |
25
- | -- |:--:|:--:|:--:|:--:|:--:|:--:|:--:|
26
- | Doc2Dial | 37.88 | 33.51 | 37.88 | 34.16 | 38.9 | 39.33 | 41.26 |
27
- | QuAC | 29.69 | 34.16 | 36.96 | 40.29 | 41.82 | 39.73 | 38.82 |
28
- | QReCC | 46.97 | 49.77 | 51.34 | 52.01 | 48.05 | 49.03 | 51.40 |
29
- | CoQA | 76.61 | 69.71 | 76.98 | 77.42 | 78.57 | 76.46 | 78.44 |
30
- | DoQA | 41.57 | 40.67 | 41.24 | 43.39 | 51.94 | 49.6 | 50.67 |
31
- | ConvFinQA | 51.61 | 71.21 | 76.6 | 81.28 | 73.69 | 78.46 | 81.88 |
32
- | SQA | 61.87 | 74.07 | 69.61 | 79.21 | 69.14 | 73.28 | 83.82 |
33
- | TopioCQA | 45.45 | 53.77 | 49.72 | 45.09 | 50.98 | 49.96 | 55.63 |
34
- | HybriDial* | 54.51 | 46.7 | 48.59 | 49.81 | 56.44 | 65.76 | 68.27 |
35
- | INSCIT | 30.96 | 35.76 | 36.23 | 36.34 | 31.9 | 30.1 | 32.31 |
36
- | Average (all) | 47.71 | 50.93 | 52.52 | 53.90 | 54.14 | 55.17 | 58.25 |
37
- | Average (exclude HybriDial) | 46.96 | 51.40 | 52.95 | 54.35 | 53.89 | 53.99 | 57.14 |
38
 
39
  Note that ChatQA-1.5 is built based on Llama-3 base model, and ChatQA-1.0 is built based on Llama-2 base model. ChatQA-1.5 used some samples from the HybriDial training dataset. To ensure fair comparison, we also compare average scores excluding HybriDial. The data and evaluation scripts for ChatRAG Bench can be found [here](https://huggingface.co/datasets/nvidia/ChatRAG-Bench).
40
 
@@ -185,7 +185,7 @@ Zihan Liu (zihanl@nvidia.com), Wei Ping (wping@nvidia.com)
185
  ## Citation
186
  <pre>
187
  @article{liu2024chatqa,
188
- title={ChatQA: Building GPT-4 Level Conversational QA Models},
189
  author={Liu, Zihan and Ping, Wei and Roy, Rajarshi and Xu, Peng and Lee, Chankyu and Shoeybi, Mohammad and Catanzaro, Bryan},
190
  journal={arXiv preprint arXiv:2401.10225},
191
  year={2024}}
 
13
 
14
 
15
  ## Model Details
16
+ We introduce Llama3-ChatQA-1.5, which excels at conversational question answering (QA) and retrieval-augmented generation (RAG). Llama3-ChatQA-1.5 is developed using an improved training recipe from [ChatQA (1.0)](https://arxiv.org/pdf/2401.10225v3), and it is built on top of [Llama-3 base model](https://huggingface.co/meta-llama/Meta-Llama-3-8B). Specifically, we incorporate more conversational QA data to enhance its tabular and arithmetic calculation capability. Llama3-ChatQA-1.5 has two variants: Llama3-ChatQA-1.5-8B and Llama3-ChatQA-1.5-70B. Both models were originally trained using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), we converted the checkpoints to Hugging Face format. **For more information about ChatQA, check the [website](https://chatqa-project.github.io/)!**
17
 
18
  ## Other Resources
19
+ [Llama3-ChatQA-1.5-70B](https://huggingface.co/nvidia/Llama3-ChatQA-1.5-70B) &ensp; [Evaluation Data](https://huggingface.co/datasets/nvidia/ChatRAG-Bench) &ensp; [Training Data](https://huggingface.co/datasets/nvidia/ChatQA-Training-Data) &ensp; [Retriever](https://huggingface.co/nvidia/dragon-multiturn-query-encoder) &ensp; [Website](https://chatqa-project.github.io/) &ensp; [Paper](https://arxiv.org/pdf/2401.10225v3)
20
 
21
  ## Benchmark Results
22
  Results in [ChatRAG Bench](https://huggingface.co/datasets/nvidia/ChatRAG-Bench) are as follows:
23
 
24
+ | | ChatQA-1.0-7B | Command-R-Plus | Llama-3-instruct-70b | GPT-4-0613 | GPT-4-Turbo | ChatQA-1.0-70B | ChatQA-1.5-8B | ChatQA-1.5-70B |
25
+ | -- |:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
26
+ | Doc2Dial | 37.88 | 33.51 | 37.88 | 34.16 | 35.35 | 38.9 | 39.33 | 41.26 |
27
+ | QuAC | 29.69 | 34.16 | 36.96 | 40.29 | 40.10 | 41.82 | 39.73 | 38.82 |
28
+ | QReCC | 46.97 | 49.77 | 51.34 | 52.01 | 51.46 | 48.05 | 49.03 | 51.40 |
29
+ | CoQA | 76.61 | 69.71 | 76.98 | 77.42 | 77.73 | 78.57 | 76.46 | 78.44 |
30
+ | DoQA | 41.57 | 40.67 | 41.24 | 43.39 | 41.60 | 51.94 | 49.6 | 50.67 |
31
+ | ConvFinQA | 51.61 | 71.21 | 76.6 | 81.28 | 84.16 | 73.69 | 78.46 | 81.88 |
32
+ | SQA | 61.87 | 74.07 | 69.61 | 79.21 | 79.98 | 69.14 | 73.28 | 83.82 |
33
+ | TopioCQA | 45.45 | 53.77 | 49.72 | 45.09 | 48.32 | 50.98 | 49.96 | 55.63 |
34
+ | HybriDial* | 54.51 | 46.7 | 48.59 | 49.81 | 47.86 | 56.44 | 65.76 | 68.27 |
35
+ | INSCIT | 30.96 | 35.76 | 36.23 | 36.34 | 33.75 | 31.9 | 30.1 | 32.31 |
36
+ | Average (all) | 47.71 | 50.93 | 52.52 | 53.90 | 54.03 | 54.14 | 55.17 | 58.25 |
37
+ | Average (exclude HybriDial) | 46.96 | 51.40 | 52.95 | 54.35 | 54.72 | 53.89 | 53.99 | 57.14 |
38
 
39
  Note that ChatQA-1.5 is built based on Llama-3 base model, and ChatQA-1.0 is built based on Llama-2 base model. ChatQA-1.5 used some samples from the HybriDial training dataset. To ensure fair comparison, we also compare average scores excluding HybriDial. The data and evaluation scripts for ChatRAG Bench can be found [here](https://huggingface.co/datasets/nvidia/ChatRAG-Bench).
40
 
 
185
  ## Citation
186
  <pre>
187
  @article{liu2024chatqa,
188
+ title={ChatQA: Surpassing GPT-4 on Conversational QA and RAG},
189
  author={Liu, Zihan and Ping, Wei and Roy, Rajarshi and Xu, Peng and Lee, Chankyu and Shoeybi, Mohammad and Catanzaro, Bryan},
190
  journal={arXiv preprint arXiv:2401.10225},
191
  year={2024}}