saadob12 commited on
Commit
89e4656
·
verified ·
1 Parent(s): 86990d6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -27
README.md CHANGED
@@ -1,28 +1,66 @@
1
- # Training Data
2
- **Autochart:** Zhu, J., Ran, J., Lee, R. K. W., Choo, K., & Li, Z. (2021). AutoChart: A Dataset for Chart-to-Text Generation Task. arXiv preprint arXiv:2108.06897.
3
 
4
- **Gitlab Link for the data**: https://gitlab.com/bottle_shop/snlg/chart/autochart
5
 
6
- Train split for this model: Train 23336, Validation 1297, Test 1296
7
 
8
- # Example use:
9
- Append ```C2T: ``` before every input to the model
10
 
 
11
 
12
- ```
13
- tokenizer = AutoTokenizer.from_pretrained(saadob12/t5_C2T_autochart)
14
- model = AutoModelForSeq2SeqLM.from_pretrained(saadob12/t5_C2T_autochart)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
- data = 'Trade statistics of Qatar with developing economies in North Africa bar_chart Year-Trade with economies of Middle East & North Africa(%)(Merchandise exports,Merchandise imports) x-y1-y2 values 2000 0.591869968616745 3.59339030672154 , 2001 0.53415012207203 3.25371165779341 , 2002 3.07769793440318 1.672796364224 , 2003 0.6932513078579471 1.62522475477827 , 2004 1.17635914189321 1.80540331396412'
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
  prefix = 'C2T: '
19
- tokens = tokenizer.encode(prefix + data, truncation=True, padding='max_length', return_tensors='pt')
20
  generated = model.generate(tokens, num_beams=4, max_length=256)
21
  tgt_text = tokenizer.decode(generated[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
22
- summary = str(tgt_text).strip('[]""')
23
- #Summary: This barchart shows the number of trade statistics of qatar with developing economies in north africa from 2000 through 2004. The unit of measurement in this graph is Trade with economies of Middle East & North Africa(%) as shown on the y-axis. The first group data denotes the change of Merchandise exports. There is a go up and down trend of the number. The peak of the number is found in 2002 and the lowest number is found in 2001. The changes in the number may be related to the conuntry's national policies. The second group data denotes the change of Merchandise imports. There is a go up and down trend of the number. The number in 2000 being the peak, and the lowest number is found in 2003. The changes in the number may be related to the conuntry's national policies.
 
 
 
24
  ```
25
- # Limitations
26
  You can use the model to generate summaries of data files.
27
  Works well for general statistics like the following:
28
 
@@ -39,21 +77,29 @@ May or may not generate an **okay** summary at best for the following kind of da
39
 
40
  | Model | BLEU score | BLEURT|
41
  |:---:|:---:|:---:|
42
- | t5-small | 25.4 | -0.11 |
43
- | t5-base | 28.2 | 0.12 |
44
- | t5-large | 35.4 | 0.34 |
45
-
46
 
47
 
48
  # Citation
49
-
50
  Kindly cite my work. Thank you.
51
  ```
52
- @misc{obaid ul islam_2022,
53
- title={saadob12/t5_C2T_autochart Hugging Face},
54
- url={https://huggingface.co/saadob12/t5_C2T_autochart},
55
- journal={Huggingface.co},
56
- author={Obaid ul Islam, Saad},
57
- year={2022}
58
- }
59
- ```
 
 
 
 
 
 
 
 
 
 
 
1
+ # Tackling Hallucinations in Neural Chart Summarization
 
2
 
3
+ ## Introduction
4
 
5
+ The trained model for investigations and state-of-the-art (SOTA) improvements are detailed in the paper: [Tackling Hallucinations in Neural Chart Summarization](https://aclanthology.org/2023.inlg-main.30/).
6
 
7
+ ### Abstract
 
8
 
9
+ Hallucinations in text generation occur when the system produces text that is not grounded in the input. In this work, we address the problem of hallucinations in neural chart summarization. Our analysis reveals that the target side of chart summarization training datasets often contains additional information, leading to hallucinations. We propose a natural language inference (NLI) based method to preprocess the training data and demonstrate through human evaluation that our approach significantly reduces hallucinations. Additionally, we found that shortening long-distance dependencies in the input sequence and adding chart-related information such as titles and legends enhances overall performance.
10
 
11
+ ## Main Findings from the Paper
12
+
13
+ - **Enhanced Context Provision:** Emphasizing the importance of providing more context and reducing long-distance dependencies in the input format.
14
+ - **NLI Cleaning Step:** Introducing an NLI-based cleaning step to eliminate ungrounded information in the training data.
15
+ - **Reduction of Intrinsic Hallucinations:** Demonstrating that reducing long-distance dependencies and adding more context leads to fewer intrinsic hallucinations.
16
+ - **Cause of Extrinsic Hallucinations:** Identifying that extrinsic hallucinations are caused by ungrounded information in training summaries.
17
+ - **Human Evaluation Results:** Showing that using NLI to filter training summaries significantly reduces hallucinations.
18
+
19
+ ## Training Data
20
+
21
+ ### GitHub Link for the Original AutoChart Data
22
+
23
+ [AutoChart](https://gitlab.com/bottle_shop/snlg/chart/autochart)
24
+
25
+ ### Training Details
26
+
27
+ The model was trained using optimized prompts based on findings from the paper [Tackling Hallucinations in Neural Chart Summarization](https://aclanthology.org/2023.inlg-main.30/).
28
+
29
+ - **Optimized Prompt Dataset:** [Hallucinations-C2T Data](https://github.com/WorldHellow/Hallucinations-C2T/tree/main/data)
30
+
31
+ ## Example Use
32
+
33
+ ### Input Prompt Template
34
+
35
+ The input prompt template consists of the `title, x-y labels, and x-y values`.
36
 
37
+ **Appending C2T Prefix:**
38
+ Before every input to the model, append `C2T:`.
39
+
40
+ ```python
41
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
42
+
43
+ tokenizer = AutoTokenizer.from_pretrained("saadob12/t5_C2T_big")
44
+ model = AutoModelForSeq2SeqLM.from_pretrained("saadob12/t5_C2T_big")
45
+
46
+ data = 'Breakdown of coronavirus (COVID-19) deaths in South Korea as of March 16, 2020\n' \
47
+ 'by chronic disease x-y labels\n' \
48
+ 'Response - Share of cases, x-y values Circulatory system disease* 62.7%, ' \
49
+ 'Endocrine and metabolic diseases** 46.7%, Mental illness*** 25.3%, ' \
50
+ 'Respiratory diseases*** 24%, Urinary and genital diseases 14.7%, Cancer 13.3%, ' \
51
+ 'Nervous system diseases 4%, Digestive system diseases 2.7%, Blood and hematopoietic diseases 1.3%' \
52
 
53
  prefix = 'C2T: '
54
+ tokens = tokenizer.encode(prefix + data, truncation=True, padding='max_length', return_tensors='pt')
55
  generated = model.generate(tokens, num_beams=4, max_length=256)
56
  tgt_text = tokenizer.decode(generated[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
57
+ summary = tgt_text.strip('[]""')
58
+ # Summary:
59
+ #As of March 16, 2020, around 62.7% of COVID-19 deaths in South Korea were related to circulatory system diseases.
60
+ #Other chronic diseases include endocrine and metabolic diseases, mental illness, and cancer.
61
+ #South Korea confirmed 30,017 infection cases, including 501 deaths.
62
  ```
63
+ # Intended Use and Limitations
64
  You can use the model to generate summaries of data files.
65
  Works well for general statistics like the following:
66
 
 
77
 
78
  | Model | BLEU score | BLEURT|
79
  |:---:|:---:|:---:|
80
+ | bert-small | 25.4 | -0.11 |
81
+ | bert-base | 28.2 | 0.12 |
82
+ | bert-large | 35.4 | 0.34 |
 
83
 
84
 
85
  # Citation
 
86
  Kindly cite my work. Thank you.
87
  ```
88
+ @inproceedings{obaid-ul-islam-etal-2023-tackling,
89
+ title = {Tackling Hallucinations in Neural Chart Summarization},
90
+ author = {Obaid ul Islam, Saad and Škrjanec, Iza and Dusek, Ondrej and Demberg, Vera},
91
+ booktitle = {Proceedings of the 16th International Natural Language Generation Conference},
92
+ month = sep,
93
+ year = {2023},
94
+ address = {Prague, Czechia},
95
+ publisher = {Association for Computational Linguistics},
96
+ url = {https://aclanthology.org/2023.inlg-main.30},
97
+ doi = {10.18653/v1/2023.inlg-main.30},
98
+ pages = {414--423},
99
+ abstract = {Hallucinations in text generation occur when the system produces text that is not grounded in the input. In this work, we tackle the problem of hallucinations in neural chart summarization. Our analysis shows that the target side of chart summarization training datasets often contains additional information, leading to hallucinations. We propose a natural language inference (NLI) based method to preprocess the training data and show through human evaluation that our method significantly reduces hallucinations. We also found that shortening long-distance dependencies in the input sequence and adding chart-related information like title and legends improves the overall performance.}
100
+ }
101
+
102
+ ```
103
+
104
+
105
+