Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# LT3
|
2 |
+
LT3 is a novel Conditional Transformer designed for generating synthetic medical instructions as an alternative to real medical data, addressing data privacy restrictions and scracity issues. It has demonstrated better generation quality and diversity than Large Language Models (LLM), and the ability to effectively train NER model with performances comparable to those achieved with real data. On top of that, our research proposes a new Beam Search Decoding algorithm (B2SD) which outperformed state-of-the-art methods on our task.
|
3 |
+
|
4 |
+
This work was presented at *NeurIPS 2023's Workshop on Synthetic Data Generation with Generative AI*.
|
5 |
+
|
6 |
+
Our pre-print can be found here: https://arxiv.org/abs/2310.19727.
|
7 |
+
|
8 |
+
*Authors: Samuel Belkadi, Nicolo Micheletti, Lifeng Han, Warren Del-Pinto, Goran Nenadic.*
|
9 |
+
|
10 |
+
## Setup
|
11 |
+
This repository is a ready-to-run version of LT3. In order to generate synthetic data, the following requirements must be met:
|
12 |
+
- Python version: 3.11.7
|
13 |
+
- Pip verion: 23.3.1
|
14 |
+
|
15 |
+
Then, required packages must be installed by running the following command at the root of the reposity:
|
16 |
+
```pip install -r requirements.txt```
|
17 |
+
|
18 |
+
## Usage: Generate data
|
19 |
+
In order to generate syntethic data, you must follow the steps given below:
|
20 |
+
|
21 |
+
1. First, you should create/edit the file named `medications.txt` at the root of the repository. This file may include medication names as well as the number of prescriptions to generate for each medication. Each line must provide a medication name and its number of generations with the form `name:amount` (e.g, `aspirin:5` to get five different prescriptions of aspirin).
|
22 |
+
|
23 |
+
2. Then, the file `generate.py` must be called to generate the desired synthetic data. You can call this script using the command ```python3 generate.py``` where the following arguments can be added:
|
24 |
+
- -in (--input_path): path to input dictionnary's file (default: './medications.txt');
|
25 |
+
- -out (--output_path): path to output file (default: './generations.json');
|
26 |
+
|
27 |
+
- -bs (--beam_size): beam size for beam search decoding (default: 4);
|
28 |
+
- -mspd (--max_step_prob_diff): maximal step probability difference for beam search decoding (default: 1.0);
|
29 |
+
- -nrpl (--no_repetition_length): minimal length between repeated special characters in generations (default: 4);
|
30 |
+
- -a (--alpha): alpha value (hyperparameter) for beam search decoding (default: 0.6);
|
31 |
+
- -tlp (--tree_length_product): tree length product for beam search decoding (default: 3);
|
32 |
+
|
33 |
+
3. Finally, the generated prescriptions will be available in the desired output file. Note that the number of desired prescriptions may exceed the maximum possible generations if you asked for too many prescriptions and saturated the decoding tree, or have over-restricted the model with your chosen hyperparameters (arguments). In this case, some prescriptions may be given as `-` to signify that they could not be generated.
|
34 |
+
|
35 |
+
|
36 |
+
## Evaluation results
|
37 |
+
|
38 |
+
### Lexical Similarity Evaluation against References
|
39 |
+
|
40 |
+
The results below show that LT3’s generations are the closest match to the reference samples. We used multi-reference evaluation to consolidate our results. Higher scores are better.
|
41 |
+
|
42 |
+
| Models | BLEU | ROUGE-1 | ROUGE-2 | ROUGE-L | BERTScore |
|
43 |
+
|----------|-------|---------|---------|---------|-----------|
|
44 |
+
| T5 Small | 71.75 | 76.16 | 66.24 | 75.55 | 0.70 |
|
45 |
+
| T5 Base | 71.98 | 76.28 | 66.30 | 75.45 | 0.70 |
|
46 |
+
| T5 Large | 69.89 | 75.07 | 65.19 | 74.22 | 0.68 |
|
47 |
+
| LT3 | **78.52** | **78.16** | **68.72** | **77.55** | **0.72** |
|
48 |
+
|
49 |
+
### Lexical Diversity Evaluation within Generated Outputs
|
50 |
+
|
51 |
+
The results below measure the diversity between models' outputs. For each label, we measured the Jaccard similarity score of the generations of our models. A higher Jaccard Score indicates more similarity between the two populations, while a lower score indicates better diversity in our tasks.
|
52 |
+
|
53 |
+
| | Median Jaccard Score | Average Jaccard Score |
|
54 |
+
|-------|-----------------------|-----------------------|
|
55 |
+
| LT3 | **0.650** | **0.652** |
|
56 |
+
| T5 Base | 0.658 | 0.660 |
|
57 |
+
|
58 |
+
|
59 |
+
### Downstream NER Evaluation
|
60 |
+
|
61 |
+
The results below demonstrate the efficiency of our generated synthetic dataset to train an NER model compared to when using real data.
|
62 |
+
|
63 |
+

|
64 |
+
|
65 |
+
## Thank you
|
66 |
+
Feel free to use LT3 for any research purpose.
|
67 |
+
|
68 |
+
Please contact us if you have any questions, and **cite our work** whenever used.
|