raynardj commited on
Commit
30378a2
1 Parent(s): b112147

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -0
README.md ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Roberta-Base fine-tuned on [PubMed](https://pubmed.ncbi.nlm.nih.gov/) Abstract
2
+ > We limit the training textual data to the following [MeSH](https://www.ncbi.nlm.nih.gov/mesh/)
3
+ * All the child MeSH of ```Biomarkers, Tumor(D014408)```, including things like ```Carcinoembryonic Antigen(D002272)```
4
+ * All the child MeSH of ```Carcinoma(D002277)```, including things like all kinds of carcinoma: like ```Carcinoma, Lewis Lung(D018827)``` etc. around 80 kinds of carcinoma
5
+ * All the child MeSH of ```Clinical Trial(D016439)```
6
+ * The training text file amounts to 531Mb
7
+ ## Training
8
+ * Trained on language modeling task, with ```mlm_probability=0.15```, on 2 Tesla V100 32G
9
+ ```python
10
+ training_args = TrainingArguments(
11
+ output_dir=config.save, #select model path for checkpoint
12
+ overwrite_output_dir=True,
13
+ num_train_epochs=3,
14
+ per_device_train_batch_size=30,
15
+ per_device_eval_batch_size=60,
16
+ evaluation_strategy= 'steps',
17
+ save_total_limit=2,
18
+ eval_steps=250,
19
+ metric_for_best_model='eval_loss',
20
+ greater_is_better=False,
21
+ load_best_model_at_end =True,
22
+ prediction_loss_only=True,
23
+ report_to = "none")
24
+ ```
25
+
26
+ ```yaml
27
+
28
+ ---
29
+ language:
30
+ - en
31
+ tags:
32
+ - pubmed
33
+ - cancer
34
+ - gene
35
+ - clinical trial
36
+ license: apache-2.0
37
+ datasets:
38
+ - pubmed
39
+ ---
40
+ ```