afg1 commited on
Commit
1493c75
1 Parent(s): aae4abd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -0
README.md CHANGED
@@ -30,6 +30,27 @@ We achieve pretty good results:
30
 
31
  Though the test set is only 59 examples, with 22 discussing disease.
32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  The next step will be to be able to classify both the specific disease (e.g. lung adenocarcinoma), and the non-disease
34
  context (e.g. localisation) a paper discusses.
35
 
 
30
 
31
  Though the test set is only 59 examples, with 22 discussing disease.
32
 
33
+ ## Key stats
34
+ The model size is ~600MB. It will run really fast on the MPS device of an M-series mac. It should also run pretty fast on a normal CPU.
35
+
36
+ The context windiw is 4,096 tokens which comes from the base longformer model. In training we limit the context to 1280 tokens because that
37
+ was a bit bigger than the longest abstract we saw. Long abstracts may cause trouble.
38
+
39
+ ## Limitations
40
+ The base model is trained on MLM on wikipedia text (and maybe some other stuff). As such, it might not have a
41
+ great understanding of scientific literature.
42
+
43
+ The dataset used to train this model was _tiny_ at only 588 examples overall. This means only 470 samples
44
+ for training with 59 for validation and testing. These have been deliberately sampled to be roughly equally distributed between the
45
+ two classes.
46
+
47
+ The dataset these are derived from is also _massively_ imbalanced, having 19,229 examples but only 294 that are not disease. As a result
48
+ the model is trained on a dataset that hugely undersamples the disease context abstracts.
49
+
50
+ While the model has been tested on some abstracts derived from lncBook's annotations, it hasn't really been tested on 'wild' abstracts.
51
+
52
+ ## Next steps
53
+
54
  The next step will be to be able to classify both the specific disease (e.g. lung adenocarcinoma), and the non-disease
55
  context (e.g. localisation) a paper discusses.
56