Update README.md
Browse files
README.md
CHANGED
@@ -30,6 +30,27 @@ We achieve pretty good results:
|
|
30 |
|
31 |
Though the test set is only 59 examples, with 22 discussing disease.
|
32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
The next step will be to be able to classify both the specific disease (e.g. lung adenocarcinoma), and the non-disease
|
34 |
context (e.g. localisation) a paper discusses.
|
35 |
|
|
|
30 |
|
31 |
Though the test set is only 59 examples, with 22 discussing disease.
|
32 |
|
33 |
+
## Key stats
|
34 |
+
The model size is ~600MB. It will run really fast on the MPS device of an M-series mac. It should also run pretty fast on a normal CPU.
|
35 |
+
|
36 |
+
The context windiw is 4,096 tokens which comes from the base longformer model. In training we limit the context to 1280 tokens because that
|
37 |
+
was a bit bigger than the longest abstract we saw. Long abstracts may cause trouble.
|
38 |
+
|
39 |
+
## Limitations
|
40 |
+
The base model is trained on MLM on wikipedia text (and maybe some other stuff). As such, it might not have a
|
41 |
+
great understanding of scientific literature.
|
42 |
+
|
43 |
+
The dataset used to train this model was _tiny_ at only 588 examples overall. This means only 470 samples
|
44 |
+
for training with 59 for validation and testing. These have been deliberately sampled to be roughly equally distributed between the
|
45 |
+
two classes.
|
46 |
+
|
47 |
+
The dataset these are derived from is also _massively_ imbalanced, having 19,229 examples but only 294 that are not disease. As a result
|
48 |
+
the model is trained on a dataset that hugely undersamples the disease context abstracts.
|
49 |
+
|
50 |
+
While the model has been tested on some abstracts derived from lncBook's annotations, it hasn't really been tested on 'wild' abstracts.
|
51 |
+
|
52 |
+
## Next steps
|
53 |
+
|
54 |
The next step will be to be able to classify both the specific disease (e.g. lung adenocarcinoma), and the non-disease
|
55 |
context (e.g. localisation) a paper discusses.
|
56 |
|