Vilnius-Lithuania-iGEM commited on
Commit
afc2bea
1 Parent(s): 066f3ec

updated README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -79
README.md CHANGED
@@ -1,98 +1,39 @@
1
- ---
2
- language:
3
- - ru
4
- - en
5
- thumbnail: https://raw.githubusercontent.com/JetRunner/BERT-of-Theseus/master/bert-of-theseus.png
6
- tags:
7
- - translation
8
- - fsmt
9
- license: Apache 2.0
10
- datasets:
11
- - wmt19
12
- metrics:
13
- - bleu
14
- - sacrebleu
15
- ---
16
-
17
- # MyModel
18
 
19
  ## Model description
20
 
21
- This is a ported version of [fairseq wmt19 transformer](https://github.com/pytorch/fairseq/blob/master/examples/wmt19/README.md) for {src_lang}-{tgt_lang}.
22
-
23
- For more details, please see, [Facebook FAIR's WMT19 News Translation Task Submission](https://arxiv.org/abs/1907.06616).
24
-
25
- The abbreviation FSMT stands for FairSeqMachineTranslation
26
 
27
- All four models are available:
28
 
29
- * [wmt19-en-ru](https://huggingface.co/facebook/wmt19-en-ru)
30
- * [wmt19-ru-en](https://huggingface.co/facebook/wmt19-ru-en)
31
- * [wmt19-en-de](https://huggingface.co/facebook/wmt19-en-de)
32
- * [wmt19-de-en](https://huggingface.co/facebook/wmt19-de-en)
33
 
34
  ## Intended uses & limitations
35
 
 
 
36
  #### How to use
37
 
38
- ```python
39
- from transformers.tokenization_fsmt import FSMTTokenizer
40
- from transformers.modeling_fsmt import FSMTForConditionalGeneration
41
- mname = "facebook/wmt19-ru-en"
42
- tokenizer = FSMTTokenizer.from_pretrained(mname)
43
- model = FSMTForConditionalGeneration.from_pretrained(mname)
44
 
45
- input = "Машинное обучение - это здорово, не так ли?"
46
- input_ids = tokenizer.encode(input, return_tensors="pt")
47
- outputs = model.generate(input_ids)
48
- decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
49
- print(decoded) # Machine learning is great, isn't it?
50
  ```
51
 
52
- #### Limitations and bias
53
-
54
- - The original (and this ported model) doesn't seem to handle well inputs with repeated sub-phrases, [content gets truncated](https://discuss.huggingface.co/t/issues-with-translating-inputs-containing-repeated-phrases/981)
55
-
56
-
57
- ## Training data
58
 
59
- Pretrained weights were left identical to the original model released by fairseq. For more details, please, see the [paper](https://arxiv.org/abs/1907.06616).
60
-
61
-
62
- ## Training procedure
63
 
 
64
 
65
  ## Eval results
66
 
67
- pair | fairseq | transformers
68
- -------|---------|----------
69
- ru-en | [41.3](http://matrix.statmt.org/matrix/output/1907?run_id=6937) | 39.20
70
-
71
-
72
- The score was calculated using this code:
73
 
74
- ```bash
75
- git clone https://github.com/huggingface/transformers
76
- cd transformers
77
- export PAIR=ru-en
78
- export DATA_DIR=data/$PAIR
79
- export SAVE_DIR=data/$PAIR
80
- export BS=8
81
- export NUM_BEAMS=15
82
- mkdir -p $DATA_DIR
83
- sacrebleu -t wmt19 -l $PAIR --echo src > $DATA_DIR/val.source
84
- sacrebleu -t wmt19 -l $PAIR --echo ref > $DATA_DIR/val.target
85
- echo $PAIR
86
- PYTHONPATH="src:examples/seq2seq" python examples/seq2seq/run_eval.py facebook/wmt19-$PAIR $DATA_DIR/val.source $SAVE_DIR/test_translations.txt --reference_path $DATA_DIR/val.target --score_path $SAVE_DIR/test_bleu.json --bs $BS --task translation --num_beams $NUM_BEAMS
87
- ```
88
-
89
- ### BibTeX entry and citation info
90
-
91
- ```bibtex
92
- @inproceedings{...,
93
- year={2020},
94
- title={Facebook FAIR's WMT19 News Translation Task Submission},
95
- author={Ng, Nathan and Yee, Kyra and Baevski, Alexei and Ott, Myle and Auli, Michael and Edunov, Sergey},
96
- booktitle={Proc. of WMT},
97
- }
98
- ```
 
1
+ # Albumin-15s
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
  ## Model description
4
 
5
+ This is a version of [Albert-base-v2](https://huggingface.co/albert-base-v2) for 15's long aptamers comparison to determine which one is more affine to target protein Albumin.
 
 
 
 
6
 
7
+ The Albert model was pretrained in the English language, it has many similarities with language or proteins and aptamers which is why we had to fine-tune it to help the model learn embedded positioning for aptamers to be able to distinguish better sequences.
8
 
9
+ More information can be found in our [github]() and our iGEMs [wiki]().
 
 
 
10
 
11
  ## Intended uses & limitations
12
 
13
+ You can use the fine-tuned model for either masked aptamer pair sequence classification, which one is more affine for target protein Albumin, prediction, but it's mostly intended to be fine-tuned again on a different length aptamer or simply expanded datasets.
14
+
15
  #### How to use
16
 
17
+ This model can be used to predict compared affinity with dataset preprocessing function which encodes the specific type of data (Sequence1, Sequence2, Label) where Label indicates binary if Sequence1 is more affine to target protein Albumin.
 
 
 
 
 
18
 
19
+ ```python
20
+ from transformers import AutoTokenizer, BertModel
21
+ mname = "Vilnius-Lithuania-iGEM/Albumin"
22
+ model = BertModel.from_pretrained(mname)
 
23
  ```
24
 
25
+ To predict batches of sequences you have to employ custom functions shown in [git/prediction.ipynb]()
 
 
 
 
 
26
 
27
+ #### Limitations and bias
 
 
 
28
 
29
+ - It seems that fine-tuned Albert model for this kind of task has limition of 90 % accuracy predicting which aptamer is more suitable for a target protein, also Albert-large or immense dataset of 15s aptamer could increase accuracy few %, however extrapolation case is not studied and we cannot confirm this model is state-of-The-art when one of aptamers is SUPER good (has almost maximum entropy to the Albumin).
30
 
31
  ## Eval results
32
 
33
+ accuracy : 0.8601
34
+ precision: 0.8515
35
+ recall : 0.8725
36
+ f1 : 0.8618
37
+ roc_auc : 0.9388
 
38
 
39
+ The score was calculated using sklearn.metrics.