Update README.md
Browse files
README.md
CHANGED
@@ -8,32 +8,36 @@ Ankh3 is a protein language model that is jointly optimized on two objectives:
|
|
8 |
* Protein sequence completion.
|
9 |
|
10 |
1. Masked Language Modeling:
|
11 |
-
The idea of this task is to intentionally 'corrupt' an input protein sequence by
|
12 |
-
|
13 |
-
|
14 |
|
15 |
-
Example on a protein sequence before and after corruption:
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
|
|
|
|
|
|
24 |
|
25 |
|
26 |
|
27 |
2. Protein Sequence Completion:
|
28 |
-
|
29 |
two segments, where the first segment is fed to the encoder
|
30 |
and the decoder is tasked to auto-regressively generate the
|
31 |
second segment conditioned on the first segment representation
|
32 |
outputted from the encoder.
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
Original sequence: MKAYVLINSRGP
|
|
|
37 |
We will pass "MKAYVL" of it to the encoder, and the decoder is trained
|
38 |
that given the representation of the first part provided by the encoder,
|
39 |
it should output the second part which is: "INSRGP"
|
|
|
8 |
* Protein sequence completion.
|
9 |
|
10 |
1. Masked Language Modeling:
|
11 |
+
- The idea of this task is to intentionally 'corrupt' an input protein sequence by
|
12 |
+
masking a certain percentage (X%) of its individual tokens (amino acids),
|
13 |
+
and then train the model to reconstruct the original sequence.
|
14 |
|
15 |
+
- Example on a protein sequence before and after corruption:
|
16 |
+
|
17 |
+
Original protein sequence: MKAYVLINSRGP
|
18 |
+
|
19 |
+
This sequence will be masked/corrupted using sentinel tokens as shown below:
|
20 |
+
Sequence after corruption: M <extra_id_0> A Y <extra_id_1> L I <extra_id_2> S R G <extra_id_3>
|
21 |
+
|
22 |
+
|
23 |
+
The decoder learns to correspond each sentinel token to the actual amino acid that was masked.
|
24 |
+
In this example: <extra_id_0> K means that <extra_id_0> corresponds to the "K" amino acid and so on.
|
25 |
+
|
26 |
+
Decoder output: <extra_id_0> K <extra_id_1> V <extra_id_2> N <extra_id_3> P
|
27 |
|
28 |
|
29 |
|
30 |
2. Protein Sequence Completion:
|
31 |
+
- The idea of this task is to cut the input sequence into
|
32 |
two segments, where the first segment is fed to the encoder
|
33 |
and the decoder is tasked to auto-regressively generate the
|
34 |
second segment conditioned on the first segment representation
|
35 |
outputted from the encoder.
|
36 |
+
|
37 |
+
- Example on protein sequence completion:
|
38 |
+
|
39 |
Original sequence: MKAYVLINSRGP
|
40 |
+
|
41 |
We will pass "MKAYVL" of it to the encoder, and the decoder is trained
|
42 |
that given the representation of the first part provided by the encoder,
|
43 |
it should output the second part which is: "INSRGP"
|