hazemessam commited on
Commit
f77377b
·
verified ·
1 Parent(s): 9f7c1ad

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -16
README.md CHANGED
@@ -8,32 +8,36 @@ Ankh3 is a protein language model that is jointly optimized on two objectives:
8
  * Protein sequence completion.
9
 
10
  1. Masked Language Modeling:
11
- The idea of this task is to intentionally 'corrupt' an input protein sequence by
12
- masking a certain percentage (X%) of its individual tokens (amino acids),
13
- and then train the model to reconstruct the original sequence.
14
 
15
- Example on a protein sequence before and after corruption:
16
- Original protein sequence: MKAYVLINSRGP
17
-
18
- This sequence will be masked/corrupted using sentinel tokens as shown below:
19
- Sequence after corruption: M <extra_id_0> A Y <extra_id_1> L I <extra_id_2> S R G <extra_id_3>
20
-
21
- The decoder learns to correspond each sentinel token to the actual amino acid that was masked.
22
- In this example: <extra_id_0> K means that <extra_id_0> corresponds to the "K" amino acid and so on.
23
- Decoder output: <extra_id_0> K <extra_id_1> V <extra_id_2> N <extra_id_3> P
 
 
 
24
 
25
 
26
 
27
  2. Protein Sequence Completion:
28
- The idea of this task is to cut the input sequence into
29
  two segments, where the first segment is fed to the encoder
30
  and the decoder is tasked to auto-regressively generate the
31
  second segment conditioned on the first segment representation
32
  outputted from the encoder.
33
-
34
- Example on protein sequence completion:
35
-
36
  Original sequence: MKAYVLINSRGP
 
37
  We will pass "MKAYVL" of it to the encoder, and the decoder is trained
38
  that given the representation of the first part provided by the encoder,
39
  it should output the second part which is: "INSRGP"
 
8
  * Protein sequence completion.
9
 
10
  1. Masked Language Modeling:
11
+ - The idea of this task is to intentionally 'corrupt' an input protein sequence by
12
+ masking a certain percentage (X%) of its individual tokens (amino acids),
13
+ and then train the model to reconstruct the original sequence.
14
 
15
+ - Example on a protein sequence before and after corruption:
16
+
17
+ Original protein sequence: MKAYVLINSRGP
18
+
19
+ This sequence will be masked/corrupted using sentinel tokens as shown below:
20
+ Sequence after corruption: M <extra_id_0> A Y <extra_id_1> L I <extra_id_2> S R G <extra_id_3>
21
+
22
+
23
+ The decoder learns to correspond each sentinel token to the actual amino acid that was masked.
24
+ In this example: <extra_id_0> K means that <extra_id_0> corresponds to the "K" amino acid and so on.
25
+
26
+ Decoder output: <extra_id_0> K <extra_id_1> V <extra_id_2> N <extra_id_3> P
27
 
28
 
29
 
30
  2. Protein Sequence Completion:
31
+ - The idea of this task is to cut the input sequence into
32
  two segments, where the first segment is fed to the encoder
33
  and the decoder is tasked to auto-regressively generate the
34
  second segment conditioned on the first segment representation
35
  outputted from the encoder.
36
+
37
+ - Example on protein sequence completion:
38
+
39
  Original sequence: MKAYVLINSRGP
40
+
41
  We will pass "MKAYVL" of it to the encoder, and the decoder is trained
42
  that given the representation of the first part provided by the encoder,
43
  it should output the second part which is: "INSRGP"