Update README.md
Browse files
README.md
CHANGED
@@ -20,14 +20,14 @@ It is released along with [this paper](google.com).
|
|
20 |
|
21 |
## Intended uses & limitations
|
22 |
|
23 |
-
This model is designed for auto-regressively generating CDR3
|
24 |
This means that the model assumes a plausible pMHC is provided as input. We have not tested the model on peptides and MHC sequences
|
25 |
where the binding affinity between petpide-MHC is low and do not expect the model will adjust its predictions around this.
|
26 |
This model is intended for academic purposes and should not be used in a clinical setting.
|
27 |
|
28 |
### How to use
|
29 |
|
30 |
-
You can use this model directly for conditional CDR3
|
31 |
|
32 |
```python
|
33 |
import re
|
@@ -61,7 +61,7 @@ cdr3b_sequences = [re.sub(r'\[.*\]', '', x) for x in tokenizer.batch_decode(mode
|
|
61 |
'CASSLGTGGNQPQHF']
|
62 |
```
|
63 |
|
64 |
-
This model can also be used for unconditional generation of CDR3
|
65 |
|
66 |
```python
|
67 |
import re
|
@@ -115,8 +115,8 @@ corpus of ~330k TCR:peptide-pseudosequence pairs taken from [VDJdb](https://vdjd
|
|
115 |
|
116 |
### Preprocessing
|
117 |
|
118 |
-
All amino acid sequences, and V/J gene names were standardized using the
|
119 |
-
allele information was standardized using
|
120 |
as defined in [NetMHCpan](https://pmc.ncbi.nlm.nih.gov/articles/PMC3319061/).
|
121 |
|
122 |
### Pre-training
|
@@ -150,12 +150,8 @@ Masks 'mlm_probability' tokens grouped into spans of size 'max_span_length' acco
|
|
150 |
|
151 |
### Finetuning
|
152 |
|
153 |
-
TCRT5 was finetuned on peptide-pseudo sequence -> CDR3
|
154 |
|
155 |
-
$$
|
156 |
-
\mathcal{L} = CE(\textbf{y}, \hat{\textbf{y}}) & = - \sum_{i=1}^n \textbf{y}_i \log \hat{\textbf{y}}_i
|
157 |
-
= - \sum_{i=1}^n \sum_{j-1}^k y_{ij} \log p_\theta (y_{ij} | \textbf{x})
|
158 |
-
$$
|
159 |
|
160 |
```
|
161 |
Example Input:
|
@@ -171,7 +167,7 @@ $$
|
|
171 |
|
172 |
## Results
|
173 |
|
174 |
-
This fine-tuned model achieves the following results on conditional CDR3
|
175 |
|
176 |
1. AVFDRKSDAK_A*11:01
|
177 |
2. CRVRLCCYVL_C*07:02
|
|
|
20 |
|
21 |
## Intended uses & limitations
|
22 |
|
23 |
+
This model is designed for auto-regressively generating CDR3 \\(\beta\\) sequences against a pMHC of interest.
|
24 |
This means that the model assumes a plausible pMHC is provided as input. We have not tested the model on peptides and MHC sequences
|
25 |
where the binding affinity between petpide-MHC is low and do not expect the model will adjust its predictions around this.
|
26 |
This model is intended for academic purposes and should not be used in a clinical setting.
|
27 |
|
28 |
### How to use
|
29 |
|
30 |
+
You can use this model directly for conditional CDR3 \\(\beta\\) generation:
|
31 |
|
32 |
```python
|
33 |
import re
|
|
|
61 |
'CASSLGTGGNQPQHF']
|
62 |
```
|
63 |
|
64 |
+
This model can also be used for unconditional generation of CDR3 \\(\beta\\) sequences:
|
65 |
|
66 |
```python
|
67 |
import re
|
|
|
115 |
|
116 |
### Preprocessing
|
117 |
|
118 |
+
All amino acid sequences, and V/J gene names were standardized using the `tidytcells` package. See [here](https://pmc.ncbi.nlm.nih.gov/articles/PMC10634431/). MHC
|
119 |
+
allele information was standardized using `mhcgnomes`, available [here](https://pypi.org/project/mhcgnomes/) before mapping allele information to the MHC pseudo-sequence
|
120 |
as defined in [NetMHCpan](https://pmc.ncbi.nlm.nih.gov/articles/PMC3319061/).
|
121 |
|
122 |
### Pre-training
|
|
|
150 |
|
151 |
### Finetuning
|
152 |
|
153 |
+
TCRT5 was finetuned on peptide-pseudo sequence -> CDR3 \\(\beta\\) source:target pairs using the canonical cross entropy loss.
|
154 |
|
|
|
|
|
|
|
|
|
155 |
|
156 |
```
|
157 |
Example Input:
|
|
|
167 |
|
168 |
## Results
|
169 |
|
170 |
+
This fine-tuned model achieves the following results on conditional CDR3 \\(\beta\\) generation on our validation set of the top-20 peptide-MHCs with the most abundant known TCRs (in alphabetical order):
|
171 |
|
172 |
1. AVFDRKSDAK_A*11:01
|
173 |
2. CRVRLCCYVL_C*07:02
|