--- license: cc-by-nc-sa-4.0 library_name: transformers tags: - biology - immunology - seq2seq pipeline_tag: text2text-generation base_model: - dkarthikeyan1/tcrt5_pre_tcrdb --- # TCRT5 model (finetuned) ## Model description TCRT5 is a seq2seq model designed to for the conditional generation of T-cell receptor (TCR) sequences given a target peptide-MHC (pMHC). It is a transformers model that is built on the [T5 architecture](https://github.com/google-research/text-to-text-transfer-transformer/tree/main/t5) operationalized by the associated HuggingFace [abstraction](https://huggingface.co/docs/transformers/v4.46.2/en/model_doc/t5#transformers.T5ForConditionalGeneration). It is released along with [this paper](google.com). ## Intended uses & limitations This model is designed for auto-regressively generating CDR3 \\(\beta\\) sequences against a pMHC of interest. This means that the model assumes a plausible pMHC is provided as input. We have not tested the model on peptides and MHC sequences where the binding affinity between petpide-MHC is low and do not expect the model will adjust its predictions around this. This model is intended for academic purposes and should not be used in a clinical setting. ### How to use You can use this model directly for conditional CDR3 \\(\beta\\) generation: ```python import re from transformers import T5Tokenizer, T5ForConditionalGeneration tokenizer = T5Tokenizer.from_pretrained('dkarthikeyan1/tcrt5_ft_tcrdb') tcrt5 = T5ForConditionalGeneration.from_pretrained("dkarthikeyan1/tcrt5_ft_tcrdb") pmhc = "[PMHC]KLGGALQAK[SEP]YFAMYQENVAQTDVDTLYIIYRDYTWAELAYTWY[EOS]" encoded_pmhc = tokenizer(pmhc, return_tensors='pt') # Define the number of TCRs you would like to generate () num_tcrs = 10 # Define the number of beams to explore (recommended: 3x the number of TCRs) num_beams = 30 outputs = tcrt5.generate(**encoded_pmhc, max_new_tokens=25, num_return_sequences=num_tcrs, num_beams=num_beams) # Use regex to get out the [TCR] tag cdr3b_sequences = [re.sub(r'\[.*\]', '', x) for x in tokenizer.batch_decode(outputs, skip_special_tokens=True)] >>> cdr3b_sequences ['CASSLGTGGTDTQYF', 'CASSPGTGGTDTQYF', 'CASSLGQGGTEAFF', 'CASSVGTGGTDTQYF', 'CASSLGTGGSYEQYF', 'CASSPGQGGTEAFF', 'CASSSGTGGTDTQYF', 'CASSLGGGGTDTQYF', 'CASSLGGGSYEQYF', 'CASSLGTGGNQPQHF'] ``` This model can also be used for unconditional generation of CDR3 \\(\beta\\) sequences: ```python import re from transformers import T5Tokenizer, T5ForConditionalGeneration tokenizer = T5Tokenizer.from_pretrained('dkarthikeyan1/tcrt5_ft_tcrdb') tcrt5 = T5ForConditionalGeneration.from_pretrained("dkarthikeyan1/tcrt5_ft_tcrdb") # Define the number of TCRs you would like to generate () num_tcrs = 10 # Define the number of beams to explore (recommended: 3x the number of TCRs) num_beams = 30 unconditional_outputs = tcrt5.generate(max_new_tokens=25, num_return_sequences=num_tcrs, num_beams=num_beams) # Use regex to get out the [TCR] tag uncond_cdr3b_sequences = [re.sub(r'\[.*\]', '', x) for x in tokenizer.batch_decode(unconditional_outputs, skip_special_tokens=True)] >>> uncond_cdr3b_sequences ['CASSLGGETQYF', 'CASSLGQGNTEAFF', 'CASSLGQGNTGELFF', 'CASSLGTSGTDTQYF', 'CASSLGLAGSYNEQFF', 'CASSLGLAGTDTQYF', 'CASSLGQGYEQYF', 'CASSLGLAGGNTGELFF', 'CASSLGGTGELFF', 'CASSLGQGAYEQYF'] ``` **Note:** For conditional generation, we found that the model performance was greatest using beam search decoding. However, we also report a reduction in sequence diversity using this particular decoding method. If you would like to generate more diverse sequence, TCRT5 supports a range of alternative decoding strategies which can be found [here](https://huggingface.co/docs/transformers/generation_strategies) and [here](https://huggingface.co/blog/how-to-generate). ### Limitations and bias One of the known biases of TCRT5's predictions is its preference for sampling high V(D)J recombination probability sequences as computed by [OLGA](https://github.com/statbiophys/OLGA). This can be attenuated with the use of alternative decoding methods such as ancestral sampling. ## Training data TCRT5 was pre-trained on masked span reconstruction of ~14M TCR sequences from [TCRdb](http://bioinfo.life.hust.edu.cn/TCRdb/) as well as ~780k peptide-pseudosequence pairs taken from [IEDB](https://www.iedb.org/). Finetuning was done using a parallel corpus of ~330k TCR:peptide-pseudosequence pairs taken from [VDJdb](https://vdjdb.cdr3.net/), [IEDB](https://www.iedb.org/), [McPAS](https://friedmanlab.weizmann.ac.il/McPAS-TCR/), and semi-synthetic examples from [MIRA](https://pmc.ncbi.nlm.nih.gov/articles/PMC7418738/). ## Training procedure ### Preprocessing All amino acid sequences, and V/J gene names were standardized using the `tidytcells` package. See [here](https://pmc.ncbi.nlm.nih.gov/articles/PMC10634431/). MHC allele information was standardized using `mhcgnomes`, available [here](https://pypi.org/project/mhcgnomes/) before mapping allele information to the MHC pseudo-sequence as defined in [NetMHCpan](https://pmc.ncbi.nlm.nih.gov/articles/PMC3319061/). ### Pre-training TCRT5 was pretrained with Masked language modeling (MLM): Span reconstruction similar to the original training loss of the T5 paper. For a given sequence, the model masks 15% of the sequence using contiguous spans of random length from length 1-3. This is done via the sentinel tokens introduced in the T5 paper. Then the entire masked sequence is passed into the model and the model is trained to reconstruct a concatenated sequence comprised of the sentinel tokens followed by the masked tokens. This forces the model to learn richer k-mer dependencies of the masked sequences. ``` Masks 'mlm_probability' tokens grouped into spans of size 'max_span_length' according to the following algorithm: * Radnomly generate span lengths that add up to round(mlm_probability*seq_len) (ignoring pad token) for each sequence. * Ensure that the spans are not directly adjacent to ensure max_span_length is observed * Once the span masks are generated according to T5 standards mask the inputs and generate the targets Example Input: CASSLGQGYEQYF Masked Input: CASSLG[X]GY[Y]F Target: [X]Q[Y]EQY[Z]. ``` ### Finetuning TCRT5 was finetuned on peptide-pseudo sequence -> CDR3 \\(\beta\\) source:target pairs using the canonical cross entropy loss. ``` Example Input: [PMHC]KLGGALQAK[SEP]YFAMYQENVAQTDVDTLYIIYRDYTWAELAYTWY[EOS] Target: [TCR]CASSLGYNEQFF[EOS]. ``` ## Results This fine-tuned model achieves the following results on conditional CDR3 \\(\beta\\) generation on our validation set of the top-20 peptide-MHCs with the most abundant known TCRs (in alphabetical order): 1. AVFDRKSDAK_**A*11:01** 2. CRVRLCCYVL_**C*07:02** 3. EAAGIGILTV_**A*02:01** 4. ELAGIGILTV_**A*02:01** 5. GILGFVFTL_**A*02:01** 6. GLCTLVAML_**A*02:01** 7. IVTDFSVIK_**A*11:01** 8. KLGGALQAK_**A*03:01** 9. LLLDRLNQL_**A*02:01** 10. LLWNGPMAV_**A*02:01** 11. LPRRSGAAGA_**B*07:02** 12. LVVDFSQFSR_**A*11:01** 13. NLVPMVATV_**A*02:01** 14. RAKFKQLL_**B*08:01** 15. SPRWYFYYL_**B*07:02** 16. STLPETAAVRR_**A*11:01** 17. TPRVTGGGAM_**B*07:02** 18. TTDPSFLGRY_**A*01:01** 19. YLQPRTFLL_**A*02:01** 20. YVLDHLIVV_**A*02:01** Benchmark results: | Metric | Char-BLEU | F@100| SeqRec% | Diversity (num_seq) | Ave. Jaccard Dissimilarity | Perplexity | |:------:|:---------:|:----:|:-------:|:-------------------:|:---------------------------:|:----------:| | | 96.4 | .09 | 89.2 | 1300 (2000 max) | 94.4/100 | 2.48 | ### BibTeX entry and citation info ```bibtex @article{dkarthikeyan2024tcrtranslate, title={TCR-TRANSLATE: Conditional Generation of Real Antigen Specific T-cell Receptor Sequences}, author={Dhuvarakesh Karthikeyan and Colin Raffel and Benjamin Vincent and Alex Rubinsteyn}, journal={bioArXiv}, year={2024}, } ```