InstaDeepAI
/

agro-nucleotide-transformer-1b

@@ -1,10 +1,67 @@
 ## Model Overview
 AgroNt is a DNA language model trained on primarily edible plant genomes. More specifically, AgroNT uses the transformer architecture with self-attention and a masked language modeling
-objective to leverage highly available genotype data from 48 different plant speices. AgroNt contains 1 billion parameters and has a context window of 1000 tokens. AgroNt uses a non-overlapping
-6-mer tokenizer to convert genomic nucletoide sequences to tokens. As a result the 1000 tokens correspond to approximately 6000 base pairs.
-## Using the Model from HF
-'''python
-Will update once it it public
-'''

 ## Model Overview
 AgroNt is a DNA language model trained on primarily edible plant genomes. More specifically, AgroNT uses the transformer architecture with self-attention and a masked language modeling
+objective to leverage highly available genotype data from 48 different plant speices to learn general representations of nucleotide sequences. AgroNT contains 1 billion parameters and has a context window of 1000 tokens.
+AgroNt uses a non-overlapping 6-mer tokenizer to convert genomic nucletoide sequences to tokens. As a result the 1024 tokens correspond to approximately 6144 base pairs.
+## How to use
+```python
+from transformers import AutoModelForMaskedLM, AutoTokenizer
+import torch
+model_name = 'agro-nt'
+# fetch model and tokenizer from InstaDeep's hf repo
+agro_nt_model = AutoModelForMaskedLM.from_pretrained(f'InstaDeepAI/{model_name}')
+agro_nt_tokenizer = AutoTokenizer.from_pretrained(f'InstaDeepAI/{model_name}')
+print(f"Loaded the {model_name} model with {agro_nt_model.num_parameters()} parameters and corresponding tokenizer.")
+# example sequence and tokenization
+sequences = ['ATATACGGCCGNC']
+batch_tokens = agro_nt_tokenizer(sequences)['input_ids']
+print(f"Tokenzied sequence: {agro_nt_tokenizer.batch_decode(batch_tokens)}")
+torch_batch_tokens = torch.tensor(batch_tokens)
+attention_mask = torch_batch_tokens != agro_nt_tokenizer.pad_token_id
+# inference
+outs = agro_nt_model(
+    torch_batch_tokens,
+    attention_mask=attention_mask,
+    encoder_attention_mask=attention_mask,
+    output_hidden_states=True
+)
+# get the final layer embeddings and language model head logits
+embeddings = outs['hidden_states'][-1].detach().numpy()
+logits = outs['logits'].detach().numpy()
+```
+## Pre-training
+#### Data
+Our pre-training dataset was built from (mostly) edible plants reference genomes contained in the Ensembl Plants database.
+The dataset consists of approximately 10.5 million genomic sequences across 48 different species.
+#### Processing
+ All reference genomes for each specie were assembled into a single fasta file. In this fasta file, all nucleotides other than A, T, C, G were replaced by N. We used a tokenizer to convert strings of letters into sequences of tokens.
+ The tokenizer's alphabet consisted of the $4^6 = 4096$ possible 6-mer combinations obtained by combining A, T, C, G, as well as five additional tokens
+ representing standalone A, T, C, G, and N. It also included three special tokens: the padding [PAD], masking [MASK], and the beginning of sequence
+ (also called class; [CLS]) token. This resulted in a vocabulary of 4104 tokens. To tokenize an input sequence, the tokenizer started with a class token and
+ then converted the sequence from left to right, matching 6-mer tokens when possible, or using the standalone tokens when necessary (for instance, when the letter
+ N was present or if the sequence length was not a multiple of 6).
+#### Training
+The MLM objective was used to pre-train AgroNT in a self-supervised manner. In a self-supervised learning setting annotations (supervision) for each sequence
+are not needed as we can mask some proportion of the sequence and use the information contained in the unmasked portion of the sequence to predict the masked locations.
+This allows us to leverage the vast amount of unlabeled genomic sequencing data available. Specifically, 15\% of the tokens in the input sequence are selected to be
+augmented with 80\% being replaced with a mask token, 10\% randomly replaced by another token from the vocabulary, and the final 10\% maintaining the same token.
+The tokenized sequence is passed through the model and a cross entropy loss is computed for the masked tokens. Pre-training was carried out with a sequence length of 1024 tokens
+and an effective batch size of 1.5M tokens for 315k update steps, resulting in the model training on a total of 472.5B tokens.
+#### Hardware
+Model pre-training was carried out using Google TPU-V4 accelerators, specifically a TPU v4-1024 containing 512 devices. We trained for a total of approx. four days.