--- license: cc-by-nc-sa-4.0 widget: - text: ACCTGATTCTGAGTC tags: - DNA - biology - genomics - segmentation --- # segment-nt-30kb Segment-NT-30kb is a segmentation model leveraging the [Nucleotide Transformer](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) (NT) DNA foundation model to predict the location of several types of genomics elements in a sequence at a single nucleotide resolution. It was trained on 14 different classes of human genomics elements in input sequences up to 30kb. These include gene (protein-coding genes, lncRNAs, 5’UTR, 3’UTR, exon, intron, splice acceptor and donor sites) and regulatory (polyA signal, tissue-invariant and tissue-specific promoters and enhancers, and CTCF-bound sites) elements. **Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI) ### Model Sources - **Repository:** [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer) - **Paper:** [Segmenting the genome at single-nucleotide resolution with DNA foundation models]() TODO: Add link to preprint ### How to use Until its next release, the `transformers` library needs to be installed from source with the following command in order to use the models: ```bash pip install --upgrade git+https://github.com/huggingface/transformers.git ``` A small snippet of code is given here in order to retrieve both logits and embeddings from a dummy DNA sequence. ```python # Load model and tokenizer from transformers import AutoTokenizer, AutoModel import torch tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt_30kb", use_auth_token=hf_token, trust_remote_code=True) model = AutoModel.from_pretrained("InstaDeepAI/segment_nt_30kb", use_auth_token=hf_token, trust_remote_code=True) # Choose the length to which the input sequences are padded. By default, the # model max length is chosen, but feel free to decrease it as the time taken to # obtain the embeddings increases significantly with it. max_length = tokenizer.model_max_length # Create a dummy dna sequence and tokenize it sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"] tokens_ids = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"] # Compute the embeddings attention_mask = torch_tokens != tokenizer.pad_token_id outs = model( torch_tokens, attention_mask=attention_mask, output_hidden_states=True ) logits = outs.logits.detach().numpy() probabilities = torch.nn.functional.softmax(logits, dim=-1) ``` ## Training data The **segment-nt-30kb** model was trained on all human chromosomes except for chromosomes 20 and 21, kept as test set, and chromosome 22, used as a validation set. During training, sequences are randomly sampled in the genome with associated annotations. However, we keep the sequences in the validation and test set fixed by using a sliding window of length 30,000 over the chromosomes 20 and 21. The validation set was used to monitor training and for early stopping. ## Training procedure ### Preprocessing The DNA sequences are tokenized using the Nucleotide Transformer Tokenizer, which tokenizes sequences as 6-mers tokens as described in the [Tokenization](https://github.com/instadeepai/nucleotide-transformer#tokenization-abc) section of the associated repository. This tokenizer has a vocabulary size of 4105. The inputs of the model are then of the form: ``` ``` ### Training The model was trained on a DGXH100 node with 8 GPUs on a total of 23B tokens for 3 days. The model was trained on 3kb, 10kb, 20kb and finally 30kb sequences, at each time with an effective batch size of 256 sequences. ### Architecture The model is composed of the [nucleotide-transformer-v2-50m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) encoder, from which we removed the language model head and replaced it by a 1-dimensional U-Net segmentation head [4] made of 2 downsampling convolutional blocks and 2 upsampling convolutional blocks. Each of these blocks is made of 2 convolutional layers with 1, 024 and 2, 048 kernels respectively. This additional segmentation head accounts for 53 million parameters, bringing the total number of parameters to 562M. ### BibTeX entry and citation info #TODO: Add bibtex citation here ```bibtex ```