metadata

license: cc-by-nc-sa-4.0
widget:
  - text: ACCTGA<mask>TTCTGAGTC
tags:
  - DNA
  - biology
  - genomics
  - segmentation

segment-nt-30kb

Segment-NT-30kb is a segmentation model leveraging the Nucleotide Transformer (NT) DNA foundation model to predict the location of several types of genomics elements in a sequence at a single nucleotide resolution. It was trained on 14 different classes of human genomics elements in input sequences up to 30kb. These include gene (protein-coding genes, lncRNAs, 5’UTR, 3’UTR, exon, intron, splice acceptor and donor sites) and regulatory (polyA signal, tissue-invariant and tissue-specific promoters and enhancers, and CTCF-bound sites) elements.

Developed by: InstaDeep

Model Sources

Repository: Nucleotide Transformer
Paper: Segmenting the genome at single-nucleotide resolution with DNA foundation models TODO: Add link to preprint

How to use

Until its next release, the transformers library needs to be installed from source with the following command in order to use the models:

pip install --upgrade git+https://github.com/huggingface/transformers.git

A small snippet of code is given here in order to retrieve both logits and embeddings from a dummy DNA sequence.

# Load model and tokenizer
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt_30kb", use_auth_token=hf_token, trust_remote_code=True)
model = AutoModel.from_pretrained("InstaDeepAI/segment_nt_30kb", use_auth_token=hf_token, trust_remote_code=True)


# Choose the length to which the input sequences are padded. By default, the 
# model max length is chosen, but feel free to decrease it as the time taken to 
# obtain the embeddings increases significantly with it.
max_length = tokenizer.model_max_length

# Create a dummy dna sequence and tokenize it
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
tokens_ids = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]

# Compute the embeddings
attention_mask = torch_tokens != tokenizer.pad_token_id
outs = model(
    torch_tokens,
    attention_mask=attention_mask,
    output_hidden_states=True
)

logits = outs.logits.detach().numpy()
probabilities = torch.nn.functional.softmax(logits, dim=-1)

Training data

The segment-nt-30kb model was trained on all human chromosomes except for chromosomes 20 and 21, kept as test set, and chromosome 22, used as a validation set. During training, sequences are randomly sampled in the genome with associated annotations. However, we keep the sequences in the validation and test set fixed by using a sliding window of length 30,000 over the chromosomes 20 and 21. The validation set was used to monitor training and for early stopping.

Training procedure

Preprocessing

The DNA sequences are tokenized using the Nucleotide Transformer Tokenizer, which tokenizes sequences as 6-mers tokens as described in the Tokenization section of the associated repository. This tokenizer has a vocabulary size of 4105. The inputs of the model are then of the form:

<CLS> <ACGTGT> <ACGTGC> <ACGGAC> <GACTAG> <TCAGCA>

Training

The model was trained on a DGXH100 node with 8 GPUs on a total of 23B tokens for 3 days. The model was trained on 3kb, 10kb, 20kb and finally 30kb sequences, at each time with an effective batch size of 256 sequences.

Architecture

The model is composed of the nucleotide-transformer-v2-50m-multi-species encoder, from which we removed the language model head and replaced it by a 1-dimensional U-Net segmentation head [4] made of 2 downsampling convolutional blocks and 2 upsampling convolutional blocks. Each of these blocks is made of 2 convolutional layers with 1, 024 and 2, 048 kernels respectively. This additional segmentation head accounts for 53 million parameters, bringing the total number of parameters to 562M.

BibTeX entry and citation info

#TODO: Add bibtex citation here