MsAlEhR/MetaBERTa-bigbird-gene

Model Overview: The model presented in this paper builds on the BigBird architecture with a similar approach detailed in our paper titled "Leveraging Large Language Models for Metagenomic Analysis" This model is optimized to enhance the performance of BigBird for large gene sequence data. Trained specifically on gene sequences, it aims to uncover valuable insights within metagenomic data and is evaluated across various tasks, including classification and sequence embedding.

Model Architecture:

Base Model: BigBird transformer architecture
Tokenizer: Custom K-mer Tokenizer with k-mer length of 6 and overlapping tokens
Training: Trained on a diverse dataset of 497 genes from 2000 bacterial and archaeal genomes
Embeddings: Generates sequence embeddings using mean pooling of hidden states

Dataset: Scorpio Gene-Taxa Benchmark Dataset: https://zenodo.org/records/12964684

Steps to Use the Model:

Install KmerTokenizer:

pip install git+https://github.com/MsAlEhR/KmerTokenizer.git

Example Code:

 from KmerTokenizer import KmerTokenizer
 from transformers import AutoModel
 import torch
 
 # Example gene sequence
 seq = "ATTTTTTTTTTTCCCCCCCCCCCGGGGGGGGATCGATGC"
 
 # Initialize the tokenizer
 tokenizer = KmerTokenizer(kmerlen=6, overlapping=True, maxlen=4096)
 tokenized_output = tokenizer.kmer_tokenize(seq)
 pad_token_id = 2  # Set pad token ID
 
 # Create attention mask (1 for tokens, 0 for padding)
 attention_mask = torch.tensor([1 if token != pad_token_id else 0 for token in tokenized_output], dtype=torch.long).unsqueeze(0)
 
 # Convert tokenized output to LongTensor and add batch dimension
 inputs = torch.tensor([tokenized_output], dtype=torch.long)
 
 # Load the pre-trained BigBird model
 model = AutoModel.from_pretrained("MsAlEhR/MetaBERTa-bigbird-gene", output_hidden_states=True)
 
 # Generate hidden states
 outputs = model(input_ids=inputs, attention_mask=attention_mask)
 
 # Get embeddings from the last hidden state
 embeddings = outputs.hidden_states[-1]  
 
 # Expand attention mask to match the embedding dimensions
 expanded_attention_mask = attention_mask.unsqueeze(-1) 
 
 # Compute mean sequence embeddings
 mean_sequence_embeddings = torch.sum(expanded_attention_mask * embeddings, dim=1) / torch.sum(expanded_attention_mask, dim=1)

Citation: For a detailed overview of leveraging large language models for metagenomic analysis, refer to our paper:

Refahi, M.S., Sokhansanj, B.A., & Rosen, G.L. (2023). Leveraging Large Language Models for Metagenomic Analysis. IEEE SPMB.

Refahi, M., Sokhansanj, B.A., Mell, J.C., Brown, J., Yoo, H., Hearne, G. and Rosen, G., 2024. Scorpio: Enhancing Embeddings to Improve Downstream Analysis of DNA sequences. bioRxiv, pp.2024-07.