|
--- |
|
license: mit |
|
pipeline_tag: mask-generation |
|
tags: |
|
- biology |
|
- metagenomics |
|
- Roberta |
|
--- |
|
### Leveraging Large Language Models for Metagenomic Analysis |
|
|
|
**Model Overview:** |
|
This model builds on the RoBERTa architecture with a similar approach to our paper titled "Leveraging Large Language Models for Metagenomic Analysis." The model was trained for one epoch on V100 GPUs. |
|
|
|
**Model Architecture:** |
|
- **Base Model:** RoBERTa transformer architecture |
|
- **Tokenizer:** Custom K-mer Tokenizer with k-mer length of 6 and overlapping tokens |
|
- **Training:** Trained on a diverse dataset of 220 million 400bp fragments from 18k genomes (Bacteria and Archaea)) |
|
|
|
|
|
**Steps to Use the Model:** |
|
|
|
1. **Install KmerTokenizer:** |
|
|
|
2. ```sh |
|
pip install git+https://github.com/MsAlEhR/KmerTokenizer.git |
|
``` |
|
3. **Example Code:** |
|
```python |
|
from KmerTokenizer import KmerTokenizer |
|
from transformers import AutoModel |
|
import torch |
|
|
|
# Example gene sequence |
|
seq = "ATTTTTTTTTTTCCCCCCCCCCCGGGGGGGGATCGATGC" |
|
|
|
# Initialize the tokenizer |
|
tokenizer = KmerTokenizer(kmerlen=6, overlapping=True, maxlen=400) |
|
tokenized_output = tokenizer.kmer_tokenize(seq) |
|
pad_token_id = 2 # Set pad token ID |
|
|
|
# Create attention mask (1 for tokens, 0 for padding) |
|
attention_mask = torch.tensor([1 if token != pad_token_id else 0 for token in tokenized_output], dtype=torch.long).unsqueeze(0) |
|
|
|
# Convert tokenized output to LongTensor and add batch dimension |
|
inputs = torch.tensor([tokenized_output], dtype=torch.long) |
|
|
|
# Load the pre-trained BigBird model |
|
model = AutoModel.from_pretrained("MsAlEhR/MetaBerta-400-fragments-18k-genome", output_hidden_states=True) |
|
|
|
# Generate hidden states |
|
outputs = model(input_ids=inputs, attention_mask=attention_mask) |
|
|
|
# Get embeddings from the last hidden state |
|
embeddings = outputs.hidden_states[-1] |
|
|
|
# Expand attention mask to match the embedding dimensions |
|
expanded_attention_mask = attention_mask.unsqueeze(-1) |
|
|
|
# Compute mean sequence embeddings |
|
mean_sequence_embeddings = torch.sum(expanded_attention_mask * embeddings, dim=1) / torch.sum(expanded_attention_mask, dim=1) |
|
|
|
``` |
|
|
|
**Citation:** |
|
For a detailed overview of leveraging large language models for metagenomic analysis, refer to our paper: |
|
> Refahi, M.S., Sokhansanj, B.A., & Rosen, G.L. (2023). Leveraging Large Language Models for Metagenomic Analysis. *IEEE SPMB*. |
|
> |
|
> Refahi, M., Sokhansanj, B.A., Mell, J.C., Brown, J., Yoo, H., Hearne, G. and Rosen, G., 2024. Scorpio: Enhancing Embeddings to Improve Downstream Analysis of DNA sequences. bioRxiv, pp.2024-07. |