MsAlEhR
/

MetaBerta-400-fragments-18k-genome

Mask Generation

Model card Files Files and versions Community

MetaBerta-400-fragments-18k-genome / README.md

MsAlEhR's picture

Update README.md

364e71a verified 24 days ago

|

history blame contribute delete

2.67 kB

	---
	license: mit
	pipeline_tag: mask-generation
	tags:
	- biology
	- metagenomics
	- Roberta
	---
	### Leveraging Large Language Models for Metagenomic Analysis

	Model Overview:
	This model builds on the RoBERTa architecture with a similar approach to our paper titled "Leveraging Large Language Models for Metagenomic Analysis." The model was trained for one epoch on V100 GPUs.

	Model Architecture:
	- Base Model: RoBERTa transformer architecture
	- Tokenizer: Custom K-mer Tokenizer with k-mer length of 6 and overlapping tokens
	- Training: Trained on a diverse dataset of 220 million 400bp fragments from 18k genomes (Bacteria and Archaea))


	Steps to Use the Model:

	1. Install KmerTokenizer:

	2. ```sh
	pip install git+https://github.com/MsAlEhR/KmerTokenizer.git
	```
	3. Example Code:
	```python
	from KmerTokenizer import KmerTokenizer
	from transformers import AutoModel
	import torch

	# Example gene sequence
	seq = "ATTTTTTTTTTTCCCCCCCCCCCGGGGGGGGATCGATGC"

	# Initialize the tokenizer
	tokenizer = KmerTokenizer(kmerlen=6, overlapping=True, maxlen=400)
	tokenized_output = tokenizer.kmer_tokenize(seq)
	pad_token_id = 2 # Set pad token ID

	# Create attention mask (1 for tokens, 0 for padding)
	attention_mask = torch.tensor([1 if token != pad_token_id else 0 for token in tokenized_output], dtype=torch.long).unsqueeze(0)

	# Convert tokenized output to LongTensor and add batch dimension
	inputs = torch.tensor([tokenized_output], dtype=torch.long)

	# Load the pre-trained BigBird model
	model = AutoModel.from_pretrained("MsAlEhR/MetaBerta-400-fragments-18k-genome", output_hidden_states=True)

	# Generate hidden states
	outputs = model(input_ids=inputs, attention_mask=attention_mask)

	# Get embeddings from the last hidden state
	embeddings = outputs.hidden_states[-1]

	# Expand attention mask to match the embedding dimensions
	expanded_attention_mask = attention_mask.unsqueeze(-1)

	# Compute mean sequence embeddings
	mean_sequence_embeddings = torch.sum(expanded_attention_mask * embeddings, dim=1) / torch.sum(expanded_attention_mask, dim=1)

	```

	Citation:
	For a detailed overview of leveraging large language models for metagenomic analysis, refer to our paper:
	> Refahi, M.S., Sokhansanj, B.A., & Rosen, G.L. (2023). Leveraging Large Language Models for Metagenomic Analysis. IEEE SPMB.
	>
	> Refahi, M., Sokhansanj, B.A., Mell, J.C., Brown, J., Yoo, H., Hearne, G. and Rosen, G., 2024. Scorpio: Enhancing Embeddings to Improve Downstream Analysis of DNA sequences. bioRxiv, pp.2024-07.