README.md · kuleshov-group/PlantCaduceus_l24 at d62653f62e25eb2060a043cc88d76fb36f9d5993

metadata

license: apache-2.0

Model Overview

PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the Caduceus architecture and a masked language modeling objective, PlantCaduceus is designed to pre-train genomic sequences from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes:

PlantCaduceus_l20: 20 layers, 384 hidden size, 20M parameters
PlantCaduceus_l24: 24 layers, 512 hidden size, 40M parameters
PlantCaduceus_l28: 28 layers, 768 hidden size, 112M parameters
PlantCaduceus_l32: 32 layers, 1024 hidden size, 225M parameters

How to use

from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
model_path = 'maize-genetics/PlantCaduceus_l24'
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True).to(device)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

sequence = "ATGCGTACGATCGTAG"
encoding = tokenizer.encode_plus(
            sequence,
            return_tensors="pt",
            return_attention_mask=False,
            return_token_type_ids=False
        )
input_ids = encoding["input_ids"].to(device)
with torch.inference_mode():
    outputs = model(input_ids=input_ids, output_hidden_states=True)