--- license: apache-2.0 --- ## Model Overview PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the Caduceus architecture and a masked language modeling objective, PlantCaduceus is designed to pre-train genomic sequences from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes: - **PlantCaduceus_l20**: 20 layers, 384 hidden size, 20M parameters - **PlantCaduceus_l24**: 24 layers, 512 hidden size, 40M parameters - **PlantCaduceus_l28**: 28 layers, 768 hidden size, 112M parameters - **PlantCaduceus_l32**: 32 layers, 1024 hidden size, 225M parameters ## How to use ```python from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer model_path = 'maize-genetics/PlantCaduceus_l24' device = "cuda:0" if torch.cuda.is_available() else "cpu" model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True).to(device) model.eval() tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) sequence = "ATGCGTACGATCGTAG" encoding = tokenizer.encode_plus( sequence, return_tensors="pt", return_attention_mask=False, return_token_type_ids=False ) input_ids = encoding["input_ids"].to(device) with torch.inference_mode(): outputs = model(input_ids=input_ids, output_hidden_states=True) ```