Get started πŸš€

Installation

From source

Currently, this is the only way to install the package but will push a pip installable version soon. To install the package from source, run the following command:

pip install git+https://github.com/goodarzilab/cdsFM.git

Applications

Now that you have cdsFM installed, you can use AutoEnCodon and AutoDeCodon classes which serve as wrappers around the pre-trained models. Here are some examples on how to use them:

Sequence Embedding Extraction with EnCodon

Following is an example of how to use the EnCodon model to extract sequence embeddings:

from cdsFM import AutoEnCodon

# Load your dataframe containing sequences
seqs = ...

# Load a pre-trained EnCodon model
model = AutoEnCodon.from_pretrained("goodarzilab/encodon-620M")

# Extract embeddings
embeddings = model.get_embeddings(seqs, batch_size=32)

Sequence Generation with DeCodon

You can generate organism-specific coding sequences with DeCodon simply by:

from cdsFM import AutoDeCodon

# Load a pre-trained DeCodon model
model = AutoDeCodon.from_pretrained("goodarzilab/DeCodon-200M")

# Generate!
gen_seqs = model.generate(
    taxid=9606, # NCBI Taxonomy ID for Homo sapiens
    num_return_sequences=32, # Number of sequences to return
    max_length=1024, # Maximum length of the generated sequence
    batch_size=8, # Batch size for generation
)

Tokenization

EnCodon and DeCodon are pre-trained on coding sequences of length up to 2048 codons (i.e. 6144 nucleotides), including the <CLS> token prepended automatically to the beginning of the sequence and the <SEP> token appended at the end. The tokenizer's vocabulary consists of 64 codons and 5 special tokens namely <CLS>, <SEP>, <PAD>, <MASK> and <UNK>.


HuggingFace πŸ€—

A collection of pre-trained checkpoints of EnCodon & DeCodon models are available on HuggingFace πŸ€—. Following table contains the list of available models:

Model name num. params description weights
EnCodon encodon-80M 80M Pre-trained checkpoint πŸ€—
EnCodon encodon-80M-euk 80M Eukaryotic-expert πŸ€—
EnCodon encodon-620M 620M Pre-trained checkpoint πŸ€—
EnCodon encodon-620M-euk 620M Eukaryotic-expert πŸ€—
DeCodon decodon-200M 200M Pre-trained checkpoint πŸ€—
DeCodon decodon-200M-euk 200M Eukaryotic-expert πŸ€—

Citation

@article{Naghipourfar2024,
  title = {A Suite of Foundation Models Captures the Contextual Interplay Between Codons},
  url = {http://dx.doi.org/10.1101/2024.10.10.617568},
  DOI = {10.1101/2024.10.10.617568},
  publisher = {Cold Spring Harbor Laboratory},
  author = {Naghipourfar,  Mohsen and Chen,  Siyu and Howard,  Mathew and Macdonald,  Christian and Saberi,  Ali and Hagen,  Timo and Mofrad,  Mohammad and Coyote-Maestas,  Willow and Goodarzi,  Hani},
  year = {2024},
  month = oct 
}
Downloads last month
24
Safetensors
Model size
158M params
Tensor type
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support model that require custom code execution.

Collection including goodarzilab/decodon-200M-euk