PlasmidGPT / README.md
McClain's picture
Add comprehensive README with citations to original PlasmidGPT
3c3d7e9

PlasmidGPT (Addgene GPT-2 Compatible Version)

This is a compatibility-enhanced version of PlasmidGPT by Bin Shao (lingxusb), optimized for easier integration with modern transformers libraries and HuggingFace infrastructure.

πŸ”¬ About PlasmidGPT

PlasmidGPT is a generative language model pretrained on 153,000 engineered plasmid sequences from Addgene. It generates de novo plasmid sequences that share similar characteristics with engineered plasmids while maintaining low sequence identity to training data. The model can generate plasmids in a controlled manner based on input sequences or specific design constraints, and learns informative embeddings for both engineered and natural plasmids.

Original work: PlasmidGPT: a generative framework for plasmid design and annotation
Original repository: github.com/lingxusb/PlasmidGPT
Original model: huggingface.co/lingxusb/PlasmidGPT

Key Features

  • Novel Sequence Generation: Generates novel plasmid sequences rather than replicating training data
  • Conditional Generation: Supports generation based on user-specified starting sequences
  • Versatile Predictions: Predicts sequence-related attributes including lab of origin, species, and vector type
  • Transformer Architecture: Decoder-only transformer with 12 layers and 110 million parameters

πŸ†š Differences from Original

This version provides:

  • βœ… Native HuggingFace transformers compatibility (no custom loading required)
  • βœ… Standard model format (model.safetensors instead of .pt)
  • βœ… Direct AutoModel and AutoTokenizer support
  • βœ… Simplified installation and usage

πŸ“¦ Installation

pip install torch transformers

πŸš€ Quick Start

Basic Sequence Generation

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = AutoModelForCausalLM.from_pretrained(
    "McClain/plasmidgpt-addgene-gpt2",
    trust_remote_code=True
).to(device)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(
    "McClain/plasmidgpt-addgene-gpt2",
    trust_remote_code=True
)

start_sequence = 'ATGGCTAGCGAATTCGGCGCGCCT'
input_ids = tokenizer.encode(start_sequence, return_tensors='pt').to(device)

outputs = model.generate(
    input_ids,
    max_length=300,
    num_return_sequences=1,
    temperature=1.0,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id
)

generated_sequence = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated sequence: {generated_sequence}")

Generate Multiple Sequences

outputs = model.generate(
    input_ids,
    max_length=500,
    num_return_sequences=5,
    temperature=1.2,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id
)

for i, output in enumerate(outputs):
    sequence = tokenizer.decode(output, skip_special_tokens=True)
    print(f"Sequence {i+1}: {sequence[:100]}...")

Extract Embeddings

model.config.output_hidden_states = True

with torch.no_grad():
    input_ids = tokenizer.encode("ATGCGTACG...", return_tensors='pt').to(device)
    outputs = model(input_ids)
    hidden_states = outputs.hidden_states[-1]
    embedding = hidden_states.mean(dim=1).cpu().numpy()
    
print(f"Embedding shape: {embedding.shape}")

🎯 Use Cases

  • Plasmid Design: Generate novel plasmid sequences for synthetic biology applications
  • Sequence Analysis: Extract meaningful embeddings for downstream ML tasks
  • Feature Prediction: Predict properties like lab of origin, species, or vector type
  • Conditional Generation: Create sequences starting from specific promoters or genes

πŸ“Š Model Details

Parameter Value
Architecture GPT-2 (Decoder-only Transformer)
Parameters 110 million
Layers 12
Hidden Size 768
Attention Heads 12
Context Length 2048 tokens
Vocabulary Size 30,002
Training Data 153k Addgene plasmid sequences

πŸ“š Citation

If you use this model, please cite the original PlasmidGPT paper:

@article{shao2024plasmidgpt,
  title={PlasmidGPT: a generative framework for plasmid design and annotation},
  author={Shao, Bin and others},
  journal={bioRxiv},
  year={2024},
  doi={10.1101/2024.09.30.615762},
  url={https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1}
}

πŸ“„ License

This model inherits the license from the original PlasmidGPT repository. Please refer to the original repository for licensing details.

πŸ™ Credits

Original Author: Bin Shao (lingxusb)
Original Work: PlasmidGPT GitHub Repository
Paper: bioRxiv 2024.09.30.615762

This compatibility version was created to facilitate easier integration with modern ML workflows while preserving all capabilities of the original model.

πŸ”— Related Resources

⚠️ Notes

  • The model generates DNA sequences for research purposes
  • Generated sequences should be validated before experimental use
  • The model was trained on Addgene plasmids and performs best on similar sequence types
  • For prediction tasks (lab, species, vector type), refer to the original repository for prediction model weights