PromoGen2 Model for Prokaryotic Promoter Sequence Generation

PromoGen2 is a specialized language model developed for generating and scoring prokaryotic promoter sequences. The model is particularly suitable for species with limited experimentally verified data. This model card provides guidance on loading the model, generating sequences, and scoring them using a custom scoring function.

Model Details

Model type: Transformer-based language model (GPT-2 architecture)
Primary use case: Generating and scoring species-specific promoter sequences
Tags: Prokaryotic promoters, sequence generation, synthetic biology

Installation

Ensure the required packages are installed:

pip install torch transformers[torch] biopython datasets pandas numpy scipy seaborn matplotlib jupyter notebook

Loading the Model and Tokenizer

To get started, load the model and tokenizer with Hugging Face's transformers library.

from transformers import GPT2LMHeadModel, AutoTokenizer, pipeline
import torch

# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained("jinyuan22/promogen2-xsmall")
tokenizer = AutoTokenizer.from_pretrained("jinyuan22/promogen2-xsmall")

# Set device (CPU or GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipe = pipeline("text-generation", model=model, device=device, tokenizer=tokenizer)

Generating Sequences

Use the text-generation pipeline to generate sequences based on an input sequence and various parameters such as sampling temperature, repetition penalty, and top-p sampling. Customize the input sequence (txt), number of sequences, and sampling parameters.

# Define input text and generation parameters
txt = "<|bos|>5"
num_return_sequences = 5
batch_size = 2
max_new_tokens = 50
repetition_penalty = 1.2
top_p = 0.9
temperature = 0.7
do_sample = True

# Generate sequences
all_outputs = []
for i in range(0, num_return_sequences, batch_size):
    outputs = pipe(
        txt, 
        num_return_sequences=batch_size,
        max_new_tokens=max_new_tokens,
        repetition_penalty=repetition_penalty,
        top_p=top_p,
        temperature=temperature,
        do_sample=do_sample
    )
    all_outputs.extend(outputs)

Scoring Generated Sequences

A custom scoring function (score) evaluates each generated sequence. It calculates the sequence's likelihood under the model, based on the provided tag (or none if no tag is used).

@torch.no_grad()
def score(seq, tag="none"):
    # Format input with specified tag
    if tag == "none":
        inputs = tokenizer(f"<|bos|>5{seq}3<|eos|>", return_tensors="pt")
    else:
        inputs = tokenizer(f"<|bos|>{tag}5{seq}3{tag}<|eos|>", return_tensors="pt")
    inputs.to(device)
    input_ids = inputs['input_ids'].to(device)
    attention_mask = inputs['attention_mask'].to(device)
    pred = model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)
    return pred['loss'].item()

Post-processing and Saving Outputs

The generated sequences are cleaned of special tokens and then scored using the score function. Each sequence and its score are saved to an output file.

# Post-process generated sequences
tag = "none"
seqs = [output["generated_text"].replace("<|bos|>", "").replace("5", "").replace("3", "").replace(tag, "") for output in all_outputs]
scores = [score(seq, tag) for seq in seqs]

# Save sequences and scores
with open("output.txt", "w") as f:
    for i, (seq, score) in enumerate(zip(seqs, scores)):
        f.write(f">{i}|score={score}\n{seq}\n")

Example Parameters

txt: Input sequence string for generation
tag: Tag to define the context or label for generation ("none" if no specific tag is used)
num_return_sequences: Number of sequences to generate
batch_size: Number of sequences generated per batch
max_new_tokens: Maximum length of generated sequences
repetition_penalty: Penalty to control repetition in generated sequences
top_p: Probability for nucleus sampling
temperature: Temperature for sampling (controls diversity)
do_sample: Set to True for sampling-based generation

Usage Notes

For best results, ensure that the device (CPU/GPU) matches the model's requirements.
This setup supports sequence generation tasks tailored to synthetic biology, particularly for organisms lacking experimentally verified promoter data.