Bio Series
Collection
Embeddings and NLG related to biology / amino acid sequences
•
12 items
•
Updated
•
1
Using the Block Diffusion architecture (code, paper) and AgroNT's 6-nucleotide tokens (4^6 = 4,096).
Took dna-blockdiff-2 weights, trained on Papaya genome for one epoch. Output tokens are restricted to avoid outputting single-nucleotide or N (unknown nucleotide) tokens.
Training loss was up and down, but validation curve (on human genome) was consistently improving
from transformers import AutoModelForMaskedLM
m = AutoModelForMaskedLM.from_pretrained(
"monsoon-nlp/dna-blockdiff-papaya",
trust_remote_code=True,
)
cd bd3lms && python -u main.py \
loader.eval_batch_size=1 \
model=small \
algo=bd3lm \
algo.T=5000 \
algo.backbone=hf_dit \
data=instadeep \
model.length=256 \
block_size=4 \
wandb=null \
mode=ppl_eval \
eval.checkpoint_path="monsoon-nlp/dna-blockdiff-papaya" \
model.attn_backend=sdpa \
sampling.nucleus_p=0.9 \
sampling.kv_cache=true \
sampling.logdir=$PWD/sample_logs/samples_genlen_bd3lm_blocksize4 \
data.tokenizer_name_or_path="monsoon-nlp/dna-blockdiff-papaya"
Use this fork of the code and don't vary params so much
cd bd3lms && python -u main.py \
loader.eval_batch_size=1 \
model=small \
algo=bd3lm \
algo.T=5000 \
algo.backbone=hf_dit \
data=instadeep \
model.length=256 \
block_size=4 \
wandb=null \
mode=sample_eval \
eval.checkpoint_path="monsoon-nlp/dna-blockdiff-papaya" \
model.attn_backend=sdpa \
sampling.nucleus_p=0.9 \
sampling.kv_cache=true \
sampling.logdir=$PWD/sample_logs/samples_genlen_bd3lm_blocksize4 \
data.tokenizer_name_or_path="monsoon-nlp/dna-blockdiff-papaya"
100% 64/64 [00:07<00:00, 8.73it/s]
Sliding Window Gen PPL: 100% 1/1 [00:00<00:00, 4.26it/s]
Text samples: ['<cls> AAATGG TTATTG CAAATC TCTAAA GAAGTA TTAAGA GAATGA TAAGAT ATGTTG AGAGAA TTACAC AGCATT GAGAAG TCTAAA TTGAAA AACCAT AAAAAT GTGAGT AGGTCA GTATGT AAGAAT TGTGTT GAACTT ATCAAT ATGTAG ACATCA TTTTGA TATAAA TATATA AAGAAA ATTTAA AAAAAA TAATAA ATAACT TTAAAA TGTTAA TAATAT TAAAAT GGAGAA GAATAA CCTTTA TTATCT ATTACA ATAATA ATTATA TTTTGG ATGAAA CATTCA GAATAT TAGATA ATTTTT ATTAAT GTATCT TCAAAT GAACAA ACTTAT ATTTAA AAACTC TAAAAT ATTTAT AGACTA AAAACT AGAGAA ATTAAT AATAAA AATAAA AAACAC AAATTT ATAAAA CCAAAT AAAGGT AATAAA AACAAA ATATTT ACAAAT AACTAT TAATGA AGTTAA AAAATG AATAAA TTTATA ATAAAA TATTTA TGTTTT AAATTA AAAATT TGAATA AAACTC ACAAAT TATTTA AATACT AATATG TATTTA TATAAT AATATA TGAAAA AATTAT GAATTT TAATTA AAATTT TTATAT TTATAA AAATTT ATATTA ATTAAT TTTTAA CAACTT AAATAA AAAGGA ATATTA AAGTCA ATAATT ATATAT TACTTA TAGACA AATAAA AAAATT CTCAAT AAAATT TAAAAT ATTAAA ATTTTG AAATTA AAATAA AAATAT AATAAT TCACTT CACACA ATACAA CTAACT TATACA ATTAAT TTAAAA GATTAA TTGAAT AAAATT ATTATC ACATGA AATTGG AATAAA CAAAAT AATATA TAAATA TATCAA AAATTG ATATAT GAAAAT CTTTAT GTGAAA TTTTAA GAAATA AATTTA ATATGC TGTTTT AAATTT TTTAAA TTTATT AAATTA AATTAA TATTAA ATTTTA ATAATA AAAATT TATAAT AATTAA TAATTT ATTAGC TTAAAA TTAAAT ATTTTA ATGTAA AAACTA TAATGC AATTTA AAGATT TTTTTA AATTAT ATAAGT TAATAA CTATAA TAATAC ATTTCT TTAATT AAAGAA GAAATT TTAAAT TTAAAT TTTTAA GTTAGA ATTACA TTAAAA TATAAA TATAAT AAATAA TAATTA TTAAAA TATACT AAATAG TTTATT AATTAT ATACTT AATATA ATATTT AATATT ATTATA AAAAAT AATCAT ATATAT ATAATT TTTTTT CTTTTT AACTTA TAAATT AATCAG TTATGA TACTTT ATAAAT ATTTGT TAATGG TGAATG AATATG CTTGAA AAGAAC AAAGAA GAAATT AAGAGA ACTTGA ATTTGG TGGTTA ATAAAT CTAATT ATATAT ATTATA TAAAAA TAGGAA TAATTT GAAAAT TAATAG AAAAGA AAAAGA ATAATT TTATGC TTCTTT ATATAA TTTAAC AAATAT TTTTTT ATAATA ATAATA TAATTA AACTTA AATTAT ATTATA TTCATC ATTATA']
Generative perplexity: tensor(19.5559, device='cuda:0')
Entropy: tensor(3.4023, device='cuda:0')
The script measures perplexity of the sequence in gpt2-large
but this wouldn't be useful to evaluate accuracy of the DNA sequence.
Base model
monsoon-nlp/dna-blockdiff-2