enigma-1.5b

Model Details

It's a 2.5b model trained on ~1billion individual letters of DNA, kinda like training a text-based model on per-character level instead of sub-word level. It does have it's own tokenizer similar that is intersection b/w char-level and bpe-tokenizer.

For EnBERT i.e. decoder-only model is trained on lot's of sequences of DNA tokenized using k-mer tokenizer specially trained for this purpose, which means it has a larger vocab size than the enigma-2.5b. Same model architecture is used in training a 430m model that is per-char based same as 2.5b model, but it's better than that.

Model Description

Developed by: Shivendra Singh
License: MIT

Model Sources

Repository: github/enigma-1.5b
Papers: Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision

Uses

Can be used to generate new sequences of DNA on a given input of tokens. Or can be used for further research. Anyway, it's very basic in nature. I'll add more functionalities which includes classification of dna, masked token generation, etc. Maybe even implement MOE techinque in future.

Direct Use

Load the model and then can be used to generate new sequences, max_length=512 for 2.5b model and 256 for enigma-430m model.

Bias, Risks, and Limitations

This model was trained on only around ~500mbs of DNA data and that too per-character level, not sub-word or sequence level like in language models. Which means it would have more precision but limited because of training. I wasn't able to train it on other datasets for better generalizations because of my technical limits, lack of gpu and good hardware.

How to Get Started with the Model

Use the code below to get started with the model.

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Shivendrra/enigma-1.5b")

# generate from the model
from model import Transformer
model = Transformer(vocab_size=vocab_size)

from tokenizer import PerCharTokenizer
token = PerCharTokenizer()

input = "TGCCCTGGCTGCTCCGCATTGCAGGAGCTGCGCCCTTCCTTTC"
token_input = token.encode(input)
context = torch.tensor([token_input], dtype=torch.long, device=device)
generated_output = token.decode(m.generate(context, max_new_tokens=500)[0].tolist())
print(generated_output)

model.generate()

Training Details

Training Data

Used from this dataset: human_ref_data Consolidated 8 ~500mb files into one big dataset. I've uploaded the data for the training though.

Training Procedure

These models were trained to 3k-4k iterations, each. on ~500million letters of DNA, roughly around 500mbs of data. Final losses were around ~0.02 for 47million parameter model and ~0.003 for 2.5billion parameter model. I had saved more data, lot more than this, but couldn't train it more due to technical in-capabilities. Try to train it yourself if possible. enigma/TrainEnigma file contains all necessary functions needed to train it, from scratch or pre-train.

Functions:

This used a basic training procedure. get_batch() generated batches of data, estimate_loss() estimates losses and train() function is kind of master function, here, calling other functions after each or set iterations.

Training Hyperparameters

Configurations are saved in the enigma/config-enigma.json file. Suitable for 2.5b model.

{
  "batch_size": 10,
  "block_size": 512,
  "max_iters": 5000,
  "eval_interval": 50,
  "learning_rate": 3e-5,
  "eval_iters": 100,
  "d_model": 384,
  "n_head": 12,
  "n_layer": 12,
  "dropout": 0.2,
  "norm_eps": 1e-5
}

Model Architecture and Objective

EnBERT is a 47million parameter model, follows BERT architecture, and has one more layer of masked self-attention layer to predict next tokens. Engima-2.5b is a transformer model. It has a fairly complex model.

Encoder Part:

It consists two different layers, each followed by their own normalization and dropout layers. Input embeddings along with positional embeddings are fed to the encoder block:

Self Attention:

Each head of self-attention layer is similar to that of used in grokAI. Key and Query matrices have biases whereas Value matrix doesn't.
After implementing torch.matmul() on Key and Query, relational positional embeddings are applied to the attention matrix.
Attention and value matrix are then multiplied using torch.matmul().
Multi-head attention layer than concatenates all the outputs together and passes them through a linear layer

FeedForward:

Normalized outputs are then passed to position-wise feedforward layer, with expansion_factor of 5.
GELU is used as the activation function in this case and two linear layers, one for output and other for input.
Finally dropout is applied and the outputs that are produced have deep global contextual information about the input tokens.

Decoder Part:

Consists of three different layers:

Masked Attention:

This layer is similar to the self-attention implemented in encoder part, except it has a triangular mask that forbids tokens to look for the context of next token.
Rest is all same, relational positional embeddings are applied in the same way, but to the masked attention matrix this time.
Attention and value matrix are then multiplied using torch.matmul().
Multi-head attention layer than concatenates all the outputs together and passes them through a linear layer

Self-Attention:

Before this, outputs from encoder layer and masked-attention layer are added together, and then passed to this layer.
Same as the encoder's unmasked attention layer. Key, Query and Value matrices are created using same technique.
Finally all the outputs are normalized and passed to dropout layer.

FeedForward:

Normalized outputs are then passed to position-wise feedforward layer, with expansion_factor of 5.
GELU is used as the activation function in this case and two linear layers, one for output and other for input.
Finally dropout is applied and the outputs that are produced have deep global contextual information about the input tokens.