shivendrra
/

enigma-1.5b

@@ -18,8 +18,10 @@ tags:
 ## Model Details
-this is a transformer based model trained on DNA seq data, capable of generating new sequences of DNA. It's a 2.5billion parameter model trained till convergence.
-It also has one more BERT based model that has 47million parameters, also capable of generating new sequences.
 ### Model Description
 - **Developed by:** [Shivendra Singh](https://twitter.com/shivendrra_)
@@ -35,7 +37,7 @@ It also has one more BERT based model that has 47million parameters, also capabl
 Can be used to generate new sequences of DNA on a given input of tokens. Or can be used for further research. Anyway, it's very basic in nature. I'll add more functionalities which includes classification of dna, masked token generation, etc. Maybe even implement MOE techinque in future.
 ### Direct Use
-Load the model and then can be used to generate new sequences, `max_length=512` for 2.5b model and `256` for enbert-47m model.
 ## Bias, Risks, and Limitations
@@ -55,34 +57,16 @@ model = AutoModel.from_pretrained("Shivendrra/enigma-1.5b")
 from model import Transformer
 model = Transformer(vocab_size=vocab_size)
-class Generator(Transformer):
-	def __init__(self, vocab_size):
-	super().__init__()
-	self.vocab_size = vocab_size
-	self.block_size = Transformer.block_size
-	def generate(self, idx, max_new_tokens, temperature=1.0, top_k=0):
-		generated_tokens = []
-		for _ in range(max_new_tokens):
-		idx_cond = idx[:, -self.block_size:]
-		logits, _ = self(idx_cond)
-		logits = logits[:, -1, :]
-		scaled_logits = logits / temperature
-		if top_k > 0:
-			scaled_logits = self._top_k_filtering(scaled_logits, top_k)
-		probs = F.softmax(scaled_logits, dim=-1)
-		sampled_idx = torch.multinomial(probs, num_samples=1)
-		generated_tokens.append(sampled_idx.item())
-		idx = torch.cat((idx, sampled_idx), dim=1)
-		return generated_tokens
-	def _top_k_filtering(self, logits, top_k):
-		values, indices = torch.topk(logits, top_k, dim=-1)
-		min_value = values[:, -1].unsqueeze(-1).expand_as(logits)
-		filtered_logits = torch.where(logits < min_value, torch.ones_like(logits) * -float('inf'), logits)
-		return filtered_logits
 ```
 ## Training Details
@@ -98,60 +82,6 @@ These models were trained to 3k-4k iterations, each. on ~500million letters of D
 Try to train it yourself if possible. `enigma/TrainEnigma` file contains all necessary functions needed to train it, from scratch or pre-train.
 #### Functions:
 This used a basic training procedure. `get_batch()` generated batches of data, `estimate_loss()` estimates losses and `train()` function is kind of master function, here, calling other functions after each or set iterations.
-```python
-def get_batch(split):
-    # generate a small batch of data of inputs x and targets y
-    data = train_data if split == 'train' else val_data
-    ix = torch.randint(len(data) - block_size, (batch_size,))
-    x = torch.stack([data[i:i+block_size] for i in ix])
-    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
-    x, y = x.to(device), y.to(device)
-    return x, y
-@torch.no_grad()
-def estimate_loss():
-    out = {}
-    model.eval()
-    for split in ['train', 'val']:
-        losses = torch.zeros(eval_iters)
-        for k in range(eval_iters):
-            X, Y = get_batch(split)
-            logits, loss = model(X, Y)
-            losses[k] = loss.item()
-        out[split] = losses.mean()
-    model.train()
-    return out
-from model import Transformer
-model = Transformer(vocab_size=vocab_size)
-m = model.to(device)
-n_param = sum(p.numel() for p in m.parameters())/1e6
-print(f"{n_param:.2f} million")
-optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
-steps = []
-train_losses = []
-val_losses = []
-for iter in range(max_iters):
-  if iter % eval_interval == 0 or iter == max_iters - 1:
-    losses = estimate_loss()
-    print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
-��   steps.append(iter)
-    train_losses.append(losses['train'])
-    val_losses.append(losses['val'])
-  xb, yb = get_batch('train')
-  logits, loss = model(xb, yb)
-  optimizer.zero_grad(set_to_none=True)
-  loss.backward()
-  optimizer.step()
-torch.save(model.state_dict(), f'enigma_{n_param:.0f}m.pth')
-```
 #### Training Hyperparameters
 Configurations are saved in the `enigma/config-enigma.json` file. Suitable for 2.5b model.

 ## Model Details
+It's a 2.5b model trained on ~1billion individual letters of DNA, kinda like training a text-based model on per-character level instead of sub-word level.
+It does have it's own tokenizer similar that is intersection b/w char-level and bpe-tokenizer.
+For EnBERT i.e. decoder-only model is trained on lot's of sequences of DNA tokenized using k-mer tokenizer specially trained for this purpose, which means it has a larger vocab size than the enigma-2.5b. Same model architecture is used in training a 430m model that is per-char based same as 2.5b model, but it's better than that.
 ### Model Description
 - **Developed by:** [Shivendra Singh](https://twitter.com/shivendrra_)
 Can be used to generate new sequences of DNA on a given input of tokens. Or can be used for further research. Anyway, it's very basic in nature. I'll add more functionalities which includes classification of dna, masked token generation, etc. Maybe even implement MOE techinque in future.
 ### Direct Use
+Load the model and then can be used to generate new sequences, `max_length=512` for 2.5b model and `256` for enigma-430m model.
 ## Bias, Risks, and Limitations
 from model import Transformer
 model = Transformer(vocab_size=vocab_size)
+from tokenizer import PerCharTokenizer
+token = PerCharTokenizer()
+input = "TGCCCTGGCTGCTCCGCATTGCAGGAGCTGCGCCCTTCCTTTC"
+token_input = token.encode(input)
+context = torch.tensor([token_input], dtype=torch.long, device=device)
+generated_output = token.decode(m.generate(context, max_new_tokens=500)[0].tolist())
+print(generated_output)
+model.generate()
 ```
 ## Training Details
 Try to train it yourself if possible. `enigma/TrainEnigma` file contains all necessary functions needed to train it, from scratch or pre-train.
 #### Functions:
 This used a basic training procedure. `get_batch()` generated batches of data, `estimate_loss()` estimates losses and `train()` function is kind of master function, here, calling other functions after each or set iterations.
 #### Training Hyperparameters
 Configurations are saved in the `enigma/config-enigma.json` file. Suitable for 2.5b model.