shivendrra commited on
Commit
c00e383
1 Parent(s): 84969ed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -85
README.md CHANGED
@@ -18,8 +18,10 @@ tags:
18
 
19
 
20
  ## Model Details
21
- this is a transformer based model trained on DNA seq data, capable of generating new sequences of DNA. It's a 2.5billion parameter model trained till convergence.
22
- It also has one more BERT based model that has 47million parameters, also capable of generating new sequences.
 
 
23
  ### Model Description
24
 
25
  - **Developed by:** [Shivendra Singh](https://twitter.com/shivendrra_)
@@ -35,7 +37,7 @@ It also has one more BERT based model that has 47million parameters, also capabl
35
  Can be used to generate new sequences of DNA on a given input of tokens. Or can be used for further research. Anyway, it's very basic in nature. I'll add more functionalities which includes classification of dna, masked token generation, etc. Maybe even implement MOE techinque in future.
36
  ### Direct Use
37
 
38
- Load the model and then can be used to generate new sequences, `max_length=512` for 2.5b model and `256` for enbert-47m model.
39
 
40
  ## Bias, Risks, and Limitations
41
 
@@ -55,34 +57,16 @@ model = AutoModel.from_pretrained("Shivendrra/enigma-1.5b")
55
  from model import Transformer
56
  model = Transformer(vocab_size=vocab_size)
57
 
58
- class Generator(Transformer):
59
- def __init__(self, vocab_size):
60
- super().__init__()
61
- self.vocab_size = vocab_size
62
- self.block_size = Transformer.block_size
63
-
64
- def generate(self, idx, max_new_tokens, temperature=1.0, top_k=0):
65
- generated_tokens = []
66
-
67
- for _ in range(max_new_tokens):
68
- idx_cond = idx[:, -self.block_size:]
69
- logits, _ = self(idx_cond)
70
- logits = logits[:, -1, :]
71
- scaled_logits = logits / temperature
72
-
73
- if top_k > 0:
74
- scaled_logits = self._top_k_filtering(scaled_logits, top_k)
75
- probs = F.softmax(scaled_logits, dim=-1)
76
- sampled_idx = torch.multinomial(probs, num_samples=1)
77
- generated_tokens.append(sampled_idx.item())
78
- idx = torch.cat((idx, sampled_idx), dim=1)
79
- return generated_tokens
80
-
81
- def _top_k_filtering(self, logits, top_k):
82
- values, indices = torch.topk(logits, top_k, dim=-1)
83
- min_value = values[:, -1].unsqueeze(-1).expand_as(logits)
84
- filtered_logits = torch.where(logits < min_value, torch.ones_like(logits) * -float('inf'), logits)
85
- return filtered_logits
86
  ```
87
 
88
  ## Training Details
@@ -98,60 +82,6 @@ These models were trained to 3k-4k iterations, each. on ~500million letters of D
98
  Try to train it yourself if possible. `enigma/TrainEnigma` file contains all necessary functions needed to train it, from scratch or pre-train.
99
  #### Functions:
100
  This used a basic training procedure. `get_batch()` generated batches of data, `estimate_loss()` estimates losses and `train()` function is kind of master function, here, calling other functions after each or set iterations.
101
-
102
- ```python
103
- def get_batch(split):
104
-     # generate a small batch of data of inputs x and targets y
105
-     data = train_data if split == 'train' else val_data
106
-     ix = torch.randint(len(data) - block_size, (batch_size,))
107
-     x = torch.stack([data[i:i+block_size] for i in ix])
108
-     y = torch.stack([data[i+1:i+block_size+1] for i in ix])
109
-     x, y = x.to(device), y.to(device)
110
-
111
-     return x, y
112
-
113
- @torch.no_grad()
114
- def estimate_loss():
115
-     out = {}
116
-     model.eval()
117
-     for split in ['train', 'val']:
118
-         losses = torch.zeros(eval_iters)
119
-         for k in range(eval_iters):
120
-             X, Y = get_batch(split)
121
-             logits, loss = model(X, Y)
122
-             losses[k] = loss.item()
123
-         out[split] = losses.mean()
124
-     model.train()
125
-     return out
126
-
127
- from model import Transformer
128
- model = Transformer(vocab_size=vocab_size)
129
- m = model.to(device)
130
-
131
- n_param = sum(p.numel() for p in m.parameters())/1e6
132
- print(f"{n_param:.2f} million")
133
- optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
134
- steps = []
135
- train_losses = []
136
- val_losses = []
137
-
138
- for iter in range(max_iters):
139
-   if iter % eval_interval == 0 or iter == max_iters - 1:
140
-     losses = estimate_loss()
141
-     print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
142
- ��   steps.append(iter)
143
-     train_losses.append(losses['train'])
144
-     val_losses.append(losses['val'])
145
-
146
-   xb, yb = get_batch('train')
147
-   logits, loss = model(xb, yb)
148
-   optimizer.zero_grad(set_to_none=True)
149
-   loss.backward()
150
-   optimizer.step()
151
-
152
- torch.save(model.state_dict(), f'enigma_{n_param:.0f}m.pth')
153
- ```
154
-
155
  #### Training Hyperparameters
156
 
157
  Configurations are saved in the `enigma/config-enigma.json` file. Suitable for 2.5b model.
 
18
 
19
 
20
  ## Model Details
21
+ It's a 2.5b model trained on ~1billion individual letters of DNA, kinda like training a text-based model on per-character level instead of sub-word level.
22
+ It does have it's own tokenizer similar that is intersection b/w char-level and bpe-tokenizer.
23
+
24
+ For EnBERT i.e. decoder-only model is trained on lot's of sequences of DNA tokenized using k-mer tokenizer specially trained for this purpose, which means it has a larger vocab size than the enigma-2.5b. Same model architecture is used in training a 430m model that is per-char based same as 2.5b model, but it's better than that.
25
  ### Model Description
26
 
27
  - **Developed by:** [Shivendra Singh](https://twitter.com/shivendrra_)
 
37
  Can be used to generate new sequences of DNA on a given input of tokens. Or can be used for further research. Anyway, it's very basic in nature. I'll add more functionalities which includes classification of dna, masked token generation, etc. Maybe even implement MOE techinque in future.
38
  ### Direct Use
39
 
40
+ Load the model and then can be used to generate new sequences, `max_length=512` for 2.5b model and `256` for enigma-430m model.
41
 
42
  ## Bias, Risks, and Limitations
43
 
 
57
  from model import Transformer
58
  model = Transformer(vocab_size=vocab_size)
59
 
60
+ from tokenizer import PerCharTokenizer
61
+ token = PerCharTokenizer()
62
+
63
+ input = "TGCCCTGGCTGCTCCGCATTGCAGGAGCTGCGCCCTTCCTTTC"
64
+ token_input = token.encode(input)
65
+ context = torch.tensor([token_input], dtype=torch.long, device=device)
66
+ generated_output = token.decode(m.generate(context, max_new_tokens=500)[0].tolist())
67
+ print(generated_output)
68
+
69
+ model.generate()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  ```
71
 
72
  ## Training Details
 
82
  Try to train it yourself if possible. `enigma/TrainEnigma` file contains all necessary functions needed to train it, from scratch or pre-train.
83
  #### Functions:
84
  This used a basic training procedure. `get_batch()` generated batches of data, `estimate_loss()` estimates losses and `train()` function is kind of master function, here, calling other functions after each or set iterations.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
  #### Training Hyperparameters
86
 
87
  Configurations are saved in the `enigma/config-enigma.json` file. Suitable for 2.5b model.