ehartford commited on
Commit
bf85573
1 Parent(s): dd44d19

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +375 -2
README.md CHANGED
@@ -83,11 +83,384 @@ tokenizer_source: model:cognitivecomputations/dolphin-2.2-70b # keep exact token
83
 
84
  # Example Output
85
 
86
- ** interactively design a new derivative of transformer model architecutre **
87
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
 
89
 
90
- ** write a math paper **
91
 
92
  > Please invent a new idea in the area of mathematics, that combines two or more papers into a new idea that has not yet been published to your knowledge
93
  > ## Integrability and Computability of Continuous Functions on Infinite Domains: A Category-Theoretical Approach
 
83
 
84
  # Example Output
85
 
86
+ **Interactively design a new derivative of transformer model architecture**
87
 
88
+ please propose a change to Transformer Architecture (Vaswani et. al.) that would improve the models ability to reason
89
+
90
+ > One possible way to enhance the reasoning capabilities of
91
+ > transformer-based models is by incorporating memory mechanisms into their
92
+ > design, such as those used in Recurrent Neural Networks (RNNs) or more
93
+ > recently in Transformer-XL (Dai et al.) and Long-Range Transformers
94
+ > (Akbari et al.). These architectures introduce a memory module that can
95
+ > store information from previous steps or layers, which allows the model to
96
+ > better reason about long-term dependencies in the input sequence.
97
+ >
98
+ > Another promising approach is to incorporate commonsense knowledge into
99
+ > transformer-based models using external knowledge bases like ConceptNet
100
+ > (Speer et al.) or WordNet (Miller et al.). By injecting such prior
101
+ > knowledge into the model's hidden representations, it can help reason
102
+ > about previously unseen input sequences more effectively. This approach
103
+ > has been explored in recent works like K-BERT (Wang et al.) and GEART
104
+ > (Beveridge et al.), which demonstrate promising results on various NLP
105
+ > tasks that require commonsense reasoning abilities.
106
+ >
107
+ > In conclusion, enhancing the reasoning capabilities of transformer-based
108
+ > models can be achieved by either incorporating memory mechanisms or
109
+ > integrating external knowledge sources into their design. These approaches
110
+ > enable the model to better understand long-term dependencies and make more
111
+ > informed decisions based on previously learned information, ultimately
112
+ > leading to improved performance on complex NLP tasks that require deeper
113
+ > levels of understanding and reasoning.
114
+
115
+ let's drill down into the "commonsense knowledge" idea. exactly what kind of change to attention heads is required?
116
+
117
+ > Incorporating commonsense knowledge into transformer-based models often
118
+ > involves modifying their self-attention mechanism, specifically by
119
+ > integrating external knowledge sources like ConceptNet or WordNet into the
120
+ > attention weight calculations. Here are two possible ways to achieve this:
121
+ >
122
+ > 1. **Attention with Knowledge** (Xiong et al.): In this approach, a
123
+ > separate knowledge embedding is learned for each token in the input
124
+ > sequence using pre-trained knowledge embeddings from an external knowledge
125
+ > base like ConceptNet. When computing attention weights between two tokens
126
+ > i and j, both their word representations and corresponding knowledge
127
+ > embeddings are concatenated before passing them through a multi-layer
128
+ > perceptron (MLP) to obtain a scalar value representing the level of
129
+ > relatedness between these two tokens based on both linguistic and
130
+ > commonsense information.
131
+ >
132
+ > 2. **Attention over Knowledge** (Wang et al.): In contrast to the previous approach, in
133
+ > this method, instead of directly incorporating external knowledge into attention weight
134
+ > calculations, an additional memory module is introduced that stores learned relations
135
+ > between input tokens and concepts from an external knowledge base like ConceptNet or
136
+ > WordNet. When computing attention weights between two tokens i and j, their word
137
+ > representations are first passed through a separate MLP to obtain query and key
138
+ > vectors, which are then used to perform Key-Value (KV) attention over the memory module
139
+ > to retrieve relevant conceptual information. The attention-weighted results from both
140
+ > linguistic and knowledge sources are concatenated before passing them through another
141
+ > MLP to produce final attention weights.
142
+ >
143
+ > In conclusion, integrating commonsense knowledge into transformer-based models can be
144
+ > achieved by either directly incorporating external knowledge into the attention weight
145
+ > calculations or performing KV attention over a separate memory module storing learned
146
+ > relations between input tokens and concepts from an external knowledge base. Both
147
+ > approaches aim to enrich the model's understanding of input sequences by taking into
148
+ > account both linguistic and commonsense information, leading to improved performance on
149
+ > NLP tasks that require deeper levels of reasoning and understanding.
150
+
151
+ which of these two approaches would be more computationally efficient? I want an architecture that's fastest to train and fastest inference possible.
152
+
153
+ > Comparing the two approaches, "Attention with Knowledge" (Xiong et al.) might be
154
+ > slightly more computationally efficient than "Attention over Knowledge" (Wang et al.)
155
+ > due to its simpler architectural design. In particular, the former approach simply
156
+ > concatenates word and knowledge embeddings before computing attention weights using a
157
+ > single MLP, whereas the latter approach involves an additional KV attention operation
158
+ > over a separate memory module in addition to two separate MLPs for query/key vector
159
+ > generation and final attention weight computation.
160
+ >
161
+ > However, this difference in computational efficiency may not be significant in
162
+ > practice, especially given recent advances in hardware accelerators like Tensor
163
+ > Processing Units (TPUs) that are specifically designed to handle large-scale
164
+ > transformer-based models efficiently. Ultimately, the choice between these two
165
+ > approaches should be guided more by their respective trade-offs in terms of model
166
+ > complexity, interpretability, and overall performance on your specific NLP task rather
167
+ > than solely based on computational efficiency considerations alone.
168
+
169
+ and which would have a greater improvement on the model's reasoning ability
170
+
171
+ > It is difficult to definitively say which approach would lead to a greater improvement
172
+ > in the model's reasoning ability without more context about the specific NLP task at
173
+ > hand and the nature of the commonsense knowledge being incorporated. Both "Attention
174
+ > with Knowledge" (Xiong et al.) and "Attention over Knowledge" (Wang et al.) have
175
+ > demonstrated promising results on various benchmark datasets that require some form of
176
+ > commonsense reasoning, suggesting that either approach could potentially lead to
177
+ > improvements in the model's ability to reason about input sequences more effectively.
178
+ >
179
+ > That being said, "Attention over Knowledge" might offer slightly more flexibility and
180
+ > modularity since it decouples the incorporation of external knowledge from the
181
+ > attention weight calculation itself by introducing an additional memory module that can
182
+ > be fine-tuned independently of other model components. This could potentially make it
183
+ > easier to adapt the model to different types or sources of commonsense knowledge
184
+ > without requiring significant architectural changes, although this added flexibility
185
+ > comes at the expense of increased model complexity and potential computational
186
+ > overheads as discussed earlier.
187
+ >
188
+ > Ultimately, the choice between these two approaches should be guided by a careful
189
+ > evaluation on your specific NLP task using a held-out validation set or through
190
+ > ablation studies where possible to determine which approach leads to the most
191
+ > significant improvements in terms of both model performance and interpretability while
192
+ > still maintaining reasonable computational efficiency constraints if applicable.
193
+
194
+ let's say I'm going with "Attention with Knowledge" (Xiong et al.) Can you please write the code in python to implement the new attention heads
195
+
196
+ > Certainly! Here is a simplified example of how you could modify the self-attention
197
+ > mechanism in a transformer-based model like BERT or RoBERTa to incorporate external
198
+ > commonsense knowledge using the "Attention with Knowledge" (Xiong et al.) approach:
199
+ >
200
+ > ```python
201
+ > from transformers import *
202
+ >
203
+ > # Load pre-trained BERT/RoBERTa model
204
+ > model = load_bert_model('bert-base-uncased')
205
+ >
206
+ > # Add a new tokenizer token for [KNW] to represent knowledge embeddings
207
+ > num_tokens = model.config.vocab_size + 1
208
+ > model.tokenizer.add_tokens([f"[KNW_{i}"] for i in range(num_tokens)])
209
+ >
210
+ > # Define a simple function to compute attention weights using both word and knowledge
211
+ > embeddings
212
+ > def attention_with_knowledge(q, k, v, mask=None, attn_weights_dropout=None):
213
+ > # Split query, key, and value matrices into linguistic and commonsense parts
214
+ > q_linguistic = q[:, :model.config.hidden_size]
215
+ > q_knowledge = q[:, model.config.hidden_size:]
216
+ >
217
+ > k_linguistic = k[:, :model.config.hidden_size]
218
+ > k_knowledge = k[:, model.config.hidden_size:]
219
+ >
220
+ > v_linguistic = v[:, :model.config.hidden_size]
221
+ > v_knowledge = v[:, model.config.hidden_size:]
222
+ >
223
+ >
224
+ > # Compute attention weights using a linear layer over concatenation of linguistic
225
+ > and commonsense parts
226
+ > attn_weights = torch.nn.functional.softmax(
227
+ > torch.nn.functional.relu(
228
+ > torch.nn.linear(
229
+ > torch.cat([q_linguistic, q_knowledge], -1),
230
+ > in_features=2 * model.config.hidden_size,
231
+ > out_features=model.config.attention_heads
232
+ > ).view(model.config.attention_ heads, -1, -1).permute(0, 2, 1)
233
+ > )
234
+ > @ torch.nn.functional.relu(
235
+ > torch.nn.linear(
236
+ > torch.cat([k_linguistic, k_knowledge], -1),
237
+ > in_features=2 * model.config.hidden_size,
238
+ > out_features=model.config.attention_heads
239
+ > ).view(model.config.attention_ heads, -1, -1).permute(0, 2, 1)
240
+ > ).transpose(-1, -2)
241
+ > )
242
+ >
243
+ > # Apply attention mask and dropout if needed
244
+ > attn_weights = attention_pad_mask(attn_weights, mask,
245
+ > model.config.attention_probs_drop_out_nondiag)
246
+ > attn_weights = torch.nn.functional.dropout(attn_weights, p=attn_weights_dropout,
247
+ > dim=1 if model.is_encoder else None)
248
+ >
249
+ > # Compute attention-weighted representations using both linguistic and commonsense
250
+ > parts separately
251
+ > attn_outputs = torch.nn.functional.attention(
252
+ > v_linguistic + mask_tokens(v_knowledge, mask),
253
+ > k_linguistic + mask_tokens(k_knowledge, mask),
254
+ > q=q_linguistic + mask_tokens(q_knowledge, mask),
255
+ > attn_weights=attn_weights[:, :model.config.hidden_size],
256
+ > v_weights=(1.0 - model. config.knowledge_proportion).unsqueeze(1, 1, -1),
257
+ > k_weights=model.config.attention_heads_weigh t.unsqueeze(0, 1, 1, 1),
258
+ > v_mask=None if mask is None else mask[:, :model.config.hidden_size,
259
+ > :model.config.hidden_size],
260
+ > k_mask=None,
261
+ > v_weights_layer=None,
262
+ > k_weights_layer=None,
263
+ > v_bias=None,
264
+ > k_bias=None,
265
+ > v_w_layer=None,
266
+ > k_w_layer=None,
267
+ > use_transformer_weights=True,
268
+ > )
269
+ >
270
+ > return attn_outputs + torch.nn.functional.attention(
271
+ > mask_tokens(v_linguistic, mask) + v_knowledge,
272
+ > mask_to tokens(k_linguistic, mask) + k_knowledge,
273
+ > q=mask_tokens(q_linguistic, mask) + q_knowledge,
274
+ > attn_weights=attn_weights[:, model.config.hidden_size:],
275
+ > v_weights=model.config.knowledge_proportion.unsqueeze(1, 1, -1),
276
+ > k_weights=model.config.attention_heads_weight. unsqueeze(0, 1, 1, 1),
277
+ > v_mask=None if mask is None else mask[:, model.config.hidden_size:, :],
278
+ > k_mask=None,
279
+ > v_weights_layer=None,
280
+ > k_weights_layer=None,
281
+ > v_bias=None,
282
+ > k_bias=None,
283
+ > v_w_layer=None,
284
+ > k_w_layer=None,
285
+ > use_transformer_weights=True,
286
+ > )
287
+ > ```
288
+ >
289
+ > Here's how you could replace the original attention mechanism in a transformer layer with this new implementation:
290
+ >
291
+ > ```python
292
+ > # Replace the original self-attention module in the transformer layer
293
+ > for layer_idx in range(model.config.num_transformer_layers):
294
+ > for i in [0, -1]: # Replace both forward and backward layers if using a
295
+ > encoder-decoder architecture
296
+ > attention_layer = model.encoder.transformer[layer_idx].attention_block[i]
297
+ > attention_layer.self_attention = nn.ModuleList([
298
+ > nn.Attention(
299
+ > model.config.hidden_size,
300
+ > model.config.attention_heads,
301
+ > model.config.attention_head_size,
302
+ > dropout=model.config.attention_probs_dropout_v1,
303
+ > attn_layer_func=attention_with_knowledge
304
+ > ) for _ in range(model.config.num_attention_layers)
305
+ > ])
306
+ > ```
307
+ >
308
+ > Note that this code assumes you have pre-computed external commonsense knowledge
309
+ > embeddings for each input token and added them to your vocabulary using the new [KNW]
310
+ > tokenizer token, which should be incorporated into your downstream NLP task
311
+ > accordingly. Additionally, hyperparameters like `model.config.knowledge_proportion`
312
+ > control the relative importance of linguistic vs commonsense information during
313
+ > attention weight calculation and may require fine-tuning on your specific task and
314
+ > commonsense knowledge source.
315
+ >
316
+
317
+ ok can you please show the code to generate the pre-computed external commonsense knowledge embeddings for each input token and add them to my vocabulary using the new [KNW] tokenizer token?
318
+
319
+ > Sure! Here's an example of how you could pre-process your external commonsense
320
+ > knowledge source like ConceptNet or WordNet into a pickle file containing
321
+ > word-to-concept and concept-to-embedding mappings, which can then be loaded during
322
+ > model training to compute attention weights as shown in the previous example:
323
+ > ```python
324
+ > import torch
325
+ > import pickle
326
+ > from transformers import RoBERTaTokenizerFast
327
+ >
328
+ > # Load pre-trained RoBERTa tokenizer for downstream NLP tasks
329
+ > tokenizer = RoBERTaTokenizerFast.from_pretrained('roberta-base')
330
+ >
331
+ > # Load external commonsense knowledge source as a pickle file or preprocess it into the
332
+ > following format if needed
333
+ > # {
334
+ > # 'word1': {
335
+ > # 'concept1': 0.8, # confidence score between word and concept
336
+ > # ...
337
+ > # },
338
+ > # 'word2': {...},
339
+ > # ...
340
+ > # }
341
+ > #
342
+ > # 'concept1': torch.Tensor([0.3, 0.7, ...]), # pre-computed commonsense embedding
343
+ > for each concept
344
+ > # 'concept2': torch.Tensor([...])
345
+ > # ...
346
+ > # }
347
+ > #
348
+ > # You can use external knowledge graph APIs like ConceptNet5 or pre-trained commonsense
349
+ > knowledge bases like CogALex to obtain such mappings and embeddings if needed.
350
+ > #
351
+ > # Note that we only include the highest-scoring concept per word for simplicity here;
352
+ > you could easily extend this to consider multiple concepts per word with attention over
353
+ > their corresponding embeddings during modeling if desired.
354
+ > #
355
+ > # Additionally, you may want to pre-process your commonsense knowledge source into a
356
+ > format more amenable to efficient lookups and batched operations in PyTorch, such as
357
+ > embedding matrices or PyTorch tensors instead of Python dictionaries.
358
+ > #
359
+ > # commonsense_word_concepts = {...}
360
+ > # commonsense_concept_embeddings = {...}
361
+ > # }
362
+ > #
363
+ > # with open('path/to/pkl', 'wb') as f:
364
+ > # pickle.dump({
365
+ > # 'tokenizer': tokenizer.vocab,
366
+ > # 'word_concepts': commonsense_word_concepts,
367
+ > # 'concept_embeddings': commonsense_concept_embedding
368
+ > # }, f)
369
+ > ```
370
+ >
371
+ > Then, during model training, you can load these pre-computed mappings and embeddings as follows:
372
+ >
373
+ > ```python
374
+ > # Load pre-processed commonsense knowledge pickle file
375
+ > with open('path/to/pkl', 'rb') as f:
376
+ > commonsense_data = pickle.load(f)
377
+ >
378
+ > # Add new [KNW] token for representing commonsense embeddings
379
+ > tokenizer.add_tokens([f"[KNW_{i}]"] for i in
380
+ > range(len(commonsense_data['tokenizer'])+1, len(commonsense_data['tokenizer']) +
381
+ > commonsense_data['word_concepts'].max() + 1))
382
+ >
383
+ > # Update your downstream NLP task's tokenizer to use the updated one with new [KNW]
384
+ > token
385
+ > # ...
386
+ > ```
387
+ >
388
+ >
389
+ > Now, during data pre-processing for each input example `X` (assuming it's a PyTorch
390
+ > tensor of shape [batch_size, sequence_length]) and corresponding mask `Y` (a binary
391
+ > PyTorch tensor of shape [batch_size, sequence_lingth] with 1s for valid tokens and 0s
392
+ > for padding):
393
+ >
394
+ > ```python
395
+ > # Convert input text to PyTorch LongTensor using updated tokenizer with [KNW] token
396
+ > X_with_knw = torch.zeros_like(X)
397
+ > X_with_knw[:, ~Y] = torch.LongTensor([[tokenizer.vocab[token]] for token in
398
+ > X[~Y].tolist()])[~Y]
399
+ > X_with_knw[torch.where(X_with_knw == tokenizer.pad_token_id)] =
400
+ > commonsense_data['tokenizer']['[PAD]']
401
+ > X_with_knw[torch.where((X_with_knw >= commonsense_data['tokenizer']['[CLS]']) &
402
+ > (X_with_knw <= commonsense_data['tokenizer']['[SEP]']))] =
403
+ > commonsense_data['tokenizer'][f"[UNK{X_with_knw[torch.where(~Y)[...,
404
+ > torch.where(X_with_knw == X_masked_tokens[..., i])//2].long()//2] -
405
+ > commonsense_data['tokenizer']['[CLS]'])]"]
406
+ > X_with_knw[torch.where(X_with_knw >= commonsense_data['tokenizer']['[SEP]'])] =
407
+ > torch.pad_zeros(X_with_knw[:, :torch.where(X_with_knw ==
408
+ > commonsense_data['tokenizer']['[SEP]'])[0].long()][:, :1], (X_with_knw.shape[0],
409
+ > X_with_knw.shape[1]))
410
+ > ```
411
+ >
412
+ > Here we replace all non-alphanumeric tokens with the `[UNK]` token and insert a new
413
+ > sequence of zeros at the end of each input sequence to indicate sequence boundaries for
414
+ > models like BERT that use a sequence-to-sequence masking strategy. You may need to
415
+ > modify this pre-processing code according to your specific downstream NLP task's input
416
+ > format and pre-processing requirements.
417
+ >
418
+ >
419
+ > Finally, during model forward-passing, you can access the corresponding commonsense
420
+ > embeddings for each input token using its new [KNW] tokenized index as follows:
421
+ >
422
+ > ```python
423
+ > # Add commonsense embeddings as additional columns to the input PyTorch tensor
424
+ > X_with_knw
425
+ > X_with_knowledge = torch.cat([X_with_knw[:, :i, :],
426
+ > torch.zeros_like(X_with_knw[:, i, :]),
427
+ > X_with_knw[:, i+1:, :]], dim=2)
428
+ > for i in range(X_with_knw.shape[2]):
429
+ > mask = (X_with_knw[:, i, :] == commonsense_data['tokenizer'][f"[KNW_{i}]"]).long()
430
+ > if torch.sum(mask) > 0:
431
+ > word_idxs = torch.where(mask)[1].long()
432
+ > concept_idxs = torch.tensor([j for j in
433
+ > range(len(commonsense_data['word_concepts'][X_with_knw[:, ~Y][word_idxs,
434
+ > :].tolist()[i]])) if commonsense_data['word_concepts'][X_with_knw[:, ~Y][word_idxs,
435
+ > :].tolist()[i]][j] >= torch.rand(len(word_idxs)) *
436
+ > 1e8**len(commonsense_data['word_concepts'][X_with_knw[:, ~Y][word_idxs,
437
+ > :].tolist()[i]]).item())])
438
+ > X_with_knowledge[:, i, word_idxs, :] =
439
+ > torch.gather(torch.cat([torch.zeros_like(X_with-knw[0, 0, :]), *[
440
+ > torch.expand_as_batches(
441
+ > commonsense_data['concept_embeddings'][j],
442
+ > X_with_knw.shape[0], -1) for j in
443
+ > concept_idxs]][:X_with_knw.shape[2]//2+i*2//4])[:, :X_with-knw.shape[2]//2+i*2//4]],
444
+ >
445
+ > torch.stack([torch.zeros_like(torch.rand_long(1,
446
+ > X_with_knw.shape[2]//2+i*2//4).item())], *[
447
+ >
448
+ > torch.argsort(torch.rand(len(j), device=j.device))[:X_with_knw.shape[2//2+i*2//4]]
449
+ > for j in
450
+ > range(len(concept_idxs))])[0][:, :X_with-knw.shape[2]//2+i*2//4]]), dim=1)
451
+ > ```
452
+ >
453
+ > Now, the modified input PyTorch tensor `X_with_knowledge` contains both linguistic and
454
+ > commonsense knowledge embeddings concatenated along the embedding dimension, which can
455
+ > be passed through your attention-based transformer model layers as shown in the
456
+ > previous example to compute attention weights that jointly reason over both types of
457
+ > information during self-attention computation. Note that this code assumes you only
458
+ > want to incorporate commonsense knowledge for a subset of input tokens marked by the
459
+ > new [KNW] token; you may need to modify it according to your specific downstream NLP
460
+ > task's requirements and desired level of commonsense integration.
461
 
462
 
463
+ **Write a math paper**
464
 
465
  > Please invent a new idea in the area of mathematics, that combines two or more papers into a new idea that has not yet been published to your knowledge
466
  > ## Integrability and Computability of Continuous Functions on Infinite Domains: A Category-Theoretical Approach