Update README.md
Browse files
README.md
CHANGED
@@ -83,11 +83,384 @@ tokenizer_source: model:cognitivecomputations/dolphin-2.2-70b # keep exact token
|
|
83 |
|
84 |
# Example Output
|
85 |
|
86 |
-
**
|
87 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
88 |
|
89 |
|
90 |
-
**
|
91 |
|
92 |
> Please invent a new idea in the area of mathematics, that combines two or more papers into a new idea that has not yet been published to your knowledge
|
93 |
> ## Integrability and Computability of Continuous Functions on Infinite Domains: A Category-Theoretical Approach
|
|
|
83 |
|
84 |
# Example Output
|
85 |
|
86 |
+
**Interactively design a new derivative of transformer model architecture**
|
87 |
|
88 |
+
please propose a change to Transformer Architecture (Vaswani et. al.) that would improve the models ability to reason
|
89 |
+
|
90 |
+
> One possible way to enhance the reasoning capabilities of
|
91 |
+
> transformer-based models is by incorporating memory mechanisms into their
|
92 |
+
> design, such as those used in Recurrent Neural Networks (RNNs) or more
|
93 |
+
> recently in Transformer-XL (Dai et al.) and Long-Range Transformers
|
94 |
+
> (Akbari et al.). These architectures introduce a memory module that can
|
95 |
+
> store information from previous steps or layers, which allows the model to
|
96 |
+
> better reason about long-term dependencies in the input sequence.
|
97 |
+
>
|
98 |
+
> Another promising approach is to incorporate commonsense knowledge into
|
99 |
+
> transformer-based models using external knowledge bases like ConceptNet
|
100 |
+
> (Speer et al.) or WordNet (Miller et al.). By injecting such prior
|
101 |
+
> knowledge into the model's hidden representations, it can help reason
|
102 |
+
> about previously unseen input sequences more effectively. This approach
|
103 |
+
> has been explored in recent works like K-BERT (Wang et al.) and GEART
|
104 |
+
> (Beveridge et al.), which demonstrate promising results on various NLP
|
105 |
+
> tasks that require commonsense reasoning abilities.
|
106 |
+
>
|
107 |
+
> In conclusion, enhancing the reasoning capabilities of transformer-based
|
108 |
+
> models can be achieved by either incorporating memory mechanisms or
|
109 |
+
> integrating external knowledge sources into their design. These approaches
|
110 |
+
> enable the model to better understand long-term dependencies and make more
|
111 |
+
> informed decisions based on previously learned information, ultimately
|
112 |
+
> leading to improved performance on complex NLP tasks that require deeper
|
113 |
+
> levels of understanding and reasoning.
|
114 |
+
|
115 |
+
let's drill down into the "commonsense knowledge" idea. exactly what kind of change to attention heads is required?
|
116 |
+
|
117 |
+
> Incorporating commonsense knowledge into transformer-based models often
|
118 |
+
> involves modifying their self-attention mechanism, specifically by
|
119 |
+
> integrating external knowledge sources like ConceptNet or WordNet into the
|
120 |
+
> attention weight calculations. Here are two possible ways to achieve this:
|
121 |
+
>
|
122 |
+
> 1. **Attention with Knowledge** (Xiong et al.): In this approach, a
|
123 |
+
> separate knowledge embedding is learned for each token in the input
|
124 |
+
> sequence using pre-trained knowledge embeddings from an external knowledge
|
125 |
+
> base like ConceptNet. When computing attention weights between two tokens
|
126 |
+
> i and j, both their word representations and corresponding knowledge
|
127 |
+
> embeddings are concatenated before passing them through a multi-layer
|
128 |
+
> perceptron (MLP) to obtain a scalar value representing the level of
|
129 |
+
> relatedness between these two tokens based on both linguistic and
|
130 |
+
> commonsense information.
|
131 |
+
>
|
132 |
+
> 2. **Attention over Knowledge** (Wang et al.): In contrast to the previous approach, in
|
133 |
+
> this method, instead of directly incorporating external knowledge into attention weight
|
134 |
+
> calculations, an additional memory module is introduced that stores learned relations
|
135 |
+
> between input tokens and concepts from an external knowledge base like ConceptNet or
|
136 |
+
> WordNet. When computing attention weights between two tokens i and j, their word
|
137 |
+
> representations are first passed through a separate MLP to obtain query and key
|
138 |
+
> vectors, which are then used to perform Key-Value (KV) attention over the memory module
|
139 |
+
> to retrieve relevant conceptual information. The attention-weighted results from both
|
140 |
+
> linguistic and knowledge sources are concatenated before passing them through another
|
141 |
+
> MLP to produce final attention weights.
|
142 |
+
>
|
143 |
+
> In conclusion, integrating commonsense knowledge into transformer-based models can be
|
144 |
+
> achieved by either directly incorporating external knowledge into the attention weight
|
145 |
+
> calculations or performing KV attention over a separate memory module storing learned
|
146 |
+
> relations between input tokens and concepts from an external knowledge base. Both
|
147 |
+
> approaches aim to enrich the model's understanding of input sequences by taking into
|
148 |
+
> account both linguistic and commonsense information, leading to improved performance on
|
149 |
+
> NLP tasks that require deeper levels of reasoning and understanding.
|
150 |
+
|
151 |
+
which of these two approaches would be more computationally efficient? I want an architecture that's fastest to train and fastest inference possible.
|
152 |
+
|
153 |
+
> Comparing the two approaches, "Attention with Knowledge" (Xiong et al.) might be
|
154 |
+
> slightly more computationally efficient than "Attention over Knowledge" (Wang et al.)
|
155 |
+
> due to its simpler architectural design. In particular, the former approach simply
|
156 |
+
> concatenates word and knowledge embeddings before computing attention weights using a
|
157 |
+
> single MLP, whereas the latter approach involves an additional KV attention operation
|
158 |
+
> over a separate memory module in addition to two separate MLPs for query/key vector
|
159 |
+
> generation and final attention weight computation.
|
160 |
+
>
|
161 |
+
> However, this difference in computational efficiency may not be significant in
|
162 |
+
> practice, especially given recent advances in hardware accelerators like Tensor
|
163 |
+
> Processing Units (TPUs) that are specifically designed to handle large-scale
|
164 |
+
> transformer-based models efficiently. Ultimately, the choice between these two
|
165 |
+
> approaches should be guided more by their respective trade-offs in terms of model
|
166 |
+
> complexity, interpretability, and overall performance on your specific NLP task rather
|
167 |
+
> than solely based on computational efficiency considerations alone.
|
168 |
+
|
169 |
+
and which would have a greater improvement on the model's reasoning ability
|
170 |
+
|
171 |
+
> It is difficult to definitively say which approach would lead to a greater improvement
|
172 |
+
> in the model's reasoning ability without more context about the specific NLP task at
|
173 |
+
> hand and the nature of the commonsense knowledge being incorporated. Both "Attention
|
174 |
+
> with Knowledge" (Xiong et al.) and "Attention over Knowledge" (Wang et al.) have
|
175 |
+
> demonstrated promising results on various benchmark datasets that require some form of
|
176 |
+
> commonsense reasoning, suggesting that either approach could potentially lead to
|
177 |
+
> improvements in the model's ability to reason about input sequences more effectively.
|
178 |
+
>
|
179 |
+
> That being said, "Attention over Knowledge" might offer slightly more flexibility and
|
180 |
+
> modularity since it decouples the incorporation of external knowledge from the
|
181 |
+
> attention weight calculation itself by introducing an additional memory module that can
|
182 |
+
> be fine-tuned independently of other model components. This could potentially make it
|
183 |
+
> easier to adapt the model to different types or sources of commonsense knowledge
|
184 |
+
> without requiring significant architectural changes, although this added flexibility
|
185 |
+
> comes at the expense of increased model complexity and potential computational
|
186 |
+
> overheads as discussed earlier.
|
187 |
+
>
|
188 |
+
> Ultimately, the choice between these two approaches should be guided by a careful
|
189 |
+
> evaluation on your specific NLP task using a held-out validation set or through
|
190 |
+
> ablation studies where possible to determine which approach leads to the most
|
191 |
+
> significant improvements in terms of both model performance and interpretability while
|
192 |
+
> still maintaining reasonable computational efficiency constraints if applicable.
|
193 |
+
|
194 |
+
let's say I'm going with "Attention with Knowledge" (Xiong et al.) Can you please write the code in python to implement the new attention heads
|
195 |
+
|
196 |
+
> Certainly! Here is a simplified example of how you could modify the self-attention
|
197 |
+
> mechanism in a transformer-based model like BERT or RoBERTa to incorporate external
|
198 |
+
> commonsense knowledge using the "Attention with Knowledge" (Xiong et al.) approach:
|
199 |
+
>
|
200 |
+
> ```python
|
201 |
+
> from transformers import *
|
202 |
+
>
|
203 |
+
> # Load pre-trained BERT/RoBERTa model
|
204 |
+
> model = load_bert_model('bert-base-uncased')
|
205 |
+
>
|
206 |
+
> # Add a new tokenizer token for [KNW] to represent knowledge embeddings
|
207 |
+
> num_tokens = model.config.vocab_size + 1
|
208 |
+
> model.tokenizer.add_tokens([f"[KNW_{i}"] for i in range(num_tokens)])
|
209 |
+
>
|
210 |
+
> # Define a simple function to compute attention weights using both word and knowledge
|
211 |
+
> embeddings
|
212 |
+
> def attention_with_knowledge(q, k, v, mask=None, attn_weights_dropout=None):
|
213 |
+
> # Split query, key, and value matrices into linguistic and commonsense parts
|
214 |
+
> q_linguistic = q[:, :model.config.hidden_size]
|
215 |
+
> q_knowledge = q[:, model.config.hidden_size:]
|
216 |
+
>
|
217 |
+
> k_linguistic = k[:, :model.config.hidden_size]
|
218 |
+
> k_knowledge = k[:, model.config.hidden_size:]
|
219 |
+
>
|
220 |
+
> v_linguistic = v[:, :model.config.hidden_size]
|
221 |
+
> v_knowledge = v[:, model.config.hidden_size:]
|
222 |
+
>
|
223 |
+
>
|
224 |
+
> # Compute attention weights using a linear layer over concatenation of linguistic
|
225 |
+
> and commonsense parts
|
226 |
+
> attn_weights = torch.nn.functional.softmax(
|
227 |
+
> torch.nn.functional.relu(
|
228 |
+
> torch.nn.linear(
|
229 |
+
> torch.cat([q_linguistic, q_knowledge], -1),
|
230 |
+
> in_features=2 * model.config.hidden_size,
|
231 |
+
> out_features=model.config.attention_heads
|
232 |
+
> ).view(model.config.attention_ heads, -1, -1).permute(0, 2, 1)
|
233 |
+
> )
|
234 |
+
> @ torch.nn.functional.relu(
|
235 |
+
> torch.nn.linear(
|
236 |
+
> torch.cat([k_linguistic, k_knowledge], -1),
|
237 |
+
> in_features=2 * model.config.hidden_size,
|
238 |
+
> out_features=model.config.attention_heads
|
239 |
+
> ).view(model.config.attention_ heads, -1, -1).permute(0, 2, 1)
|
240 |
+
> ).transpose(-1, -2)
|
241 |
+
> )
|
242 |
+
>
|
243 |
+
> # Apply attention mask and dropout if needed
|
244 |
+
> attn_weights = attention_pad_mask(attn_weights, mask,
|
245 |
+
> model.config.attention_probs_drop_out_nondiag)
|
246 |
+
> attn_weights = torch.nn.functional.dropout(attn_weights, p=attn_weights_dropout,
|
247 |
+
> dim=1 if model.is_encoder else None)
|
248 |
+
>
|
249 |
+
> # Compute attention-weighted representations using both linguistic and commonsense
|
250 |
+
> parts separately
|
251 |
+
> attn_outputs = torch.nn.functional.attention(
|
252 |
+
> v_linguistic + mask_tokens(v_knowledge, mask),
|
253 |
+
> k_linguistic + mask_tokens(k_knowledge, mask),
|
254 |
+
> q=q_linguistic + mask_tokens(q_knowledge, mask),
|
255 |
+
> attn_weights=attn_weights[:, :model.config.hidden_size],
|
256 |
+
> v_weights=(1.0 - model. config.knowledge_proportion).unsqueeze(1, 1, -1),
|
257 |
+
> k_weights=model.config.attention_heads_weigh t.unsqueeze(0, 1, 1, 1),
|
258 |
+
> v_mask=None if mask is None else mask[:, :model.config.hidden_size,
|
259 |
+
> :model.config.hidden_size],
|
260 |
+
> k_mask=None,
|
261 |
+
> v_weights_layer=None,
|
262 |
+
> k_weights_layer=None,
|
263 |
+
> v_bias=None,
|
264 |
+
> k_bias=None,
|
265 |
+
> v_w_layer=None,
|
266 |
+
> k_w_layer=None,
|
267 |
+
> use_transformer_weights=True,
|
268 |
+
> )
|
269 |
+
>
|
270 |
+
> return attn_outputs + torch.nn.functional.attention(
|
271 |
+
> mask_tokens(v_linguistic, mask) + v_knowledge,
|
272 |
+
> mask_to tokens(k_linguistic, mask) + k_knowledge,
|
273 |
+
> q=mask_tokens(q_linguistic, mask) + q_knowledge,
|
274 |
+
> attn_weights=attn_weights[:, model.config.hidden_size:],
|
275 |
+
> v_weights=model.config.knowledge_proportion.unsqueeze(1, 1, -1),
|
276 |
+
> k_weights=model.config.attention_heads_weight. unsqueeze(0, 1, 1, 1),
|
277 |
+
> v_mask=None if mask is None else mask[:, model.config.hidden_size:, :],
|
278 |
+
> k_mask=None,
|
279 |
+
> v_weights_layer=None,
|
280 |
+
> k_weights_layer=None,
|
281 |
+
> v_bias=None,
|
282 |
+
> k_bias=None,
|
283 |
+
> v_w_layer=None,
|
284 |
+
> k_w_layer=None,
|
285 |
+
> use_transformer_weights=True,
|
286 |
+
> )
|
287 |
+
> ```
|
288 |
+
>
|
289 |
+
> Here's how you could replace the original attention mechanism in a transformer layer with this new implementation:
|
290 |
+
>
|
291 |
+
> ```python
|
292 |
+
> # Replace the original self-attention module in the transformer layer
|
293 |
+
> for layer_idx in range(model.config.num_transformer_layers):
|
294 |
+
> for i in [0, -1]: # Replace both forward and backward layers if using a
|
295 |
+
> encoder-decoder architecture
|
296 |
+
> attention_layer = model.encoder.transformer[layer_idx].attention_block[i]
|
297 |
+
> attention_layer.self_attention = nn.ModuleList([
|
298 |
+
> nn.Attention(
|
299 |
+
> model.config.hidden_size,
|
300 |
+
> model.config.attention_heads,
|
301 |
+
> model.config.attention_head_size,
|
302 |
+
> dropout=model.config.attention_probs_dropout_v1,
|
303 |
+
> attn_layer_func=attention_with_knowledge
|
304 |
+
> ) for _ in range(model.config.num_attention_layers)
|
305 |
+
> ])
|
306 |
+
> ```
|
307 |
+
>
|
308 |
+
> Note that this code assumes you have pre-computed external commonsense knowledge
|
309 |
+
> embeddings for each input token and added them to your vocabulary using the new [KNW]
|
310 |
+
> tokenizer token, which should be incorporated into your downstream NLP task
|
311 |
+
> accordingly. Additionally, hyperparameters like `model.config.knowledge_proportion`
|
312 |
+
> control the relative importance of linguistic vs commonsense information during
|
313 |
+
> attention weight calculation and may require fine-tuning on your specific task and
|
314 |
+
> commonsense knowledge source.
|
315 |
+
>
|
316 |
+
|
317 |
+
ok can you please show the code to generate the pre-computed external commonsense knowledge embeddings for each input token and add them to my vocabulary using the new [KNW] tokenizer token?
|
318 |
+
|
319 |
+
> Sure! Here's an example of how you could pre-process your external commonsense
|
320 |
+
> knowledge source like ConceptNet or WordNet into a pickle file containing
|
321 |
+
> word-to-concept and concept-to-embedding mappings, which can then be loaded during
|
322 |
+
> model training to compute attention weights as shown in the previous example:
|
323 |
+
> ```python
|
324 |
+
> import torch
|
325 |
+
> import pickle
|
326 |
+
> from transformers import RoBERTaTokenizerFast
|
327 |
+
>
|
328 |
+
> # Load pre-trained RoBERTa tokenizer for downstream NLP tasks
|
329 |
+
> tokenizer = RoBERTaTokenizerFast.from_pretrained('roberta-base')
|
330 |
+
>
|
331 |
+
> # Load external commonsense knowledge source as a pickle file or preprocess it into the
|
332 |
+
> following format if needed
|
333 |
+
> # {
|
334 |
+
> # 'word1': {
|
335 |
+
> # 'concept1': 0.8, # confidence score between word and concept
|
336 |
+
> # ...
|
337 |
+
> # },
|
338 |
+
> # 'word2': {...},
|
339 |
+
> # ...
|
340 |
+
> # }
|
341 |
+
> #
|
342 |
+
> # 'concept1': torch.Tensor([0.3, 0.7, ...]), # pre-computed commonsense embedding
|
343 |
+
> for each concept
|
344 |
+
> # 'concept2': torch.Tensor([...])
|
345 |
+
> # ...
|
346 |
+
> # }
|
347 |
+
> #
|
348 |
+
> # You can use external knowledge graph APIs like ConceptNet5 or pre-trained commonsense
|
349 |
+
> knowledge bases like CogALex to obtain such mappings and embeddings if needed.
|
350 |
+
> #
|
351 |
+
> # Note that we only include the highest-scoring concept per word for simplicity here;
|
352 |
+
> you could easily extend this to consider multiple concepts per word with attention over
|
353 |
+
> their corresponding embeddings during modeling if desired.
|
354 |
+
> #
|
355 |
+
> # Additionally, you may want to pre-process your commonsense knowledge source into a
|
356 |
+
> format more amenable to efficient lookups and batched operations in PyTorch, such as
|
357 |
+
> embedding matrices or PyTorch tensors instead of Python dictionaries.
|
358 |
+
> #
|
359 |
+
> # commonsense_word_concepts = {...}
|
360 |
+
> # commonsense_concept_embeddings = {...}
|
361 |
+
> # }
|
362 |
+
> #
|
363 |
+
> # with open('path/to/pkl', 'wb') as f:
|
364 |
+
> # pickle.dump({
|
365 |
+
> # 'tokenizer': tokenizer.vocab,
|
366 |
+
> # 'word_concepts': commonsense_word_concepts,
|
367 |
+
> # 'concept_embeddings': commonsense_concept_embedding
|
368 |
+
> # }, f)
|
369 |
+
> ```
|
370 |
+
>
|
371 |
+
> Then, during model training, you can load these pre-computed mappings and embeddings as follows:
|
372 |
+
>
|
373 |
+
> ```python
|
374 |
+
> # Load pre-processed commonsense knowledge pickle file
|
375 |
+
> with open('path/to/pkl', 'rb') as f:
|
376 |
+
> commonsense_data = pickle.load(f)
|
377 |
+
>
|
378 |
+
> # Add new [KNW] token for representing commonsense embeddings
|
379 |
+
> tokenizer.add_tokens([f"[KNW_{i}]"] for i in
|
380 |
+
> range(len(commonsense_data['tokenizer'])+1, len(commonsense_data['tokenizer']) +
|
381 |
+
> commonsense_data['word_concepts'].max() + 1))
|
382 |
+
>
|
383 |
+
> # Update your downstream NLP task's tokenizer to use the updated one with new [KNW]
|
384 |
+
> token
|
385 |
+
> # ...
|
386 |
+
> ```
|
387 |
+
>
|
388 |
+
>
|
389 |
+
> Now, during data pre-processing for each input example `X` (assuming it's a PyTorch
|
390 |
+
> tensor of shape [batch_size, sequence_length]) and corresponding mask `Y` (a binary
|
391 |
+
> PyTorch tensor of shape [batch_size, sequence_lingth] with 1s for valid tokens and 0s
|
392 |
+
> for padding):
|
393 |
+
>
|
394 |
+
> ```python
|
395 |
+
> # Convert input text to PyTorch LongTensor using updated tokenizer with [KNW] token
|
396 |
+
> X_with_knw = torch.zeros_like(X)
|
397 |
+
> X_with_knw[:, ~Y] = torch.LongTensor([[tokenizer.vocab[token]] for token in
|
398 |
+
> X[~Y].tolist()])[~Y]
|
399 |
+
> X_with_knw[torch.where(X_with_knw == tokenizer.pad_token_id)] =
|
400 |
+
> commonsense_data['tokenizer']['[PAD]']
|
401 |
+
> X_with_knw[torch.where((X_with_knw >= commonsense_data['tokenizer']['[CLS]']) &
|
402 |
+
> (X_with_knw <= commonsense_data['tokenizer']['[SEP]']))] =
|
403 |
+
> commonsense_data['tokenizer'][f"[UNK{X_with_knw[torch.where(~Y)[...,
|
404 |
+
> torch.where(X_with_knw == X_masked_tokens[..., i])//2].long()//2] -
|
405 |
+
> commonsense_data['tokenizer']['[CLS]'])]"]
|
406 |
+
> X_with_knw[torch.where(X_with_knw >= commonsense_data['tokenizer']['[SEP]'])] =
|
407 |
+
> torch.pad_zeros(X_with_knw[:, :torch.where(X_with_knw ==
|
408 |
+
> commonsense_data['tokenizer']['[SEP]'])[0].long()][:, :1], (X_with_knw.shape[0],
|
409 |
+
> X_with_knw.shape[1]))
|
410 |
+
> ```
|
411 |
+
>
|
412 |
+
> Here we replace all non-alphanumeric tokens with the `[UNK]` token and insert a new
|
413 |
+
> sequence of zeros at the end of each input sequence to indicate sequence boundaries for
|
414 |
+
> models like BERT that use a sequence-to-sequence masking strategy. You may need to
|
415 |
+
> modify this pre-processing code according to your specific downstream NLP task's input
|
416 |
+
> format and pre-processing requirements.
|
417 |
+
>
|
418 |
+
>
|
419 |
+
> Finally, during model forward-passing, you can access the corresponding commonsense
|
420 |
+
> embeddings for each input token using its new [KNW] tokenized index as follows:
|
421 |
+
>
|
422 |
+
> ```python
|
423 |
+
> # Add commonsense embeddings as additional columns to the input PyTorch tensor
|
424 |
+
> X_with_knw
|
425 |
+
> X_with_knowledge = torch.cat([X_with_knw[:, :i, :],
|
426 |
+
> torch.zeros_like(X_with_knw[:, i, :]),
|
427 |
+
> X_with_knw[:, i+1:, :]], dim=2)
|
428 |
+
> for i in range(X_with_knw.shape[2]):
|
429 |
+
> mask = (X_with_knw[:, i, :] == commonsense_data['tokenizer'][f"[KNW_{i}]"]).long()
|
430 |
+
> if torch.sum(mask) > 0:
|
431 |
+
> word_idxs = torch.where(mask)[1].long()
|
432 |
+
> concept_idxs = torch.tensor([j for j in
|
433 |
+
> range(len(commonsense_data['word_concepts'][X_with_knw[:, ~Y][word_idxs,
|
434 |
+
> :].tolist()[i]])) if commonsense_data['word_concepts'][X_with_knw[:, ~Y][word_idxs,
|
435 |
+
> :].tolist()[i]][j] >= torch.rand(len(word_idxs)) *
|
436 |
+
> 1e8**len(commonsense_data['word_concepts'][X_with_knw[:, ~Y][word_idxs,
|
437 |
+
> :].tolist()[i]]).item())])
|
438 |
+
> X_with_knowledge[:, i, word_idxs, :] =
|
439 |
+
> torch.gather(torch.cat([torch.zeros_like(X_with-knw[0, 0, :]), *[
|
440 |
+
> torch.expand_as_batches(
|
441 |
+
> commonsense_data['concept_embeddings'][j],
|
442 |
+
> X_with_knw.shape[0], -1) for j in
|
443 |
+
> concept_idxs]][:X_with_knw.shape[2]//2+i*2//4])[:, :X_with-knw.shape[2]//2+i*2//4]],
|
444 |
+
>
|
445 |
+
> torch.stack([torch.zeros_like(torch.rand_long(1,
|
446 |
+
> X_with_knw.shape[2]//2+i*2//4).item())], *[
|
447 |
+
>
|
448 |
+
> torch.argsort(torch.rand(len(j), device=j.device))[:X_with_knw.shape[2//2+i*2//4]]
|
449 |
+
> for j in
|
450 |
+
> range(len(concept_idxs))])[0][:, :X_with-knw.shape[2]//2+i*2//4]]), dim=1)
|
451 |
+
> ```
|
452 |
+
>
|
453 |
+
> Now, the modified input PyTorch tensor `X_with_knowledge` contains both linguistic and
|
454 |
+
> commonsense knowledge embeddings concatenated along the embedding dimension, which can
|
455 |
+
> be passed through your attention-based transformer model layers as shown in the
|
456 |
+
> previous example to compute attention weights that jointly reason over both types of
|
457 |
+
> information during self-attention computation. Note that this code assumes you only
|
458 |
+
> want to incorporate commonsense knowledge for a subset of input tokens marked by the
|
459 |
+
> new [KNW] token; you may need to modify it according to your specific downstream NLP
|
460 |
+
> task's requirements and desired level of commonsense integration.
|
461 |
|
462 |
|
463 |
+
**Write a math paper**
|
464 |
|
465 |
> Please invent a new idea in the area of mathematics, that combines two or more papers into a new idea that has not yet been published to your knowledge
|
466 |
> ## Integrability and Computability of Continuous Functions on Infinite Domains: A Category-Theoretical Approach
|