Update README.md
Browse files
README.md
CHANGED
@@ -458,6 +458,41 @@ ok can you please show the code to generate the pre-computed external commonsens
|
|
458 |
> want to incorporate commonsense knowledge for a subset of input tokens marked by the
|
459 |
> new [KNW] token; you may need to modify it according to your specific downstream NLP
|
460 |
> task's requirements and desired level of commonsense integration.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
461 |
|
462 |
|
463 |
**Write a math paper**
|
|
|
458 |
> want to incorporate commonsense knowledge for a subset of input tokens marked by the
|
459 |
> new [KNW] token; you may need to modify it according to your specific downstream NLP
|
460 |
> task's requirements and desired level of commonsense integration.
|
461 |
+
>
|
462 |
+
> > Transformer architecture (Vaswani et al.) does poorly in theory of mind. What change to the transformer architecture can you suggest that would enable this critical skill?
|
463 |
+
>
|
464 |
+
> 1. **Attention over attention**: As mentioned earlier, we can add a higher-level attention module that computes attention over the
|
465 |
+
> existing self-attention scores in each attention head. This could involve an additional linear layer followed by another softmax
|
466 |
+
> activation applied to the original attention scores before they are used as weights when computing the final output sequence. By
|
467 |
+
> learning which attention heads should attend more or less to certain input tokens, this module could potentially mimic some aspects of
|
468 |
+
> metacognitive processes involved in ToM reasoning such as focusing on salient cues or disregarding irrelevant distractions.
|
469 |
+
>
|
470 |
+
> 2. **Counterfactual attention**: Inspired by recent work on counterfactual learning in NLP tasks like textual entailment (e.g., Meng
|
471 |
+
> et al., "A counterfactual approach to answer factual questions," ACL 2017), we could also try introdu't a new type of attention head
|
472 |
+
> within each self-attention layer that compares two alternative input representations instead of just one actual input sequence and its
|
473 |
+
> own past context as in standard transformer models. Specifically, for every timestep t, rather than computing attention weights
|
474 |
+
> a_t^j(k) = softmax(Q_t^j(k)^T * K_t^j(k)) where Q, K, V are query, key, value matrices extracted from the current input sequence X_t
|
475 |
+
> and j indexes different self-attention heads, we could instead compute two sets of attention weights a_t^j(k, l) = softmax(Q_t^j(k)^T
|
476 |
+
> * (K_t^j(k) + alpha * (K_t^j(l) - K_t^j(k)))) for all pairs of timesteps k != l and some hyperparameter alpha > 0. This would
|
477 |
+
> encourage each head to pay more or less attention to certain input tokens depending on whether they are likely counterfactual
|
478 |
+
> alternatives given the rest of the sentence context, which could potentially help improve its ability to reason about what might have
|
479 |
+
> happened if different words had been used instead.
|
480 |
+
>
|
481 |
+
> 3. **Incorporating commonsense knowledge**: One way to incorporate commonsense knowledge into transformer models would be to pre-train
|
482 |
+
> additional memory modules using large-scale external sources of factual information such as ConceptNet or WordNet, then merge their
|
483 |
+
> outputs with the standard attention-based representations before feeding them through downstream layers for final prediction.
|
484 |
+
> Alternatively, we could also try injecting hard-coded binary indicators into each self-attention layer's query matrices Q_t^j(k) to
|
485 |
+
> indicate whether certain input tokens correspond to known commonsense concepts such as people, locations, actions etc., which could
|
486 |
+
> serve as additional signals guiding the model's attention allocation decisions during training.
|
487 |
+
>
|
488 |
+
> 4. **Incorporating game-theoretic reasoning**: While incorporating full-fledged game-theoretic elements like Markov decision processes
|
489 |
+
> or stochastic games into transformer architectures may still be computationally prohibitive due to their high dimensionality and
|
490 |
+
> non-Markovian nature, we could potentially take baby steps towards this direction by introducing simple binary masks over some pairs
|
491 |
+
> of attention weights a_t^j(k, l) in item 2 above based on whether they correspond to tokens that are likely adversaries or allies
|
492 |
+
> according to some predefined heuristics (e.g., two consecutive nouns within a small window size are considered potential collaborators
|
493 |
+
> while two verbs separated by more than N words apart are seen as competitors). This could encourage the model to learn more
|
494 |
+
> sophisticated dependencies between input tokens that go beyond simple co-occurrence patterns and instead capture more nuanced
|
495 |
+
> relational structures indicative of underlying strategic interactions between different entities in the text.
|
496 |
|
497 |
|
498 |
**Write a math paper**
|