ehartford commited on
Commit
67d3388
1 Parent(s): bf85573

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -0
README.md CHANGED
@@ -458,6 +458,41 @@ ok can you please show the code to generate the pre-computed external commonsens
458
  > want to incorporate commonsense knowledge for a subset of input tokens marked by the
459
  > new [KNW] token; you may need to modify it according to your specific downstream NLP
460
  > task's requirements and desired level of commonsense integration.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
461
 
462
 
463
  **Write a math paper**
 
458
  > want to incorporate commonsense knowledge for a subset of input tokens marked by the
459
  > new [KNW] token; you may need to modify it according to your specific downstream NLP
460
  > task's requirements and desired level of commonsense integration.
461
+ >
462
+ > > Transformer architecture (Vaswani et al.) does poorly in theory of mind. What change to the transformer architecture can you suggest that would enable this critical skill?
463
+ >
464
+ > 1. **Attention over attention**: As mentioned earlier, we can add a higher-level attention module that computes attention over the
465
+ > existing self-attention scores in each attention head. This could involve an additional linear layer followed by another softmax
466
+ > activation applied to the original attention scores before they are used as weights when computing the final output sequence. By
467
+ > learning which attention heads should attend more or less to certain input tokens, this module could potentially mimic some aspects of
468
+ > metacognitive processes involved in ToM reasoning such as focusing on salient cues or disregarding irrelevant distractions.
469
+ >
470
+ > 2. **Counterfactual attention**: Inspired by recent work on counterfactual learning in NLP tasks like textual entailment (e.g., Meng
471
+ > et al., "A counterfactual approach to answer factual questions," ACL 2017), we could also try introdu't a new type of attention head
472
+ > within each self-attention layer that compares two alternative input representations instead of just one actual input sequence and its
473
+ > own past context as in standard transformer models. Specifically, for every timestep t, rather than computing attention weights
474
+ > a_t^j(k) = softmax(Q_t^j(k)^T * K_t^j(k)) where Q, K, V are query, key, value matrices extracted from the current input sequence X_t
475
+ > and j indexes different self-attention heads, we could instead compute two sets of attention weights a_t^j(k, l) = softmax(Q_t^j(k)^T
476
+ > * (K_t^j(k) + alpha * (K_t^j(l) - K_t^j(k)))) for all pairs of timesteps k != l and some hyperparameter alpha > 0. This would
477
+ > encourage each head to pay more or less attention to certain input tokens depending on whether they are likely counterfactual
478
+ > alternatives given the rest of the sentence context, which could potentially help improve its ability to reason about what might have
479
+ > happened if different words had been used instead.
480
+ >
481
+ > 3. **Incorporating commonsense knowledge**: One way to incorporate commonsense knowledge into transformer models would be to pre-train
482
+ > additional memory modules using large-scale external sources of factual information such as ConceptNet or WordNet, then merge their
483
+ > outputs with the standard attention-based representations before feeding them through downstream layers for final prediction.
484
+ > Alternatively, we could also try injecting hard-coded binary indicators into each self-attention layer's query matrices Q_t^j(k) to
485
+ > indicate whether certain input tokens correspond to known commonsense concepts such as people, locations, actions etc., which could
486
+ > serve as additional signals guiding the model's attention allocation decisions during training.
487
+ >
488
+ > 4. **Incorporating game-theoretic reasoning**: While incorporating full-fledged game-theoretic elements like Markov decision processes
489
+ > or stochastic games into transformer architectures may still be computationally prohibitive due to their high dimensionality and
490
+ > non-Markovian nature, we could potentially take baby steps towards this direction by introducing simple binary masks over some pairs
491
+ > of attention weights a_t^j(k, l) in item 2 above based on whether they correspond to tokens that are likely adversaries or allies
492
+ > according to some predefined heuristics (e.g., two consecutive nouns within a small window size are considered potential collaborators
493
+ > while two verbs separated by more than N words apart are seen as competitors). This could encourage the model to learn more
494
+ > sophisticated dependencies between input tokens that go beyond simple co-occurrence patterns and instead capture more nuanced
495
+ > relational structures indicative of underlying strategic interactions between different entities in the text.
496
 
497
 
498
  **Write a math paper**