abacusai
/

TheProfessor-155b

@@ -458,9 +458,9 @@ ok can you please show the code to generate the pre-computed external commonsens
 > want to incorporate commonsense knowledge for a subset of input tokens marked by the
 > new [KNW] token; you may need to modify it according to your specific downstream NLP
 > task's requirements and desired level of commonsense integration.
->
-> > Transformer architecture (Vaswani et al.) does poorly in theory of mind.  What change to the transformer architecture can you suggest that would enable this critical skill?
->
 > 1. **Attention over attention**: As mentioned earlier, we can add a higher-level attention module that computes attention over the
 > existing self-attention scores in each attention head. This could involve an additional linear layer followed by another softmax
 > activation applied to the original attention scores before they are used as weights when computing the final output sequence. By
@@ -472,8 +472,7 @@ ok can you please show the code to generate the pre-computed external commonsens
 > within each self-attention layer that compares two alternative input representations instead of just one actual input sequence and its
 > own past context as in standard transformer models. Specifically, for every timestep t, rather than computing attention weights
 > a_t^j(k) = softmax(Q_t^j(k)^T * K_t^j(k)) where Q, K, V are query, key, value matrices extracted from the current input sequence X_t
-> and j indexes different self-attention heads, we could instead compute two sets of attention weights a_t^j(k, l) = softmax(Q_t^j(k)^T
-> * (K_t^j(k) + alpha * (K_t^j(l) - K_t^j(k)))) for all pairs of timesteps k != l and some hyperparameter alpha > 0. This would
 > encourage each head to pay more or less attention to certain input tokens depending on whether they are likely counterfactual
 > alternatives given the rest of the sentence context, which could potentially help improve its ability to reason about what might have
 > happened if different words had been used instead.

 > want to incorporate commonsense knowledge for a subset of input tokens marked by the
 > new [KNW] token; you may need to modify it according to your specific downstream NLP
 > task's requirements and desired level of commonsense integration.
+Transformer architecture (Vaswani et al.) does poorly in theory of mind.  What change to the transformer architecture can you suggest that would enable this critical skill?
 > 1. **Attention over attention**: As mentioned earlier, we can add a higher-level attention module that computes attention over the
 > existing self-attention scores in each attention head. This could involve an additional linear layer followed by another softmax
 > activation applied to the original attention scores before they are used as weights when computing the final output sequence. By
 > within each self-attention layer that compares two alternative input representations instead of just one actual input sequence and its
 > own past context as in standard transformer models. Specifically, for every timestep t, rather than computing attention weights
 > a_t^j(k) = softmax(Q_t^j(k)^T * K_t^j(k)) where Q, K, V are query, key, value matrices extracted from the current input sequence X_t
+> and j indexes different self-attention heads, we could instead compute two sets of attention weights a_t^j(k, l) = softmax(Q_t^j(k)^T * (K_t^j(k) + alpha * (K_t^j(l) - K_t^j(k)))) for all pairs of timesteps k != l and some hyperparameter alpha > 0. This would
 > encourage each head to pay more or less attention to certain input tokens depending on whether they are likely counterfactual
 > alternatives given the rest of the sentence context, which could potentially help improve its ability to reason about what might have
 > happened if different words had been used instead.