Update README.md
Browse files
README.md
CHANGED
@@ -458,9 +458,9 @@ ok can you please show the code to generate the pre-computed external commonsens
|
|
458 |
> want to incorporate commonsense knowledge for a subset of input tokens marked by the
|
459 |
> new [KNW] token; you may need to modify it according to your specific downstream NLP
|
460 |
> task's requirements and desired level of commonsense integration.
|
461 |
-
|
462 |
-
|
463 |
-
|
464 |
> 1. **Attention over attention**: As mentioned earlier, we can add a higher-level attention module that computes attention over the
|
465 |
> existing self-attention scores in each attention head. This could involve an additional linear layer followed by another softmax
|
466 |
> activation applied to the original attention scores before they are used as weights when computing the final output sequence. By
|
@@ -472,8 +472,7 @@ ok can you please show the code to generate the pre-computed external commonsens
|
|
472 |
> within each self-attention layer that compares two alternative input representations instead of just one actual input sequence and its
|
473 |
> own past context as in standard transformer models. Specifically, for every timestep t, rather than computing attention weights
|
474 |
> a_t^j(k) = softmax(Q_t^j(k)^T * K_t^j(k)) where Q, K, V are query, key, value matrices extracted from the current input sequence X_t
|
475 |
-
> and j indexes different self-attention heads, we could instead compute two sets of attention weights a_t^j(k, l) = softmax(Q_t^j(k)^T
|
476 |
-
> * (K_t^j(k) + alpha * (K_t^j(l) - K_t^j(k)))) for all pairs of timesteps k != l and some hyperparameter alpha > 0. This would
|
477 |
> encourage each head to pay more or less attention to certain input tokens depending on whether they are likely counterfactual
|
478 |
> alternatives given the rest of the sentence context, which could potentially help improve its ability to reason about what might have
|
479 |
> happened if different words had been used instead.
|
|
|
458 |
> want to incorporate commonsense knowledge for a subset of input tokens marked by the
|
459 |
> new [KNW] token; you may need to modify it according to your specific downstream NLP
|
460 |
> task's requirements and desired level of commonsense integration.
|
461 |
+
|
462 |
+
Transformer architecture (Vaswani et al.) does poorly in theory of mind. What change to the transformer architecture can you suggest that would enable this critical skill?
|
463 |
+
|
464 |
> 1. **Attention over attention**: As mentioned earlier, we can add a higher-level attention module that computes attention over the
|
465 |
> existing self-attention scores in each attention head. This could involve an additional linear layer followed by another softmax
|
466 |
> activation applied to the original attention scores before they are used as weights when computing the final output sequence. By
|
|
|
472 |
> within each self-attention layer that compares two alternative input representations instead of just one actual input sequence and its
|
473 |
> own past context as in standard transformer models. Specifically, for every timestep t, rather than computing attention weights
|
474 |
> a_t^j(k) = softmax(Q_t^j(k)^T * K_t^j(k)) where Q, K, V are query, key, value matrices extracted from the current input sequence X_t
|
475 |
+
> and j indexes different self-attention heads, we could instead compute two sets of attention weights a_t^j(k, l) = softmax(Q_t^j(k)^T * (K_t^j(k) + alpha * (K_t^j(l) - K_t^j(k)))) for all pairs of timesteps k != l and some hyperparameter alpha > 0. This would
|
|
|
476 |
> encourage each head to pay more or less attention to certain input tokens depending on whether they are likely counterfactual
|
477 |
> alternatives given the rest of the sentence context, which could potentially help improve its ability to reason about what might have
|
478 |
> happened if different words had been used instead.
|