11.5. Multi-Head Attention — Dive into Deep Learning 1.0.3 documentation thumbnail
11.5. Multi-Head Attention — Dive into Deep Learning 1.0.3 documentation
d2l.ai
ine knowledge from different behaviors of the same attention mecha ads can be computed in parall In practice, given the same set of queries, keys, and values we may want our model to combine knowledge from different behaviors of the same attention mechanism, such as capturing dependencies of various
2 Users
0 Comments
11 Highlights
0 Notes

Top Highlights

  • ine knowledge from different behaviors of the same attention mecha
  • ads can be computed in parall
  • In practice, given the same set of queries, keys, and values we may want our model to combine knowledge from different behaviors of the same attention mechanism, such as capturing dependencies of various ranges (e.g., shorter-range vs. longer-range) within a sequence. Thus, it may be beneficial to allow our attention mechanism to jointly use differ...
  • independently learned linear projections. Then these
  • projected queries, keys, and values are fed into attention pooling in parallel.

Domain

Ready to highlight and find good content?

Glasp is a social web highlighter that people can highlight and organize quotes and thoughts from the web, and access other like-minded people’s learning.