The Story of Heads

This is a post for the ACL 2019 paper Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned.

From this post, you will learn:
  • how we evaluate the importance of attention heads in Transformer
  • which functions the most important encoder heads perform
  • how we prune the vast majority of attention heads in Transformer without seriously affecting quality
  • which types of model attention are most sensitive to the number of attention heads and on which layers
June 2019