The Story of Heads
data:image/s3,"s3://crabby-images/ff0f9/ff0f9fb898735a432fde7ee1bb23806912c9e58d" alt=""
From this post, you will learn:
- how we evaluate the importance of attention heads in Transformer
- which functions the most important encoder heads perform
- how we prune the vast majority of attention heads in Transformer without seriously affecting quality
- which types of model attention (encoder self-attention, decoder self-attention or decoder-encoder attention) are most sensitive to the number of attention heads and on which layers
data:image/s3,"s3://crabby-images/3e109/3e109c5a255bb4c5390f1a16710c56777a811b0b" alt=""
data:image/s3,"s3://crabby-images/7b2ac/7b2ac8fa55642543a3e1e05e899f288eb8ced9e1" alt=""
Heads Importance
Previous works analyzing how representations are formed by the Transformer’s multi-head attention mechanism haven't taken into account the varying importance of different heads. Also, this obscures the roles played by individual heads which, as we will show later, influence the generated translations to differing extents.data:image/s3,"s3://crabby-images/06034/06034741545c9450abcb423027016f7eaa522202" alt=""
Figure shows confidence of all heads in the model. We see, that there are a few heads which are extremely confident: on average, they assign more than 80% of their attention mass to a single token.
data:image/s3,"s3://crabby-images/13029/13029f7b2cb15a243c9e87fd6da09f763d1a2de1" alt=""
LRP was originally designed to compute the contributions of single pixels to predictions of image classifiers. It back-propagates relevance recursively from the output layer to the input layer as shown in the illustration (the illustration is taken from this cool post). To propagate the prediction back, LRP relies on a conservation principle. Intuitively, this means that total contribution of neurons at each layer is constant.
We adapt LRP to the Transformer model to calculate relevance that measures the association degree between two arbitrary neurons in neural networks. The way we use LRP here is different from using attribution methods in computer vision in two important ways:
- we evaluate a neuron/network part importance, not input element (pixel or token),
- we evaluate the importance on average, not for a single prediction.
data:image/s3,"s3://crabby-images/a7862/a7862387957166801367140515bd7777f4d70239" alt=""
Head Functions
We now turn to investigating whether heads play consistent and interpretable roles within the model. We examined some attention matrices paying particular attention to heads ranked highly by LRP and identified three functions which heads might be playing.data:image/s3,"s3://crabby-images/18090/18090db76bfb80cb02556b93950f6dfa651584ae" alt=""
Model trained on WMT EN-DE
data:image/s3,"s3://crabby-images/2f237/2f237988c690ce7814094c7ca322288e1caceaa6" alt=""
Model trained on WMT EN-DE
data:image/s3,"s3://crabby-images/3590c/3590c70c42e487085053f75637ddae5be82b512c" alt=""
Model trained on WMT EN-FR
data:image/s3,"s3://crabby-images/c9a91/c9a91acee5e7e73d51aca92a55384981d53f64a5" alt=""
Model trained on WMT EN-FR
data:image/s3,"s3://crabby-images/82ef8/82ef80fc9280bb30dfe21da8e300d5d35830dc29" alt=""
Model trained on WMT EN-RU
data:image/s3,"s3://crabby-images/b95ef/b95ef9d62d41a35bfece37feb7482cbc79d23142" alt=""
Model trained on WMT EN-RU
data:image/s3,"s3://crabby-images/b9f50/b9f50f77df7a634959a7072cc8fc854b3c2356b0" alt=""
Model trained on OpenSubtitles EN-RU
data:image/s3,"s3://crabby-images/6e7c6/6e7c6a11d0495d624490d05a4ecf1e20af4972bf" alt=""
Model trained on OpenSubtitles EN-RU
We refer to a head as “positional” if at least 90% of the time its maximum attention weight is assigned to a specific relative position (in practice either -1 or +1, i.e. attention to adjacent tokens).
data:image/s3,"s3://crabby-images/6f6f8/6f6f82c3c6b73347fdb252710164b178773fb3c4" alt=""
subject-> verb
data:image/s3,"s3://crabby-images/1a1cb/1a1cb9f7160a6d74355ae1f2a352933165d1c206" alt=""
verb -> subject
data:image/s3,"s3://crabby-images/4039e/4039ee41ed63907b07d196828802458c230af76e" alt=""
subject-> verb
data:image/s3,"s3://crabby-images/5251a/5251aa8f46ae225e204269c46394aca7af4c2efc" alt=""
verb -> subject
data:image/s3,"s3://crabby-images/93819/93819f6a7ebf79e42e25791837cbb4075a22f912" alt=""
verb -> subject
data:image/s3,"s3://crabby-images/14d49/14d491e5d20d509dafb5fb1757c2af5635bee922" alt=""
object -> verb
data:image/s3,"s3://crabby-images/3d2e1/3d2e1a02e264dfedf2f9733a09d8b92d87767e53" alt=""
verb -> object
data:image/s3,"s3://crabby-images/2fe19/2fe1944fd461311323a59f3f033fe6f3c6315766" alt=""
object -> verb
We hypothesize that, when used to perform translation, the Transformer’s encoder may be responsible for disambiguating the syntactic structure of the source sentence. We therefore wish to know whether a head attends to tokens corresponding to any of the major syntactic relations in a sentence. In our analysis, we looked at nominal subject (nsubj), direct object (dobj), adjectival modifier (amod) and adverbial modifier (advmod) relations. We calculate for each head how often it assigns its maximum attention weight (excluding EOS) to a token with which it is in one of the aforementioned dependency relations. We do so by comparing its attention weights to a dependency structure predicted by CoreNLP on a large number of held-out sentences.
For more details, read the paper.
data:image/s3,"s3://crabby-images/94534/945341f9bb0a0091249e7ff57b0eb565e07feaeb" alt=""
Model trained on WMT EN-DE
data:image/s3,"s3://crabby-images/056e8/056e88f3b70a50a146f77880c77130fe8d31584b" alt=""
Model trained on WMT EN-DE
data:image/s3,"s3://crabby-images/4c724/4c72483053bacc8530eb022eb4ad9176086babd7" alt=""
Model trained on WMT EN-DE
data:image/s3,"s3://crabby-images/372ee/372eed3985dccc820b4dab4f2499862e40391f6e" alt=""
Model trained on WMT EN-FR
data:image/s3,"s3://crabby-images/5a38d/5a38dfa65752522a8d9778a84f52c334f7b3f26c" alt=""
Model trained on WMT EN-FR
data:image/s3,"s3://crabby-images/2c37a/2c37ad06e87327f19eec513591e11f94ae251f0d" alt=""
Model trained on WMT EN-FR
data:image/s3,"s3://crabby-images/d793a/d793aad61e602d6fcdcf15b84e0715bbd6445732" alt=""
Model trained on WMT EN-RU
data:image/s3,"s3://crabby-images/25276/252769ee6ca3e7c0f6a3d7334631498640c1cab9" alt=""
Model trained on WMT EN-RU
data:image/s3,"s3://crabby-images/f9505/f9505134ec7dcea05b54aab0cca113fce83119e9" alt=""
Model trained on WMT EN-RU
data:image/s3,"s3://crabby-images/8244b/8244b1891cbadc39c71f472a2b28e09879e86096" alt=""
Model trained on WMT EN-RU
data:image/s3,"s3://crabby-images/e32fc/e32fcf0541d008febddd1f2953d97ec6c53f5e07" alt=""
Model trained on OpenSubtitles EN-RU
data:image/s3,"s3://crabby-images/14bdd/14bdd80091b458c9092faf5309b4c6223ffe4279" alt=""
Model trained on OpenSubtitles EN-RU
data:image/s3,"s3://crabby-images/0bb06/0bb0658eb83351a453a9d5e771a17aa4d4ba5341" alt=""
Model trained on OpenSubtitles EN-RU
Rare tokens
For all models, we find a head pointing to the least frequent tokens in a sentence. For models trained on OpenSubtitles, among sentences where the least frequent token in a sentence is not in the top-500 most frequent tokens, this head points to the rarest token in 66% of cases, and to one of the two least frequent tokens in 83% of cases. For models trained on WMT, this head points to one of the two least frequent tokens in more than 50% of such cases.
data:image/s3,"s3://crabby-images/eb6c6/eb6c61afb59a2bccb1d1bca11990b0205e039e11" alt=""
Pruning Attention Heads
We have identified certain functions of the most relevant heads at each layer and showed that to a large extent they are interpretable. What about the remaining heads? Are they useless to translation quality or do they play equally vital but simply less easily defined roles? We introduce a method for pruning attention heads to try to answer these questions.Method
In the standard Transformer, results of different attention heads in a layer are concatenated:data:image/s3,"s3://crabby-images/8dee6/8dee6691a7ae93522e5946c7a59a32fe6b7154ff" alt=""
data:image/s3,"s3://crabby-images/8d1e1/8d1e1f1c8a47f58a24e8d5a0ac107ed4d8c3210c" alt=""
data:image/s3,"s3://crabby-images/c8099/c8099eed24a1149a8c0fe358c997e8326211b7ec" alt=""
data:image/s3,"s3://crabby-images/90c82/90c82369ad440aa776cdbcbc8c48f974452a313b" alt=""
data:image/s3,"s3://crabby-images/c8099/c8099eed24a1149a8c0fe358c997e8326211b7ec" alt=""
data:image/s3,"s3://crabby-images/c8099/c8099eed24a1149a8c0fe358c997e8326211b7ec" alt=""
Unfortunately, the L0 norm is nondifferentiable and so cannot be directly incorporated as a regularization term in the objective function. Instead, we use a stochastic relaxation. Each gate
data:image/s3,"s3://crabby-images/c8099/c8099eed24a1149a8c0fe358c997e8326211b7ec" alt=""
We use the sum of the probabilities of heads being non-zero (
data:image/s3,"s3://crabby-images/52e17/52e178f3709b4c758e4fe326cba2b74ff943bfd0" alt=""
data:image/s3,"s3://crabby-images/2b099/2b099466334e65c51e435e45f6e2732b12e4a1fa" alt=""
When applying the regularizer, we start from the converged model trained without the
data:image/s3,"s3://crabby-images/52e17/52e178f3709b4c758e4fe326cba2b74ff943bfd0" alt=""
data:image/s3,"s3://crabby-images/57a4a/57a4aa4426e2dd735b564c1906b80d3d4c43dc91" alt=""
data:image/s3,"s3://crabby-images/f3e09/f3e09180f1a4027cecd95ce31334a7c54d7ba6f3" alt=""
data:image/s3,"s3://crabby-images/8a0b6/8a0b6b9d4f1e22c5aae6d9e459e4a6511b662286" alt=""
We observe that the model converges to solutions where gates are either almost completely closed or completely open. This means that at test time we can treat the model as a standard Transformer and use only a subset of heads.
Expected L0 regularization with stochastic gates has also been used in another ACL paper to produce sparse and interpretable classifiers.
BLEU score
Encoder heads
First, let's look how the translation quality is affected by pruning attention heads. The figure below shows BLEU score as a function of number of retained encoder heads (EN-RU). Regularization applied by fine-tuning the trained model.data:image/s3,"s3://crabby-images/95e64/95e6491913d4e2f1ac87064cf36dbc218d387e6c" alt=""
Surprisingly, for OpenSubtitles, we lose only 0.25 BLEU when we prune all but 4 heads out of 48. For the more complex WMT task, 10 heads in the encoder are sufficient to stay within 0.15 BLEU of the full model.
All heads
data:image/s3,"s3://crabby-images/40b2a/40b2a260a91a3e08bd7a27f32f2f486af64156b8" alt=""
In the rightmost column we provide BLEU scores for models trained with exactly the same number and configuration of heads in each layer as the corresponding pruned models but starting from a random initialization of parameters. Here the degradation in translation quality is more significant than for pruned models with the same number of heads. This agrees with the observations made in model compression papers: sparse architectures learned through pruning cannot be trained from scratch to reach the same test set performance as a model trained with joint sparsification and optimization. See for example Zhu and Gupta, 2017 or Gale et al., 2019.
Pruning for analysis
Functions of retained encoder heads
data:image/s3,"s3://crabby-images/ff0f9/ff0f9fb898735a432fde7ee1bb23806912c9e58d" alt=""
Note that the model with 17 heads retains heads with all the functions that we identified previously, even though 2⁄3 of the heads have been pruned. This indicates that these functions are indeed the most important. Furthermore, when we have fewer heads in the model, some functions “drift” to other heads: for example, we see positional heads starting to track syntactic dependencies; hence some heads are assigned more than one color at certain stages.
Importance of attention types
data:image/s3,"s3://crabby-images/54399/543996e1b41c7d36c7f40275d8c9c5a9df54e86f" alt=""
data:image/s3,"s3://crabby-images/f5ea6/f5ea68f43048671d1a0fc13d83485e1cbb012671" alt=""
Conclusions
We evaluate the contribution made by individual attention heads to Transformer model performance on translation. We use layer-wise relevance propagation to show that the relative contribution of heads varies: only a small subset of heads appear to be important for the translation task. Important heads have one or more interpretable functions in the model, including attending to adjacent words and tracking specific syntactic relations. To determine if the remaining less-interpretable heads are crucial to the model’s performance, we introduce a new approach to pruning attention heads. We observe that specialized heads are the last to be pruned, confirming their importance directly. Moreover, the vast majority of heads, especially the encoder self-attention heads, can be removed without seriously affecting performance. In future work, we would like to investigate how our pruning method compares to alternative methods of model compression in NMT.P.S. There is also a recent study by Paul Michel, Omer Levy and Graham Neubig confirming our observations that importance of individual heads greatly varies. Have a look in their arXiv paper.
Want to know more?
data:image/s3,"s3://crabby-images/3e109/3e109c5a255bb4c5390f1a16710c56777a811b0b" alt=""
data:image/s3,"s3://crabby-images/7b2ac/7b2ac8fa55642543a3e1e05e899f288eb8ced9e1" alt=""
Share: Tweet