Information-Theoretic Probing with MDL

This is a post for the paper Information-Theoretic Probing with Minimum Description Length.

Probing classifiers often fail to adequately reflect differences in representations and can show different results depending on hyperparameters.
As an alternative to the standard probes,
  • we propose information-theoretic probing which measures minimum description length (MDL) of labels given representations;
  • we show that MDL characterizes both probe quality and the amount of effort needed to achieve it;
  • we explain how to easily measure MDL on top of standard probe-training pipelines;
  • we show that results of MDL probes are more informative and stable than those of standard probes.
March 2020


Evolution of Representations in the Transformer

This is a post for the EMNLP 2019 paper The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives.

We look at the evolution of representations of individual tokens in Transformers trained with different training objectives (MT, LM, MLM - BERT-style) from the Information Bottleneck perspective and show, that:
  • LMs gradually forget past when forming predictions about future;
  • for MLMs, the evolution proceeds in two stages of context encoding and token reconstruction;
  • MT representations get refined with context, but less processing is happening.
September 2019


When a Good Translation is Wrong in Context

This is a post for the ACL 2019 paper When a Good Translation is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion.

From this post, you will learn:
  • which phenomena cause context-agnostic translations to be inconsistent with each other
  • how we create test sets addressing the most frequent phenomena
  • about a novel set-up for context-aware NMT with a large amount of sentence-level data and much less of document-level data
  • about a new model for this set-up (Context-Aware Decoder, aka CADec) - a two-pass MT model which first produces a draft translation of the current sentence, then corrects it using context.
July 2019


The Story of Heads

This is a post for the ACL 2019 paper Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned.

From this post, you will learn:
  • how we evaluate the importance of attention heads in Transformer
  • which functions the most important encoder heads perform
  • how we prune the vast majority of attention heads in Transformer without seriously affecting quality
  • which types of model attention are most sensitive to the number of attention heads and on which layers
June 2019