This is a post for the EMNLP 2020 paper Information-Theoretic Probing with Minimum Description Length.

Probing classifiers often fail to adequately reflect differences in representations and can show different results depending on hyperparameters.
As an alternative to the standard probes,

we propose information-theoretic probing which measures minimum description length (MDL) of labels given representations;
we show that MDL characterizes both probe quality and the amount of effort needed to achieve it;
we explain how to easily measure MDL on top of standard probe-training pipelines;
we show that results of MDL probes are more informative and stable than those of standard probes.

March 2020

How to understand if a model captures a linguistic property?

How would you understand whether a model (e.g., ELMO, BERT) learned to encode some linguistic property? Usually this question is narrowed down to the following.

We have:
data

, where

are representations from a model and

are labels for some linguistic task (

).
We want:
to measure to what extent representations

capture labels

Standard Probing: train a classifier, use its accuracy

The most popular approach is to use probing classifiers (aka probes, probing tasks, diagnostic classifiers). These classifiers are trained to predict a linguistic property from frozen representations, and accuracy of the classifier is used to measure how well these representations encode the property.
Looks reasonable and simple, right? Yes, but...

Wait, how about some sanity checks?

While such probes are extremely popular, several 'sanity checks' showed that differences in accuracies fail to reflect differences in representations.

These sanity checks are different kinds of random baselines. For example, Zhang & Bowman (2018) compared probe scores for trained models and randomly initialized ones. They were able to see reasonable differences in the scores only when reducing the amount of a classifier training data.

Later Hewitt & Liang (2019) proposed to put probe accuracy in context with its ability to remember from word types. They constructed so-called 'control tasks', which are manually constructed for each linguistic task. A control task defines random output for a word type, regardless of context. For example, for POS tags, each word is assigned a random label based on the empirical distribution of tags.
Control tasks were used to perform an exhaustive search over hyperparameters for training probes and to find a setting with the largest difference in the scores. This was achieved by reducing size of a probing model (e.g., fewer neurons).

Houston, we have a problem.

We see that accuracy of a probe does not always reflect what we want it to reflect, at least not without explicit search for a setting where it does. But then another problem arises: if we tune a probe to behave in a certain way for one task (e.g., we tune its setting to have low accuracy for a certain control task), how do we know if it's still reasonable for other tasks? Well, we do not (if you do, please tell!).

All in all, we want a probing method which does not require a human to tell it what to say.

Information-Theoretic Viewpoint

Let us come back to the initial task: we have data

with representations

and labels

and want to measure to what extent

encode

. Let us look at this task from a new perspective, different from the current most popular approach.
Our idea can be summarized in two sentences.

Regularity in representations with respect to labels can be exploited to compress the data.
Better compression ↔ stronger regularity ↔ representations better encode labels.

Basically, this is it. All is left is to make it clear, theoretically justified and illustrated experimentally. For this, I will need some help from Alice and Bob.

Adventures of Alice, Bob and the Data

Let us imagine that Alice has all pairs

from

, Bob has just

, and that Alice wants to communicate

to Bob. Transmitting data is a lot of work, and surely Alice has much better things to do: let's help Alice to compress the data!

Formally, the task is to encode labels

knowing representations

in an optimal way, i.e., with minimal codelength (in bits) needed to transmit

. The resulting minimal number of bits is minimum description length (MDL) of labels given representations, and we will refer to probes measuring MDL as description length probes or MDL probes.

Changing a probe's goal converts it into an MDL probe

Turns out that to evaluate MDL we don't need to change much in standard probe-training pipelines. This is done via a simple trick: we know that data can be compressed using some probabilistic model

, and we just set this model

to be a probing classifier. This is how we turn a standard probe into an MDL probe by changing a goal from predicting labels to transmitting data.

Note that since Bob does not know the precise model Alice is using, we will also have to transmit the model, either explicitly or implicitly. Thus, the overall codelength is a combination of quality-of-fit of the model (compressed data length), together with the cost of transmitting the model itself.

Intuitively, codelength characterizes not only final quality of a probe (data codelength), but also the amount of effort needed to achieve this quality.

The Data Part

TL;DR: cross-entropy is the data codelength

Let's assume for now Alice and Bob have agreed in advance on a model

and both know the inputs

; we need just to transmit data using this known model. Then there exists a code to transmit the labels

losslessly with codelength

This is the Shannon-Huffman code, which gives the optimal bound on the codelength if the data are independent and come from a conditional probability distribution

Learning is Compression

Note that (1) is exactly the categorical cross-entropy loss evaluated on the model p. This shows that the task of compressing labels

is equivalent to learning a model

of data:
quality of a learned model

is the codelength needed to transmit the data.

Compression is compared against Uniform Code

Compression is usually compared against uniform encoding which does not require any learning from data. For K classes, it assumes

, and yields codelength

bits.

The Amount of Effort Part: Strength of the Regularity in the Data

Based on real events! These are t-sne projections of representations of different occurrences of the token is from the last layer of MT encoder (strong) and the last LM layer (weak). The colors show CCG tags (only 4 most frequent tags).
Curious why MT and LM behave this way? Look at our Evolution of Representations post!

Now we come to the other component of the total codelength, which characterizes how hard it is to achieve final quality of a trained probe. Intuitively, if representations have some clear structure with respect to labels, this structure can be understood with less effort; for example,

the "rule" explaining this structure (i.e., a probing model) can be simple, and/or
the amount of data needed to reveal this structure can be small.

This is exactly how our vague (so far) notion of the "amount of effort" is translated into codelength. We explain this more formally when describing the two methods for evaluating MDL we use: variational coding and online coding; they differ in a way they incorporate model cost: directly or indirectly.

Variational Code: Explicit Transmission of the Model

Variational code is an instance of two-part codes, where a model (probe weights) is transmitted explicitly and then used to encode the data. It has been known for quite a while that this joint cost (model and data codes) is exactly the loss function of a variational learning methods.

The Idea: Informal version

What is happening in this part:

model weights are transmitted explicitly;
each weight is treated as a random variable;
weights with higher variance can be transmitted with lower precision (i.e., fewer bits);
by choosing sparsity-inducing priors we will obtain induced (sparse) probe architectures.

Codelength: Formal Derivations

I hereby confirm that I'm not afraid of scary formulas

May the force be with you!

Here we assume that Alice and Bob have agreed on a model class

, and for any model

, Alice first transmits its parameters

and then encodes the data while relying on the model. The description length decomposes accordingly:

To compute the description length of the parameters, we can further assume that Alice and Bob have agreed on a prior distribution over the parameters

. Now, we can rewrite the total description length as

where parameters

is the number of parameters and

is a prearranged precision for each parameter. With deep learning models (in our case, probing classifiers), such straightforward codes for parameters are highly inefficient. Instead, in the variational approach, weights are treated as random variables, and the description length is given by the expectation

where

is a distribution encoding uncertainty about the parameter values. The distribution

is chosen by minimizing the codelength given in Eq. (3). In you are interested in the formal justification for the description length, it relies on the bits-back argument and can be found, for example, here. However, the underlying intuition is straightforward: parameters we are uncertain about can be transmitted at a lower cost as the uncertainty can be used to determine the required precision. The entropy term in equation (3),

, quantifies this discount.

The negated codelength

is known as the evidence-lower-bound (ELBO) and used as the objective in variational inference. The distribution

approximates the intractable posterior distribution

. Consequently, any variational method can in principle be used to estimate the codelength.

We Use: Bayesian Compression

You can choose priors for weights to get something interesting. For example, if you choose sparsity-inducing priors on the parameters, you can get variational dropout. What is more interesting, you can do this in such way that you prune the whole neurons (not just individual weights): this is the Bayesian network compression method we use. This allows us to assess the probe complexity both using its description length

and by inspecting the discovered architecture.

Intuition

If regularity in the data is strong, it can be explained with a "simple rule" (i.e., a small probing model). A small probing model is easy to communicate, i.e., only a few parameters require high precision. As we will see in the experiments, similar probe accuracies often come at a very different model cost, i.e. different total codelength.

Note that the variational code gives us probe architecture (and model size) as a byproduct of training, and does not require manual search over probe hyperparameters.

Online Code: Implicit Transmission of the Model

Online code provides a way to transmit data without directly transmitting the model. Intuitively, it measures the ability to learn from different amounts of data.

In this setting, the data is transmitted in a sequence of portions; at each step, the data transmitted so far is used to understand the regularity in this data and compress the following portion.

Codelength: Formal Derivations

Formally, Alice and Bob agree on the form of the model

with learnable parameters

, its initial random seeds, and its learning algorithm. They also choose timesteps

and encode data by blocks. Alice starts by communicating

with a uniform code, then both Alice and Bob learn a model

that predicts

from

using data

, and Alice uses that model to communicate the next data block

. Then both Alice and Bob learn a model

from a larger block

and use it to encode

. This process continues until the entire dataset is transmitted. The resulting online codelength is

Data and Model components of Online Code

While the online code does not incorporate model cost explicitly, we can still evaluate model cost by interpreting the difference between the cross-entropy of the model trained on all data,

, and online codelength as the cost of the model:

Indeed,

is codelength of the data if one knows model parameters,

- if one does not know them.

Intuition

Online code measures the ability to learn from different amounts of data, and it is related to the area under the learning curve, which plots quality as a function of the number of training examples.

Intuitively, if the regularity in the data is strong, it can be revealed using a small subset of the data, i.e., early in the transmission process, and can be exploited to efficiently transmit the rest of the dataset.

Description Length and Control Tasks

Compression methods agree in results:

for the linguistic task, the best layer is the first;
for the control task, codes become larger as we move up from the embedding layer
(this is expected since the control task measures the ability to remember word types);
codelengths for the control task are substantially larger than for the linguistic task.

Layer 0: MDL is correct, accuracy is not.

What is even more surprising, codelength identifies the control task even when accuracy indicates the opposite: for layer 0, accuracy for the control task is higher, but the code is twice longer than for the linguistic task. This is because codelength characterizes how hard it is to achieve this accuracy: for the control task, accuracy is higher, but the cost of achieving this score is very big. We will illustrate this a bit later when looking at the model component of the code.

Embedding vs contextual: drastic difference.

For the linguistic task, note that codelength for the embedding layer is approximately twice larger than that for the first layer. Later, when looking at the random model baseline, we will see the same trends for several other tasks and will show that even contextualized representations obtained with a randomly initialized model are a lot better than the embedding layer alone.

Model: small for linguistic, large for control.

Let us now look at the data and model components of code separately (as shown in formulas (3) and (5) for variational and online codes, respectively).

The trends for the two compression methods are similar: for the control task, model size is several times larger than for the linguistic task. This is something that probe accuracy alone is not able to reflect: while representations have structure with respect to the linguistic task labels, the same representations do not have structure with respect to random labels.

For variational code, it means that the linguistic task labels can be "explained" with a small model, but random labels can be learned only using a larger model.
For online code, it shows that the linguistic task labels can be learned from a small amount of data, but random labels can be learned (memorized) well only using a large dataset.

For online code, we can also recall that it is related to the area under the learning curve, which plots quality as a function of the number of training examples. This figure shows learning curves corresponding to online code. We can clearly see the difference between behavior of the linguistic and control tasks.

Architecture: sparse for linguistic, dense for control.

Let us now recall that Bayesian compression gives us not only variational codelength, but also the induced probe architecture.

Probes learned for linguistic tasks have small architectures with only 33-75 neurons at the second and third layers of the probe. In contrast, models learned for control tasks are quite dense.

This relates to the control tasks paper which proposed to use random label baselines. The authors considered several predefined probe architectures and picked one of them based on a manually defined criterion. In contrast, variational code gives probe architecture as a byproduct of training and does not require human guidance.

MDL results are stable, accuracy is not

Hyperparameters: change results for accuracy but not for MDL.

While here we discussed in detail results for the default settings, we also tried 10 different settings and saw that accuracy varies greatly with the settings. For example, for layer 0 there are settings with contradictory results: accuracy can be higher either for the linguistic or for the control task depending on the settings (look at the figure). In striking contrast to accuracy, MDL results are stable across settings, thus MDL does not require search for probe settings.

Random Seed: affects accuracy but not MDL.

Figure shows results for 5 random seeds (linguistic task). We see that using accuracy can lead to different rankings of layers depending on a random seed, making it hard to draw conclusions about their relative qualities. For example, accuracy for layer 1 and layer 2 are 97.48 and 97.31 for seed 1, but 97.38 and 97.48 for seed 0.
On the contrary, the MDL results are stable and the scores given to different layers are well separated.

Description Length and Random Models

In the paper, we also consider another type of random baselines: randomly initialized models. Following previous work, we also experiment with ELMo and compare it with a version of the ELMo model in which all weights above the lexical layer (layer 0) are replaced with random orthonormal matrices (but the embedding layer, layer 0, is retained from trained ELMo).

We experiment with 7 edge probing tasks (5 of them are listed above). Here we will only briefly mention our findings: if you are interested in more details, look at the paper.

We find, that:

layer 0 vs contextual: even random contextual representations are better
As we already saw in the previous part, codelength shows a drastic difference between the embedding layer (layer 0) and contextualized representations: codelengths differ about twice for most tasks. Both compression methods show that even for the randomly initialized model, contextualized representations are better than lexical representations.

codelengths for the randomly initialized model are larger than for the trained one
This is more prominent when not just looking at the bare scores, but comparing compression against context-agnostic representations. For all tasks, compression bounds for the randomly initialized model are closer to those of context-agnostic Layer 0 than to representations from the trained model. This shows that gain from using context for the randomly initialized model is at least twice smaller than for the trained model.

randomly initialized layers do not evolve
For all tasks, MDL (and data/model components of code separately) for layers of the randomly initialized model is identical. For the trained model, this is not the case: layer 2 is worse than layer 1 for all tasks. This is one more illustration of the general process explained in our Evolution of Representations post: the way representations evolve between layers is defined by the training objective. For the randomly initialized model, since no training objective has been optimized, no evolution happens.

Conclusions

We propose information-theoretic probing which measures minimum description length (MDL) of labels given representations. We show that MDL naturally characterizes not only probe quality, but also the amount of effort needed to achieve it (or, intuitively, strength of the regularity in representations with respect to the labels); this is done in a theoretically justified way without manual search for settings. We explain how to easily measure MDL on top of standard probe-training pipelines. We show that results of MDL probing are more informative and stable compared to the standard probes.

Want to know more?