Language Modeling

What does it mean to "model something"?

Imagine that we have, for example, a model of a physical world. What do you expect it to be able to do? Well, if it is a good model, it probably can predict what happens next given some description of "context", i.e., the current state of things. Something of the following kind:

We have a tower of that many toy cubes of that size made from this material. We push the bottom cube from this point in that direction with this force. What will happen?

A good model would simulate the behavior of the real world: it would "understand" which events are in better agreement with the world, i.e., which of them are more likely.

What about language?

For language, the intuition is exactly the same! What is different, is the notion of an event. In language, an event is a linguistic unit (text, sentence, token, symbol), and a goal of a language model is to estimate the probabilities of these events.

Language Models (LMs) estimate the probability of different linguistic units: symbols, tokens, token sequences.

But how can this be useful?

We deal with LMs every day!

We see language models in action every day - look at some examples. Usually models in large commercial services are a bit more complicated than the ones we will discuss today, but the idea is the same: if we can estimate probabilities of words/sentences/etc, we can use them in various, sometimes even unexpected, ways.

What is easy for humans, can be very hard for machines

Lena: The morphosyntax example is from the slides by Alex Lascarides and Sharon Goldwater,
Foundations of Natural Language Processing course at the University of Edinburgh.

We, humans, already have some feeling of "probability" when it comes to natural language. For example, when we talk, usually we understand each other quite well (at least, what's being said). We disambiguate between different options which sound similar without even realizing it!

But how a machine is supposed to understand this? A machine needs a language model, which estimates the probabilities of sentences. If a language model is good, it will assign a larger probability to a correct option.

General Framework

Text Probability

Our goal is to estimate probabilities of text fragments; for simplicity, let's assume we deal with sentences. We want these probabilities to reflect knowledge of a language. Specifically, we want sentences that are "more likely" to appear in a language to have a larger probability according to our language model.

How likely is a sentence to appear in a language?

Let's check if simple probability theory can help. Imagine we have a basket with balls of different colors. The probability to pick a ball of a certain color (let's say green) from this basket is the frequency with which green balls occur in the basket.

What if we do the same for sentences? Since we can not possibly have a text corpus that contains all sentences in a natural language, a lot of sentences will not occur in the corpus. While among these sentences some are clearly more likely than the others, all of them will receive zero probability, i.e., will look equally bad for the model. This means, the method is not good and we have to do something mode clever.

Sentence Probability: Decompose Into Smaller Parts

We can not reliably estimate sentence probabilities is we treat them as atomic units. Instead, let's decompose the probability of a sentence into probabilities of smaller parts.

For example, let's take the sentence I saw a cat on a mat and imagine that we read it word by word. At each step, we estimate the probability of all seen so far tokens. We don't want any computations not to be in vain (no way!), so we won't throw away previous probability once a new word appears: we will update it to account for a new word. Look at the illustration.

Formally, let \(y_1, y_2, \dots, y_n\) be tokens in a sentence, and \(P(y_1, y_2, \dots, y_n)\) the probability to see all these tokens (in this order). Using the product rule of probability (aka the chain rule), we get \[P(y_1, y_2, \dots, y_n)=P(y_1)\cdot P(y_2|y_1)\cdot P(y_3|y_1, y_2)\cdot\dots\cdot P(y_n|y_1, \dots, y_{n-1})= \prod \limits_{t=1}^n P(y_t|y_{\mbox{<}t}).\] We decomposed the probability of a text into conditional probabilities of each token given the previous context.

We got: Left-to-Right Language Models

What we got is the standard left-to-right language modeling framework. This framework is quite general: N-gram and neural language models differ only in a way they compute the conditional probabilities \(P(y_t|y_1, \dots, y_{t-1})\).

Lena: Later in the course we will see other language models: for example, Masked Language Models or models that decompose the joint probability differently (e.g., arbitrary order of tokens and not fixed as the left-to-right order).

We will come to specifics of N-gram and neural models a bit later. Now, we discuss how to generate a text using a language model.


Generate a Text Using a Language Model

Once we have a language model, we can use it to generate text. We do it one token at a time: predict the probability distribution of the next token given previous context, and sample from this distribution.

Alternatively, you can apply greedy decoding: at each step, pick the token with the highest probability. However, this usually does not work well: a bit later I will show you examples from real models.

Despite its simplicity, such sampling is quite common in generation. In section Generation Strategies we will look at different modifications of this approach to get samples with certain qualities; e.g., more or less "surprising".


N-gram Language Models

Let us recall that the general left-to-right language modeling framework decomposes probability of a token sequence, into conditional probabilities of each token given previous context: \[P(y_1, y_2, \dots, y_n)=P(y_1)\cdot P(y_2|y_1)\cdot P(y_3|y_1, y_2)\cdot\dots\cdot P(y_n|y_1, \dots, y_{n-1})= \prod \limits_{t=1}^n P(y_t|y_{\mbox{<}t}).\] The only thing which is not clear so far is how to compute these probabilities.

We need to: define how to compute the conditional probabilities \(P(y_t|y_1, \dots, y_{t-1})\).

Similar to count-based methods we saw earlier in the Word Embeddings lecture, n-gram language models also count global statistics from a text corpus.

How: estimate based on global statistics from a text corpora, i.e., count.

That is, the way n-gram LMs estimate probabilities \(P(y_t|y_{\mbox{<}t}) = P(y_t|y_1, \dots, y_{t-1})\) is almost the same as the way we earlier estimated the probability to pick a green ball from a basket. This innocent "almost" contains the key components of n-gram LMs: Markov property and smoothings.

Markov Property (Independence Assumption)

The straightforward way to compute \(P(y_t|y_1, \dots, y_{t-1})\) is \[P(y_t|y_1, \dots, y_{t-1}) = \frac{N(y_1, \dots, y_{t-1}, y_t)}{N(y_1, \dots, y_{t-1})},\] where \(N(y_1, \dots, y_k)\) is the number of times a sequence of tokens \((y_1, \dots, y_k)\) occur in the text.

For the same reasons we discussed before, this won't work well: many of the fragments \((y_1, \dots, y_{t})\) do not occur in a corpus and, therefore, will zero out the probability of the sentence. To overcome this problem, we make an independence assumption (assume that the Markov property holds):

The probability of a word only depends on a fixed number of previous words.

Formally, n-gram models assume that \[P(y_t|y_1, \dots, y_{t-1}) = P(y_t|y_{t-n+1}, \dots, y_{t-1}).\] For example,
  • n=3 (trigram model): \(P(y_t|y_1, \dots, y_{t-1}) = P(y_t|y_{t-2}, y_{t-1})\),
  • n=2 (bigram model): \(P(y_t|y_1, \dots, y_{t-1}) = P(y_t|y_{t-1})\),
  • n=1 (unigram model): \(P(y_t|y_1, \dots, y_{t-1}) = P(y_t)\).

Look how the standard decomposition changes for n-gram models.


Smoothing: Redistribute Probability Mass

Let's imagine we deal with a 4-gram language model and consider the following example:

What if either denominator or numerator is zero? Both these cases are not really good for the model. To avoid these problems (and some other), it is common to use smoothings. Smoothings redistribute probability mass: they "steal" some mass from seen events and give to the unseen ones.

Lena: at this point, usually I'm tempted to imagine a brave Robin Hood, stealing from the rich and giving to the poor - just like smoothings do with the probability mass. Unfortunately, I have to stop myself, because, let's be honest, smoothings are not so clever - it would be offensive to Robin.

Avoid zeros in the denominator

If the phrase cat on a never appeared in our corpus, we will not be able to compute the probability. Therefore, we need a "plan B" in case this happens.


Backoff (aka Stupid Backoff)

One of the solutions is to use less context for context we don't know much about. This is called backoff:

  • if you can, use trigram;
  • if not, use bigram;
  • if even bigram does not help, use unigram.
This is rather stupid (hence the title), but works fairly well.


More clever: Linear interpolation

A more clever solution is to mix all probabilities: unigram, bigram, trigram, etc. For this, we need scalar positive weights \(\lambda_0, \lambda_1, \dots, \lambda_{n-1}\) such that \(\sum\limits_{i}\lambda_i=1\). Then the updated probability is:

The coefficients \(\lambda_i\) can be picked by cross-validation on the development set. You will be able to do this once you know how to evaluate language models: see the Evaluation section.

Avoid zeros in the numerator

If the phrase cat on a mat never appeared in our corpus, the probability of the whole sentence will be zero - but this does not mean that the sentence is impossible! To avoid this, we also need a "plan B".


Laplace smoothing (aka add-one smoothing)

The simplest way to avoid this is just to pretend we saw all n-grams at least one time: just add 1 to all counts! Alternatively, instead of 1, you can add a small \(\delta\):


More Clever Smoothings

Kneser-Ney Smoothing.
The most popular smoothing for n-gram LMs is Kneser-Ney smoothing: it is a more clever variant of the back-off. More details are here.

Generation (and Examples)

The generation procedure for a n-gram language model is the same as the general one: given current context (history), generate a probability distribution for the next token (over all tokens in the vocabulary), sample a token, add this token to the sequence, and repeat all steps again. The only part which is specific to n-gram models is the way we compute the probabilities. Look at the illustration.

Examples of generated text

To show you some examples, we trained a 3-gram model on 2.5 million of English sentences.

Dataset details. The data is the English side of WMT English-Russian translation data. It consists of 2.5 million sentence pairs (a pair of sentences in English and Russian which are supposed to be translations of each other). The dataset contains news data, Wikipedia titles and 1 million crawled sentences released by Yandex. This data is one of the standard datasets for machine translation; for language modeling, we used only the English side.

Note that everything you will see below is generated by a model and presented without changes or filtering. Any content you might not like appeared as a result of training data. The best we can do is to use the standard datasets, and we did.


How to: Look at the samples from a n-gram LM. What is clearly wrong with these samples? What in the design of n-gram models leads to this problem?



You probably noticed that these samples are not fluent: it can be clearly seen that the model does not use long context, and relies only on a couple of tokens. The inability to use long contexts is the main shortcoming of n-gram models.

Now, we take the same model, but perform greedy decoding: at each step, we pick the token with the highest probability. We used 2-token prefixes from the examples of samples above (for each example, the prefix fed to the model is underlined).


How to: Look at the examples generated by the same model using greedy decoding. Do you like these texts? How would you describe them?



We see that greedy texts are:

  • shorter - the _eos_ token has high probability;
  • very similar - many texts end up generating the same phrase.

To overcome the main flaw of n-gram LMs, fixed context size, we will now come to neural models. As we will see later, when longer contexts are used, greedy decoding is not so awful.



Neural Language Models

In our general left-to-right language modeling framework, the probability of a token sequence is: \[P(y_1, y_2, \dots, y_n)=P(y_1)\cdot P(y_2|y_1)\cdot P(y_3|y_1, y_2)\cdot\dots\cdot P(y_n|y_1, \dots, y_{n-1})= \prod \limits_{t=1}^n P(y_t|y_{\mbox{<}t}).\] Let us recall, again, what is left to do.

We need to: define how to compute the conditional probabilities \(P(y_t|y_1, \dots, y_{t-1})\).

Differently from n-gram models that define formulas based on global corpus statistics, neural models teach a network to predict these probabilities.

How: Train a neural network to predict them.

Intuitively, neural Language Models do two things:

  • process context → model-specific
    The main idea here is to get a vector representation for the previous context. Using this representation, a model predicts a probability distribution for the next token. This part could be different depending on model architecture (e.g., RNN, CNN, whatever you want), but the main point is the same - to encode context.
  • generate a probability distribution for the next token → model-agnostic
    Once a context has been encoded, usually the probability distribution is generated in the same way - see below.

This is classification!

We can think of neural language models as neural classifiers. They classify prefix of a text into |V| classes, where the classes are vocabulary tokens.

High-Level Pipeline

Since left-to-right neural language models can be thought of as classifiers, the general pipeline is very similar to what we saw in the Text Classification lecture. For different model architectures, the general pipeline is as follows:

  • feed word embedding for previous (context) words into a network;
  • get vector representation of context from the network;
  • from this vector representation, predict a probability distribution for the next token.

Similarly to neural classifiers, we can think about the classification part (i.e., how to get token probabilities from a vector representation of a text) in a very simple way.

Vector representation of a text has some dimensionality \(d\), but in the end, we need a vector of size \(|V|\) (probabilities for \(|V|\) tokens/classes). To get a \(|V|\)-sized vector from a \(d\)-sized, we can use a linear layer. Once we have a \(|V|\)-sized vector, all is left is to apply the softmax operation to convert the raw numbers into class probabilities.

Another View: Dot Product with Output Word Embeddings

If we look at the final linear layer more closely, we will see that it has \(|V|\) columns and each of them corresponds to a token in the vocabulary. Therefore, these vectors can be thought of as output word embeddings.

Now we can change our model illustration according to this view. Applying the final linear layer is equivalent to evaluating the dot product between text representation h and each of the output word embeddings.

Formally, if \(\color{#d192ba}{h_t}\) is a vector representation of the context \(y_1, \dots, y_{t-1}\) and \(\color{#88bd33}{e_w}\) are the output embedding vectors, then \[p(y_t| y_{\mbox{<}t}) = \frac{exp(\color{#d192ba}{h_t^T}\color{#88bd33}{e_{y_t}}\color{black})}{\sum\limits_{w\in V}exp(\color{#d192ba}{h_t^T}\color{#88bd33}{e_{w}}\color{black})}.\] Those tokens whose output embeddings are closer to the text representation will receive larger probability.

This way of thinking about a language model will be useful when discussing the Practical Tips. Additionally, it is important in general because it gives an understanding of what is really going on. Therefore, below I'll be using this view.


Training and the Cross-Entropy Loss

Lena: This is the same cross-entropy loss we discussed in the Text Classification lecture.

Neural LMs are trained to predict a probability distributions of the next token given the previous context. Intuitively, at each step we maximize the probability a model assigns to the correct token.

Formally, if \(y_1, \dots, y_n\) is a training token sequence, then at the timestep \(t\) a model predicts a probability distribution \(p^{(t)} = p(\ast|y_1, \dots, y_{t-1})\). The target at this step is \(p^{\ast}=\mbox{one-hot}(y_t)\), i.e., we want a model to assign probability 1 to the correct token, \(y_t\), and zero to the rest.

The standard loss function is the cross-entropy loss. Cross-entropy loss for the target distribution \(p^{\ast}\) and the predicted distribution \(p^{}\) is \[Loss(p^{\ast}, p^{})= - p^{\ast} \log(p) = -\sum\limits_{i=1}^{|V|}p_i^{\ast} \log(p_i).\] Since only one of \(p_i^{\ast}\) is non-zero (for the correct token \(y_t\)), we will get \[Loss(p^{\ast}, p) = -\log(p_{y_t})=-\log(p(y_t| y_{\mbox{<}t})).\] At each step, we maximize the probability a model assigns to the correct token. Look at the illustration for a single timestep.

For the whole sequence, the loss will be \(-\sum\limits_{t=1}^n\log(p(y_t| y_{\mbox{<}t}))\). Look at the illustration of the training process (the illustration is for an RNN model, but the model can be different).



Cross-Entropy and KL divergence

When the target distribution is one-hot (\(p^{\ast}=\mbox{one-hot}(y_t)\)), the cross-entropy loss \(Loss(p^{\ast}, p^{})= -\sum\limits_{i=1}^{|V|}p_i^{\ast} \log(p_i)\) is equivalent to Kullback-Leibler divergence \(D_{KL}(p^{\ast}|| p^{})\).

Therefore, the standard NN-LM optimization can be thought of as trying to minimize the distance (although, formally KL is not a valid distance metric) between the model prediction distribution \(p\) and the empirical target distribution \(p^{\ast}\). With many training examples, this is close to minimizing the distance to the actual target distribution.

Models: Recurrent

Now we will look at how we can use recurrent models for language modeling. Everything you will see here will apply to all recurrent cells, and by "RNN" in this part I refer to recurrent cells in general (e.g. vanilla RNN, LSTM, GRU, etc).

Simple: One-Layer RNN

The simplest model is a one-layer recurrent network. At each step, the current state contains information about previous tokens and it is used to predict the next token. In training, you feed the training examples. At inference, you feed as context the tokens your model generated; this usually happens until the _eos_ token is generated.

Multiple layers: feed the states from one RNN to the next one

To get a better text representation, you can stack multiple layers. In this case, inputs for the higher RNN are representations coming from the previous layer.

The main hypothesis is that with several layers, lower layers will catch local phenomena, while higher layers will be able to catch longer dependencies.

Models: Convolutional

Lena: In this part, I assume you read the Convolutional Models section in the Text Classification lecture. If you haven't, read the Convolutional Models Supplementary.

Compared to CNNs for text classification, language models have several differences. Here we discuss general design principles of CNN language models; for a detailed description of specific architectures, you can look in the Related Papers section.

When designing a CNN language model, you have to keep in mind the following things:

  • prevent information flow from future tokens
    To predict a token, a left-to-right LM has to use only previous tokens - make sure your CNN does not see anything but them! For example, you can shift tokens to the right by using padding - look at the illustration above.
  • do not remove positional information
    Differently from text classification, positional information is very important for language models. Therefore, do not use pooling (or be very careful in how you do it).
  • if you stack many layers, do not forget about residual connections
    If you stack many layers, it may difficult to train a very deep network well. To avoid this, use residual connections - look for the details below.

Receptive field: with many layers, can be large

When using convolutional models without global pooling, your model will inevitably have a fixed-sized context. This might seem undesirable: the fixed context size problem is exactly what we didn't like in the n-gram models!

However, if for n-gram models typical context size is 1-4, contexts in convolutional models can be quite long. Look at the illustration: with only 3 convolutional layers with small kernel size 3, a network has a context of 7 tokens. If you stack many layers, you can get a very large context length.

Residual connections: train deep networks easily

To process longer contexts you need a lot of layers. Unfortunately, when stacking a lot of layers, you can have a problem with propagating gradients from top to bottom through a deep network. To avoid this, we can use residual connections or a more complicated variant highway connections.

Residual connections are very simple: they add input of a block to its output. In this way, the gradients over inputs will flow not only indirectly through the block, but also directly through the sum.

Highway connections have the same motivation, but a use a gated sum of input and output instead of the simple sum. This is similar to LSTM gates where a network can learn the types of information it may want to carry on from bottom to top (or, in case of LSTMs, from left to right).

Look at the example of a convolutional network with residual connections. Typically, we put residual connections around blocks with several layers. A network can several such blocks - remember, you need a lot of layers to get a decent receptive field.

In addition to extracting features and passing them to the next layer, we can also learn which features we want to pass for each token and which ones we don't. More details are in this paper summary.

P.S. Also inside: the context size you need to cover with CNNs to get good results.



Generation Strategies

As we saw before, to generate a text using a language model you just need to sample tokens from the probability distribution predicted by a model.

Coherence and Diversity

You can modify the distributions predicted by a model in different ways to generate texts with some properties. While the specific desired text properties may depend on the task you care about (as always), usually you would want the generated texts to be:

  • coherent - the generated text has to make sense;
  • diverse - the model has to be able to produce very different samples.

Lena: Recall the incoherent samples from an n-gram LM - not nice!

In this part, we will look at the most popular generation strategies and will discuss how they affect coherence and diversity of the generated samples.

Standard Sampling

The most standard way of generating sequences is to use the distributions predicted by a model, without any modifications.

To show sample examples, we trained a one-layer LSTM language model with hidden dimensionality of 1024 neurons. The data is the same as for the n-gram model (2.5m English sentences from WMT English-Russian dataset).


How to: Look at the samples from an LSTM LM. Pay attention to coherence and diversity. Are these samples better than those of an n-gram LM we saw earlier?

Lena: we sample until the _eos_ token is generated.



Sampling with temperature

A very popular method of modifying language model generation behavior is to change the softmax temperature. Before applying the final softmax, its inputs are divided by the temperature \(\tau\):

Formally, the computations change as follows:

Note that the sampling procedure remains standard: the only thing which is different is how we compute the probabilities.

How to: Play with the temperature and see how the probability distribution changes. Note how changes the difference between the probability of the most likely class (green) and others.

  • What happens when the temperature is close to zero?
  • What happens when the temperature is high?
  • Sampling with which temperature corresponds to greedy decoding?

Note that you can also change the number of classes and generate another probability distribution.



Examples: Samples with Temperatures 2 and 0.2

Now when you understand how the temperature changes the distributions, it's time to look at the samples with different temperature.

How to: Look at the samples with the temperature 2. How are these samples are different from the standard ones? Try to characterize both coherence and diversity.

Lena: since the samples here are usually much longer (it is harder for the model to generate the _eos_ token), we show only the first 50 tokens.


Clearly, these samples are very diverse, but most of them do not have much sense. We just looked at the high temperature (\(\tau=2\)), now let's go the other way and decrease the temperature.


How to: Look at the samples with the temperature 0.2. How are these samples are different from the previous ones? Try to characterize both coherence and diversity.

Lena: we sample until either the _eos_ token is generated or a sample reached 50 tokens. Note that we show all samples, without filtering!


Here we have the other problem: the samples lack diversity. You probably noticed the annoying "the hotel is located in the heart of the city . _eos_" - it feels like half of the samples end up generating this sentence! Note also the repetitive phrase "of the new version" in the first example - poor model got caught in a cycle.

To summarize our findings here, use can use temperature to modify sampling quality, but one of the coherence and diversity will suffer at the expense of the other.



Top-K sampling: top K most probable tokens

Varying temperature is tricky: if the temperature is too low, then almost all tokens receive very low probability; if the temperature is too high, plenty of tokens (not very good) will receive high probability.

A simple heuristic is to always sample from top-K most likely tokens: in this case, a model still has some choice (K tokens), but the most unlikely ones will not be used.

How to: Look at the results of the top-K sampling with K=10. How are these samples are different from the standard ones? Try to characterize both coherence and diversity.

Lena: we sample until the _eos_ token is generated.


Fixed K is not always good

While usually top-K sampling is much more effective than changing the softmax temperature alone, the fixed value of K is surely not optimal. Look at the illustration below.

The fixed value of K in the top-K sampling is not good because top-K most probable tokens may

  • cover very small part of the total probability mass (in flat distributions);
  • contain very unlikely tokens (in peaky distributions).

Top-p (aka Nucleus) sampling: top-p% of the probability mass

A more reasonable strategy is to consider not top-K most probable tokens, but top-p% of the probability mass: this solution is called Nucleus sampling.

Look at the illustration: with nucleus sampling, the number of tokens we sample from is dynamic and depends on the properties of the distribution.

How to: Look at the results of the nucleus sampling with p=80%. Are they better than everything we saw before?

Lena: we sample until the _eos_ token is generated.





Evaluating Language Models

TL;DR When reading a new text, how much is a model "surprised"?

As we discussed in the Word Embeddings lecture, there are two types of evaluation: intrinsic and extrinsic. Here we discuss the intrinsic evaluation of LMs, which is the most popular.

When reading a new text, how much is a model "surprised"?

Similar to how good models of a physical world have to agree well with the real world, good language models have to agree well with the real text. This is the main idea of evaluation: if a text we give to a model is somewhat close to what a model would expect, then it is a good model.

Cross-Entropy and Perplexity

But how to evaluate if "a text is somewhat close to what a model would expect"? Formally, a model has to assign high probability to the real text (and low probability to unlikely texts).

Cross-Entropy and Log-Likelihood of a Text

Let us assume we have a held-out text \(y_{1:M}= (y_1, y_2, \dots, y_M)\). Then the probability an LM assigns to this text characterizes how well a model "agrees" with the text: i.e., how well it can predict appearing tokens based on their contexts:

This is log-likelihood: the same as our loss function, but without negation. Note also the logarithm base: in the optimization, the logarithm us usually natural (because it is faster to compute), but in the evaluation, it's log base 2. Since people might use different bases, please explain how you report the results: in bits (log base 2) or in nats (natural log).

Perplexity

Instead of cross-entropy, it is more common to report its transformation called perplexity: \[Perplexity(y_{1:M})=2^{-\frac{1}{M}L(y_{1:M})}.\]

A better model has higher log-likelihood and lower perplexity.

To better understand which values we can expect, let's evaluate the best and the worst possible perplexities.

  • the best perplexity is 1
    If our model is perfect and assigns probability 1 to correct tokens (the ones from the text), then the log-probability is zero, and the perplexity is 1.
  • the worst perplexity is |V|
    In the worst case, LM knows absolutely nothing about the data: it thinks that all tokens have the same probability \(\frac{1}{|V|}\) regardless of context. Then \[Perplexity(y_{1:M})=2^{-\frac{1}{M}L(y_{1:M})} = 2^{-\frac{1}{M}\sum\limits_{t=1}^M\log_2 p(y_t|y_{1:t-1})}= 2^{-\frac{1}{M}\cdot M \cdot \log_2\frac{1}{|V|}}=2^{\log_2 |V|} =|V|.\]

Therefore, your perplexity will always be between 1 and |V|.



Practical Tips

Weight Tying (aka Parameter Sharing)

Note that in an implementation of your model, you will have to define two embedding matrices:

  • input - the ones you use when feeding context words into a network,
  • output - the ones you use before the softmax operation to get predictions.

Usually, these two matrices are different (i.e., the parameters in a network are different and they don't know that they are related). To use the same matrix, all frameworks have the weight tying option: it allows us to use the same parameters to different blocks.

Practical point of view. Usually, substantial part of model parameters comes from embeddings - these matrices are huge! With weight tying, you can significantly reduce a model size.

Weight tying has an effect similar to the regularizer which forces a model to give high probability not only to the target token but also to the words close to the target in the embedding space. More details are here.




Analysis and Interpretability

Visualizing Neurons: Some are Interpretable!

Good Old Classics

The (probably) most famous work which looked at the activations of neurons in neural LMs is the work by Andrej Karpathy, Justin Johnson, Li Fei-Fei Visualizing and Understanding Recurrent Networks.

In this work, (among other things) the authors trained character-level neural language models with LSTMs and visualized activations of neurons. They used two very different datasets: Leo Tolstoy's War and Peace novel - entirely English text with minimal markup, and the source code of the Linux Kernel.

Look at the examples from the Visualizing and Understanding Recurrent Networks paper.
Why do you think the model leaned these things?



More recent: Sentiment Neuron

A more recent fun result is Open-AI's Sentiment Neuron. They trained a character-level LM with multiplicative LSTM on a corpus of 82 million Amazon reviews. Turned out, the model learned to track sentiment!

Note that this result is qualitatively different from the previous one. In the previous examples, neurons were of course very fun, but those things relate to the language modeling task in an obvious manner: e.g., tracking quotes is needed for predicting next tokens. Here, sentiment is a more high-level concept. Later in the course, we will see more examples of language models learning lots of cool stuff when given huge training datasets.

Use Interpretable Neurons to Control Generated Texts

Interpretable neurons not only fun, but also can be used to control your language model. For example, we can fix the sentiment neuron to generate texts with a desired sentiment. Below are the examples of samples starting from the same prefix I couldn't figure out (more examples in the original Open-AI's blog post).

What about neurons (or filters) in CNNs?

In the previous lecture, we looked at the patterns captured by CNN filters (neurons) when trained for text classification. Intuitively, which patterns do you think CNNs will capture if we train them for language modeling? Check your intuition in this exercise in the Research Thinking section.

Contrastive Evaluation: Test Specific Phenomena

To test if your LM knows something very specific, you can use contrastive examples. These are the examples where you have several versions of the same text which differ only in the aspect you care about: one correct and at least one incorrect. A model has to assign higher scores (probabilities) to the correct version.

A very popular phenomenon to look at is subject-verb agreement, initially proposed in Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies paper. In this task, contrastive examples consist of two sentences: one where the verb agrees in number with the subject, and another with the same verb, but incorrect inflection.

Examples can be of different complexity depending on the number of attractors: other nouns in a sentence, that have different grammatical number and can "distract" a model from the subject.


But how do we know if it learned syntax or just collocations/semantic? Use a bit of nonsense! More details are here.




Research Thinking




How to

  • Read the short description at the beginning - this is our starting point, something known.
  • Read a question and think: for a minute, a day, a week, ... - give yourself some time! Even if you are not thinking about it constantly, something can still come to mind.
  • Look at the possible answers - previous attempts to answer/solve this problem.
    Important: You are not supposed to come up with something exactly like here - remember, each paper usually takes the authors several months of work. It's a habit of thinking about these things that counts! All the rest a scientist needs is time: to try-fail-think until it works.

It's well-known that you will learn something easier if you are not just given the answer right away, but if you think about it first. Even if you don't want to be a researcher, this is still a good way to learn things!



A Bit of Analysis

Which patterns a CNN learned? Check your intuition

As we saw in the previous lecture, filters of CNNs trained for sentiment analysis capture very interpretable and informative "clues" for the sentiment (e.g., poorly designed junk, still working perfect, a mediocre product, etc.). What do you think CNNs capture when trained as language models? What could be these "informative clues" for language models?

? Imagine you trained a simple CNN-LM on the data containing parliamentary debates and News Commentary data. How do you imagine the patterns your model might capture?
Possible answers

TL;DR: Models Learn Patterns Useful for the Task at Hand

Let's look at the examples from This EMNLP 2016 paper with a simple convolutional LM. Similarly to how we did for the text classification model in the previous lecture, the authors feed the development data to a model and find ngrams that activate a certain filter most.

While a model for sentiment classification learned to pick things which are related to sentiment, the LM model captures phrases which can be continued similarly. For example, one kernel activates on phrases ending with a month, another - with a name; note also the "comparative" kernel firing at as ... as.



Here will be more exercises!

This part will be expanding from time to time.









Have Fun!




Let's write a paper!

Every great paper starts with an inciting thought - something only a human can have ...


Write a prefix, for example: "Deep Variational" (without quotes)

Waiting for the model to load... (this should take 3-5s)