Neurons in LLMs: Dead, N-gram, Positional
This is a post for the paper Neurons in Large Language Models: Dead, N-gram, Positional.![](../resources/posts/ffn_neurons/suppressed_concepts-min.png)
With scale, LMs become more exciting but, at the same time, harder to analyze. We show that even with simple methods and a single GPU, you can do a lot! We analyze OPT models up to 66b and find that
- neurons inside LLMs can be:
- dead, i.e. never activate on a large dataset,
- n-gram detectors that explicitly remove information about current input token;
- positional, i.e. encode "where" regardless of "what" and question the key-value memory view of FFNs;
- with scale, models have more dead neurons and token detectors and are less focused on absolute position.
![](../resources/posts/buttons/button_read_more-min.png)
![](../resources/posts/buttons/button_read_paper-min.png)
NMT Training Process though the Lens of SMT
This is a post for the EMNLP 2021 paper Language Modeling, Lexical Translation, Reordering: The Training Process of NMT through the Lens of Classical SMT.![](../resources/posts/nmt_training/morda-min.png)
In SMT, model competences are modelled with distinct models. In NMT, the whole translation task is modelled with a single neural network. How and when does NMT get to learn all the competences? We show that
- during training, NMT undergoes three different stages:
- target-side language modeling,
- learning how to use source and approaching word-by-word translation,
- refining translations, visible by increasingly complex reorderings but not visible by e.g. BLEU;
- not only this is fun, but it can also help in practice! For example, in settings where data complexity matters, such as non-autoregressive NMT.
![](../resources/posts/buttons/button_read_more-min.png)
![](../resources/posts/buttons/button_read_paper-min.png)
Neural Machine Translation Inside Out
![](../resources/posts/nmt_inside_out/morda_test.png)
In the last decade, machine translation shifted from the traditional statistical approaches with distinct components and hand-crafted features to the end-to-end neural ones. We try to understand how NMT works and show that:
- NMT model components can learn to extract features which in SMT were modelled explicitly;
- for NMT, we can also look at how it balances the two different types of context: the source and the prefix;
- NMT training consists of the stages where it focuses on competences mirroring three core SMT components.
![](../resources/posts/buttons/button_read_more-min.png)
Source and Target Contributions to NMT Predictions
This is a post for the ACL 2021 paper Analyzing the Source and Target Contributions to Predictions in Neural Machine Translation.In NMT, the generation of a target token is based on two types of context: the source and the prefix of the target sentence. We show how to evaluate the relative contributions of source and target to NMT predictions and find that:
- models suffering from exposure bias are more prone to over-relying on target history (and hence to hallucinating) than the ones where the exposure bias is mitigated;
- models trained with more data rely on the source more and do it more confidently;
- the training process is non-monotonic with several distinct stages.
![](../resources/posts/buttons/button_read_more-min.png)
![](../resources/posts/buttons/button_read_paper-min.png)
![](../resources/posts/buttons/button_view_code-min.png)
Information-Theoretic Probing with MDL
![](../resources/posts/mdl_probes/probe_main_orange-min.png)
Probing classifiers often fail to adequately reflect differences in representations and can show different results depending on hyperparameters. As an alternative to the standard probes,
- we propose information-theoretic probing which measures minimum description length (MDL) of labels given representations;
- we show that MDL characterizes both probe quality and the amount of effort needed to achieve it;
- we explain how to easily measure MDL on top of standard probe-training pipelines;
- we show that results of MDL probes are more informative and stable than those of standard probes.
![](../resources/posts/buttons/button_read_more-min.png)
![](../resources/posts/buttons/button_read_paper-min.png)
![](../resources/posts/buttons/button_view_code-min.png)
Evolution of Representations in the Transformer
![](../resources/posts/emnlp19_evolution/fugue_logo_on_white-min-min.png)
We look at the evolution of representations of individual tokens in Transformers trained with different training objectives (MT, LM, MLM - BERT-style) from the Information Bottleneck perspective and show, that:
- LMs gradually forget past when forming predictions about future;
- for MLMs, the evolution proceeds in two stages of context encoding and token reconstruction;
- MT representations get refined with context, but less processing is happening.
![](../resources/posts/buttons/button_read_more-min.png)
![](../resources/posts/buttons/button_read_paper-min.png)
When a Good Translation is Wrong in Context
This is a post for the ACL 2019 paper When a Good Translation is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion.From this post, you will learn:
- which phenomena cause context-agnostic translations to be inconsistent with each other
- how we create test sets addressing the most frequent phenomena
- about a novel set-up for context-aware NMT with a large amount of sentence-level data and much less of document-level data
- about a new model for this set-up (Context-Aware Decoder, aka CADec) - a two-pass MT model which first produces a draft translation of the current sentence, then corrects it using context.
![](../resources/posts/buttons/button_read_more-min.png)
![](../resources/posts/buttons/button_read_paper-min.png)
![](../resources/posts/buttons/button_view_code-min.png)
The Story of Heads
![](../img/paper/acl19_heads-min.png)
From this post, you will learn:
- how we evaluate the importance of attention heads in Transformer
- which functions the most important encoder heads perform
- how we prune the vast majority of attention heads in Transformer without seriously affecting quality
- which types of model attention are most sensitive to the number of attention heads and on which layers
![](../resources/posts/buttons/button_read_more-min.png)
![](../resources/posts/buttons/button_read_paper-min.png)
![](../resources/posts/buttons/button_view_code-min.png)