### The Rediscovery Hypothesis:

### Language Models Need to Meet Linguistics

### by

### TEZEKBAYEV MAXAT

### Submitted to the Department of Mathematics

### in partial fulfillment of the requirements for the degree of Master of Applied Mathematics

### at the

### NAZARBAYEV UNIVERSITY May 2022

### © Nazarbayev University 2022. All rights reserved.

### Author . . . . Department of Mathematics

### May 8, 2022

### Certified by. . . . ZHENSIBEK ASSYLBEKOV Assistant Professor Thesis Supervisor

### Accepted by . . . .

### Gonzalo Hortelano

### Acting Dean, School of Science and Humanities

### The Rediscovery Hypothesis:

### Language Models Need to Meet Linguistics by

### TEZEKBAYEV MAXAT

Submitted to the Department of Mathematics on May 8, 2022, in partial fulfillment of the

requirements for the degree of Master of Applied Mathematics

### Abstract

There is an ongoing debate in the NLP community whether modern language models contain linguistic knowledge, recovered through so-calledprobes. This work examines whether linguistic knowledge is a necessary condition for the good performance of modern language models, which we call therediscovery hypothesis.

In the first place, we show that language models that are significantly compressed but perform well on their pretraining objectives retain good scores when probed for linguistic structures. This result supports the rediscovery hypothesis and leads to an information-theoretic framework that relates language modeling objectives with linguistic information. This framework also provides a metric to measure the impact of linguistic information on the word prediction task. We reinforce our analytical results with various experiments, both on synthetic and on real NLP tasks in English.

Thesis Supervisor: ZHENSIBEK ASSYLBEKOV Title: Assistant Professor

## Contents

1 Overview 7

1.1 Introduction . . . 7

1.2 Word Embeddings and Language Models . . . 9

1.3 Pruning Method . . . 9

1.4 Measuring the Amount of Linguistic Knowledge . . . 11

2 Main 13 2.1 Experimental Setup . . . 13

2.2 Results . . . 14

3 An Information-Theoretic Framework 18 3.1 An Information-Theoretic Framework . . . 18

3.1.1 Notation . . . 19

3.2 Information Theory Background . . . 20

3.3 Main Result . . . 22

3.4 Proof of Theorem 1 . . . 25

3.5 Experiments . . . 29

3.5.1 Removal Techniques . . . 29

3.5.2 Results . . . 34

3.6 Related Work . . . 38

4 Conclusion 42

## List of Figures

2-1 Results of applying the LTH (Algorithm 1.3.1) to SGNS and CoVe.

In each case we scatter-plot the percentage of remained weights vs validation loss, which is cross entropy for CoVe, and a variant of negative sampling objective for SGNS. . . 16 2-2 Probing results for the CoVe embeddings. Horizontal axes indicate

validation loss values, which are cross-entropy values. Vertical axes indicate drops in probing performances. In case of edge probing, we use the NE, POS, and constituent labeling tasks from the suite of [66] and report the drop in micro-averaged F1 score compared to the baseline (unpruned) model. In case of structural probing, we use the distance probe of [26] and report the drop in undirected unlabeled attachment score (UUAS). . . 16 2-3 Similarity and analogy results for the SGNS embeddings. For simi-

larities we report the drop in Spearman’s correlation with the human ratings and for analogies in accuracy. . . 16 2-4 Zooming in on better performing CoVe models. Indication of the

region to be zoomed in (left) and the zoomed in region (right). . . 17 3-1 Illustration of Theorem 1. . . 23

3-2 Nullspace projection for a 2-dimensional binary classifier. The decision
boundary of U_{𝑖} is U_{𝑖}’s null-space. Source: [55]. . . 30
3-3 Loss increase w.r.t. INLP iteration for different tasks. ON stands for

OntoNotes, UD for Universal Dependencies. The UD EWT dataset has two types of POS annotation: coarse tags (UPOS) and fine-grained tags (FPOS) . . . 34 3-4 INLP Dynamics for SGNS. ON stands for OntoNotes, UD for Uni-

versal Dependencies. The UD EWT dataset has two types of POS annotation: coarse tags (UPOS) and fine-grained tags (FPOS). . . 35 3-5 INLP Dynamics for BERT. . . 36 3-6 Synthetic task results for INLP. ∆ℓ is the increase in cross-entropy

loss when pseudo-linguistic information is removed from the BERT’s last layer with the INLP procedure. 𝜌 is estimated with the help of AWD-LSTM-MoS model [73] as described in Section 3.5.1. . . 37 3-7 Criticism and improvement of the probing methodology. An arrow

𝐴→𝐵 means that 𝐵 criticizes and/or improves𝐴. . . 40

## Chapter 1 Overview

### 1.1 Introduction

Vector representations of words obtained from self-supervised pretraining of neural language models (LMs) on massive unlabeled data have revolutionized NLP in the last decade. This success has spurred an interest in trying to understand what type of knowledge these models actually learn [58, 48].

Of particular interest here is “linguistic knowledge”, which is generally measured by annotating test sets through experts following certain pre-defined linguistic schemas.

These annotation schemas are based on language regularities manually defined by lin- guists. On the other side, we have language models which are pretrained predictors that assign a probability to the presence of a token given the surrounding context.

Neural models solve this task by finding patterns in the text. We refer to the claim that such neural language models rediscover linguistic knowledge as the rediscovery hypothesis. It stipulates that the patterns of the language discovered by the model trying to solve the pretraining task correlate with the human-defined linguistic regu- larity. In this work we measure the amount of linguistics rediscovered by a pretrained model through the so-called probing tasks: hidden layers of a neural LM are fed to

a simple classifier—a probe—that learns to predict a linguistic structure of inter- est [16, 1, 10, 26, 65]. In Section 2 we attempt to challenge therediscovery hypothesis through a variety of experiments to understand to what extent it holds.

Those experiments aim to verify whether the path through language regularity is indeed the one taken by pretrained LMs or whether there is another way to reach good LM performance without rediscovering linguistic structure. The experiments show that pretraining loss is indeed tightly linked to the amount of linguistic structure discovered by an LM. We, therefore, fail to reject the rediscovery hypothesis.

This negative attempt, as well as the abundance of positive examples in the litera- ture, motivates us to prove mathematically the rediscovery hypothesis. In Section 3.1 we use information theory to prove the contrapositive of the hypothesis—removal of linguistic information from an LM degrades its performance. Moreover, we show that the decline in the LM quality depends on how strongly the removed property is interdependent with the underlying text: a greater dependence leads to a greater drop. We confirm this result empirically, both with synthetic data and with real annotations on English text.

The result that removing information that contains strong mutual information with the underlying text degrades (masked) word prediction might not seem surprising a posteriori. However, it is this surprise that lies at the heart of most of the work in recent years around the discovery of how easily this information can be extracted from intermediate representations. Our framework also provides a coefficient that measures the dependence between a probing task and the underlying text. This measure can be used to determine more complex probing tasks, whose rediscovery by language models would indeed be surprising.

### 1.2 Word Embeddings and Language Models

We explore one static embedding model,SGNS, and contextualized embedding model, CoVe.

Word2vecSGNS[41] is a shallow two-layer neural network that produces uncon-
textualized word embeddings. It is widely accepted that the SGNS vectors capture
words semantics to a certain extent, which is confirmed by the folklore examples, such
asw_{king}−w_{man}+w_{woman} ≈w_{queen}. SGNS [40] is a masked language model with all
tokens but one in a sequence being masked. SGNS approximates its cross-entropy
loss by negative sampling procedure [7].So, the question of the relationship between
the SGNS objective function and its ability to discover linguistics is also relevant.

CoVe [38] uses the top-level activations of a two-layer BiLSTM encoder from an attentional sequence-to-sequence model [6] trained for English-to-German transla- tion. CoVe is a conditional language model conditioned on the sequence in source language. The authors used the CommonCrawl-840B GloVemodel [46] for English word vectors, which were completely fixed during pretraining, and we follow their setup. This entails that the embedding layer on the source side is not pruned dur- ing the LTH procedure. We also concatenate the encoder output with the GloVe embeddings as is done in the original paper [38, Eq. 6].

### 1.3 Pruning Method

The lottery ticket hypothesis (LTH) of [19] claims that a randomly-initialized neural
network𝑓(x;𝜃)with trainable parameters𝜃 ∈R^{𝑛}contains subnetworks 𝑓(x;m⊙𝜃),
m ∈ {0,1}^{𝑛}, such that, when trained in isolation, they can match the performance
of the original network, where ⊙ is the element-wise multiplication. The authors

suggest a simple procedure for identifying such subnetworks (Alg. 1.3.1). When this
procedure is applied iteratively (Step 5 of Alg. 1.3.1), we get a sequence of pruned
models{𝑓(x;m_{𝑖}⊙𝜃_{0})}in which each model has fewer parameters than its predecessor:

‖m_{𝑖}‖_{0} <‖m𝑖−1‖_{0}, where ‖x‖_{0} is a number of nonzero elements in x∈ R^{𝑑}. [19] used
iterative pruning for image classification models and found subnetworks that were
10%–20% of the sizes of the original networks and met or exceeded their validation
accuracies. Such a compression approach retains weights important for the main task
while discarding others. We hypothesize that it might be those additional weights
that contain the signals used by probes. But, if the subnetworks retain linguistic
knowledge then this is evidence in favor of the rediscovery hypothesis.

Algorithm 1.3.1: Lottery ticket hypothesis—Identifying winning tickets [19]

1 Randomly initialize a neural network 𝑓(x;𝜃_{0}),𝜃_{0} ∈R^{𝑛}

2 Train the network for 𝑗 iterations, arriving at parameters 𝜃𝑗.

3 Prune 𝑝% of the parameters in𝜃_{𝑗}, creating a mask 𝑚∈ {0,1}^{𝑛}.

4 Reset the remaining parameters to their values in 𝜃_{0}, creating the winning
ticket 𝑓(x;𝑚⊙𝜃0).

5 Repeat from 2 if performing iterative pruning.

6 Train the winning ticket 𝑓(x;m⊙𝜃_{0}) to convergence.

### 1.4 Measuring the Amount of Linguistic Knowledge

To properly measure the amount of linguistic knowledge in word vectors we define it as the performance of classifiers (probes) that take those vectors as input and are trained on linguistically annotated data. This definition has the advantage of being able to be measured exactly, at the cost of avoiding the discussion of whether POS tags or syntactic parse trees indeed denote linguistic knowledge captured by humans in their learning process.

We should note that this probing approach has received a lot of criticism recently [25, 51, 69] due to its inability to distinguish between information encoded in the pretrained vectors from the information learned by the probing classifier. However, in our study, the question is not How much linguistics is encoded in the presenta- tion vector?, but rather Does one vector contain more linguistic information than the other? We compare different representations of the same dimensionality using probing classifiers of the same capacity. Even if part of the probing performance is due to the classifier itself we claim that the difference in the probing performance will be due to the difference in the amount of linguistic knowledge encoded in the representations we manipulate. This conjecture is strengthened by the findings of [74]

who analyzed the representations from pretrained miniBERTas and demonstrated that the trends found through edge probing [66] are the same as those found through better-designed probes such as Minimum Description Length [69]. Therefore in this work we adopt edge probing and structural probing for contextualized embeddings.

For static embeddings, we use the traditionalword similarity andword analogy tasks.

Edge probing [66] formulates several linguistics tasks of different nature as text span classification tasks. The probing model is a lightweight classifier on top of the pretrained representations trained to solve those linguistic tasks. In this work we use

the part-of-speech tagging (POS), constituent labeling, named entity labeling (NE), and semantic role labeling (SRL) tasks from the suite, in which a probing classifier receives a sequence of tokens and predicts a label for it. For example, in the case of constituent labeling, for a sentence This probe [discovers linguistic knowledge], the sequence in square brackets should be labeled as a verb phrase.

Structural probing [26] evaluates whether syntax trees are embedded in a linear transformation of a neural network’s word representation space. The probe identifies a linear transformation under which squared Euclidean distance encodes the distance between words in the parse tree. [26] show that such transformations exist for both ELMo and BERT but not in static baselines, providing evidence that entire syntax trees can be easily extracted from the vector geometry of deep models.

Word similarity [17] and word analogy [42] tasks can be considered as non- parametric probes of static embeddings, and—differently from the other probing tasks—are not learned. The use of word embeddings in the word similarity task has been criticized for the instability of the results obtained [3]. Regarding the word analogy task, [61] raised concerns on the misalignment of assumptions in generat- ing and testing word embeddings. However, the success of the static embeddings in performing well in these tasks was a crucial part of their widespread adoption.

## Chapter 2 Main

The first question we pose in this work is the following: is the rediscovery of linguistic
knowledge mandatory for models that perform well on their pretraining tasks, typi-
cally language modeling or translation; or is it a side effect of overparameterization?^{1}
We analyze the correlation between linguistic knowledge and LM performance with
pruned pretrained models. By compressing a network through pruning we retain the
same overall architecture and can compare probing methods. More important, we
hypothesize that pruning removes all unnecessary information with respect to the
pruning objective (language modeling) and that it might be that information that is
used to rediscover linguistic knowledge.

### 2.1 Experimental Setup

We pruneSGNSandCoVethe embedding models with the LTH algorithm (Alg. 1.3.1)
and evaluate them with probes from Section 1.4 ateach pruning iteration. The mod-
els are pruned iteratively. Assuming that ℓ^{𝜔}_{𝑖} is the validation loss of the embedding

1Overparamterization is defined informally as “having more parameters than can be estimated from the data”, and therefore using a model richer than necessary for the task at hand. Those additional parameters could be responsible for the good performance of the probes.

model 𝜔 ∈ {SGNS, CoVe} at iteration 𝑖, and ∆𝑠^{𝜔,𝑇}_{𝑖} := 𝑠^{𝜔,𝑇}_{𝑖} −𝑠^{𝜔,𝑇}_{0} is the drop in
the corresponding score on the probing task 𝑇 ∈ {NE, POS, Const., Struct., Sim.,
Analogy} compared to the score 𝑠^{𝜔,𝑇}_{0} of the baseline (unpruned) model, we obtain
pairs (ℓ^{𝜔}_{𝑖},∆𝑠^{𝜔,𝑇}_{𝑖} ) for further analysis. SGNS is pruned in full, while in CoVe we
prune everything except the source-side embedding layer. This exception is due to
the design of the CoVe model, and we follow the original paper’s setup [38].

Software and datasets. The SGNS model is pretrained on the text8 data [35]

using our custom implementation [5]. CoVe is trained on the English–German part of the IWSLT 2016 machine translation task [8] using theOpenNMT-pytoolkit [28].

The training set consists of 210K sentence pairs from transcribed TED presentations that cover a wide variety of topics.

The edge probing classifier is trained on the standard benchmark dataset OntoNotes 5.0 [70] using the jiant toolkit [54]. The structural probe is trained on the English UD [63] using the code from the authors [24]. For word similarities we use the Word- Sim353 dataset [17], while for word analogies we use the Google dataset [40]. All those datasets are in English.

### 2.2 Results

First, we note that the lottery ticket hypothesis is confirmed for the embedding models since pruning up to 60% weights does not harm their performance significantly on held-out data (Fig. 2-1). In the case ofSGNS, pruning up to 80% of weights does not affect its validation loss. Since solving SGNS objective is essentially a factorization of the pointwise mutual information matrix, in the form PMI−log𝑘 ≈ WC [33], this means that a factorization with sparse W and C is possible. This observation complements the findings of [68] who showed that near-to-optimal factorization is

possible with binary W and C.

The probing scores of the baseline (unpruned) models obtained by us are close to the scores from the original papers [66, 26], and are shown in Table 2.1.

Model Task

NE POS Const. Struct.

CoVe .921 .936 .808 .726

Model Task

Similarity Analogy SGNS .716 .332 Table 2.1: Probing scores for the baseline (unpruned) models. We report the micro- averaged F1 score for the POS, NE, and Constituents; undirected unlabeled attach- ment score (UUAS) for the structural probe; Spearman’s correlation with the human ratings for the similarity task; and accuracy for the analogy task.

Probing results are provided in Fig. 2-2 and 2-3, where we scatter-plot validation
loss ℓ^{𝜔}_{𝑖} vs drop in probing performance ∆𝑠^{𝜔,𝑇}_{𝑖} for each of the model-probe combi-
nations. First, we note that, in most cases, the probing score correlates with the
pretraining loss, which supports the rediscovery hypothesis. We note that the prob-
ing score decreases slower for some tasks (e.g., POS tagging), but is much steeper
for others (e.g., constituents labeling). This is complementary to the findings of [74]

who showed that the syntactic learning curve reaches plateau performance with less pretraining data while solving semantic tasks requires more training data. Our results suggest that similar behavior emerges with respect to the model size: simpler tasks (e.g., POS tagging) can be solved with smaller models, while more complex linguistic tasks (e.g., syntactic constituents or dependency parsing) require bigger model size.

We noticed that when restricting the loss values ofCoVeto the better performing
end (zoomed in regions in Figure 2-4), drop in probing scores (compared to the scores
of the baseline unpruned model) become indistinguishable. Assuming that ℓ^{CoVe}_{𝑖} is
the validation loss of CoVe at iteration𝑖, and∆𝑠^{CoVe,𝑇}_{𝑖} is the corresponding drop in
the score on the probing task 𝑇 ∈ {NE, POS, Constituents, Structural}, we obtain
pairs(ℓ^{CoVe}_{𝑖} ,∆𝑠^{CoVe}_{𝑖} ^{,𝑇})and setup a simple linear regression for the zoomed in regions
in Figure 2-4:

Figure 2-1: Results of applying the LTH (Algorithm 1.3.1) to SGNS and CoVe. In each case we scatter-plot the percentage of remained weights vs validation loss, which is cross entropy forCoVe, and a variant of negative sampling objective for SGNS.

Figure 2-2: Probing results for the CoVe embeddings. Horizontal axes indicate validation loss values, which are cross-entropy values. Vertical axes indicate drops in probing performances. In case of edge probing, we use the NE, POS, and constituent labeling tasks from the suite of [66] and report the drop in micro-averaged F1 score compared to the baseline (unpruned) model. In case of structural probing, we use the distance probe of [26] and report the drop in undirected unlabeled attachment score (UUAS).

Figure 2-3: Similarity and analogy results for theSGNSembeddings. For similarities we report the drop in Spearman’s correlation with the human ratings and for analogies in accuracy.

Figure 2-4: Zooming in on better performingCoVe models. Indication of the region to be zoomed in (left) and the zoomed in region (right).

∆𝑠^{CoVe}_{𝑖} ^{,𝑇} =𝛼+𝛽·ℓ^{CoVe}_{𝑖} +𝜖𝑖.

Below are 𝑝-values when testing 𝐻_{0} : 𝛽 = 0 with a two-sided Student 𝑡-test. In all
Task NE POS Constituents Structural

p-value 0.725 0.306 0.184 0.160 Table 2.2: Testing 𝛽 = 0, reported are 𝑝-values.

cases 𝛽 is not significantly different from 0. Hence, the correlation between model performance and its linguistic knowledge may become very weak for better solutions to the pretraining objectives. We argue that this phenomenon is related to the amount of information remaining in an LM after pruning, and how much of this information is enough for the probe. It turns out that in the bestCoVe(with the lowest pretraining loss) such information is redundant for the probe, and in the not-so-good (but still decent) CoVe there is just enough of information for the probe to show its best performance.

## Chapter 3

## An Information-Theoretic Framework

### 3.1 An Information-Theoretic Framework

Recall that the rediscovery hypothesis asserts that neural language models, in the pro- cess of their pretraining, rediscover linguistic knowledge. We will prove this claim by contraposition, which states that without linguistic knowledge, the neural LMs can- not perform at their best. A recent paper of [14] has already investigated how linearly removing certain linguistic information from BERT’s layers impacts its accuracy of predicting a masked token, whereremoving linearly means that a linear classifier can- not predict the required linguistic property with above majority class accuracy. They showedempirically that dependency information, part-of-speech tags, and named en- tity labels are important for word prediction, while syntactic constituency boundaries (which mark the beginning and the end of a phrase) are not. One of the questions raised by the authors is how to quantify the relative importance of different proper- ties encoded in the representation for the word prediction task. The current section of this work attempts to answer this question—we provide a metric 𝜌 that is a re- liable predictor of such importance. This metric occurs naturally when we take an information-theory lens and develop a theoretical framework that ties together lin-

guistic properties, word representations, and language modeling performance. We show that when a linguistic property is removed from word vectors, the decline in the quality of a language model depends on how strongly the removed property is interdependent with the underlying text, which is measured by 𝜌: a greater 𝜌 leads to a greater drop.

The proposed metric has an undeniable advantage: its calculation does not require word representations themselves or a pretrained language model. All that is needed is the text and its linguistic annotation. Thanks to our Theorem 1, we can express the influence of a linguistic property on the word prediction task in terms of the coefficient 𝜌.

### 3.1.1 Notation

We will use plain-faced lowercase letters (𝑥) to denote scalars and plain-faced upper-
case letters (𝑋) for random variables. Bold-faced lowercase letters (x) will denote
vectors—both random and non-random—in the Euclidean spaceR^{𝑑}, while bold-faced
uppercase letters (X) will be used for matrices.

Assuming there is a finite vocabulary𝒲, members of that vocabulary are calledto-
kens. A sentence𝑊_{1:𝑛}is a sequence of tokens𝑊_{𝑖} ∈ 𝒲, this is𝑊_{1:𝑛} = [𝑊_{1}, 𝑊_{2}, . . . , 𝑊_{𝑛}].
A linguistic annotation 𝑇 of a sentence 𝑊_{1:𝑛} may take different forms. For example,
it may be a sequence of per-token tags𝑇 = [𝑇_{1}, 𝑇_{2}, . . . , 𝑇_{𝑛}], or a parse-tree 𝑇 = (𝒱,ℰ)
with vertices 𝒱 = {𝑊_{1}, . . . , 𝑊_{𝑛}} and edges ℰ ⊂ 𝒱 × 𝒱. We only require that 𝑇 is
a deterministic function of 𝑊_{1:𝑛}. Although in reality two people can give two differ-
ent annotations of the same text due to inherent ambiguity of language or different
linguistic theories, we will treat 𝑇 as the final—also called gold—annotation after
disagreements are resolved between annotators and a common reference annotation
is agreed upon.

A language model is formulated as the probability distribution 𝑞_{𝜃}(𝑊_{𝑖} | 𝜉_{𝑖}) ≈
Pr(𝑊_{𝑖} |𝐶_{𝑖}), where 𝐶_{𝑖} is the context of 𝑊_{𝑖} (see below for different types of context),
and 𝜉_{𝑖} is the vector representation of 𝐶_{𝑖}. The cross-entropy loss of such a model
is ℓ(𝑊_{𝑖},𝜉_{𝑖}) := E(𝑊𝑖,𝐶𝑖)∼𝒟[−log𝑞_{𝜃}(𝑊_{𝑖} |𝜉_{𝑖})], where 𝒟 is the true joint distribution of
word-context pairs (𝑊, 𝐶).

Discreteness of representations. Depending on the LM, 𝐶_{𝑖} is usually either
the left context [𝑊1, . . . , 𝑊𝑖−1]. Although the possible set of all such contexts 𝒞
is infinite, it is still countable. Thus the set of all contextual representations {𝜉 :
𝜉 is a vector representation of𝐶 | 𝐶 ∈ 𝒞} is also countable. Hence, we treat 𝜉 as
discrete random vector.

### 3.2 Information Theory Background

For a random variable 𝑋 with the distribution function 𝑝(𝑥), its entropy is defined as

H[𝑋] := E𝑋[−log𝑝(𝑋)].

The quantity inside the expectation, −log𝑝(𝑥), is usually referred to as information content orsurprisal, and can be interpreted as quantifying the level of “surprise” of a particular outcome 𝑥. As𝑝(𝑥)→0, the surprise of observing 𝑥 approaches +∞, and conversely as 𝑝(𝑥)→ 1, the surprise of observing 𝑥 approaches 0. Then the entropy H[𝑋] can be interpreted as the average level of “information” or “surprise” inherent in the 𝑋’s possible outcomes. We will use the following

Property 3.2.1 If 𝑌 =𝑓(𝑋), then

H[𝑌]≤H[𝑋], (3.1)

i.e. the entropy of a variable can only decrease when the latter is passed through a (deterministic) function.

Now let𝑋and𝑌 be a pair of random variables with the joint distribution function 𝑝(𝑥, 𝑦). Theconditional entropy of 𝑌 given 𝑋 is defined as

H[𝑌 |𝑋] :=E𝑋,𝑌[−log𝑝(𝑌 |𝑋)],

and can be interepreted as the amount of information needed to describe the outcome of 𝑌 given that the outcome of𝑋 is known. The following property is important for us.

Property 3.2.2 H[𝑌 |𝑋] = 0 ⇔ 𝑌 =𝑓(𝑋).

The mutual information of 𝑋 and 𝑌 is defined as

I[𝑋;𝑌] :=E𝑋,𝑌

[︂

log 𝑝(𝑋, 𝑌) 𝑝(𝑋)·𝑝(𝑌)

]︂

and is a measure of mutual dependence between𝑋 and𝑌. We will need the following properties of the mutual information.

Property 3.2.3 1. I[𝑋;𝑌] = H[𝑋]−H[𝑋 |𝑌] = H[𝑌]−H[𝑌 |𝑋].

2. I[𝑋;𝑌] = 0 ⇔ 𝑋 and 𝑌 are independent.

3. For a function 𝑓, I[𝑋;𝑌]≥I[𝑋;𝑓(𝑌)].

4. H[𝑋, 𝑌] = H[𝑋] + H[𝑌]−I[𝑋;𝑌],

Property 3.2.3.3 is known as data processing inequality, and it means that post- processing cannot increase information. By post-processing we mean a transforma-

tion 𝑓(𝑌) of a random variable 𝑌, independent of other random variables. In Prop- erty 3.2.3.4, H[𝑋, 𝑌]is the joint entropy of 𝑋 and 𝑌 which is defined as

H[𝑋, 𝑌] := E^{𝑋,𝑌}[−log𝑝(𝑋, 𝑌)]

and is a measure of “information” associated with the outcomes of the tuple (𝑋, 𝑌).

### 3.3 Main Result

Our main result is the following

Theorem 1 Let

1. x𝑖 be a (contextualized) embedding of a token𝑊𝑖 in a sentence𝑊1:𝑛, and denote

𝜎_{𝑖} := I[𝑊_{𝑖};x_{𝑖}]/H[𝑊_{1:𝑛}], (3.2)

2. 𝑇 be a linguistic annotation of 𝑊1:𝑛, and the dependence between 𝑇 and 𝑊1:𝑛

is measured by the coefficient

𝜌:= I[𝑇;𝑊_{1:𝑛}]/H[𝑊_{1:𝑛}], (3.3)

3. 𝜌 >1−𝜎𝑖,

4. x˜_{𝑖} be a (contextualized) embedding of 𝑊_{𝑖} that contains no information on 𝑇.

Then the decline in the language modeling quality when using ˜x_{𝑖} instead of x_{𝑖} is
approximately supralinear in 𝜌:

ℓ(𝑊,x˜ )−ℓ(𝑊,x)⪆H[𝑊 ]·𝜌+𝑐 (3.4)

for 𝜌 > 𝜌_{0}, with constants 𝜌_{0} >0, and 𝑐 depending on H[𝑊_{1:𝑛}] and I[𝑊_{𝑖};x_{𝑖}].

The proof is given in Section 3.4. Here we provide a less formal argument. Using visualization tricks as in [45] we can illustrate the essence of the proof by Figure 3-1.

First of all, look at the entropy H[𝑋] as the amount of information in the variable

Figure 3-1: Illustration of Theorem 1.

𝑋. Imagine amounts of information as bars. These bars overlap if there is shared information between the respective variables.

The annotation 𝑇 and the embedding vector x˜_{𝑖} are derived from the underlying
text 𝑊_{1:𝑛}, thus 𝑊_{1:𝑛} contains more information than 𝑇 or x˜_{𝑖}, hence H[𝑊_{1:𝑛}] fully
covers H[𝑇] and H[˜x𝑖]. Since the mutual information I[𝑊𝑖; ˜x𝑖] cannot exceed the
information in x˜_{𝑖}, we can write

I[𝑊_{𝑖}; ˜x_{𝑖}]≤H[˜x_{𝑖}]. (3.5)

Now recall the Eq. (3.3): it simply means that 𝜌 is the fraction of information
that is left in 𝑇 after it was derived from 𝑊_{1:𝑛}. This immediately implies that the
information which isnot in𝑇 (but was in𝑊_{1:𝑛} initially) is equal to (1−𝜌)·H[𝑊_{1:𝑛}].
Since the embedding x˜𝑖 contains no information about the annotation 𝑇, H[˜x𝑖]
and H[𝑇] do not overlap. But this means that

H[˜x_{𝑖}]≤(1−𝜌)·H[𝑊_{1:𝑛}], (3.6)

because x˜_{𝑖} is in the category of no information about 𝑇.

Combining (3.2), (3.5), (3.6), and the assumption 𝜌 >1−𝜎_{𝑖}, we have

I[𝑊𝑖; ˜x𝑖]≤(1−𝜌)·H[𝑊1:𝑛]< 𝜎𝑖·H[𝑊1:𝑛] = I[𝑊𝑖;x𝑖],

⇒ I[𝑊_{𝑖};x_{𝑖}]−I[𝑊_{𝑖}; ˜x_{𝑖}]>(𝜌+𝜎_{𝑖}−1)·H[𝑊_{1:𝑛}]. (3.7)

The inequality (3.7) is almost the required (3.4)—it remains to show that the change in mutual information can be approximated by the change in LM loss; this is done in Lemma 3.4.2.

Role of 𝜌 and 𝜎_{𝑖}. Equation (3.3) quantifies the dependence between 𝑇 and 𝑊_{1:𝑛},
and it simply means that the annotation 𝑇 carries 100·𝜌% of information contained
in the underlying text 𝑊1:𝑛 (in the information-theoretic sense). The quantity 𝜌
is well known as the entropy coefficient in the information theory literature [53].

It can be thought of as an analog of the correlation coefficient for measuring not only linear or monotonic dependence between numerical variables but any kind of statistical dependence between any kind of variables (numerical and non-numerical).

As we see from Equation (3.4), the coefficient𝜌plays a key role in predicting the LM
degradation when the linguistic structure 𝑇 is removed from the embedding x𝑖. In
Section 3.5 we give a practical way of its estimation for the case when𝑇 is a per-token
annotation of 𝑊_{1:𝑛}.

Similarly, Equation (3.2) means that both 𝑊_{𝑖} and x_{𝑖} carry at least 100·𝜎_{𝑖}% of
information contained in𝑊_{1:𝑛}. By Firth’s distributional hypothesis [18], which states
that “you shall know a word by the company it keeps”, we assume that𝜎_{𝑖} significantly
exceeds zero.

Range of𝜌. In general, mutual information is non-negative. Mutual information of
the annotation 𝑇 and the underlying text 𝑊_{1:𝑛} cannot exceed information contained
in either of these variables, i.e. 0≤I[𝑇;𝑊_{1:𝑛}]≤H[𝑊_{1:𝑛}], and therefore 𝜌∈[0,1].

Absence of information. When we write “x˜_{𝑖} contains no information on 𝑇”, this
means that the mutual information between x˜_{𝑖} and 𝑇 is zero:

I[𝑇; ˜x_{𝑖}] = 0. (3.8)

In the language of [50], Equation (3.8) assumes that all probes—even the best—

perform poorly in extracting 𝑇 from x˜_{𝑖}. This essentially means that the information
on𝑇 has been filtered out ofx˜. In practice, we will approximate this with the Iterative
Nullspace Projection [55].

### 3.4 Proof of Theorem 1

The proof is split into two steps: first, in Lemma 3.4.1 we show that the decrease in
mutual information ∆ I := I[𝑊_{𝑖};x_{𝑖}]−I[𝑊_{𝑖}; ˜x_{𝑖}] is Ω(𝜌), and then in Lemma 3.4.2 we
approximate∆ I by the increase in cross-entropy loss ℓ(𝑊_{𝑖},x˜_{𝑖})−ℓ(𝑊_{𝑖},x_{𝑖}).

Lemma 3.4.1 Let 𝑇 be an annotation of a sentence 𝑊1:𝑛 = [𝑊1, . . . , 𝑊𝑖, . . . , 𝑊𝑛]
and let x_{𝑖}, x˜_{𝑖} be (contextualized) word vectors corresponding to a token 𝑊_{𝑖} such that

I[𝑇; ˜x_{𝑖}] = 0. (3.9)

Denote

𝜌:= I[𝑇;𝑊_{1:𝑛}]/H[𝑊_{1:𝑛}], 𝜌∈[0,1], (3.10)
𝜎_{𝑖} := I[𝑊_{𝑖};x_{𝑖}]/H[𝑊_{1:𝑛}], 𝜎_{𝑖} ∈[0,1]. (3.11)

If 𝜌 >1−𝜎_{𝑖}, then

I[𝑊_{𝑖}; ˜x_{𝑖}]<I[𝑊_{𝑖};x_{𝑖}]. (3.12)
and the difference ∆ I := I[𝑊𝑖;x𝑖]−I[𝑊𝑖; ˜x𝑖] is lowerbounded as

∆ I≥(𝜌+𝜎_{𝑖}−1)·H[𝑊_{1:𝑛}]. (3.13)

From (3.9) and by Property 3.2.3.4, we have

H[𝑇,x˜𝑖] = H[𝑇] + H[˜x𝑖]−I[𝑇; ˜x𝑖]

⏟ ⏞

0

= H[𝑇] + H[˜x𝑖]. (3.14)

On the other hand, both the annotation𝑇 and the word vectors x˜_{𝑖} are obtained from
the underlying sentence 𝑊_{1:𝑛}, and therefore 𝑇 =𝑇(𝑊_{1:𝑛}) and x˜_{𝑖} = ˜x_{𝑖}(𝑊_{1:𝑛}), i.e. the
tuple (𝑇,x˜𝑖) is a function of 𝑊1:𝑛, and by Property 3.2.1,

H[𝑇,x˜_{𝑖}]≤H[𝑊_{1:𝑛}]. (3.15)

From (3.14) and (3.15), we get

H[𝑇] + H[˜x_{𝑖}]≤H[𝑊_{1:𝑛}]. (3.16)

Since 𝑇 = 𝑇(𝑊_{1:𝑛}) and x˜_{𝑖} = ˜x_{𝑖}(𝑊_{1:𝑛}), by Property 3.2.2 we have H[𝑇 | 𝑊_{1:𝑛}] = 0
and H[˜x_{𝑖} |𝑊_{1:𝑛}] = 0, and therefore

I[𝑇;𝑊1:𝑛] = H[𝑇]−H[𝑇 |𝑊1:𝑛] = H[𝑇]

Plugging this into (3.16), rearranging the terms, and taking into account (3.10), we get

H[˜x𝑖]≤H[𝑊1:𝑛]−I[𝑇;𝑊1:𝑛]

⏟ ⏞

𝜌·H[𝑊1:𝑛]

= (1−𝜌)·H[𝑊1:𝑛]. (3.17) Also, by Property 3.2.3.1,

I[𝑊_{𝑖}; ˜x_{𝑖}] = H[˜x_{𝑖}]−H[˜x_{𝑖} |𝑊_{𝑖}]

⏟ ⏞

≥0

≤H[˜x_{𝑖}] (3.18)

From (3.11), (3.17), (3.18), and the assumption 𝜌 >1−𝜎_{𝑖}, we have

I[𝑊_{𝑖}; ˜x_{𝑖}]≤H[˜x_{𝑖}]≤(1−𝜌)·H[𝑊_{1:𝑛}]< 𝜎_{𝑖}·H[𝑊_{1:𝑛}] = I[𝑊_{𝑖};x_{𝑖}],

which implies (3.12) and (3.13).

Equation (3.10) quantifies the dependence between𝑇 and𝑊_{1:𝑛}, and it means that
𝑇 carries100·𝜌%of information contained in𝑊_{1:𝑛}. Similarly, Equation (3.11) means
that both 𝑊𝑖 and x𝑖 carry at least 100·𝜎𝑖% of information contained in 𝑊1:𝑛. By
Firth’s distributional hypothesis [18], 𝜎_{𝑖} significantly exceeds zero.

The mutual informationI[𝑊_{𝑖};𝜉_{𝑖}]can be interpreted as the performance of the best
language model that tries to predict the token𝑊_{𝑖} given its contextual representation
𝜉_{𝑖}.^{1} Therefore, inequalities (3.12) and (3.13) mean that removing a linguistic structure
from word vectorsdoesharm the performance of a language model based on such word

1Just as I[𝑇𝑖;𝜉_{𝑖}] is interpreted as the performance of the best probe that tries to predict the
linguistic property𝑇𝑖 given the embedding𝜉_{𝑖} [51]

vectors, and the more the linguistic structure is interdependent with the underlying text (that is being predicted by the language model) the bigger is the harm.

We treat I[𝑊;𝜉] as the performance of a language model. However, in practice, this performance is measured by the validation loss ℓ of such a model, which is usually the cross-entropy loss. Nevertheless, the change in mutual information can be estimated by the change in the LM objective, as we show below.

Lemma 3.4.2 Let x_{𝑖} and x˜_{𝑖} be (contextualized) embeddings of a token 𝑊_{𝑖} in a sen-
tence 𝑊_{1:𝑛} = [𝑊_{1}, . . . , 𝑊_{𝑖}, . . . , 𝑊_{𝑛}]. Let ℓ(𝑊_{𝑖},𝜉_{𝑖})be the cross-entropy loss of a neural
(masked) language model 𝑞𝜃(𝑤𝑖 | 𝜉_{𝑖}), parameterized by 𝜃, that provides distribution
over the vocabulary 𝒲 given vector representation 𝜉_{𝑖} of the 𝑊_{𝑖}’s context. Then

I[𝑊_{𝑖};x_{𝑖}]−I[𝑊_{𝑖}; ˜x_{𝑖}]≈ℓ(𝑊_{𝑖},x˜_{𝑖})−ℓ(𝑊_{𝑖},x_{𝑖}). (3.19)

By Property 3.2.3.1, we have

∆ I : = I[𝑊_{𝑖};x_{𝑖}]−I[𝑊_{𝑖}; ˜x_{𝑖}] = H[𝑊_{𝑖}]−H[𝑊_{𝑖} |x_{𝑖}]−H[𝑊_{𝑖}] + H[𝑊_{𝑖} |x˜_{𝑖}]

= H[𝑊_{𝑖} |x˜_{𝑖}]−H[𝑊_{𝑖} |x_{𝑖}]≈H_{𝑞}[𝑊_{𝑖} |x˜_{𝑖}]−H_{𝑞}[𝑊_{𝑖} |x_{𝑖}], (3.20)

where H_{𝑞}[𝑊_{𝑖} | 𝜉_{𝑖}]—a cross entropy—is an estimate of H[𝑊_{𝑖} | 𝜉_{𝑖}] when the true
distribution 𝑝(𝑤_{𝑖} |𝜉_{𝑖}) is replaced by a parametric language model 𝑞_{𝜃}(𝑤_{𝑖} |𝜉_{𝑖}), which
is exactly the cross-entropy loss function of a LM 𝑞:

H_{𝑞}[𝑊_{𝑖} |𝜉_{𝑖}] =E(𝑊𝑖,𝐶𝑖)∼𝒟[−log𝑞_{𝜃}(𝑊_{𝑖} |𝜉_{𝑖})] =: ℓ(𝑊_{𝑖},𝜉_{𝑖}). (3.21)

From (3.20) and (3.21), we have (3.19).

When we approximate H[𝑊_{𝑖} | x_{𝑖}] ≈ H_{𝑞}[𝑊_{𝑖} | x_{𝑖}], here the parameters 𝜃 of 𝑞 =
𝑞 (𝑤 | x) are only the LM head’s parameters (e.g., the weights and biases of the

output softmax layer in BERT), and we assume that x_{𝑖} can be any representation
(e.g., from the last encoding layer of BERT). We assume that𝜃are chosen to minimize
the KL-divergence between the true distribution𝑝(𝑤_{𝑖} |x_{𝑖})and the model𝑞_{𝜃}(𝑤_{𝑖} |x_{𝑖}),
thus making the approximation reasonable.

Similarly, when we approximate H[𝑊_{𝑖} | x˜_{𝑖}] ≈ H_{𝑞}[𝑊_{𝑖} | x˜_{𝑖}] we again have the
LM head 𝑞_{𝜂}(𝑤_{𝑖} | x˜_{𝑖}), whose parameters 𝜂 should be chosen to minimize the KL-
divergence between true 𝑝(𝑤𝑖 |x˜𝑖) and the model 𝑞𝜂(𝑤𝑖 |x˜𝑖), for the approximation
to be reasonable. In the experimental part (Section 3.5), we do exactly this: after
applying removal technique to a representation, we fine-tune the softmax layer (but
keep the encoding layers frozen) so that it can adapt to the modified representation.

### 3.5 Experiments

In this section, we empirically verify the prediction of our theory—the stronger the dependence of a linguistic property with the underlying text the greater the decline in the performance of a language model that does not have access to such property.

Here we will focus on only one contextualized embedding model, BERT because it is the most mainstream model. Along with this, we will keep the SGNS for con- sideration as a model of static embeddings.

### 3.5.1 Removal Techniques

INLP [55] is a method of post-hoc removal of some property𝑇 from the pretrained
embeddings x. INLP neutralizes the ability to linearly predict 𝑇 from x (here 𝑇 is
a single tag, and x is a single vector). It does so by training a sequence of auxiliary
models 𝜏_{1}, . . . , 𝜏_{𝑘} that predict 𝑇 from x, interpreting each one as conveying informa-
tion on unique directions in the latent space that correspond to 𝑇, and iteratively
removing each of these directions. In the 𝑖^{th} iteration, 𝜏𝑖 is a linear model param-

Figure 3-2: Nullspace projection for a 2-dimensional binary classifier. The decision
boundary of U_{𝑖} is U_{𝑖}’s null-space. Source: [55].

eterized by a matrix U_{𝑖} and trained to predict 𝑇 from x. [55] use the Linear SVM
[11], and we follow their setup. When the embeddings are projected ontonull(U_{𝑖})by
a projection matrix P_{null(U}_{𝑖}_{)}, we have

U_{𝑖}P_{null(U}_{𝑖}_{)}x=0,

i.e. 𝜏_{𝑖} will be unable to predict 𝑇 from P_{null(U}_{𝑖}_{)}x. Figure 3-2 illustrates the method
for the case when the property𝑇 has only two types of tags andx∈R^{2}. The number
of iterations𝑘 is taken such that no linear classifier achieves above-majority accuracy
when predicting 𝑇 from x˜ = P_{null(U}_{𝑘}_{)}P_{null(U}_{𝑘−1}_{)}. . .P_{null(U}_{1}_{)}x. Before measuring the
LM loss we allow the model to adapt to the modified embeddingsx˜ by fine-tuning its
softmax layer that predicts𝑊 fromx˜, while keeping the rest of the model (encoding
layers) frozen. This fine-tuning does not bring back information on𝑇 as the softmax
layer linearly separates classes.

Tasks

We perform experiments on two real tasks and a series of synthetic tasks. In real tasks, we cannot arbitrarily change 𝜌—we can only measure it for some given annotations.

Although we have several real annotations that give some variation in the values of 𝜌, we want more control over this metric. Therefore, we come up with synthetic annotations that allow us to smoothly change𝜌in a wider range and track the impact on the loss function of a language model.

To remove a property 𝑇 from embeddings, INLP requires an annotated dataset.

Therefore for INLP we use gold annotations.

Real tasks. We consider two real tasks with per-token annotations: part-of-speech tagging (POS) and named entity labeling (NER). The choice of per-token annotations is guided by the suggested method of estimating 𝜌 (Section 3.5.1).

POS is the syntactic task of assigning tags such as noun, verb, adjective, etc. to individual tokens. We consider POS tagged datasets that are annotated with different tagsets:

• Universal Dependencies English Web Treebank [63], which uses two annotation schemas: UPOS corresponding to 17 universal tags across languages, and FPOS which encodes additional lexical and grammatical properties.

• English part of the OntoNotes corpus [70] based on the Penn Treebank anno- tation schema [36].

NER is the task of predicting the category of an entity referred to by a given token, e.g. does the entity refer to a person, a location, an organization, etc. This task is taken from the English part of the OntoNotes corpus. Table 3.1 reports some statistics on the size of different tagsets.

Synthetic tasks are created as follows. Let𝑇^{(0)} be an annotation of a corpus𝑊_{1:𝑛},
which has𝑚 unique tags, and let the corresponding𝜌^{(0)}:= I[𝑇^{(0)};𝑊_{1:𝑛}]/H[𝑊_{1:𝑛}]. We
select the two least frequent tags from the tagset, and conflate them into one tag.

UD UPOS UD FPOS ON POS ON NER

17 50 48 66

Table 3.1: Size of the tagsets for Universal Dependencies and OntoNotes PoS tagging and NER datasets.

This gives us an annotation𝑇^{(1)} which contains less information about𝑊_{1:𝑛}than the
annotation 𝑇^{(0)}, and thus has 𝜌^{(1)} < 𝜌^{(0)}. In Table 3.2 we give an example of such
conflation for a POS-tagged sentence from the OntoNotes corpus [70].

𝑊_{1:9} When a browser starts to edge near to consuming

𝑇^{(0)} WRB DT NN VBZ TO VB RB IN VBG

𝑇^{(1)} X DT NN VBZ X VB RB IN VBG

𝑊_{10:18} 500 MB of RAM on a regular basis ,

𝑇^{(0)} CD NNS IN NN IN DT JJ NN ,

𝑇^{(1)} CD NNS IN NN IN DT JJ NN ,

𝑊_{19:22} something is wrong .

𝑇^{(0)} NN VBZ JJ .

𝑇^{(1)} NN VBZ JJ .

Table 3.2: Example of conflating two least frequent tags (WRB and TO) into one tag (X).

Next, we select two least frequent tags from the annotation𝑇^{(1)} and conflate them.

This will give an annotation 𝑇^{(2)} with 𝜌^{(2)} < 𝜌^{(1)}. Iterating this process 𝑚−1 times
we will end up with the annotation 𝑇^{(𝑚−1)} that tags all tokens with a single (most
frequent) tag. In this last iteration, the annotation has no mutual information with
𝑊1:𝑛, i.e. 𝜌^{(𝑚−1)} = 0.

Experimental Setup

We remove (pseudo-)linguistic structures (Section 3.5.1) from BERT and SGNS embeddings using the methods from Section 3.5.1,and measure the decline in the language modeling performance. INLP is applied to the last layers of BERT and

𝜔 ∈ {BERT,SGNS} when information on 𝑇 ∈ {Synthetic,POS,NER} is removed
from𝜔using the removal method𝜇∈ {INLP}, we compare|∆ℓ_{𝜔,𝑇,𝜇}|against𝜌defined
by (3.3) which is the strength of interdependence between the underlying text 𝑊_{1:𝑛}
and its annotation𝑇. By Theorem 1,∆ℓ_{𝜔,𝑇,𝜇} isΩ(𝜌)for any combination of𝜔,𝑇, 𝜇.

Estimating 𝜌. Recall that 𝜌:= I[𝑇;𝑊_{1:𝑛}]/H[𝑊_{1:𝑛}] (Eq. 3.3) and that the annota-
tion 𝑇 is a deterministic function of the underlying text 𝑊_{1:𝑛} (Sec. 3.1.1). In this
case, we can write

𝜌= H[𝑇]−

0

⏞ ⏟
H[𝑇 |𝑊_{1:𝑛}]

H[𝑊1:𝑛] = H[𝑇]

H[𝑊1:𝑛]. (3.22)

and when𝑇 is a per-token annotation of𝑊_{1:𝑛}, i.e.𝑇 =𝑇_{1:𝑛} (which is the case for the
annotations that we consider), this becomes 𝜌 = H[𝑇1:𝑛]/H[𝑊1:𝑛]. Thus to estimate
𝜌, we simply need to be able to estimate the latter two entropies. This can be done by
training an autoregressive sequence model, such as LSTM, on 𝑊_{1:𝑛} and on 𝑇_{1:𝑛}. The
loss function of such a model—the cross-entropy loss—serves as an estimate of the
required entropy. Notice, that we cannot use masked LMs for this estimation as they
do not give a proper factorization of the probability 𝑝(𝑤_{1}, . . . , 𝑤_{𝑛}). Thus, we decided
to choose the AWD-LSTM-MoS model of [73] which is a compact and competitive
LM that can be trained in a reasonable time and with moderate computing resources.

In addition, we also estimated the entropies through a vanilla LSTM with tied input and output embeddings [27], and a Kneser-Ney 3-gram model [44] to test how strongly our method depends on the underlying sequence model.

Limitations. The suggested method of estimating 𝜌 through autoregressive se- quence models is limited to per-token annotations only. However, according to for- mula (3.22), to estimate𝜌for deeper annotations, it is sufficient to be able to estimate the entropyH[𝑇]of such deeper linguistic structures𝑇. For example, to estimate the

entropy of a parse tree, one can use the cross-entropy produced by a probabilistic parser. The only limitation is the determinism of the annotation process.

Amount of information to remove. INLP has a hyperparameter that controls how much of the linguistic information 𝑇 is removed from the word vectors x—the number of iterations. Following [14] we keep iterating the INLP procedure until the performance of a linear probe that predicts 𝑇 from the filtered embeddings x˜ drops to the majority-class accuracy. When this happens we treat the resulting filtered embeddings x˜ as containing no information on 𝑇.

Optimization details. For the INLP experiments we use pretrainedBERT-Base from HuggingFace [71], and an SGNS pretrained in-house [5].

### 3.5.2 Results

Real tasks. Table 3.3 reports the loss drop of pretrained LM when removing lin- guistic information (POS or NER) from the pretrained model.

Figure 3-3: Loss increase w.r.t. INLP iteration for different tasks. ON stands for OntoNotes, UD for Universal Dependencies. The UD EWT dataset has two types of POS annotation: coarse tags (UPOS) and fine-grained tags (FPOS)

Figure 3-3 shows how quickly and how much the pre-training loss grows when applying the INLP procedure for various linguistic tasks. For each task, we show the

Figure 3-4: INLP Dynamics for SGNS. ON stands for OntoNotes, UD for Universal Dependencies. The UD EWT dataset has two types of POS annotation: coarse tags (UPOS) and fine-grained tags (FPOS).

minimum number of iterations at which the probing accuracy drops to (or below) the level of the majority accuracy.

We illustrate how pre-training loss and probing accuracy change with respect to INLP iteration in Figures 3-4 and 3-5. The number of directions being removed at each INLP iteration is equal to the number of tags in the respective task.

First, we compare UPOS tagsets versus FPOS tagset: intuitively FPOS should have a tighter link with underlying text, and therefore result in higher 𝜌 and as a consequence in a higher drop in loss after the removal of this information from words representations. This is confirmed by the numbers reported in Table 3.3.

We also see a greater LM performance drop when the POS information is removed from the models compared to NE information removal. This is in line with Theorem 1 as POS tags depend stronger on the underlying text than NE labels as measured by 𝜌.

Figure 3-5: INLP Dynamics forBERT.

Finally, we see that although𝜌 indeed depends on the underlying sequence model
that is used to estimate the entropiesH[𝑇_{1:𝑛}]and H[𝑊_{1:𝑛}], all models—AWD-LSTM-
MoS, LSTM, and KN-3—preserve the relative order for the annotations that we
consider. E.g., all models indicate that OntoNotes NER annotation is the least in-
terdependent with the underlying text, while OntoNotes POS annotation is the most
interdependent one. In addition, it turns out that for a quick estimate of 𝜌, one can
use the KN-3 model, which on a modern laptop calculates the entropy of texts of 100
million tokens in a few minutes, in contrast to the LSTM, which takes several hours
on a modern GPU.

Synthetic tasks. To obtain synthetic data, we apply the procedure described in Sect. 3.5.1 to the OntoNotes POS annotation as it has the highest 𝜌in Table 3.3 and thus allows us to vary the metric in a wider range.

The results of evaluation on the synthetic tasks through the INLP are provided

Removal method 𝜇 INLP

Annotation 𝑇 ON NER UD UPOS UD FPOS ON POS

𝜌 (AWD-LSTM-MoS) 0.18 0.32 0.36 0.42

𝜌 (LSTM) 0.18 0.32 0.37 0.42

𝜌 (KN-3) 0.18 0.36 0.41 0.42

∆ℓ_{BERT}_{,𝑇 ,𝜇} 0.13 0.54 0.70 0.87

∆ℓ_{SGNS}_{,𝑇 ,𝜇} 1.33 1.62 1.79 2.04

Table 3.3: Results on POS tagging and NER tasks. ON stands for OntoNotes, UD for Universal Dependencies. The UD EWT dataset has two types of POS annotation:

coarse tags (UPOS) and fine-grained tags (FPOS). KN-3 is a Kneser-Ney 3-gram model.

Figure 3-6: Synthetic task results for INLP. ∆ℓ is the increase in cross-entropy loss when pseudo-linguistic information is removed from the BERT’s last layer with the INLP procedure. 𝜌 is estimated with the help of AWD-LSTM-MoS model [73] as described in Section 3.5.1.

in Figure 3-6. They validate the predictions of our theory—for the annotations with greater𝜌 there is a bigger drop in the LM performance (i.e. increase in the LM loss) when the information on such annotations is removed from the embeddings. We notice that ∆ℓ is piecewise-linear in 𝜌with the slope changing at𝜌≈0.4. We attribute this change to the following: for 𝜌 < 0.4, the majority class (i.e. the most frequent tag) is the tag that encapsulates several conflated tags (see Subsection 3.5.1 for details), while for 𝜌 > 0.4, the majority is NN tag. This switch causes a significant drop in the majority class accuracy which in turn causes a significant increase in the number of INLP iterations to reach that accuracy, and hence an increase in the amount of

information being removed which implies greater degradation of the LM performance.

### 3.6 Related Work

Theoretical analysis. Since the success of early word embedding algorithms like
SGNS and GloVe, there were attempts to analyze theoretically the connection be-
tween their pretraining objectives and performance on downstream tasks such as word
similarity and word analogy tasks. An incomplete list of such attempts includes those
of [33, 4, 23, 20, 67, 15, 2]. Most of these works represent pretraining as a low-rank
approximation of some co-occurrence matrix—such as PMI—and then use an empiri-
cal fact that the set of columns (or rows) of such a matrix is already a good solution to
the analogy and similarity tasks. Recently, we have seen a growing number of works
devoted to the theoretical analysis of contextualized embeddings. [29] showed that
modern embedding models, as well as the old warrior SGNS, maximize an objective
function that is a lower bound on the mutual information between different parts of
the text. [32] formalized how solving certain pretraining tasks allows learning repre-
sentations that provably decrease the sample complexity of downstream supervised
tasks. Of particular interest is a recent paper by [60] that relates a pretraining perfor-
mance of an autoregressive LM with a downstream performance for downstream tasks
thatcan be reformulated as next word prediction tasks. The authors showed that for
such tasks, if the pretraining objective is 𝜖-optimal,^{2} then the downstream objective
of a linear classifier is 𝒪(√

𝜖)-optimal. In Section 3.1 we prove a similar statement, but the difference is that we study how the removal of linguistic information affects the pretraining objective and our approach is not limited to downstream tasks that can be reformulated as next word prediction.

2[60] say that the pre-training lossℓis𝜖-optimalifℓ−ℓ^{*}≤𝜖, whereℓ^{*}is the minimum achievable
loss.

Probing. Early work on probing tried to analyze LSTM language models [34, 62, 1, 10, 22, 30, 66]. Moreover, word similarity [17] and word analogy [43] tasks can be regarded as non-parametric probes of static embeddings such as SGNS [41] and GloVe[46]. Recently the probing approach has been used mainly for the analysis of contextualized word embeddings. [26] for example showed that entire parse trees can be linearly extracted from ELMo’s [47] and BERT’s [12] hidden layers. [66] probed contextualized embeddings for various linguistic phenomena and showed that, in gen- eral, contextualized embeddings improve over their non-contextualized counterparts largely on syntactic tasks (e.g., constituent labeling) in comparison to semantic tasks (e.g., coreference). The probing methodology has also shown thatBERTlearns some reflections of semantics [57] and factual knowledge [49] into the linguistic form which are useful in applications such as word sense disambiguation and question answering respectively. [74] analyzed how the quality of representations in a pretrained model evolves with the amount of pretraining data. They performed extensive probing ex- periments on various NLU tasks and found that pretraining with 10M sentences was already able to solve most of the syntactic tasks, while it required 1B training sen- tences to be able to solve tasks requiring semantic knowledge (such as Named Entity Labeling, Semantic Role Labeling, and some others as defined by [66]).

[14] propose to look at probing from a different angle, proposing amnesic probing which is defined as the drop in performance of a pretrained LM after the relevant linguistic information is removed from one of its layers. The notion of amnesic probing fully relies on the assumption that the amount of the linguistic information contained in the pretrained vectors should correlate with the drop in LM performance after this information is removed. In this work (Section 3.1) we theoretically prove this assumption. While [14] measured LM performance as word prediction accuracy, we focus on the native LM cross-entropy loss. In addition, we answer one of the questions

raised by the authors on how to measure the influence of different linguistic properties on the word prediction task—we provide an easy-to-estimate metric that does exactly this.

Criticism of probing. The probing approach has been criticized from different angles. Our attempt to systematize this line of work is given in Figure 3-7. The semi-

Avoid Learnability Issues Limitations of Probing

Extractability = Learnability Extractability ≠ Learnability, Accuracy-Complexity Tradeoff

Control tasks (Hewitt & Liang, 2019) Information-theoretic

view of probing (Pimentel et al., 2020)

Parsing as syntactic probing (Maudslay et al., 2020)

Validity of Hewitt &

Liang’s dichotomy (Zhu & Rudzicz, 2020)

MDL analysis (Voita & Titov, 2020)

Pareto probing (Pimentel et al., 2020)

Non-parametric probing (Wu et al., 2020)

Latent subclass learning (Michael et al., 2020)

Lack of correlation with fine-tuning scores

(Tamkin et al., 2020)

Unreliability for low-resource languages

(Eger et al., 2020)

Lack of correlation with pre-training scores (Ravichander et al., 2020;

Elazar et al., 2020; Our work)

Context-only hypothesis (Kunz & Kuhlmann, 2020)

Figure 3-7: Criticism and improvement of the probing methodology. An arrow𝐴→𝐵 means that 𝐵 criticizes and/or improves𝐴.

nal paper of [25] raises the issue of separation betweenextracting linguistic structures
from contextualized embeddings and learning such structures by the probes them-
selves. This dichotomy was challenged by [51, 37], but later validated by [76] using
an information-theoretic view on probing. Meanwhile, methods were proposed that
take into account not only the probing performance but also the ease of extracting
linguistic information [69] or the complexity of the probing model [50]. At the same
time, [72] and [39] suggested avoiding learnability issues by non-parametric probing^{3}

3Parametric probes transform embeddings to linguistic structures usingparameterized operations

and weak supervision respectively. The remainder of the criticism is directed at the limitations of probing such as insufficient reliability for low-resourced languages [13], lack of evidence that probes indeed extract linguistic structures but do not learn from the linear context only [31], lack of correlation with fine-tuning scores [64] and with pretraining scores [56, 14]. The first part of our work (Section 2) partly falls into this latter group, as we did not find any evidence for a correlation between probing scores and pretraining objectives for better performing CoVe [38].

Pruning language models. A recent work by [21] compressed BERT using con- ventional pruning and showed a linear correlation between pretraining loss and down- stream task accuracy. [9] pruned pretrained BERT with LTH and fine-tuned it to downstream tasks, while [52] pruned fine-tuned BERT with LTH and then re-fine- tuned it. [59] showed that the weights needed for specific tasks are a small subset of the weights needed for masked language modeling, but they prune during fine-tuning which is beyond the scope of our work. [75] propose to learn the masks of the pre- trained LM as an alternative to finetuning on downstream tasks and shows that it is possible to find a subnetwork of large pretrained model which can reach the per- formance on the downstream tasks comparable to finetuning on this task. Generally speaking, the findings of the above-mentioned papers are aligned with our findings that the performance of pruned models on downstream tasks is correlated with the pretraining loss. The one difference from Chapter 2 of our work is that most of the previous work looks at the performance offine-tuned pruned models. In our work, we probe pruned models, i.e. the remaining weights of language models arenot adjusted to the downstream probing task. It is not obvious whether the conclusions from the former should carry over to the latter.

parameterized operations on vectors (such as vector addition/subtraction, inner product, Euclidean distance, etc.). The approach of [72] builds a so-called impact matrix and then feeds it into a graph-based algorithm to induce a dependency tree, all done without learning any parameters.

## Chapter 4 Conclusion

In this work, we tried to better understand the phenomenon of therediscovery hypoth- esis in pretrained language models. Our main contribution is two-fold: we demon- strate that the rediscovery hypothesis

• holds across various SGNS and CoVE even when a significant amount of weights get pruned (in English);

• can be formally defined within an information-theoretic framework and proved (assuming that the linguistic annotation is a deterministic function of the un- derlying text and that the annotation is sufficiently interdependent with the text).

First (Chapter 2), we performed probing of different pruned instances of the orig- inal models. If models are overparametrized, then it could be that the pruned model only keeps the connections that are important for the pretraining task, but not for auxiliary tasks like probing. Our experiments show that there is a correlation between the pretrained model’s cross-entropy loss and probing performance on various linguis- tic tasks. We believe that such correlation can be interpreted as strong evidence in

favor of the .