Causal Conversations - with Basse and Bojinov/Recent content on Causal Conversations - with Basse and BojinovHugo -- gohugo.ioenMon, 28 Jun 2021 00:00:00 +0000Correlation and Causation: Are regressions always wrong?/post/correlation-causation/Mon, 28 Jun 2021 00:00:00 +0000/post/correlation-causation/<p>If you hang around causal folks for long enough, you’ll probably hear a
version of the sentence: “you can’t just run a regression on observational
data” sooner or later. Is that really true? Statisticians (and other
scientists) have been running regressions for a very long time, surely it
can’t be <em>that</em> bad? Let’s have a closer look!</p>
<h2 id="two-stories">Two stories</h2>
<h3 id="a-first-example">A first example</h3>
<p>Let’s say you’re working for an online music streaming service that
recently introduced a paid premium subscription service that allows
members to listen to music ad-free. The premium account model was
rolled out a few months ago and you wish to assess its impact on the
listening habits of users. You dig into the logs collected
by the service and you find something disturbing: it looks like subscribing
to the premium account <strong>reduces</strong> (rather than increases) the listening time,
adjusting for the number of playlists a user has created. At the next team
meeting, you present the following plot:</p>
<link rel="stylesheet" href="/css/hugo-easy-gallery.css" />
<div class="box">
<figure itemprop="associatedMedia"
itemscope itemtype="http://schema.org/ImageObject" >
<div class="img">
<img itemprop="thumbnail" src="/image/noncausal-fit.jpg" />
</div>
<a href="/image/noncausal-fit.jpg" itemprop="contentUrl"></a>
<figcaption><h4>Observed data and fit.</h4>
</figcaption>
</figure>
</div>
<p>You explain that you ran the regression</p>
<p>\[
Y_i = \alpha + \tau \, Z_i + \beta \, X_i + \epsilon_i
\]</p>
<p>and found that the causal effect of signing up for a premium account is \(\hat{\tau}^{ols} = -3.24\) hours of
the member’s weekly listening time. What’s worse is that this estimate is
statistically significant from \(0\)!</p>
<p>Thankfully one of your colleagues points out that among users with any
given number of playlists (\(X_i=x\)), older ones are both more likely
to be premium subscribers, and also likely to spend less time listening
to music than younger users: in other words, age is likely to be a
<em>confounder</em>! Indeed, you dig a little more and plot \(Y\) against \(X\)
again, but this time you overlay the age \(U\):</p>
<div class="box">
<figure itemprop="associatedMedia"
itemscope itemtype="http://schema.org/ImageObject" >
<div class="img">
<img itemprop="thumbnail" src="/image/correlation-age.jpg" />
</div>
<a href="/image/correlation-age.jpg" itemprop="contentUrl"></a>
<figcaption><h4>Illustrating age as a confounder.</h4>
</figcaption>
</figure>
</div>
<p>Which lends support to your colleague’s suspicion. In fact, we know the
ground truth (since, well, this is a fictional example and we generated
the data!). Indeed, the data was generated so that \(Y_i(1) = Y_i(0)\);
that is, the membership has absolutely <strong>no effect</strong> on the listening time!</p>
<p>Clearly, ‘‘just running a regression’’ did not work out so well. Perhaps
those causal folks are onto something.</p>
<h3 id="a-second-example">A second example</h3>
<p>One of your teammates working on the infrastructure noticed your plot and
was very interested. Indeed, the infrastructure team is having a hard time
figuring out how quickly they should increase their streaming capacity: too
quickly and they are wasting money; too slowly and they are providing
suboptimum bandwidth to the users. Based on your plot, she wonders whether
she could use the number of playlists (\(X\)) and the premium membership
status (\(Z\)) to <strong>predict</strong> the amount of streaming time per user (\(Y\)),
which would be helpful in calibrating the bandwidth. Having learnt your lesson,
you flattly assert that regressions are evil, and that nothing good can come
off it.</p>
<p>Your colleague is persistent though. She fits the regression on the available
data, then waits patiently for new data to come in to checks her predictions
against those new observations; the results are displayed in the Figure below.</p>
<div class="box">
<figure itemprop="associatedMedia"
itemscope itemtype="http://schema.org/ImageObject" >
<div class="img">
<img itemprop="thumbnail" src="/image/correlation_pred.jpg" />
</div>
<a href="/image/correlation_pred.jpg" itemprop="contentUrl"></a>
<figcaption><h4>Checking quality of predictions</h4>
</figcaption>
</figure>
</div>
<p>Well, ‘‘sorry dude but it just works,’’ she says, as she goes back to her office
to use her predictions to improve the infrastructure.</p>
<p>You scratch your head in confusion… so regression works?</p>
<h2 id="correlation-vs-causation">Correlation vs Causation</h2>
<h3 id="what-happened">What happened?</h3>
<p>So, what happened there? In the first example, regression gave us the wrong answer;
in the second example, it gave us the right answer. Why?</p>
<p>Well, in the first example, you asked a <strong>causal question</strong>: what
would be the causal effect of giving everyone a premium
subscription. As we saw in our <a href="/post/causal-inf-intro">previous post </a> ,
answering such questions with observational data (as opposed to data
from a randomized experiment) requires care: in particular, it
requires that one adjust for all confounders (we will discuss
observational studies in more details in another post).</p>
<p>The second example, however, has nothing to do with causal inference:
we asked a purely <strong>predictive question</strong>. Now, of course, you still
need to be mindful of the usual prediction challenges (e.g. the
bias-variance tradeoff, overfitting, etc…), but unmeasured confounders
have no special role here.</p>
<h3 id="takeaway">Takeaway</h3>
<p>So can you ‘‘just run a regression’’? It depends on the type of
question you’re asking. If you’re asking a causal question, then you
should be worrying about unmeasured confounders (among other things);
if you’re asking a predictive question (or indeed, a simple
descriptive question), then you can use your usual tools without being
paranoid about unmeasured confounders.</p>A Leader's Guide to Causal Inference/post/causal-inf-intro/Mon, 24 May 2021 00:00:00 +0000/post/causal-inf-intro/<h2 id="what-is-causal-inference">What is causal inference?</h2>
<p>Causal inference is a vast field that seeks to address questions relating causes to effects. We more formally (but still not very rigorously) define causal inference as the study of how a <strong>treatment</strong> (aka <strong>action</strong> or <strong>intervention</strong>) affects outcomes of interest relative to an alternate <strong>treatment</strong>. Often, one of the treatments represents a baseline or status quo: it is then called a <strong>control</strong>. Academics have used causal reasoning for over a century to establish many scientific findings we now consider as facts. In the past decade, there has been a rapid increase in the adoption of causal thinking by firms, and it is now an integral part of data science.</p>
<blockquote>
<p><strong>Examples: causal questions</strong></p>
<ol>
<li>What is the effect on the throughput time (outcome) of introducing a new drill to a production line (treatment) relative to the current process (control)?</li>
<li>What is the effect of changing the text on the landing page’s button (treatment) on the clickthrough rate (outcome), relative to the current text (control)?</li>
<li>Which of two hospital admission processess (treatment 1 vs treatment 2) leads to the better health results (outcome)?</li>
</ol>
</blockquote>
<p>The most reliable way to establish causal relationships is to run a randomized experiment. Different fields have different names for these, including A/B tests, clinical trials, randomized control trials, etc… but basically, a randomized experiment involves randomly assigning subjects (e.g., customers, divisions, companies) to either receive a treatment or a control intervention. The effectiveness of the treatment is then assessed by contrasting the outcomes of the treated subjects to the outcomes of the control subjects.</p>
<p>The simple idea of running experiments has had a profound impact on how managers make decisions, as it allows them to discern their customers’ preferences, evaluate their initiatives, and ultimately test their hypotheses. Experimentation is now an integral part of the product development process at most technology companies. It is increasingly being adopted by non-technology companies as well, as they recognize that experimentation allows managers to continuously challenge their working hypotheses and perform pivots that ultimately lead to better innovations.</p>
<p>Unfortunately, we cannot always run experiments because of ethical concerns, high costs, or an inability to control the random assignment directly. Luckily, this is a challenge that academics have grappled with for a long time, and they have developed many different strategies for identifying causal effects from non-experimental (i.e., observational) data. For example, we know that smoking causes cancer even though no one ever ran a randomized experiment to measure the effect of smoking. However, it is important to understand that causal claims from observational data are inherently less reliable than claims derived from experimental evidence and can be subject to severe biases.</p>
<p>This post provides a high-level overview of these two methods.</p>
<h2 id="observational-studies">Observational Studies</h2>
<p>The vast majority of statistics courses begin and end with the premise that correlation is not causation. But they rarely explain why we can not directly assume that correlation is not causation.</p>
<h3 id="does-signing-up-to-a-premium-account-reduce-music-streaming">Does signing up to a premium account reduce music streaming?</h3>
<p>Let’s consider a simple example. Imagine you are working as a data scientist for Musicfi, a firm that offers on-demand music streaming services. To keep things simple, let’s assume the firm has two account types: a free account and a premium account. Musicfi’s main measure of customer engagement is the total streaming minutes that measure how many minutes each customer spent on the service per day. As a data scientist, you want to understand your customers, so you decide to perform a simple analysis that compares the total streaming minutes across the two account types. To put this in causal terms:</p>
<ul>
<li>Our treatment is having a premium account.</li>
<li>Our control is having a free account.</li>
<li>Our outcome is total streaming minutes.</li>
</ul>
<link rel="stylesheet" href="/css/hugo-easy-gallery.css" />
<div class="box">
<figure itemprop="associatedMedia"
itemscope itemtype="http://schema.org/ImageObject" >
<div class="img">
<img itemprop="thumbnail" src="/image/StreamingMinutesByAccountType.png" />
</div>
<a href="/image/StreamingMinutesByAccountType.png" itemprop="contentUrl"></a>
<figcaption><h4>Barplot of Streaming Minutes Split by Account Type.</h4>
</figcaption>
</figure>
</div>
<p>From the figure, we can see that the customers with premium accounts had lower total streaming minutes than customers with free accounts! The average in the premium group was 71 minutes compared to 81 minutes in the free group, meaning that (on average) customers with a premium account listened to less music!! We can use a t-test to check if this observed difference is statistically significant:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span></code></pre></td>
<td class="lntd">
<pre class="chroma"><code class="language-R" data-lang="R"><span class="c1"># UPDATE</span>
<span class="nf">t.test</span><span class="p">(</span><span class="n">StreamingMinutes[AccountType</span> <span class="o">==</span> <span class="s">"Free"</span><span class="n">]</span><span class="p">,</span> <span class="n">StreamingMinutes[AccountType</span> <span class="o">==</span> <span class="s">"Premium"</span><span class="n">]</span><span class="p">)</span>
<span class="n">Welch</span> <span class="n">Two</span> <span class="n">Sample</span> <span class="n">t</span><span class="o">-</span><span class="n">test</span>
<span class="n">data</span><span class="o">:</span> <span class="n">StreamingMinutes[AccountType</span> <span class="o">==</span> <span class="s">"Free"</span><span class="n">]</span> <span class="n">and</span> <span class="n">StreamingMinutes[AccountType</span> <span class="o">==</span> <span class="s">"Premium"</span><span class="n">]</span>
<span class="n">t</span> <span class="o">=</span> <span class="m">13.756</span><span class="p">,</span> <span class="n">df</span> <span class="o">=</span> <span class="m">428.07</span><span class="p">,</span> <span class="n">p</span><span class="o">-</span><span class="n">value</span> <span class="o"><</span> <span class="m">2.2e-16</span>
<span class="n">alternative</span> <span class="n">hypothesis</span><span class="o">:</span> <span class="n">true</span> <span class="n">difference</span> <span class="n">in</span> <span class="n">means</span> <span class="n">is</span> <span class="n">not</span> <span class="n">equal</span> <span class="n">to</span> <span class="m">0</span>
<span class="m">95</span> <span class="n">percent</span> <span class="n">confidence</span> <span class="n">interval</span><span class="o">:</span>
<span class="m">8.244544</span> <span class="m">10.993385</span>
<span class="n">sample</span> <span class="n">estimates</span><span class="o">:</span>
<span class="n">mean</span> <span class="n">of</span> <span class="n">x</span> <span class="n">mean</span> <span class="n">of</span> <span class="n">y</span>
<span class="m">81.49807</span> <span class="m">71.87910</span>
</code></pre></td></tr></table>
</div>
</div><p>Suppose we take the above results at face value. In that case, we might incorrectly conclude that having a premium account reduces customer engagement. But this seems unlikely! Our general understanding suggests that buying a premium account should have the opposite effect. So what’s going on? Well, we are likely falling victim to what is often called <strong>selection bias</strong>. Selection bias occurs when the treatment group is systematically different from the control group before the treatment has occurred, making it hard to disentangle differences due to the treatment from those due to the systematic difference between the two groups.</p>
<p>Mathematically, we can model selection bias as a third variable—often called a confounding variable—associated with both a unit’s propensity to receive the treatment and that unit’s outcome. In our Musicfi example, this variable could be the age of the customer: younger customers tend to listen to more music and are less likely to purchase a premium account. Therefore, the age variable is a confounding variable as it limits our ability to draw causal inferences.</p>
<p>To see this, we can compare how age is related to total streaming minutes and the account type.</p>
<div class="box">
<figure itemprop="associatedMedia"
itemscope itemtype="http://schema.org/ImageObject" >
<div class="img">
<img itemprop="thumbnail" src="/image/StreamingMinutesByAge.png" />
</div>
<a href="/image/StreamingMinutesByAge.png" itemprop="contentUrl"></a>
<figcaption><h4>There is a clear relationship between Streaming Minutes and Age: Younger users stream a lot more than older users!</h4>
</figcaption>
</figure>
</div>
<p>If we wanted to adjust for the age variable (assuming it is observed), we could run a linear regression:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span></code></pre></td>
<td class="lntd">
<pre class="chroma"><code class="language-R" data-lang="R"><span class="c1"># Need to update</span>
<span class="o">></span> <span class="nf">summary</span><span class="p">(</span><span class="nf">lm</span><span class="p">(</span><span class="n">StreamingMinutes</span> <span class="o">~</span> <span class="n">AccountType</span> <span class="o">+</span> <span class="n">Age</span><span class="p">))</span>
<span class="n">Call</span><span class="o">:</span>
<span class="nf">lm</span><span class="p">(</span><span class="n">formula</span> <span class="o">=</span> <span class="n">StreamingMinutes</span> <span class="o">~</span> <span class="n">AccountType</span> <span class="o">+</span> <span class="n">Age</span><span class="p">)</span>
<span class="n">Residuals</span><span class="o">:</span>
<span class="n">Min</span> <span class="m">1</span><span class="n">Q</span> <span class="n">Median</span> <span class="m">3</span><span class="n">Q</span> <span class="n">Max</span>
<span class="m">-10.9809</span> <span class="m">-1.9480</span> <span class="m">0.2115</span> <span class="m">2.0191</span> <span class="m">8.3711</span>
<span class="n">Coefficients</span><span class="o">:</span>
<span class="n">Estimate</span> <span class="n">Std.</span> <span class="n">Error</span> <span class="n">t</span> <span class="n">value</span> <span class="nf">Pr</span><span class="p">(</span><span class="o">>|</span><span class="n">t</span><span class="o">|</span><span class="p">)</span>
<span class="p">(</span><span class="n">Intercept</span><span class="p">)</span> <span class="m">100.74363</span> <span class="m">0.37325</span> <span class="m">269.91</span> <span class="o"><</span> <span class="m">2e-16</span> <span class="o">***</span>
<span class="n">AccountTypePremium</span> <span class="m">1.23146</span> <span class="m">0.31415</span> <span class="m">3.92</span> <span class="m">0.000101</span> <span class="o">***</span>
<span class="n">Age</span> <span class="m">-1.05867</span> <span class="m">0.01736</span> <span class="m">-60.97</span> <span class="o"><</span> <span class="m">2e-16</span> <span class="o">***</span>
<span class="o">---</span>
<span class="n">Signif.</span> <span class="n">codes</span><span class="o">:</span> <span class="m">0</span> ‘<span class="o">***</span>’ <span class="m">0.001</span> ‘<span class="o">**</span>’ <span class="m">0.01</span> ‘<span class="o">*</span>’ <span class="m">0.05</span> ‘<span class="n">.’</span> <span class="m">0.1</span> ‘ ’ <span class="m">1</span>
<span class="n">Residual</span> <span class="n">standard</span> <span class="n">error</span><span class="o">:</span> <span class="m">2.93</span> <span class="n">on</span> <span class="m">497</span> <span class="n">degrees</span> <span class="n">of</span> <span class="n">freedom</span>
<span class="n">Multiple</span> <span class="n">R</span><span class="o">-</span><span class="n">squared</span><span class="o">:</span> <span class="m">0.9086</span><span class="p">,</span> <span class="n">Adjusted</span> <span class="n">R</span><span class="o">-</span><span class="n">squared</span><span class="o">:</span> <span class="m">0.9082</span>
<span class="bp">F</span><span class="o">-</span><span class="n">statistic</span><span class="o">:</span> <span class="m">2470</span> <span class="n">on</span> <span class="m">2</span> <span class="n">and</span> <span class="m">497</span> <span class="n">DF</span><span class="p">,</span> <span class="n">p</span><span class="o">-</span><span class="n">value</span><span class="o">:</span> <span class="o"><</span> <span class="m">2.2e-16</span>
</code></pre></td></tr></table>
</div>
</div>
<div class="box">
<figure itemprop="associatedMedia"
itemscope itemtype="http://schema.org/ImageObject" >
<div class="img">
<img itemprop="thumbnail" src="/image/ResidualsStreamingMinutesByAccountType.png" />
</div>
<a href="/image/ResidualsStreamingMinutesByAccountType.png" itemprop="contentUrl"></a>
<figcaption><h4>To see the effect of AccountType after adjusting for Age, we can plot the residuals from the regression of Age on Streaming Minutes. </h4>
</figcaption>
</figure>
</div>
<p>From the above, we can see that the effect of account type on total streaming minutes, controlling for age, is now positive!
However, even after controlling for age, can we be confident in interpreting this as a causal outcome of a hospital stay?
The answer is still most likely no. That is because there may exist more confounding variables that are not a part of our data set; these are known as unobserved confounders. Even if there were no unobserved confounders, there are much more robust methods for analyzing observational studies than linear regression. We will discuss these in more detail in future posts.</p>
<h2 id="randomized-experiments">Randomized Experiments</h2>
<p>Randomized experiments remove the selection problem and ensure that there are no confounding variables (observed or unobserved). They do this by removing the individual’s opportunity to select whether or not they receive the treatment. In the Musicfi example, if we randomly upgraded some free accounts to premium accounts, then we would no longer have to adjust for age as (on average) there will be no age difference between the treated and control subjects.</p>
<div class="box">
<figure itemprop="associatedMedia"
itemscope itemtype="http://schema.org/ImageObject" >
<div class="img">
<img itemprop="thumbnail" src="/image/RandExpAgeByAccountType.png" />
</div>
<a href="/image/RandExpAgeByAccountType.png" itemprop="contentUrl"></a>
<figcaption><h4>There is no difference in the distribution of age across the two account types. </h4>
</figcaption>
</figure>
</div>
<p>So even though age is still correlated with the amount of music that Musicfi customers listen to:</p>
<div class="box">
<figure itemprop="associatedMedia"
itemscope itemtype="http://schema.org/ImageObject" >
<div class="img">
<img itemprop="thumbnail" src="/image/RandExpStreamingMinutesByAge.png" />
</div>
<a href="/image/RandExpStreamingMinutesByAge.png" itemprop="contentUrl"></a>
<figcaption><h4>Relationship between age and streaming minutes.</h4>
</figcaption>
</figure>
</div>
<p>Age is not a confounding variables as it is indepedent of the treatment assignement. We can now directly attribute any differences in the outcome due to the intervention. Allowing us to conclude that giving people a premium account increases streaming minutes.</p>
<div class="box">
<figure itemprop="associatedMedia"
itemscope itemtype="http://schema.org/ImageObject" >
<div class="img">
<img itemprop="thumbnail" src="/image/RandExpStreamingMinutesByAccountType.png" />
</div>
<a href="/image/RandExpStreamingMinutesByAccountType.png" itemprop="contentUrl"></a>
<figcaption><h4>Relationship between age and streaming minutes.</h4>
</figcaption>
</figure>
</div>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span></code></pre></td>
<td class="lntd">
<pre class="chroma"><code class="language-R" data-lang="R"><span class="nf">t.test</span><span class="p">(</span><span class="n">StreamingMinutes[AccountType</span> <span class="o">==</span> <span class="s">"Premium"</span><span class="n">]</span><span class="p">,</span> <span class="n">StreamingMinutes[AccountType</span> <span class="o">==</span> <span class="s">"Free"</span><span class="n">]</span><span class="p">)</span>
<span class="n">Welch</span> <span class="n">Two</span> <span class="n">Sample</span> <span class="n">t</span><span class="o">-</span><span class="n">test</span>
<span class="n">data</span><span class="o">:</span> <span class="n">StreamingMinutes[AccountType</span> <span class="o">==</span> <span class="s">"Premium"</span><span class="n">]</span> <span class="n">and</span> <span class="n">StreamingMinutes[AccountType</span> <span class="o">==</span> <span class="s">"Free"</span><span class="n">]</span>
<span class="n">t</span> <span class="o">=</span> <span class="m">3.0805</span><span class="p">,</span> <span class="n">df</span> <span class="o">=</span> <span class="m">491.55</span><span class="p">,</span> <span class="n">p</span><span class="o">-</span><span class="n">value</span> <span class="o">=</span> <span class="m">0.002183</span>
<span class="n">alternative</span> <span class="n">hypothesis</span><span class="o">:</span> <span class="n">true</span> <span class="n">difference</span> <span class="n">in</span> <span class="n">means</span> <span class="n">is</span> <span class="n">not</span> <span class="n">equal</span> <span class="n">to</span> <span class="m">0</span>
<span class="m">95</span> <span class="n">percent</span> <span class="n">confidence</span> <span class="n">interval</span><span class="o">:</span>
<span class="m">0.5413153</span> <span class="m">2.4479528</span>
<span class="n">sample</span> <span class="n">estimates</span><span class="o">:</span>
<span class="n">mean</span> <span class="n">of</span> <span class="n">x</span> <span class="n">mean</span> <span class="n">of</span> <span class="n">y</span>
<span class="m">89.19549</span> <span class="m">87.70085</span>
</code></pre></td></tr></table>
</div>
</div><p>In addition to ensuring that on average all covariates are balanced (i.e., the same) randomization also provides a basis for assumption light inference as we’ve discussed in our post on <a href="/post/randomization-based-inference">Randomization Based Inference: The Neymanian approach</a> and <a href="/post/frt">Randomization-based inference: the Fisherian Approach</a>.</p>Randomization-based inference: the Fisherian Approach/post/frt/Mon, 05 Apr 2021 00:00:00 +0000/post/frt/<h2 id="why-another-approach-to-inference">Why another approach to inference?</h2>
<h3 id="the-limits-of-the-neymanian-approach">The limits of the Neymanian approach</h3>
<p>The Neymanian approach to inference allows us to construct asymptotic confidence
intervals for the average treatment effect under very mild assumptions. However,
it has three main downsides (we will discuss these in more details in a subsequent
post):</p>
<ol>
<li>It can only be used with <em>certain</em> assignment mechanisms.</li>
<li>It is valid only asymptotically.</li>
<li>It still makes <em>some</em> assumptions (however mild they are).</li>
</ol>
<h3 id="how-far-can-we-get-with-no-assumptions-whatsoever">How far can we get with no assumptions whatsoever?</h3>
<p>Here, we briefly introduce an alternative approach to inference named after the
famous Statistician R.A. Fisher. In contrast to the Neymanian approach to inference,
the Fisherian approach:</p>
<ol>
<li>Works with <em>any</em> assignment mechanism.</li>
<li>Is valid for any sample size \(n\).</li>
<li>Makes absolutely no assumptions whatever.</li>
</ol>
<p>Now this looks an aweful lot like a free lunch, and we all know that the universe
is never quite so cooperative — so what’s the catch?</p>
<p>Well, the downside is that in its simplest instantiation, this approach to
inference does not give us an estimator of the average treatment effect, let
alone a confidence interval. Instead, it allows us to answer the admittedly
simpler question: based on the data, would it be reasonable to conclude that
the treatment does not affect any of the units in the population. In statistical
language, the Fisherian approach to inference allows us to <strong>test</strong> the following
null hypothesis:</p>
<p>\[
H_0: \,Y_i(0) = Y_i(1) \quad \forall i=1,\ldots,n
\]</p>
<p>At this point, it’s essential to understand what you need to assume with each
inferential approach and what you get in exchange. The Fisher Randomization Test,
as the Fisherian approach is often called, makes
fewer assumptions than the Neymanian approach (in fact, it makes no assumptions
whatsoever) — but the question it answers is, in a sense, narrower than that
answered by the Neymanian approach.</p>
<h2 id="hypothesis-testing-as-stochastic-proof-by-contradiction">Hypothesis testing as stochastic proof by contradiction</h2>
<p>How can we test the null hypothesis \(H_0\) introduced above?
If you’ve taken an introductory statistics class,
you may have seen things like t-tests and chi-square tests, but it’s
unclear how to apply these to test \(H_0\).</p>
<p>At this point, it is helpful to step back and consider
what a statistical <em>hypothesis testing</em> is; a particular fruitful
analogy is with a mathematical device called a <em>proof by contradiction</em>.</p>
<p>Suppose you wish to prove that a statement \(P\) is true. The idea
of the proof by contradiction is to</p>
<ol>
<li>Suppose that the opposite of \(P\) is in fact true.</li>
<li>Show that this leads to an internal contradiction, or an absurd statement.</li>
<li>Concludes that since the opposite of \(P\) leads to an absurdity, \(P\) must be true.</li>
</ol>
<blockquote>
<p><strong>Example (from Wikipedia):</strong></p>
<p>A famous example is the proof of the fact that there exists no
smallest positive rational number. This is easy to show with a proof
by contradiction. Indeed, suppose that there exists smallest positive
rational number \(r = \frac{a}{b} > 0\). Then the number
\(r’ = r/2 = \frac{a}{2b} > 0\) is also a positive rational number,
but we have \(r’ < r\). This, however, contradicts the fact that
\(r\) is the smallest positive rational number! We conclude that
the initial supposition is wrong, and that there doesn’t exist a smallest
positive rational number.</p>
</blockquote>
<p>We can think of statistical hypothesis testing as a stochastic version of proof by contradiction.
When we “test” the null hypothesis \(H_0\), our goal is to establish that it is false (or unlikely).
Indeed, we can never verify the null hypothesis; we can only reject it or fail to reject it.
By analogy with the proof by contradiction, the general strategy for hypothesis testing is as follows:</p>
<ol>
<li>Suppose that \(H_0\) is true.</li>
<li>Show that this implies something very unlikely.</li>
<li>Conclude that since \(H_0\) leads to something very unlikely, it
is unlikely to be true.</li>
</ol>
<p>The key difference here is in step 2: showing that something is <strong>impossible</strong> is generally two
stringent a criterion in statistics, so we settle for showing that it is unlikely.</p>
<h2 id="the-fisher-randomization-test">The Fisher Randomization Test</h2>
<p>The Fisher randomization test begins with the observed assignment vector \(W\),
the observed outcome vector \(Y\), and the sharp null hypothesis \(H_0\).
Following the stochastic proof by contradiction analogy, let’s suppose for a moment \(H_0\) is,
in fact, correct, and let’s follow the thread of implications.</p>
<p><strong>Step 1: Hypothesized science table</strong></p>
<p>If \(H_0\) is true, then we know the entire science table! Indeed, for
illustration, suppose that \(W = (1, 1, 0, 1, 0, 0)\) and
\(Y = (3, 1, 2, 1, 3, 0)\) as in our post on the<a href="/post/randomization-based-inference">Neymanian apprach</a>.
Again, we assume that the assignment was a <a href="/post/po-introduction/#assignment-mechanism">completely randomized design</a>
with paramter \(n_1 = 3\).</p>
<p>As we’ve discussed in earlier <a href="post/po-introduction/#the-science">posts</a>, we typically only have a partial view of the science table since, as for each unit, we only observe one of the two potential outcomes. That is, we observe the following partial table:</p>
<table>
<thead>
<tr>
<th>i</th>
<th>\(y_i(0)\)</th>
<th>\(y_i(1)\)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>?</td>
<td>3</td>
</tr>
<tr>
<td>2</td>
<td>?</td>
<td>1</td>
</tr>
<tr>
<td>3</td>
<td>2</td>
<td>?</td>
</tr>
<tr>
<td>4</td>
<td>?</td>
<td>1</td>
</tr>
<tr>
<td>5</td>
<td>3</td>
<td>?</td>
</tr>
<tr>
<td>6</td>
<td>0</td>
<td>?</td>
</tr>
</tbody>
</table>
<p>But if \(H_0\) is true, then \(y_i(1) = y_i(0)\) for all
units \(i\)! So each question mark in the table can be filled replaced by the
number on the same row in the other column. That is, we have the following
table:</p>
<table>
<thead>
<tr>
<th>i</th>
<th>\(y_i(0)\)</th>
<th>\(y_i(1)\)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<td>2</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>3</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>4</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>5</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<td>6</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>
<p>Now we must be very careful. The science table above is not the true
science table — it is the what the true science table would be if \(H_0\)
where true! To make the distinction clear, we denote by
\(\underline{y}(H_0)\) this hypothesized science table.</p>
<p><strong>Step 2: Null distribution</strong></p>
<p>Recall that our goal – in a stochastic proof by contradiction – is to
show that supposing that \(H_0\) is true implies something very unlikely. One
way to do so is to check whether “the outcome of the experiment we actually
observe would be considered unusually extreme if \(H_0\) really were true.”
Concretely, take any summary \(T\) of the observed data \(Y, W\) — we
call this summary a “test statistic.” A simple
choice, for instance, could be the average outcome of the treated units,</p>
<p>\[
T_a^{obs} = T_a(Y, W) = \frac{1}{n_1} \sum_{i=1}^n Y_i W_i,
\]</p>
<p>or the difference in means,</p>
<p>\[
T_b^{obs} = T_b(Y, W) = \frac{1}{n_1} \sum_{i=1}^n Y_i W_i - \frac{1}{1-n_1} \sum_{i=1}^n Y_i (1-W_i).
\]</p>
<p>In our specific example, we have \(T_a^{obs} = 5/3\) and \(T_b^{obs} = 5/3 - 5/3 = 0\).
The choice of test statistic is important in practice — but it doesn’t affect the
mechanics of the test so, for the sake of illustration, we will use \(T_b\) here.
Now say we observe a certain value \(T_b^{obs}\) corresponding to the observed
data \(Y, W\). Is this value surprising, if \(H_0\) is true? Well, we can check
that. Suppose that we had observed assignment \(W^{(1)} = (1,1,1,0,0,0)\) instead. If \(H_0\)
is true, then the science is \(\underline{y}(H_0)\), and so we would have observed
\(Y^{(1)} = (3,1,2,1,3,0) \) and the corresponding value of the test statistic</p>
<p>\[
T_b^{(1)} = T_b(Y^{(1)}, W^{(1)}) = 4/2 - 6/2 = - 1
\]</p>
<p>Likewise for any possible assignment \(W^{(k)}\), we can deduce from
\(\underline{y}(H_0)\) what \(Y^{(k)}\) and, therefore, \(T^{(k)}\)
would have been. The distribution of \(T(y(W’), W’)\) under \(H_0\)
is called the <em>null distribution</em>.</p>
<p><strong>Step 3: p-values quantify our surprise</strong></p>
<!-- raw HTML omitted -->
<!-- raw HTML omitted -->
<p>We now compare the value \(T^{obs}_{b}\) that we actually
observe with the null distribution and ask: Is \(T^{obs}_b\) extreme compared
to the other values or not? Specifically, we can ask how likely would it have been
to observe a value as large as or larger than \(T^{obs}_b\) if \(H_0\) where
true. This is what we call (one-sided) a p-value; formally</p>
<p>\[
pval(Y, W) = \sum_{W’} pr(W’) \cdot \mathbb{I}(T_b^{obs} \geq T(y(W’), W’)).
\]</p>
<p><strong>Step 4: interpreting the p-value</strong></p>
<p>Now that we have obtained a p-value, let’s step back and see what it actually
means, and how it fits into our general perspective of the Fisher Randomization
Test as a stochastic proof by contradiction. To “disprove” \(H_0\),
we assumed it to be true, and computed the probability of observing a value \(T_b\) more extreme than the one we actually observed;
this was given by \(pval(Y, W)\). If \(pval(Y, W)\) is small—typically smaller than 0.05 or 0.01— we can reject the null hypothesis; otherwise, we fail to reject the null.</p>
<!-- raw HTML omitted -->
<h3 id="summary-of-the-fisher-randomization-test">Summary of the Fisher Randomization Test</h3>
<p>There is lot more that could be said about the Fisher Randomization Test — and
indeed, we will say more about it in another post — but these are the very
basics. The figure below summarizes the whole process</p>
<link rel="stylesheet" href="/css/hugo-easy-gallery.css" />
<div class="box">
<figure itemprop="associatedMedia"
itemscope itemtype="http://schema.org/ImageObject" >
<div class="img">
<img itemprop="thumbnail" src="/image/FRT-overview.jpg" />
</div>
<a href="/image/FRT-overview.jpg" itemprop="contentUrl"></a>
<figcaption><h4>The big picture idea of the Fisher randomization test.</h4>
</figcaption>
</figure>
</div>
<h2 id="convince-yourself">Convince yourself!</h2>
<p>Once you really understand it, the Fisher Randomization Test is a
profoundly beautiful — and useful — inferential tool. But even
more so than the Neymanian approach to inference, it takes some getting
used to. And here again, playing with simulations and implementing your
own tests will really make a difference.</p>
<h3 id="a-simple-implementation">A simple implementation</h3>
<p>The code below implements a simple – exact Fisher Randomization Test.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span><span class="lnt">29
</span><span class="lnt">30
</span><span class="lnt">31
</span><span class="lnt">32
</span><span class="lnt">33
</span><span class="lnt">34
</span><span class="lnt">35
</span><span class="lnt">36
</span><span class="lnt">37
</span><span class="lnt">38
</span><span class="lnt">39
</span><span class="lnt">40
</span><span class="lnt">41
</span><span class="lnt">42
</span><span class="lnt">43
</span><span class="lnt">44
</span><span class="lnt">45
</span><span class="lnt">46
</span><span class="lnt">47
</span><span class="lnt">48
</span><span class="lnt">49
</span><span class="lnt">50
</span><span class="lnt">51
</span><span class="lnt">52
</span><span class="lnt">53
</span><span class="lnt">54
</span><span class="lnt">55
</span><span class="lnt">56
</span><span class="lnt">57
</span><span class="lnt">58
</span><span class="lnt">59
</span></code></pre></td>
<td class="lntd">
<pre class="chroma"><code class="language-R" data-lang="R">
<span class="nf">library</span><span class="p">(</span><span class="n">magrittr</span><span class="p">)</span>
<span class="nf">library</span><span class="p">(</span><span class="n">purrr</span><span class="p">)</span>
<span class="nf">library</span><span class="p">(</span><span class="n">gtools</span><span class="p">)</span>
<span class="c1">## ----------------------------</span>
<span class="c1">## Test Statitsic</span>
<span class="c1">## ----------------------------</span>
<span class="c1">## Two examples of test statistics:</span>
<span class="c1">## Any function of w and y will do, so feel free to experiment!</span>
<span class="bp">T</span><span class="m">.1</span> <span class="o"><-</span> <span class="nf">function</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="nf">mean</span><span class="p">(</span><span class="n">y[w</span><span class="o">==</span><span class="m">1</span><span class="n">]</span><span class="p">)</span> <span class="o">-</span> <span class="nf">mean</span><span class="p">(</span><span class="n">y[w</span><span class="o">==</span><span class="m">0</span><span class="n">]</span><span class="p">)</span>
<span class="bp">T</span><span class="m">.2</span> <span class="o"><-</span> <span class="nf">function</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="nf">sum</span><span class="p">(</span><span class="n">y[w</span><span class="o">==</span><span class="m">1</span><span class="n">]</span><span class="p">)</span>
<span class="c1">## ----------------------------</span>
<span class="c1">## Observed data</span>
<span class="c1">## ----------------------------</span>
<span class="n">W.obs</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">)</span>
<span class="n">Y.obs</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">2</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">3</span><span class="p">,</span> <span class="m">0</span><span class="p">)</span>
<span class="bp">T</span><span class="m">.1</span><span class="n">.obs</span> <span class="o"><-</span> <span class="nf">T.1</span><span class="p">(</span><span class="n">W.obs</span><span class="p">,</span> <span class="n">Y.obs</span><span class="p">)</span>
<span class="bp">T</span><span class="m">.2</span><span class="n">.obs</span> <span class="o"><-</span> <span class="nf">T.2</span><span class="p">(</span><span class="n">W.obs</span><span class="p">,</span> <span class="n">Y.obs</span><span class="p">)</span>
<span class="c1">## ----------------------------</span>
<span class="c1">## Constructing the null science</span>
<span class="c1">## ----------------------------</span>
<span class="n">science.0</span> <span class="o"><-</span> <span class="nf">data.frame</span><span class="p">(</span><span class="n">y1</span> <span class="o">=</span> <span class="n">Y.obs</span><span class="p">,</span>
<span class="n">y0</span> <span class="o">=</span> <span class="n">Y.obs</span><span class="p">)</span>
<span class="c1">## ----------------------------</span>
<span class="c1">## Computing the p-value -- the exact version</span>
<span class="c1">## ----------------------------</span>
<span class="n">W.ls</span> <span class="o"><-</span> <span class="nf">permutations</span><span class="p">(</span><span class="m">6</span><span class="p">,</span> <span class="m">6</span><span class="p">,</span> <span class="m">1</span><span class="o">:</span><span class="m">6</span><span class="p">)</span> <span class="o">%>%</span>
<span class="nf">array_tree</span><span class="p">(</span><span class="m">1</span><span class="p">)</span> <span class="o">%>%</span>
<span class="nf">map</span><span class="p">(</span><span class="o">~</span> <span class="n">W.obs[.]</span><span class="p">)</span> <span class="o">%>%</span>
<span class="n">unique</span>
<span class="bp">T</span><span class="m">.1</span><span class="n">.ls</span> <span class="o"><-</span> <span class="nf">vector</span><span class="p">(</span><span class="s">'numeric'</span><span class="p">,</span> <span class="n">length</span><span class="o">=</span><span class="nf">length</span><span class="p">(</span><span class="n">W.ls</span><span class="p">))</span>
<span class="bp">T</span><span class="m">.2</span><span class="n">.ls</span> <span class="o"><-</span> <span class="nf">vector</span><span class="p">(</span><span class="s">'numeric'</span><span class="p">,</span> <span class="n">length</span><span class="o">=</span><span class="nf">length</span><span class="p">(</span><span class="n">W.ls</span><span class="p">))</span>
<span class="nf">for</span><span class="p">(</span><span class="n">i</span> <span class="n">in</span> <span class="nf">seq_along</span><span class="p">(</span><span class="n">W.ls</span><span class="p">))</span> <span class="p">{</span>
<span class="n">W</span> <span class="o"><-</span> <span class="n">W.ls[[i]]</span>
<span class="n">Y</span> <span class="o"><-</span> <span class="n">science.0</span><span class="o">$</span><span class="n">y1</span> <span class="o">*</span> <span class="n">W</span> <span class="o">+</span> <span class="n">science.0</span><span class="o">$</span><span class="n">y0</span> <span class="o">*</span> <span class="p">(</span><span class="m">1</span><span class="o">-</span><span class="n">W</span><span class="p">)</span>
<span class="bp">T</span><span class="m">.1</span><span class="n">.ls[i]</span> <span class="o"><-</span> <span class="nf">T.1</span><span class="p">(</span><span class="n">W</span><span class="p">,</span> <span class="n">Y</span><span class="p">)</span>
<span class="bp">T</span><span class="m">.2</span><span class="n">.ls[i]</span> <span class="o"><-</span> <span class="nf">T.2</span><span class="p">(</span><span class="n">W</span><span class="p">,</span> <span class="n">Y</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">pval.1</span> <span class="o"><-</span> <span class="nf">mean</span><span class="p">(</span><span class="bp">T</span><span class="m">.1</span><span class="n">.ls</span> <span class="o">>=</span> <span class="bp">T</span><span class="m">.1</span><span class="n">.obs</span><span class="p">)</span>
<span class="n">pval.2</span> <span class="o"><-</span> <span class="nf">mean</span><span class="p">(</span><span class="bp">T</span><span class="m">.2</span><span class="n">.ls</span> <span class="o">>=</span> <span class="bp">T</span><span class="m">.2</span><span class="n">.obs</span><span class="p">)</span>
</code></pre></td></tr></table>
</div>
</div><p>One thing that may be surprising, when looking at the code, is that
the true science table appears nowhere! That’s right, the pvalues we
obtain at the end do depend on the true science only through the observed
outcomes! Another insight — which see if you try the code with different values
of <code>Y.obs</code> and <code>W.obs</code> is that <code>pval.1</code> and <code>pval.2</code> are
always equal! That is a special property of the test statistics <code>T.1</code>
and <code>T.2</code>, but is not true in general.</p>
<p>You should play around with the code above, changing the values of W.obs and
Y.obs (make sure that they remain vectors of lengths 6). Here are two useful
exercises:</p>
<ol>
<li>Come up with values of <code>W.obs</code> and <code>Y.obs</code> that yield high p-values.</li>
<li>Come up with values of <code>W.obs</code> and <code>Y.obs</code> that yield low p-values.</li>
</ol>
<h3 id="a-fancier-implementation">A fancier implementation</h3>
<p>As mentioned earlier, one can think of the FRT as simply an algorithm that takes
four inputs: the observed assignments, the observed outcomes, a test statistic,
and the design of the experiment. We will show how to implement such an algorithm
in the simple case where the design is completely randomized.</p>
<p>Before we do that, however, we point out another insight you can get from playing
with the code above. You will notice that the value of <code>Y</code> in the loop on
line 50 of the code is constant, and always equal to <code>Y.obs</code>. You can verify
easily that this is true mathematically. This simple observation will allow us to
simplify the code slightly. Below, an implementtion of the Fisher Randomization Test
function:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span></code></pre></td>
<td class="lntd">
<pre class="chroma"><code class="language-R" data-lang="R"><span class="n">FRT</span> <span class="o"><-</span> <span class="nf">function</span><span class="p">(</span><span class="n">Y.obs</span><span class="p">,</span> <span class="n">W.obs</span><span class="p">,</span> <span class="bp">T</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">## Implementation of the FRT</span>
<span class="c1">## Y.obs: the vector of observed outcomes</span>
<span class="c1">## W.obs: the vector of observed assignments</span>
<span class="c1">## T: the test statistic. Must be a function T(., .)</span>
<span class="c1">##</span>
<span class="c1">## This function assumes that W was completely randomized,</span>
<span class="c1">## and returns a one sided p-value.</span>
<span class="n">W.ls</span> <span class="o"><-</span> <span class="nf">permutations</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">W.obs</span><span class="p">),</span> <span class="nf">length</span><span class="p">(</span><span class="n">W.obs</span><span class="p">),</span> <span class="m">1</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">W.obs</span><span class="p">))</span> <span class="o">%>%</span>
<span class="nf">array_tree</span><span class="p">(</span><span class="m">1</span><span class="p">)</span> <span class="o">%>%</span>
<span class="nf">map</span><span class="p">(</span><span class="o">~</span> <span class="n">W.obs[.]</span><span class="p">)</span> <span class="o">%>%</span>
<span class="n">unique</span>
<span class="bp">T</span><span class="n">.obs</span> <span class="o"><-</span> <span class="nf">T</span><span class="p">(</span><span class="n">W.obs</span><span class="p">,</span> <span class="n">Y.obs</span><span class="p">)</span>
<span class="bp">T</span><span class="n">.ls</span> <span class="o"><-</span> <span class="nf">map_dbl</span><span class="p">(</span><span class="n">W.ls</span><span class="p">,</span> <span class="o">~</span><span class="nf">T</span><span class="p">(</span><span class="n">.,</span> <span class="n">Y.obs</span><span class="p">))</span>
<span class="nf">return</span><span class="p">(</span><span class="nf">mean</span><span class="p">(</span><span class="bp">T</span><span class="n">.ls</span> <span class="o">>=</span> <span class="bp">T</span><span class="n">.obs</span><span class="p">))</span>
<span class="p">}</span>
</code></pre></td></tr></table>
</div>
</div><p>This implementation uses <code>map_dbl</code> and the <code>~</code> from the <code>purrr</code> package
to avoid having to wrap a loop, and uses the aforementioned fact that <code>Y</code> and <code>Y.obs</code>
are always equal.</p>
<p>You will notice that although conceptually profound, the FRT can ultimately be implemented with
just a few lines of code!</p>
<h2 id="the-frt-in-practice-a-monte-carlo-approximation">The FRT in practice: a Monte-Carlo approximation</h2>
<h3 id="two-practical-problems-with-our-implementation">Two practical problems with our implementation</h3>
<p>If you’ve tried to modify the code above to run the FRT test in a setting with more
that a dozen units, you will have noticed that the execution time rapidly becomes
alarmingly long. The culprit is the line:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre class="chroma"><code class="language-R" data-lang="R"><span class="n">W.ls</span> <span class="o"><-</span> <span class="nf">permutations</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">W.obs</span><span class="p">),</span> <span class="nf">length</span><span class="p">(</span><span class="n">W.obs</span><span class="p">),</span> <span class="m">1</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">W.obs</span><span class="p">))</span> <span class="o">%>%</span>
<span class="nf">array_tree</span><span class="p">(</span><span class="m">1</span><span class="p">)</span> <span class="o">%>%</span>
<span class="nf">map</span><span class="p">(</span><span class="o">~</span> <span class="n">W.obs[.]</span><span class="p">)</span> <span class="o">%>%</span>
<span class="n">unique</span>
</code></pre></td></tr></table>
</div>
</div><p>That generates all the assignments. Think about it for a moment: the support of a
completely randomized design with \(n\) units, of which \(n_1\) are treated,
has size \(n \choose n_1\).</p>
<ul>
<li>For \(n=10\) and \(n_1=5\), that’s 252 assignments:
totally doable.</li>
<li>For \(n=20\) and \(n_1 = 10\), that’s \(184,756\) assignments:
still doable, but much slower.</li>
<li>For \(n=30\) and \(n_1 = 15\), the support has
more than \(10^8\) assignments: you should probably not try that on your laptop.</li>
</ul>
<p>Another problem with our implementation is that it only allows us to run FRTs for
completely randomized experiment — but the theory has no such requirements!</p>
<h3 id="one-solution">One solution</h3>
<p>We can solve the first computational problem by approximate the null dsitribution by using only a subset of the possible assignments.</p>
<p>For the second problem, we can replace the permutations by any generic assignment mechanims. The sudo code is below.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span></code></pre></td>
<td class="lntd">
<pre class="chroma"><code class="language-R" data-lang="R"><span class="n">FRT.generic</span> <span class="o"><-</span> <span class="nf">function</span><span class="p">(</span><span class="n">Y.obs</span><span class="p">,</span> <span class="n">W.obs</span><span class="p">,</span> <span class="bp">T</span><span class="p">,</span> <span class="n">Mechanism</span><span class="p">,</span> <span class="n">iterations</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">## Implementation of the FRT</span>
<span class="c1">## Y.obs: the vector of observed outcomes</span>
<span class="c1">## W.obs: the vector of observed assignments</span>
<span class="c1">## T: the test statistic. Must be a function T(Y, W)</span>
<span class="c1">## Mechanism: a function that returns a new assignment W from the same</span>
<span class="c1">## distribution as W.obs </span>
<span class="c1">## iterations: the number of samples to obtain from the assignment</span>
<span class="c1">## mechanism</span>
<span class="c1">##</span>
<span class="c1">## This function returns a one sided p-value.</span>
<span class="n">W.ls</span> <span class="o"><-</span> <span class="nf">apply</span><span class="p">(</span><span class="n">Mechanism</span><span class="p">,</span> <span class="n">iterations</span><span class="p">)</span>
<span class="bp">T</span><span class="n">.obs</span> <span class="o"><-</span> <span class="nf">T</span><span class="p">(</span><span class="n">W.obs</span><span class="p">,</span> <span class="n">Y.obs</span><span class="p">)</span>
<span class="bp">T</span><span class="n">.ls</span> <span class="o"><-</span> <span class="nf">map_dbl</span><span class="p">(</span><span class="n">W.ls</span><span class="p">,</span> <span class="o">~</span><span class="nf">T</span><span class="p">(</span><span class="n">.,</span> <span class="n">Y.obs</span><span class="p">))</span>
<span class="nf">return</span><span class="p">(</span><span class="nf">mean</span><span class="p">(</span><span class="bp">T</span><span class="n">.ls</span> <span class="o">>=</span> <span class="bp">T</span><span class="n">.obs</span><span class="p">))</span>
<span class="p">}</span>
</code></pre></td></tr></table>
</div>
</div>Randomization-based inference: the Neymanian approach/post/randomization-based-inference/Fri, 12 Mar 2021 00:00:00 +0000/post/randomization-based-inference/<h2 id="prereqs-and-notational-matters">Prereqs and notational matters</h2>
<p>This post requires a basic understanding of the potential outcomes framework
and familiarity with the standard notation; see our earlier <a href="/post/po-introduction">post</a>
for the necessary background and notation.</p>
<p>Appropriate notation is critical in causal inference, as it can give a visual reminder of some of the subtlety involved. The key challenge is to strike an appropriate balance between mathematical rigor, which would make the notation precise but heavy and sometimes hard to parse, and simplicity, which would make it readable at the cost of the occasional ambiguity. Another complication is that some unfortunate notational habits have become de facto standard in the literature, making their use almost unavoidable. We do our best to use clear and sensible notation while hewing close to the generally accepted standard.</p>
<p>In our first <a href="/post/po-introduction">post</a> on the subject, we used a number
of notational crutches that were appropriate for an introduction to the subject,
but will quickly become cumbersome. For instance, we denoted vectors with the
arrow symbol (<em>e.g.,</em> \(\vec{W}\) and \(\vec{Y}\)): these are now omitted. We
also drop the superscript ‘‘DiM’’ from the difference-in-means estimator \(\hat{\tau}\)
and the superscript ‘‘ATE’’ from the average treatment effect estimand \(\tau\).</p>
<h2 id="the-question-of-inference">The question of inference</h2>
<h3 id="how-good-is-my-guess">How good is my guess?</h3>
<p>Consider a randomized experiment run on a population of \(n\) units
to assess the effect of a binary intervention; assume throughout that
there is no <a href="/post/interference">interference</a>.</p>
<p>To make things very concrete, suppose that
\(n=6\) and that we observe the assignment
\(W = (1, 1, 0, 1, 0, 0)\) and the outcome vector
\(Y = (3, 1, 2, 1, 3, 0)\). Suppose our goal is to estimate the
Average Treatment Effect (ATE),</p>
<p>\[
\tau = \frac{1}{6} \sum_{i=1}^6 \bigg(y_i(1) - y_i(0)\bigg),
\]</p>
<p>using the observed data. One possible solution is to use
the difference-in-means estimator,</p>
<p>\[
\hat{\tau} = \frac{1}{3}( 3, 1, 1) - \frac{1}{3}(2, 3, 0),
\]</p>
<p>which gives us an estimate of \(\hat{\tau} = 0\) for \(\tau\). As we’ve
seen in our <a href="/post/po-introduction">introduction to the potential outcomes framework</a>,
an estimator is a data-driven ‘‘guess’’ of the value of an estimand. The
difference-in-means estimator is one procedure for obtaining such a guess and,
presently, it gives us a guess of \(0\) for the ATE.</p>
<p>But, how good a guess is this? We could have guessed 42, -5, or 1380.
Why should \(0\) be preferred, given the data we observed?
The answer to this question is complex, and it is at the heart of what statistical inference is.</p>
<h3 id="a-parametric-answer">A parametric answer</h3>
<p>If you have taken an introduction to statistical inference, then your answer
might start with the assumption that:</p>
<p>\[
\begin{aligned}
Y_i \mid W_i = 1 \,\, &\overset{iid}{\sim} \,\, N(\mu_1, \sigma_1^2) \\\<br>
Y_i \mid W_i = 0 \,\, &\overset{iid}{\sim} \,\, N(\mu_0, \sigma_0^2) \\\<br>
\end{aligned}
\]</p>
<p>It is easy to show that the difference-in-means estimator \(\hat{\tau}\) is the
Maximum Likelihood Estimator (MLE) for the parameter \(\theta \equiv \mu_1 - \mu_0\).
But that is an awkward answer for a number of reasons:</p>
<ol>
<li>Since by definition, \(Y_i = W_i \, y_i(1) + (1-W_i) \, y_i(0)\),
assuming that \(Y_i \mid W_i = 1\) and \(Y_i \mid W_i = 0\) follow some
distributions implies that \(y_i(1)\) and \(y_i(0)\) must be random. This
in turns implies that the science table \(\underline{y} = (y(1), y(0))\)
is random, and therefore our estimand \(\tau\) is itself a random
variable! So it’s all nice and good to say that \(\hat{\tau}\) is the MLE
for the parameter \(\theta\), but what does the parameter \(\theta\)
have to do with the random variable \(\tau\)?</li>
<li>The parameter \(\theta \) is also a population parameter; that is, to interpret it, we implicitly require the assumption that there exists some larger population from which we sample our experimental units. This sampling allows us to model the outcomes as random; however, the price we pay is that our inference makes claims about the larger population, NOT our actual sample!</li>
<li>Our key building blocks, in the potential frameworks, are… well, <em>the
potential outcomes</em> — not the observed outcomes! The distributions on
\(Y_i \mid W_i = 1\) and \(Y_i \mid W_i = 0\) imply assumptions on the
distributions \(y(1)\) and \(y(0)\), but these are not transparent.</li>
</ol>
<p>Both points can be partially addressed by modifying the assumption as follows:</p>
<p>\[
\begin{aligned}
Y_i(1) \, \, &\overset{iid}{\sim} N(\mu_1, \sigma_1^2), \\\<br>
Y_i(0) \, \, &\overset{iid}{\sim} N(\mu_0, \sigma_0^2), \\\<br>
W_i \,\, &\perp \,\, (Y_i(1), Y_i(0)),
\end{aligned}
\]
where \( \perp \) is used to indicate that \(W_i\) and \((Y_i(1), Y_i(0))\) are independent.
Under these specifications, \(\hat{\tau}\) is still
the MLE for \(\theta = \mu_1 - \mu_0\), but now we can also establish a
connection between \(\theta\) and \(\tau\):</p>
<p>\[
E[\tau] = \theta,
\]
where the expectation is taken with respect to the distribution of the potential outcomes.</p>
<p>With some work, we can state analogous results without assuming normality, but we still
require that the potential outcomes be independent and identically distributed.
This may seem like a mild assumption, but it is still an assumption — and in some scenarios
(e.g.,in the presence of <a href="/post/interference">interference</a>), it is anything but mild.
In addition, we started with \(\tau\) as our estimand, but we had to settle for \(\theta\) which,
even if it is connected to \(\tau\), is not what we were originally after.</p>
<p>It turns out that we don’t need to assume anything about the potential outcomes to say something meaningful about \(\tau\). Unlike traditional statistical inference problems, experiments have an additional source of variability: the random assignment. To use this for inference requires thinking about the problem in a somewhat unique way, as we see below.</p>
<h3 id="randomization-as-the-basis-for-inference">Randomization as the basis for inference</h3>
<p>Our goal is to limit the number of necessary assumptions so that our conclusions can be as objective as possible. To begin with, let’s focus on the one element that we control: the assignment mechanism. Let’s say that the assignment mechanism is a <em>completely randomized design</em> with parameter \(n_1 = 3\), as defined in our <a href="/post/po-introduction">previous post</a>. Then, all the assignments with exactly \(3\) treated units and, therefore, \(3\) control units have equal probability, while all other assignments have probability \(0\).</p>
<p>Let’s see how far we can go with just this simple assumption. As before, we consider the science table to be a constant \(\underline{y} = (y(0), y(1))\). For illustration purposes, let’s take \(\underline{y}\) to be as follows:</p>
<table>
<thead>
<tr>
<th>i</th>
<th>\(y_i(0)\)</th>
<th>\(y_i(1)\)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>2</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>3</td>
<td>2</td>
<td>4</td>
</tr>
<tr>
<td>4</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>5</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<td>6</td>
<td>0</td>
<td>2</td>
</tr>
</tbody>
</table>
<p>In practice, the science is unknown (if we knew it, why would we be running an experiment?) — here, we give it a concrete value just to illustrate our point.</p>
<p>Now, say we observe the assignment \(W^{(1)} = (1, 1, 1, 0, 0, 0)\). Recall that
we defined the observed outcome of unit \(i\) to be
\(Y_i = W_i \, y_i(1) + (1-W_i) \, y_i(0)\), so the vector
of observed outcomes becomes \(Y = (3, 1, 4, 0, 3, 0)\), and the
difference-in-means estimator is:</p>
<p>\[
\hat{\tau}^{(1)} = \hat{\tau}(W^{(1)}) = \frac{1}{3} (3 + 1 + 4) - \frac{1}{3} (0 + 3 + 0) = \frac{5}{3}.
\]</p>
<!-- raw HTML omitted -->
<!-- raw HTML omitted -->
<!-- raw HTML omitted -->
<hr>
<p><strong>Note:</strong>
<em>Mathematically, the estimator \(\hat{\tau}\) is a function of both the assignment
\(W\) and the outcome vector \(Y\). Since \(Y\) is itself a function of
\(W\) and the science \(\underline{y}\), we could write
\(\hat{\tau} = \hat{\tau}(W, \underline{y})\), with the understanding that it is
a function that can be computed from observed data since the dependence on \(\underline{y}\)
is only through the observed \(Y\). Ultimately though, since we consider the
science table \(\underline{y}\) as fixed and unknown, we keep it implicit in our
notation for the estimator and simply write \(\hat{\tau}(W)\).</em></p>
<hr>
<p>What happens if we had observed a different assignment? For instance, say we observed the assignment \(W^{(2)} = (1, 0, 1, 0, 1, 0)\),
then we would have observed the outcome vector \(Y = (3, 1, 4, 0, 3, 0)\), and
the difference-in-means estimator would have had the value:</p>
<p>\[
\hat{\tau}^{(2)} = \hat{\tau}(W^{(2)}) = \frac{1}{3} (3 + 4 + 3) - \frac{1}{3} (1+0+0) = 3.
\]</p>
<p>Now since we consider a completely randomized design with parameter
\(n_1 = 3\), both assignments \(W^{(1)}\) and \(W^{(2)}\) have equal probability of being observed</p>
<p>\[
pr(W^{(1)}) = pr(W^{(2)}) = \frac{1}{6 \choose 3}.
\]</p>
<p>The random variable \(\hat{\tau}\), therefore, takes the value \(\frac{5}{3}\)
with probability \({6 \choose 3}^{-1}\) and the value \(3\) with probability
\({6 \choose 3}^{-1}\). If we repeat this with all \(6 \choose 3\) possible
values of the assignment vector, we obtain the distribution of \(\hat{\tau}\)
induced by the assignment mechanism \(pr(W)\). Formally, for
\(t \in \mathbb{R}\), we define</p>
<p>\[
pr(\hat{\tau} = t) \equiv \sum_{w} \mathbb{I}[\hat{\tau}(w)=t]\,\, pr(W = w)
\]</p>
<p>where the function \(\mathbb{I}[\cdot]\) is an indicator that takes value
\(1\) if whatever is inside it is true, and value \(0\) otherwise.</p>
<p>In summary, if we knew the entire science table \(\underline{y}\), we could
build the following table, which summarizes the distribution \(pr(\hat{\tau})\).</p>
<table>
<thead>
<tr>
<th>k</th>
<th>\(W^{(k)}\)</th>
<th>\(\hat{\tau}(W^{(k)})\)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>\((1,1,1,0,0,0)\)</td>
<td>\(\frac{5}{3}\)</td>
</tr>
<tr>
<td>2</td>
<td>\((1,0,1,0,1,0)\)</td>
<td>\(3\)</td>
</tr>
<tr>
<td>…</td>
<td>…</td>
<td>…</td>
</tr>
<tr>
<td>\(6 \choose 3\)</td>
<td>\((0,0,0,1,1,1)\)</td>
<td>\(\frac{1}{3}\)</td>
</tr>
</tbody>
</table>
<p>If you think about it, this table provides a lot of information! For example, it lets us compute our estimator’s expectation and variance—two rather valuable quantities. Of course, we don’t know the entire science table. Therefore we never see the previous table — or to be more specific, <strong>we only ever observe a single row</strong> from that table: this is yet another manifestation of the fundamental problem of causal inference.</p>
<link rel="stylesheet" href="/css/hugo-easy-gallery.css" />
<div class="box">
<figure itemprop="associatedMedia"
itemscope itemtype="http://schema.org/ImageObject" >
<div class="img">
<img itemprop="thumbnail" src="/image/rbi-source-of-randomness.jpg" />
</div>
<a href="/image/rbi-source-of-randomness.jpg" itemprop="contentUrl"></a>
<figcaption><h4>The figure illustrates pr(W) as the source of randomness and provides another visual of the above table.</h4>
</figcaption>
</figure>
</div>
<!-- raw HTML omitted -->
<!-- raw HTML omitted -->
<h2 id="neymanian-inference">Neymanian inference</h2>
<p>The general inferential strategy we discuss in this post is named after the famous
statistician Jerzy Neyman. It leverages the ideas directly from the previous section
by focusing on the properties of the estimator. Indeed, we have
seen that \(\hat{\tau}\) is a random variable whose
distribution is induced by the assignment distribution \(W \sim pr(W)\). It
therefore makes sense to consider quantities like the expectation
\(E[\hat{\tau}]\) of \(\hat{\tau}\) as well as the bias of \(\hat{\tau}\)
for estimating \(\tau\).</p>
<p>Let’s start with the expectation, because it is helpful in reinforcing how
randomization inference works. For an assignment mechanism \(pr(W)\),
define \(\mathbb{W} = \{W: pr(W) > 0\}\) the support of the design, and
denote by \(\mathbb{T} = \{\hat{\tau}(w), w \in \mathbb{W}\}\) the set of
all distinct values the estimator can take. By definition we have,</p>
<p>\[
E[\hat{\tau}] = \sum_{t \in \mathbb{T}} t \, \cdot \, pr(\hat{\tau} = t),
\]</p>
<p>The following equivalent formulation is often more convenient</p>
<p>\[
E[\hat{\tau}] = \sum_{t \in \mathbb{T}} t \, \cdot \, pr(\hat{\tau} = t) = \sum_{w \in \mathbb{W}} \hat{\tau}(w) \, \cdot \, pr(W = w).
\]</p>
<p>As we discussed, since we don’t know the science table, we only observe a single
realization from the random variable \(\hat{\tau}\), corresponding to a single draw
from the distribution \(pr(\hat{\tau})\). It turns out, however, that we can
say a lot about \(pr(\hat{\tau})\) even if we only observe the estimate
\(\hat{\tau}(W)\) corresponding to the observed assignment \(W\), as
long as we know the assignment mechanism \(pr(W)\).</p>
<p>For instance, in the previous post, we stated the following proposition:</p>
<p><strong>Proposition:</strong>
<em>If \(W\) is assigned according to a completely
randomized design, then the difference in means estimator
\(\hat{\tau}\) is unbiased for the average
treatment effect \(\tau\).</em></p>
<p>Below we provide a formal proof of this result to give a sense of how the math works. Even if you’re not super interested in the maths, we strongly recommend that you take a close look, as it will help you understand randomization-based inference better.</p>
<hr>
<p><strong>Proof:</strong> The goal,
stated in mathematical terms, is to show that:</p>
<p>\[
E[\hat{\tau}] = \tau.
\]</p>
<p>We’ll do that in four steps.</p>
<p>(1) Recall from basic probability that the expectation is linear,</p>
<p>\[
\begin{aligned}
E[\hat{\tau}] &= E\bigg[\frac{1}{N_1} \sum_{i=1}^N W_i Y_i - \frac{1}{N_0}\sum_{i=1}^N (1-W_i) Y_i \bigg] \\\<br>
&= \frac{1}{N_1} \sum_{i=1}^N E[W_i Y_i] - \frac{1}{N_0} \sum_{i=1}^N E[(1-W_i) Y_i].
\end{aligned}
\]</p>
<p>(2) Now the tricky part is that both \(W_i\) and \(Y_i\) are random variables,
so we <strong>cannot</strong> write \(E[W_i Y_i] = E[W_i] E[Y_i]\) <em>a priori</em>. The key is to notice that since by definition
\(Y_i = W_i y_i(1) + (1-W_i) y_i(0)\), we have:</p>
<p>\[
W_i Y_i = W_i^2 y_i(1) + W_i (1-W_i) y_i(0) = W_i y_i(1).
\]</p>
<p>This helps because \(y_i(1)\) is a constant, so</p>
<p>\[
E[W_i Y_i] = E[W_i y_i(1)] = E[W_i] y_i(1).
\]</p>
<p>(3) Finally, recall from basic probability that if \(W_i\) is a binary
random variable,
\[
E[W_i] = P(W_i = 1) = \frac{N_1}{N}
\]
where the second equality follows from the fact that the design is completely
randomized. So, all together, we have:
\[
E[W_iY_i] = \frac{N_1}{N} y_i(1)
\]
An exactly parallel derivation shows that:
\[
E[(1-W_i)Y_i] = \frac{N_0}{N} y_i(0)
\]</p>
<p>(4) Putting it all together, we have:
\[
\begin{aligned}
E[\hat{\tau}] &= \frac{1}{N_1} \sum_{i=1}^N \frac{N_1}{N} y_i(1) - \frac{1}{N_0} \sum_{i=1}^N \frac{N_0}{N} y_i(0) \\\<br>
&= \frac{1}{N} \sum_{i=1}^N y_i(1) - \frac{1}{N} \sum_{i=1}^N y_i(0) \\\<br>
&= \tau
\end{aligned}
\]
which completes the proof.</p>
<hr>
<p>If you think about it, this result is admirable. We have indeed made no assumption whatsoever on the potential outcomes \(\underline{y}\)
— these could be anything. The only assumption made in this proposition is
that the treatment was assigned in a particular random way. But since the
assignment mechanism \(pr(W)\) is presumably under our control, this is not
much of an assumption: the proposition tells us that if we run our experiment
a certain way, the difference-in-means estimator \(\hat{\tau}\) will be
unbiased.</p>
<p>Now unbiased estimators are good, but knowing that an estimator is unbiased gives us only a partial picture of its performance. What we really want is a measure of our uncertainty, as provided by confidence intervals (or credible intervals, for Bayesians). Incredibly, it turns out that we can construct such intervals with the difference-in-means estimator and prove that they are valid for the ATE while making only a few very mild assumptions. In particular, we still don’t need to assume that the potential outcomes are independent or identically distributed (or, indeed, even random). The pseudo-theorem below gives you a high-level picture of the kind of result that can be derived:</p>
<blockquote>
<p><strong>Pseudo-theorem:</strong> Suppose that \(pr(W)\) is a completely randomized design.
Then we can
construct an interval \(\widehat{CI} = \widehat{CI}(Y, W)\) that depends only
on observed quantities, such that under mild conditions,
\[
pr(\tau \in \widehat{CI}) \geq 0.95
\]
for large \(n\). The interval \(\widehat{CI}\) is therefore a valid
\(95\%\) confidence interval for the average treatment effect \(\tau\).</p>
</blockquote>
<p>Although stated rather informally, this is a beautiful result. It essentially says that one can construct valid confidence intervals for the ATE without assuming much about \(\underline{y}\) if one controls the assignment mechanism. Stating this result formally requires some more technical tools and concepts that are beyond the scope of this introductory post: we will tackle them in a subsequent article in this series.</p>
<h2 id="convince-yourself">Convince yourself!</h2>
<p>The ideas presented in this post are not difficult, but they may feel unfamiliar to many — because they are! We have seen two main sources of confusion in students seeing this material for the first time:</p>
<ol>
<li>Understanding what is random and what is not.</li>
<li>Understanding what is observed and what is not.</li>
</ol>
<p>The best way to become comfortable with randomization-based inference — and
convince yourself that there is no magic involved — is to play around with small
simulations.</p>
<p>The <code>R</code> code below illustrates the example we considered throughout this post. You
can also find on <a href="https://gist.github.com/gwb/913ba53542e24e63f27291f9f655fb03">github</a>. Go ahead and modify the values of the potential outcomes in the
science table on lines 23 and 24 (making sure to keep them of length 6, so the rest
of the code works unchanged). You will see that if you modify the science, the
value of the variable <code>tau</code> will change, but the value of <code>bias</code> will
remain equal to 0, up to numerical precision!</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span><span class="lnt">29
</span><span class="lnt">30
</span><span class="lnt">31
</span><span class="lnt">32
</span><span class="lnt">33
</span><span class="lnt">34
</span><span class="lnt">35
</span><span class="lnt">36
</span><span class="lnt">37
</span><span class="lnt">38
</span><span class="lnt">39
</span><span class="lnt">40
</span><span class="lnt">41
</span><span class="lnt">42
</span><span class="lnt">43
</span><span class="lnt">44
</span><span class="lnt">45
</span><span class="lnt">46
</span><span class="lnt">47
</span><span class="lnt">48
</span><span class="lnt">49
</span><span class="lnt">50
</span><span class="lnt">51
</span><span class="lnt">52
</span><span class="lnt">53
</span><span class="lnt">54
</span><span class="lnt">55
</span><span class="lnt">56
</span><span class="lnt">57
</span><span class="lnt">58
</span><span class="lnt">59
</span><span class="lnt">60
</span><span class="lnt">61
</span><span class="lnt">62
</span><span class="lnt">63
</span><span class="lnt">64
</span><span class="lnt">65
</span><span class="lnt">66
</span><span class="lnt">67
</span><span class="lnt">68
</span><span class="lnt">69
</span><span class="lnt">70
</span><span class="lnt">71
</span><span class="lnt">72
</span><span class="lnt">73
</span></code></pre></td>
<td class="lntd">
<pre class="chroma"><code class="language-R" data-lang="R"><span class="nf">library</span><span class="p">(</span><span class="n">magrittr</span><span class="p">)</span> <span class="c1"># provides `%>%`</span>
<span class="nf">library</span><span class="p">(</span><span class="n">purrr</span><span class="p">)</span> <span class="c1"># provides `array_tree` and `map`</span>
<span class="nf">library</span><span class="p">(</span><span class="n">gtools</span><span class="p">)</span> <span class="c1"># provides `permutations`</span>
<span class="c1">## ----------------------------</span>
<span class="c1">## Estimand function and estimator function</span>
<span class="c1">## ----------------------------</span>
<span class="c1"># average treatment effect function</span>
<span class="n">tau.fn</span> <span class="o"><-</span> <span class="nf">function</span><span class="p">(</span><span class="n">science</span><span class="p">)</span> <span class="nf">mean</span><span class="p">(</span><span class="n">science</span><span class="o">$</span><span class="n">y1</span><span class="p">)</span> <span class="o">-</span> <span class="nf">mean</span><span class="p">(</span><span class="n">science</span><span class="o">$</span><span class="n">y0</span><span class="p">)</span>
<span class="c1"># Difference in means function</span>
<span class="n">hat.tau.fn</span> <span class="o"><-</span> <span class="nf">function</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="nf">mean</span><span class="p">(</span><span class="n">y[w</span><span class="o">==</span><span class="m">1</span><span class="n">]</span><span class="p">)</span> <span class="o">-</span> <span class="nf">mean</span><span class="p">(</span><span class="n">y[w</span><span class="o">==</span><span class="m">0</span><span class="n">]</span><span class="p">)</span>
<span class="c1">## ----------------------------</span>
<span class="c1">## The science table + estimand</span>
<span class="c1">## (unobserved quantities)</span>
<span class="c1">## ----------------------------</span>
<span class="c1"># The science :: try changing it and see what happens!</span>
<span class="n">science</span> <span class="o"><-</span> <span class="nf">data.frame</span><span class="p">(</span><span class="n">y0</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">2</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">3</span><span class="p">,</span> <span class="m">0</span><span class="p">),</span>
<span class="n">y1</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">4</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">3</span><span class="p">,</span> <span class="m">2</span><span class="p">))</span>
<span class="c1"># The (unobserved) ATE corresponding to the (unobserved) science</span>
<span class="n">tau</span> <span class="o"><-</span> <span class="nf">tau.fn</span><span class="p">(</span><span class="n">science</span><span class="p">)</span>
<span class="c1">## ----------------------------</span>
<span class="c1">## observed data + estimate</span>
<span class="c1">## (observed quantitites)</span>
<span class="c1">## ----------------------------</span>
<span class="c1"># Observed assignment vector</span>
<span class="n">W.obs</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">)</span>
<span class="c1"># Observed outcome vector</span>
<span class="n">Y.obs</span> <span class="o"><-</span> <span class="n">science</span><span class="o">$</span><span class="n">y1</span> <span class="o">*</span> <span class="n">W.obs</span> <span class="o">+</span> <span class="n">science</span><span class="o">$</span><span class="n">y0</span> <span class="o">*</span> <span class="p">(</span><span class="m">1</span> <span class="o">-</span> <span class="n">W.obs</span><span class="p">)</span>
<span class="c1"># Observed estimate of the ATE</span>
<span class="n">hat.tau.obs</span> <span class="o"><-</span> <span class="nf">hat.tau.fn</span><span class="p">(</span><span class="n">W.obs</span><span class="p">,</span> <span class="n">Y.obs</span><span class="p">)</span>
<span class="c1">## ----------------------------</span>
<span class="c1">## bias of estimator</span>
<span class="c1">## ----------------------------</span>
<span class="c1"># The list of all possible assignments in a completely randomized design </span>
<span class="c1"># with 3 treated units and 3 control units. You can check that the </span>
<span class="c1"># list has 20 elements, which is exactly 6 choose 3.</span>
<span class="n">W.ls</span> <span class="o"><-</span> <span class="n">gtools</span><span class="o">::</span><span class="nf">permutations</span><span class="p">(</span><span class="m">6</span><span class="p">,</span> <span class="m">6</span><span class="p">,</span> <span class="m">1</span><span class="o">:</span><span class="m">6</span><span class="p">)</span> <span class="o">%>%</span>
<span class="n">purrr</span><span class="o">::</span><span class="nf">array_tree</span><span class="p">(</span><span class="m">1</span><span class="p">)</span> <span class="o">%>%</span>
<span class="n">purrr</span><span class="o">::</span><span class="nf">map</span><span class="p">(</span><span class="o">~</span> <span class="n">W.obs[.]</span><span class="p">)</span> <span class="o">%>%</span>
<span class="n">unique</span>
<span class="c1"># For each possible assignment W, compute the outcome Y that would be </span>
<span class="c1"># observed, and the corresponding estimate</span>
<span class="c1">#</span>
<span class="c1"># Note: in practice, we only observe a single element of hat.tau.ls, </span>
<span class="c1"># namely hat.tau.obs</span>
<span class="n">hat.tau.ls</span> <span class="o"><-</span> <span class="nf">vector</span><span class="p">(</span><span class="s">'numeric'</span><span class="p">,</span> <span class="n">length</span> <span class="o">=</span> <span class="nf">length</span><span class="p">(</span><span class="n">W.ls</span><span class="p">))</span>
<span class="nf">for</span><span class="p">(</span><span class="n">i</span> <span class="n">in</span> <span class="nf">seq_along</span><span class="p">(</span><span class="n">W.ls</span><span class="p">))</span> <span class="p">{</span>
<span class="n">W</span> <span class="o"><-</span> <span class="n">W.ls[[i]]</span>
<span class="n">Y</span> <span class="o"><-</span> <span class="n">science</span><span class="o">$</span><span class="n">y1</span> <span class="o">*</span> <span class="n">W</span> <span class="o">+</span> <span class="n">science</span><span class="o">$</span><span class="n">y0</span> <span class="o">*</span> <span class="p">(</span><span class="m">1</span> <span class="o">-</span> <span class="n">W</span><span class="p">)</span>
<span class="n">hat.tau.ls[i]</span> <span class="o"><-</span> <span class="nf">hat.tau.fn</span><span class="p">(</span><span class="n">W</span><span class="p">,</span> <span class="n">Y</span><span class="p">)</span>
<span class="p">}</span>
<span class="c1"># The bias of the difference in means estimator (under complete randomization)</span>
<span class="n">bias</span> <span class="o"><-</span> <span class="nf">mean</span><span class="p">(</span><span class="n">hat.tau.ls</span><span class="p">)</span> <span class="o">-</span> <span class="n">tau</span>
</code></pre></td></tr></table>
</div>
</div>Introduction to Identification/post/identification/Fri, 12 Feb 2021 00:00:00 +0000/post/identification/<h2 id="identification-what-is-it">Identification: What is it?</h2>
<p>Identification and identifiable are two commonly used words in Statistics and
Economics. But, have you ever wondered, what does it really mean to say that a
quantity is identifiable from the data? Statisticians seem to agree on a
definition in the context of parametric models — calling a parameter
identifiable if distinct values of the parameter correspond to distinct members
of a parametric family. Intuitively, this means that if we had an infinite
amount of data, we could learn the actual model parameters used to generate the
data — it’s kinda neat!</p>
<p>So this definition makes sense for a parametric model, but what about
nonparametric models? And what if you don’t have an explicit model that you want
to use? For instance, in the context of an experiment or an observational study,
we might want to ask whether the average treatment effect of an intervention is
identifiable from the data. Unfortunately, the simple definition of parametric
identifiability does not apply if we take a nonparametric or randomization-based
approach.</p>
<p>This post will build on the parametric intuition to describe a more general
notion of identifiability that works for all types of settings. Our goal is to
explain the definition and provide examples of how to apply it in a range of
different contexts. The next few posts in the series will dive deeper into
the role of identification in the causal context.</p>
<!-- raw HTML omitted -->
<!-- raw HTML omitted -->
<!-- raw HTML omitted -->
<!-- raw HTML omitted -->
<!-- raw HTML omitted -->
<!-- raw HTML omitted -->
<!-- raw HTML omitted -->
<!-- raw HTML omitted -->
<h2 id="a-general-notion-of-identifiability">A general notion of identifiability</h2>
<h3 id="basic-framework">Basic framework</h3>
<p>The identification framework we will consider consists of three elements:</p>
<ol>
<li>A <em>statistical universe</em> that contains all of the objects relevant to a given
problem \(S\).</li>
<li>An <em>estimand mapping</em> that describes what aspect of the
statistical universe we are trying to learn about \(\theta : S \to \Theta \).</li>
<li>An <em>observation mapping</em> that tells us what parts of our statistical universe
we observe \( \lambda: S \to \Lambda \).</li>
</ol>
<link rel="stylesheet" href="/css/hugo-easy-gallery.css" />
<div class="box">
<figure itemprop="associatedMedia"
itemscope itemtype="http://schema.org/ImageObject" >
<div class="img">
<img itemprop="thumbnail" src="/image/identification-diagram.png" />
</div>
<a href="/image/identification-diagram.png" itemprop="contentUrl"></a>
<figcaption><h4>Diagram of the main objects.</h4>
</figcaption>
</figure>
</div>
<p>We then define identification by studying the inherent relationship between the
estimand mapping and the observation mapping using the <em>induced binary relation</em>
\(R\). Intuitively, the induced binary relation connects “what we know” to
“what we are trying to learn” through the “statistical universe” in which we
operate.</p>
<hr>
<p><strong>Binary relation interlude:</strong> If you are not familiar with <a href="https://en.wikipedia.org/wiki/Binary_relation">binary
relations</a>, or it’s been a while
since you’ve seen them, they are basically a generalization of a
function. Mathematically, a binary relation \(R\) from a set \(\Theta\) to
\(\Lambda\) is a subset of the cartesian product \(\Theta \times
\Lambda\). For \(\vartheta \in \Theta\) and \(\ell \in \Lambda\), we say
that \(\vartheta\) is \(R\)-related to \(\ell\) if \((\vartheta, \ell)
\in R\) and write \(\vartheta R \ell\).</p>
<p>For example, let \(\Theta\) be the set of prime numbers, \(\Lambda\) be the
set of integers, and \(R\) the “divides” relation such that \(\vartheta R
\ell \) if \(\vartheta\) divides \(\ell\) (<em>e.g.</em>, \(3 R 3\), \(3 R
6\), but 3 is not in relation with 2).</p>
<hr>
<h3 id="definition-of-identification">Definition of Identification</h3>
<p>Specifying the three elements of the framework for a given problem gives rise
a special binary relation.</p>
<dl>
<dt><strong>Definition (Induced binary relation)</strong></dt>
<dd>Consider \(\mathcal{S} \) and \(\theta, \lambda \in G(\mathcal{S} )\),
where \(G(\mathcal{S} )\) is the set of all function with domain
\(\mathcal{S}\) and let \(\Theta = Img(\theta)\) and \(\Lambda =
Img(\lambda)\). We define the binary relation \(R_{\theta, \lambda}\)
induced by \(\mathcal{S}, \theta, \lambda\) as the subset
\(R_{\theta,\lambda} = \{(\theta(S), \lambda(S)), S \in \mathcal{S}\}
\subseteq \Theta \times \Lambda\).</dd>
</dl>
<p>With this in place, we say that \(\theta\), the estimand mapping, is
identifiable from \(\lambda\) if the induced binary relation \(R\) is
injective. In words, if there is a 1-1 relationship between what we are trying
to <em>estimate</em> and <em>what we observe</em>, then the estimand mapping is identifiable
from the observation mapping. This can be formalized mathematically as follows: <!-- raw HTML omitted --></p>
<dl>
<dt><strong>Definition (Identifiability)</strong></dt>
<dd>Consider \(\mathcal{S}, \lambda, \theta\) and \(\Theta, \Lambda\) as in
the previous definition. For a given \(\ell_0 \in \Lambda\), let
\(\mathcal(S)(\ell_0) = \{S \in \mathcal{S}: \lambda(S) = \ell_0\}\). The
function \(\theta\) is said to be \(R_{\theta,\lambda}\)-identifiable at
\(\ell_0 \in \Lambda\) if there exists \(\vartheta_0 \in \Theta\), such
that, for all \(S \in \mathcal{S}(\ell_0)\), we have that
\(\theta(S)=\vartheta_{0}\).
<p>The function \(\theta\) is said to be \(R_{\theta,\lambda}\)-identifiable
everywhere from \(\lambda\) if it is
\(R_{\theta,\lambda}\)-identifiable at \(\ell\) for all \(\ell \in
\Lambda\). We will usually simply say that \(\theta\) is identifiable.</p>
</dd>
</dl>
<p>This definition might seem really abstract, but it’s exactly the intuitive
definition we gave in the introduction. It says that if the part
\(\mathcal{S}(\ell_0)\) of the statistical universe \(\mathcal{S}\) that is
coherent with the observed data \(\ell_0\) uniquely corresponds to a single
estimand of interest — i.e. \(\theta(S) = \vartheta_0, \forall S \in
\mathcal{S}(\ell_0)\) — then that estimand is identifiable!</p>
<h2 id="examples">Examples</h2>
<p>Below we describe how the above definition applies to parametric and nonparametric models as well as finite population settings.</p>
<h3 id="parametric-models">Parametric models</h3>
<p>Consider a parametric model where \(P(\vartheta)\) is a distribution indexed
by \(\vartheta \in \Theta\).</p>
<p>The <em>statistical universe</em> is then \(S = {(P(\vartheta),\vartheta): \vartheta\in \Theta })\). Notice that it contains both the model and the parameters that determine the model.</p>
<p>The <em>estimand mapping</em> is \(\theta(S) = \vartheta\). This defines the quantity of interest: in the simplest case, it is just the parameters of the model but it can also be any function or subset of the parameters — i.e. \(\theta(S) = \phi(\vartheta)\).</p>
<p>The <em>observation mapping</em> is \( \lambda(S) = P(\vartheta) \). This is simply the model for a given parameter.</p>
<p>Remember, identification is about understanding the limits of our estimation
with an infinite amount of data, and so the observations correspond exactly to
the full distribution. If \(\phi(\vartheta) = \vartheta\), the induced binary
relation \(R_{\theta,\lambda}\) reduces to a simple function
\(R_{\theta,\lambda}: \vartheta \rightarrow P(\vartheta)\) and the abstract
definition we gave above is exactly equivalent to the textbook definition of
identification for parametric statistical models.</p>
<dl>
<dt><strong>Linear Regression Example</strong></dt>
<dd>Consider a \(p\)-dimensional random vector \(X \sim P(X)\) for some
distribution \(P_X\). <!-- raw HTML omitted -->
Let \(P(Y \mid X; \beta,\sigma^2) = \mathcal{N}(X^t\beta, \sigma^2)\), where
\(\mathcal{N}(\mu,\sigma^2)\) is the normal distribution with mean \(\mu\)
and variance \(\sigma^2\), and let \(P_{\beta,\sigma^2}(X,Y) = P(Y \mid X;
\beta, \sigma^2) P(X)\).
<p><em>Example question:</em> Are the regression parameters \((\beta\) and \(\sigma^2)\)
identifiable?</p>
<p><em>Identification setup:</em> We can study the identifiability of the parameter
\(\vartheta = (\beta, \sigma^2)\) from the joint distribution
\(P_{\vartheta}(X,Y)\) using our framework, by letting \(\mathcal{S} =
\{(P_{\vartheta}, \vartheta), \vartheta \in \Theta\}\), where \(\Theta =
\mathbb{R} \times \mathbb{R}^{+}\), \(\lambda(S) = P_{\vartheta}\), and
\(\theta(S) = \vartheta\). In this case, the induced binary relation
\(R_{\theta,\lambda}\) is reduces to a function and so \(\vartheta\)
is identifiable iff the function \(R_{\theta,\lambda}\) is injective. It
is easy to verify that this is the case if and only if the matrix
\(E[X^tX]\) has full rank. If in contrast, we take \(\theta’(S) = \beta\),
then the induced binary relation \(R_{\theta’, \lambda}\) is no longer
a function, but our general definition of identifiability still applies.</p>
</dd>
</dl>
<h3 id="nonparametric-identification">Nonparametric identification</h3>
<p>Applying the mathematical definition of identification to nonparametric models is relatively straight forward.</p>
<dl>
<dt><strong>Missing Data Example</strong></dt>
<dd>If \(Y\) is a random variable representing a response of interest, let
\(Z\) be a missing data indicator that is equal to \(1\) if the response
\(Y\) is observed, and \(0\) otherwise; that is, the data we actually observe
is drawn from \(P(Y \mid Z=1)\).
<p><em>Example question:</em> Is the distribution of the missing outcomes \(P(Y \mid Z=0)\)
identifiable from that of the observed outcomes \(P(Y \mid Z=1)\) combined with
that of the missing data indicator \(P(Z)\)?</p>
<p><em>Identification setup:</em> Let \(\mathcal{S}\) be a family of joint distributions for \(Z\) and \(Y\),
and define \(\lambda(S) = (P(Y \mid Z=1), P(Z))\), and \(\theta(S) = P(Y \mid Z=0)\). The
question can be answered by studying the injectivity of the induced mapping \(R_{\theta, \lambda}\). We let that question unanswered for now, but will revisit it in a subsequent installment of this series.</p>
</dd>
</dl>
<p>Notice that the same definition worked in both nonparametric and parametric models <em>without</em> having to be adjusted or tweaked.</p>
<h3 id="finite-population-identification">Finite population identification</h3>
<p>In the previous two sections, we assumed an infinite amount of data—that is the usual starting point for identification. But what do you do if you plan to take a finite population perspective? Clearly, the assumption that you have an infinite amount of data doesn’t make sense! It turns out that we can again use the mathematical definition of identification by defining the appropriate statistical universe, estimand mapping, and observation mapping.</p>
<dl>
<dt><strong>Causal Inference Example</strong></dt>
<dd>With \(N\) units, let each unit be assigned to one of two treatment interventions, \(Z_i =1\) for treatment and \(Z_i=0\) for control. Under the stable unit treatment value assumption each unit \(i\) has two potential outcomes \(Y_i(1)\) and \(Y_i(0)\), corresponding to the outcome of unit \(i\) under treatment and control, respectively. For each unit \(i\), the observed outcome is \(Y^{\ast}_i = Y_i(Z) = Y_i(1) Z_i + Y_i(0) (1-Z_i)\). Let \(Y(1) = {Y_1(1), \ldots, Y_N(1)}\) and \(Y(0) = {Y_1(0), \ldots, Y_N(0)}\) be the vectors of potential outcomes and \(Y = (Y(1), Y(0))\).
<p><em>Example question:</em> Is \(\tau(Y) = \overline{Y(1)} - \overline{Y(0)}\) identifiable from the observed data \((Y^\ast, Z)\).</p>
<p>Identification setup: Let \(\mathcal{S}_Y = \mathbb{R}^N \times \mathbb{R}^N\) be the set of all possible values for \(Y\), \(\mathcal{S}_Z = \{0,1\}^N\) be the set of all possible values for \(Z\), and \(\mathcal{S} = \mathcal{S}_Y \times \mathcal{S}_Z\).</p>
<p>Take \(\theta(S) = \tau(Y)\) and \(\lambda(S) = (Y^\ast, Z)\) as the estimand and observation mapping, respectively. The question is then answerable by studying the injectivity of the induced binary relation \(R_{\theta, \lambda}\).</p>
</dd>
</dl>
<p>Once again, we will revisit this example in more details in a future post — our goal here
is simply to show how identification questions can be formulated in a unified fashion within a simple framework.</p>
<h2 id="further-reading">Further reading</h2>
<p>We dive deeper into some of the formalism behind identification in a <a href="https://arxiv.org/abs/2002.06041">recent
paper</a>. A lot of the pioneering work on
identifiability was done by the economist Charles Manski: see for instance his
2009 monograph <em>‘Identification for prediction and decisions’</em> for a very
accessible introduction. For a recent in-depth survey of the area, see Lewbel
2019: <em>‘The identification Zoo’</em>. Finally, as mentioned in the introduction, this
is the first installment in a series dedicated to identifiability — stay tuned
for the next installments!<!-- raw HTML omitted --></p>
<h2 id="references">References</h2>
<p><a href="https://arxiv.org/abs/2002.06041">Basse, G., & Bojinov, I. (2020). <em>A general theory of identification.</em> arXiv preprint arXiv:2002.06041.</a></p>
<p>Lewbel, A. (2019). <em>The identification zoo: Meanings of identification in econometrics</em> Journal of Economic Literature.</p>
<p>Manski, C. (2009) <em>Identification for prediction and decisions.</em> Harvard University Press.</p>A Leader's Guide to Heterogeneous Treatment Effects/post/heterogenous/Mon, 25 Jan 2021 00:00:00 +0000/post/heterogenous/<p>People don’t always agree; that is a fact of life. Similarly, when running an experiment, not everyone has the same reaction to the intervention! It’s critical that data scientists, academics, and the general public understand that the global average may not always be the most important or meaningful measure. Instead, it is often more informative to study how the effect of an intervention varies across different population subgroups. This post explains, at a high level, what <em>heterogeneous treatment effects</em> are, why they are essential, and how to think about them.</p>
<h2 id="what-are-heterogeneous-treatment-effects">What are heterogeneous treatment effects?</h2>
<h3 id="intuitive-definition">Intuitive definition</h3>
<p>When analyzing a randomized experiment or observational study, analysts often report the population <em>average treatment effect</em> (ATE) as the main —and often only— summary for the causal effect of the intervention. But what does it really mean? And who does it apply to? Just because we see that an intervention has a +1 (statistically significant) effect on average, can we conclude that the treatment would have an effect of +1 on every untreated unit, if we were to treat them? In short, the answer is no. The average is an important summary, but it often fails to provide the full picture. For example, an ATE of +1 may correspond to a scenario in which the intervention affects every unit equally; but it may also correspond to a case in which it has a +2 effect for half the units and -1 for the other half (see the figure below).</p>
<link rel="stylesheet" href="/css/hugo-easy-gallery.css" />
<div class="box">
<figure itemprop="associatedMedia"
itemscope itemtype="http://schema.org/ImageObject" >
<div class="img">
<img itemprop="thumbnail" src="/image/heterogeneous_causal.jpg" />
</div>
<a href="/image/heterogeneous_causal.jpg" itemprop="contentUrl"></a>
<figcaption><h4>The figure illustrates how two different interventions can produce the same average causal effect. The left shows everyone having the same reaction to the intervention, whereas the right shows that some units have very positive responses and others have an adverse reaction.</h4>
</figcaption>
</figure>
</div>
<p>Hetoregenous literally means “diverse in character or content,” so when we talk about “heterogeneous treatment effects,” we acknowledge the fact that every experimental unit may have a different response to the intervention. In practice, however, it is extremely difficult to reliably assess the effect of the intervention on each individual unit, and we must instead settle on assessing the effect of the intervention on subgroups of units that share similar characteristics. For example, it is common in clinical trials to report separate estimates for how a drug impacts children and for how it impacts adults, to reflect the biological differences. Technology companies similarly track responses across different platforms and countries.</p>
<!-- raw HTML omitted -->
<!-- raw HTML omitted -->
<h3 id="which-heterogeneous-effects-to-look-at">Which heterogeneous effects to look at?</h3>
<p>Any collection of average effects on subgroups of the population can be considered heterogeneous effects. Typically, the subgroups are defined by looking at each unit’s covariates (or characteristics). The extreme case is the individual treatment effects that try to ascertain each experimental unit’s causal effect separately; these, however, are virtually impossible to estimate without making strong assumptions. So what is the right middle ground between the average treatment effect (which is easy to estimate but not very informative) and the individual treatment effects (which are very informative but hard to estimate)? There is no unique answer to this question as it depends on the context. One crucial tradeoff to remember is that the smaller the subgroups we consider, the more difficult it is to estimate that subgroup’s effect.</p>
<p>Broadly speaking, there are two approaches to looking for heterogeneous treatment effects. The first requires knowledge of which groups to look at, and the second attempts to learn them from the data.</p>
<h2 id="estimating-heterogeneous-effects-within-pre-specified-groups">Estimating heterogeneous effects within pre-specified groups</h2>
<h3 id="design-considerations">Design Considerations</h3>
<p>If we expect that some groups will react differently to our intervention, we can separate them before starting the experiment. In other words, we can run smaller-scale experiments within the pre-defined groups. That way, we can obtain separate estimates of the causal effects within the groups. For example, suppose you work at a technology company that has a presence in many countries. In that case, it is relatively common to separately run experiments within each country to ensure that we can estimate how the innovations are perceived differently across different geographies.</p>
<h3 id="post-experiment-analysis">Post experiment analysis</h3>
<p>We can estimate heterogeneous effects even when experiments where not run separately in each
subgroup. Imagine we have a single binary covariate (for example, whether or not someone likes pie or whether or not someone is under the age of 18) that was recorded before the treatment assignment. Then, it is easy to measure the effect within the different values of this covariate. We simply group the units that like pie and the units that dislike pie and separately compute each group’s effect. We can carry out inference in the standard way, as discussed in our post on the <a href="/post/po-introduction/">potential outcomes framework</a>. But, since we are testing multiple hypotheses, we need to perform a multiple comparison test adjustment to ensure that we control the overall type I error at the appropriate level. The simplest way to do this is via the <a href="https://en.wikipedia.org/wiki/Bonferroni_correction">Bonferroni correction</a> that divides the nominal type I error by the number of tests.</p>
<div class="box">
<figure itemprop="associatedMedia"
itemscope itemtype="http://schema.org/ImageObject" >
<div class="img">
<img itemprop="thumbnail" src="/image/heterogeneous_causal_known_groups.jpg" />
</div>
<a href="/image/heterogeneous_causal_known_groups.jpg" itemprop="contentUrl"></a>
<figcaption><h4>The figure shows the potential outcomes for two distinct groups of units. The left shows the control outcomes, and the right shows the treatment outcomes. Group 1 loved the intervention, whereas group 2 isn't as keen. </h4>
</figcaption>
</figure>
</div>
<p>The post-experiment analysis approach works well with large experiments. However, depending on the adopted design, it might be the case that some groups are disproportionately assigned to certain treatments, making it hard to assess the heterogeneous effect accurately. That is why it is generally better to account for the subgroups of interest directly in the design, as suggested above.</p>
<h2 id="group-discovery">Group discovery</h2>
<p>The pre-specified strategy works well when there is only a handful of known groups. However, it starts to break down when the number of possible groups becomes large, which is often the case if we are interested in understanding how the effects differ across interactions of multiple covariates (<em>i.e.</em>, children that like pie and adults that don’t like pie). When there are many possible groupings, and we don’t know which to look at, there is rarely enough data to perform an exhaustive search, so we have to be more strategic about finding these groups.</p>
<p>There are two intuitive approaches to accomplish. The first begins with the whole population and sequentially tries to split the sample into subgroups that reacted similarly to the intervention but differently from everyone else. The second starts with everyone in small groups and carefully combines the groups that reacted similarly to intervention until we have a manageable number of distinct groups such that, within the groups, the units responded similarly, but across groups, there are significant differences in the effect. See the two examples below for more details.</p>
<div class="box">
<figure itemprop="associatedMedia"
itemscope itemtype="http://schema.org/ImageObject" >
<div class="img">
<img itemprop="thumbnail" src="/image/heterogeneous_causal_unknown_groups.jpg" />
</div>
<a href="/image/heterogeneous_causal_unknown_groups.jpg" itemprop="contentUrl"></a>
<figcaption><h4>The figure shows the potential outcomes for a set of ungrouped units. The left shows the control outcomes, and the right shows the treatment outcomes. Based on the treatment effect, we can identify two distinct groups: Group 1 loved the intervention, and group 2 didn't. </h4>
</figcaption>
</figure>
</div>
<h3 id="example-1-split-the-data-wager--athey-2018">Example 1: Split the data (Wager & Athey, 2018):</h3>
<p>The machine learning community has widely used tree-based methods for both classification and regression tasks. Typically, these methods aim to partition covariate space into a set of homogeneous rectangles and then fit a simple model on each. The partitioning is done by finding a point that minimizes the within-group variability of the outcome. In causal inference, we focus on the causal effect; thus, applying this idea requires simply augmenting the objective function to find subgroups with estimated causal effects different from the overall group. This approach works well on small to moderate size data sets; however, the computation time increases rapidly as the data’s dimensions grow. This method is described in detail in Wager & Athey (2018) and is implemented in the grf R package (Athey et al. 2019).</p>
<h3 id="example-2-regroup-the-data-sepehri--diciccio-2020">Example 2: Regroup the data (Sepehri & DiCiccio, 2020):</h3>
<p>An alternative to the above approach that tries to group the data by sequentially splitting the larger group into smaller groups is to reverse this. In particular, we can begin by grouping everyone into a fixed number of predefined subgroups based on their covariates. We then iteratively merge the two “most similar” clusters at each step until a stopping criterion is met. The notion of similarity and the stopping criteria can be defined using a formal hypothesis test; the resulting algorithm easily scalable to massive data sets and provide inferential guarantees about the output. This method is described in detail in Sepehri and DiCiccio (2020).</p>
<!-- raw HTML omitted -->
<!-- raw HTML omitted -->
<h1 id="references">References</h1>
<p>Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized random forests. The Annals of Statistics, 47(2), 1148-1178.</p>
<p>Sepehri, A., & DiCiccio, C. (2020). Interpretable Assessment of Fairness During Model Evaluation. arXiv preprint arXiv:2010.13782.</p>
<p>Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228-1242.</p>A Leader's Guide to Interference/post/interference/Mon, 18 Jan 2021 00:00:00 +0000/post/interference/<h2 id="what-is-interference">What is interference?</h2>
<h3 id="basics">Basics</h3>
<p>Suppose you wish to test whether a drug is effective at reducing a person’s
blood pressure. Typically, you would assign some units to take the drug
(treatment) and others to take a placebo (control). You would then contrast
the outcome (blood pressure) of the units in the treatment group, to that of
the units in the control group.</p>
<link rel="stylesheet" href="/css/hugo-easy-gallery.css" />
<div class="box">
<figure itemprop="associatedMedia"
itemscope itemtype="http://schema.org/ImageObject" >
<div class="img">
<img itemprop="thumbnail" src="/image/interference_sutva.jpg" />
</div>
<a href="/image/interference_sutva.jpg" itemprop="contentUrl"></a>
<figcaption><h4>Example of treatment comparison where the experimental units belong to different households, but where that structure is ignored.</h4>
</figcaption>
</figure>
</div>
<h3 id="examples-of-interference-scenarios">Examples of interference scenarios</h3>
<p>Notice that in this particular example, it seems reasonable to assume that a
drug pill (treatment) affects only the blood pressure (outcome) of the
particular unit who takes that pill. In more general terms, an individual
unit’s treatment assignment affects only its own outcome: we say that there
is no interference between units. In contrast, if a unit’s treatment assignment
may affect another unit’s outcome, we say that there is interference between
units.</p>
<p>The best way to get familiar with the idea is to consider a few examples:</p>
<p><strong>Example 1: SDP example (Rogers and Feller 2016)</strong></p>
<blockquote>
<p>An important challenge that the school districts across the US face is a high
student-absence rate: 10% of students in K-12 schools are chronically absent,
meaning that they miss more than 18 days of school in the year. In a recent
paper, Rogers and Feller studied the effectiveness of a mail-based intervention,
with the following design: they randomly selected households with children
attending K-12 in the School District of Philadelphia. Within these selected
households (let’s call them “treated” household), they selected at random one
of the children attending K-12 (if more than one) and sent information to the
parents about that child’s current absence rate. While the intervention was
ostensibly aimed at reducing the absence rate of the child targeted by the
intervention, Basse and Feller (2018) showed that the intervention also affects the
siblings of targeted children.</p>
</blockquote>
<p><strong>Example 2: Get out the vote experiment (Bond et al. 2012)</strong></p>
<blockquote>
<p>In a study designed to assess the effectiveness of a get-out-the-vote intervention
on facebook, Bond and colleagues sent messages to 60 million facebook users encouraging
them to go out and vote. They found that not only the intervention increased the
probability of the treated individuals (i.e. those who received the intervention) to
vote: the intervention also increased the probability that their close friends would
vote!</p>
</blockquote>
<p><strong>Example 3: 401k investment example (Duflo and Saez 2003)</strong></p>
<blockquote>
<p>Duflo and Saez designed an experiment to study the determinants of an employee’s
choice to sign-up for a tax deferred retirement account at a large university. As
part of this study, they randomly selected a number of departments, and within
those departments, they randomly selected a number of employees to whom they sent
an encouragement to participate in a “benefits information fair” (the encouragement
included some money). They note that the intervention “multiplied by more than five
the attendance rate of these treated individuals (relative to controls), and
tripled that of untreated individuals within departments where some individuals
were treated.”</p>
</blockquote>
<h2 id="what-can-go-wrong-if-i-ignore-interference">What can go wrong if I ignore interference?</h2>
<h3 id="comparing-apples-and-oranges">Comparing apples and oranges</h3>
<p>The strategy illustrated in Figure 1 works well if there is no interference
between units, but things can go very wrong if units interfere with each other.
To understand what the problem is consider the household example from Rogers
and Feller 2016, described above. Although the treatment status is binary (a
unit is treated or control), there are, in fact, three effective treatment
statuses: a unit can be treated (pure treated), it can be untreated but have
a sibling treated (exposed), or it can be untreated and have no sibling treated
(pure control).</p>
<p>Notice, in particular, how untreated units may either be pure control or exposed.
The comparison of Figure 1 between treated and untreated units now looks as follows:</p>
<p><img src="/image/interference_partial.jpg" alt="Partial interference"></p>
<p>The “control group” now contains a mix of pure control units, and exposed units:
the treatment has spilled over to the control group!</p>
<h3 id="so-what">So what?</h3>
<p>Well, suppose that the effect of receiving the treatment directly is almost the
same as receiving spillovers from treatment (this is not so unlikely — Basse
and Feller (2018) found spillovers roughly half as large as the direct effect).
Then the average response in the “control group” is inflated by the presence of
spillover units: this may lead to severe bias (in this case, you would
underestimate the effect of your intervention).</p>
<p>How much bias? It depends on what you are willing to assume about the interference
structure, but if you believe that the treatment assignment of any unit may affect
any other unit’s response arbitrarily, then the bias may be arbitrarily bad.</p>
<h3 id="what-is-the-direction-of-the-bias">What is the direction of the bias?</h3>
<p>In the examples we’ve presented so far, it seems—at least intuitively—that the
bias is gently pulling us towards 0. So is this always true? Does interference
just slightly dilute the effect? The simple answer is no. There are many other
examples where interference can arbitrarily change the sign and magnitude of an
effect. This is particularly prominent when experimental units are competing for
a limited resource; for example, in marketplaces.</p>
<p>Consider a ridesharing firm that has developed a new pricing algorithm. Their
hope is that this innovation will reduce both wait time for passengers and
increase revenue for drivers (and in turn the firm) by introducing a temporary
price increase in areas where demand exceeds supply. Now, let’s imagine running
an experiment in Boston, where for every driver we flip an unbiased coin: if lands
heads the driver is assigned to the algorithm; if it lands tails they stick with
the current approach. Focusing on revenue, suppose the experiment has a negative
effect, that is, if everyone were assigned to the new version (or treatment) then
the company would generate less revenue as fewer riders would be willing to pay
the higher prices. But in the experiment, the drivers in the treatment group are
sent to the busier areas and so they pick up more drivers at a higher price. The
results say that the treatment is generating a lot more revenue than the control.
But there are two things happening. The first is that there is a redistribution
where the revenue is shifting from being evenly spread amongst all the drivers to
being more concentrated in the treatment group. The second is that the overall
revenue is actually going down! When all the drivers start to receive treatment
this redistribution will no longer occur and so the perceived positive effect will
disappear.</p>
<p><img src="/image/Interference_effect_impact.jpg" alt="The effect of interference"></p>
<p>When the experimental units are competing for a limited resource more
experiments perform a redistribution between treatment and control—making it seem
like there is a positive or negative effect when really it’s just 0.</p>
<h2 id="what-can-i-do-about-it">What can I do about it?</h2>
<h3 id="design">Design</h3>
<p>Many authors have proposed different experimental designs to either alleviate
the interference or to directly measure it. At the heart of most of these
designs is a simple idea of creating groups such that most of the interference
is contained within the group and there is little (or usually none) across the
groups. Once these groups have been defined, we can vary treatment across the
groups as well as within the groups. Typically the cross-group variation allows
us to measure average treatment effect whereas the within-group variations
allows for the estimation of the spillovers.</p>
<p>An alternative approach that has also gained a lot of popularity in technology
companies is to use switchback (or time series experiments). This class of experiments
treats the city as a single unit and alternates the treatment over time, thereby
translating the problem of interference across drivers into a problem of
interference over time. This approach will be the object of another post.</p>
<h3 id="analysis">Analysis</h3>
<p>Analyzing a well-designed study that is subject to interference is—although trickier
than a study with no interference—relatively straightforward. The main idea is to
continue respecting the isolation generated in the design phase and only focus on
well-specified contrasts, as illustrated below. One thing to keep in mind is that
we still need to normalize each observation by the appropriate treatment exposure probabilities.
Without proper normalization, even isolated contrasts can lead to biased results.</p>
<p><img src="/image/interference_partial_est.jpg" alt="Estimation under partial interference"></p>
<h2 id="mind-the-gap-from-intuition-to-formalization">Mind the gap: from intuition to formalization</h2>
<p>In trying to provide helpful intuition, we have brushed a number of important
details under the proverbial rug. We conclude this note by pointing out the
holes and over-simplifications in our treatment of the subject and provide
some starting points for a fuller understanding of the subject.</p>
<ul>
<li><strong>Assumption on the interference structure.</strong> In Figure 1, we noted that the
siblings of treated units received “some spillover,” but how do we know that
these spillovers don’t also affect units in untreated households? We don’t ––
this is an assumption we have implicitly made on the interference mechanism.
Such assumptions are generally formulated via the concept of
<em>exposure mapping</em>.</li>
<li><strong>Defining the effects of interest.</strong> Throughout this note, we have thrown
around the words direct effect and spillover effect without defining them
formally. While it is easy to get an “intuitive” understanding of these,
this intuition can be misleading in more complex situations. Rigorously
defining these effects requires quite a bit of formalism.</li>
<li><strong>Spillover or spillovers?</strong> An important idea that gets lost in our
oversimplification is the fact that there is, in general, no single
unambiguous definition of “spillover effect”. Instead, there are many
possible “spillover effects”, and one must be clear about which one they are
interested in.</li>
<li><strong>Analysis.</strong> A popular approach is to rely on inverse probability weighted
estimators (e.g. Horvitz-Thompson and Hajek). There is an ongoing research
effort to provide a satisfactory asymptotic theory for a large class of designs
and interference structures.</li>
</ul>
<p>For an accessible introduction to the main formalism, we suggest starting out with
Basse and Feller 2018, and then Aronow and Samii 2017: the former focuses on a special
case of the latter, which might make it more accessible as a starting point. For an
example of marketplace interference, see Basse et al. 2016. For recent advances in the
asymptotic theory, see Leung 2020 and Chin 2019. For different designs, see Ugander
et al. 2013, Rogers and Feller 2016, Baird et al. 2018., and Saint-Jacque et al. 2019.</p>
<!-- raw HTML omitted -->
<!-- raw HTML omitted -->
<h2 id="references">References</h2>
<p>Aronow, P. M., & Samii, C. (2017). Estimating average causal effects under general interference, with application to a social network experiment. The Annals of Applied Statistics, 11(4), 1912-1947.</p>
<p>Baird, S., Bohren, J. A., McIntosh, C., & Özler, B. (2018). Optimal design of experiments in the presence of interference. Review of Economics and Statistics, 100(5), 844-860.</p>
<p>Basse, G., & Feller, A. (2018). Analyzing two-stage experiments in the presence of interference. Journal of the American Statistical Association, 113(521), 41-55.</p>
<p>Basse, G. W., Soufiani, H. A., & Lambert, D. (2016, May). Randomization and the pernicious effects of limited budgets on auction experiments. In Artificial Intelligence and Statistics (pp. 1412-1420).</p>
<p>Bond, R. M., Fariss, C. J., Jones, J. J., Kramer, A. D., Marlow, C., Settle, J. E., & Fowler, J. H. (2012). A 61-million-person experiment in social influence and political mobilization. Nature, 489(7415), 295-298.</p>
<p>Chin, A. (2018). Central limit theorems via Stein’s method for randomized experiments under interference. arXiv preprint arXiv:1804.03105.</p>
<p>Duflo, E., & Saez, E. (2003). The role of information and social interactions in retirement plan decisions: Evidence from a randomized experiment. The Quarterly journal of economics, 118(3), 815-842.</p>
<p>Leung, M. P. (2020). Treatment and spillover effects under network interference. Review of Economics and Statistics, 102(2), 368-380.</p>
<p>Saint-Jacques, G., Varshney, M., Simpson, J., & Xu, Y. (2019). Using Ego-Clusters to Measure Network Effects at LinkedIn. arXiv preprint arXiv:1903.08755.</p>
<p>Rogers, T., & Feller, A. (2016). Discouraged by peer excellence: Exposure to exemplary peer performance causes quitting. Psychological science, 27(3), 365-374.</p>
<p>Ugander, J., Karrer, B., Backstrom, L., & Kleinberg, J. (2013). Graph cluster randomization: Network exposure to multiple universes. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 329-337).</p>Introduction to the Potential Outcomes Framework/post/po-introduction/Mon, 18 Jan 2021 00:00:00 +0000/post/po-introduction/<p>The Potential Outcomes Framework (aka the Neyman-Rubin Causal Model) is arguably the most widely used framework for causal inference in the social sciences. This post gives an accessible introduction to the framework’s key elements — interventions, potential outcomes, estimands, assignment mechanisms, and estimators.</p>
<h2 id="an-informal-first-look-a-game-and-a-story">An informal first look: a game and a story</h2>
<h3 id="a-game-with-boxes">A game with boxes</h3>
<p>Suppose we play the following game. I put two boxes in front of you, one
labelled \(0\) and the other labelled \(1\). Each box contains a slip
of paper on which I’ve written some number. Denote by \(y(0)\) the number
in the box labelled \(0\) and \(y(1)\) the number in the box labelled
\(1\), both of which are unknown to you. Your goal is to guess the
difference between the number in box \(1\) and the number in box \(0\),
denoted:</p>
<p>\[
\tau = y(1) - y(0).
\]</p>
<p>The game is played as follows: you must choose a box and open it to read the
number it contains. The trick, however, is that the boxes are rigged so that
the moment you open one of the boxes, the other box self-destructs so you
can never know the number it contained.</p>
<p>In general, a variant of the game is played in which \(N\) pairs of boxes
are aranged in \(N\) rows indexed \(i = 1, \ldots, N \). Each box is
sealed and contains a slip of paper with a number. Denote by \(y_i(0)\)
the number in the box labelled \(0\) in the \(i^{th}\) row, and
\(y_i(1)\) the number in the box labelled \(1\) in the same row. This
time, the goal is to guess the average difference:</p>
<p>\[
\tau = \frac{1}{n} \sum_{i=1}^n {y_i(1) - y_i(0)},
\]</p>
<p>You get to open a single box in each row — the rule being, as above,
that the moment you open one of the boxes in row \(i\) the other box
in the same row self-destructs. The name of this game, as you’ve probably guessed, is causal inference. Now let’s look at another way of motivating this framework.</p>
<h3 id="a-story-with-a-genie">A story with a genie</h3>
<p>Guillaume is in a bit of a bind at the moment: he must give a talk in one hour but is currently having a throbbing headache. He contemplates taking an aspirin pill, but he is unsure of the effect that it would have on his headache. Fortunately, he met a genie a long time ago who promised to grant him a wish, and this seems like a good time to use it. He conjures the genie and asks what the effect of taking the pill would be on his headache. After pointing out that he could have asked him to cure his headache, the genie tells Guillaume that although he is all-powerful, he cannot answer his question because he doesn’t know what “the effect of the pill” even means! Guillaume scratches his head and starts to babble stuff about causal effects, but the genie interrupts him:</p>
<ul>
<li>
<p>Genie: Look, here’s what we’re going to do. If you were all-powerful like me,
describe precisely the quantity that you would like to know.</p>
</li>
<li>
<p>Guillaume: Ok, if I were all powerful, here is what I would do. First I would take the
aspirin, wait an hour, and then assess the state of my headache.</p>
</li>
<li>
<p>Genie: Wait, how would you “assess” the state of your headache?</p>
</li>
<li>
<p>Guillaume: I have a special scale for rating headaches, from 1 (feeling great) to
10 (feeling aweful). You’re all-powerful, so you should know my scale..</p>
</li>
<li>
<p>Genie: Ah, right. Ok, sure, so you’d get a number between 1 and 10. Then what
would you do?</p>
</li>
<li>
<p>Guillaume: Well, after that I would go back in time to the exact moment I took the
aspirin pill, but this time I would not take it. I would then wait an hour
and rate the state of my headache: this would give me an other number. What
I call a causal effect is the difference between these two numbers.</p>
</li>
<li>
<p>Genie: I see. Wait a sec. Done — the answer is -4, which means the Aspirin would
help…</p>
</li>
<li>
<p>Guillaume: Got it, thank you for your help!</p>
</li>
</ul>
<h3 id="potential-outcomes-in-a-nutshell">Potential outcomes in a nutshell</h3>
<p>Hopefully, the connection between the (first version) of the game and the
story is clear. The numbers written in the box labelled \(1\) corresponds
to the strength of the headache should Guillaume take the pill, and the
number written in the box labelled \(0\) corresponds to the strength
of the headache should he not take the pill. So
we can write \(y(1)\) the strength of Guillaume’s headache (an hour from
now) should he take the pill, and \(y(0)\) the strength of his headache
should he not take the pill. The causal effect of the pill is then defined
as \(\tau = y(1) - y(0)\).</p>
<p>The boxes metaphore crystalizes the key intuition behind the potential
outcomes framework. Before he decides which action to take (pill or no pill),
\(y(1)\) and \(y(0)\) exists as potentialities: they are the potential
state of the headache (one hour from now) if Guillaume takes the pill (resp.
doesn’t take the pill). Like the numbers in the boxes, they are well-defined,
and exist prior to Guillaume’s action: they are the “potential outcomes” of
Guillaume’s action. Like the numbers in the boxes, they are both unknown prior
to Guillaume’s action. Like the numbers in the boxes, one of the two will be
revealed depending on which action is taken, while the other will never be
knowable.</p>
<p>The quantities \(y(1)\) and \(y(0)\) are known as the <strong>potential outcomes</strong>.
If Guillaume takes the pill, then he observes \(y(1)\), and \(y(0)\) becomes
counterfactual; if he does not take the pill, he observes \(y(0)\) and
\(y(1)\) becomes counterfactual. The fundamental problem of causal inference
is that in either case, the counterfactual outcome forever remains unknown.</p>
<p>This, in a nutshell, summarizes the intuition behind potential outcomes. The
next section formalizes this intuition, and introduces the remaining components
of the framework.</p>
<h2 id="the-rubin-causal-model">The Rubin Causal Model</h2>
<h3 id="assignments-and-potential-outcomes">Assignments and potential outcomes</h3>
<p>Consider a population of \(n\) units indexed \(i = 1, \ldots n\). Suppose
that each unit may receive one of two interventions — for concreteness, let’s
call one of them <em>treatment</em> and the other <em>control</em>. For each unit \(i\)
denote by \(W_i\) the treatment indicator for unit i, where \(W_i = 1\) if
unit \(i\) is assigned to treatment and \(W_i = 0\) if unit \(i\) is
assigned to control. The vector of treatment indicators is denoted
\(\vec{W} = (W_1, \ldots, W_n)\). Since the treatment will be assigned at
random, \(\vec{W}\) will be a random vector. We will denote by an uppercase
\(\vec{W}\) the random variable, and by lower case
\(\vec{w} = (w_1, \ldots, w_n)\) a specific realizations of the assignment. Then we use the notation \(Pr(\vec{W} = \vec{w})\) to denote the probability that the
random assignment vector \(\vec{W}\) will take the value \(\vec{w}\). The
distribution of \(\vec{W}\), called <em>assignment mechanism</em>, will be discussed
in more detailed below.</p>
<p><em>Note: Throughout, we will still to the convention that uppercase letters denote
random variables, while lowercase letters denote constants or specific realizations
of random variables.</em></p>
<p>In the examples above, we introduced the potential outcomes intuitively, as
describing the response of a unit if it were assigned a specific treatment.
More formally, for each unit \(i\) consider a function
\(y_i: \{0,1\}^n \rightarrow \mathbb{R}\), so that for any assignment
\(\vec{w} \in \{0,1\}^n\), \(y_i(\vec{w})\) is the potential outcome of
unit \(i\) if unit 1 were assigned to treatment \(w_1\), unit 2 to \(w_2\), etc…</p>
<p>In our aspirin example, there is a single unit \((n=1)\), therefore
the assignment vector \(\vec{w}\) reduces to a scalar
\(w_1 \in \{0,1\}\), and the potential outcomes for the single unit
\(i=1\) are \(y_1(w_1)\) for \(w_1 \in \{0,1\}\); in this case, the
outcomes of unit \(i=1\) depend only the unit’s own treatment assignment \(w_1\),
so there are only two potential outcomes: \(y_1(1)\) and \(y_1(0)\).
When there are more than one units, the outcome of any unit may, a priori,
depend on the treatment assigned to any other unit: this is captured by the
notation \(y_i(\vec{w})\). In particular, each unit has not two but
\(|\{0,1\}^n| = 2^n\) potential outcomes.</p>
<h3 id="the-no-interference-assumption">The no-interference assumption</h3>
<p>Let’s go back, once again, to our aspirin example, but suppose that we
have not just a single individual, but \(n\) units participating in
the drug trial. Following the framework so far, we can write the potential
outcomes for unit \(i\) as \(y_i(\vec{w})\) for \({\vec{w} \in \{0,1\}^N}\).
As stated above, this allows for unit \(i\)’s outcome to depend on another
unit, say unit \(j\)’s treatment assignment. In the context of the drug
trial though, this seems a bit overkill. Barring very special circumstances
(e.g. the units enrolled in the trial know each other), it seems reasonable
to assume that the outcome of the trial for unit \(i\) depends only
on the treatment assigned to unit \(i\) itself — that is, it should
depend on \(\vec{w}\) only through \(w_i\). This assumption is
known as the no-interference assumption, and can be formally stated as
follows:</p>
<p><strong>Assumption (No Interference):</strong> For all \(i=1, \ldots, n\), it holds that:
\[
\forall \vec{w}, \vec{w}’ \in \{0,1\}^n, \quad
w_i = w_i’ \quad \Rightarrow \quad y_i(\vec{w}) = y_i(\vec{w}')
\]
With a slight abuse of notation, we can write \(y_i(\vec{w}) = y_i(w_i)\).</p>
<p>In words, the assumption says that if you are a trial participant,
the status of your headache depends only on whether you took the aspirin —
not on whether the other trial participants took the aspirin. This assumption is
reasonable in a broad range of settings, and simplifies matters considerably.
Indeed, under then no-interference assumption, each units \(i\) has only two
potential outcomes: its outcome under treatment, \(y_i(1)\) and its outcome under
control \(y_i(0)\).</p>
<h3 id="the-science">The Science</h3>
<p>If there is no interference, then all the information of a causal problem is
contained in the treatment and control potential outcomes vectors. Taken together,
they form what is sometimes called <strong>the Science Table</strong> or just <strong>the Science</strong>,
denoted \(\underline{y} = (\vec{y}(1), \vec{y}(0))\). An example of the Science for
6 units is given below.</p>
<table>
<thead>
<tr>
<th>i</th>
<th>\(y_i(0)\)</th>
<th>\(y_i(1)\)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>2</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>3</td>
<td>2</td>
<td>4</td>
</tr>
<tr>
<td>4</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>5</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<td>6</td>
<td>0</td>
<td>2</td>
</tr>
</tbody>
</table>
<p>This table gives, for each unit \(i\), it’s potential outcome under treatment,
\(y_i(1)\) and its potential outcome under control, \(y_i(0)\). For instance,
for unit \(i = 4\), we read \(y_4(0) = 0\) and \(y_4(1)=1\).</p>
<p>An important fact about the Science is that we can never observe it fully. Indeed,
if a unit \(i\) is assigned to treatment (\(w_i=1\)), then we only observe
its potential outcome \(y_i(1)\). If, however, it is assigned to control
(\(w_i=0\)), then we only observe its control potential outcome, \(y_i(0)\).</p>
<p>So, suppose for instance that \(\vec{W} = (1, 1, 0, 1, 0 ,0)\), then we would only
observe a partial version of the table, with missing elements:</p>
<table>
<thead>
<tr>
<th>i</th>
<th>W_i</th>
<th>\(y_i(0)\)</th>
<th>\(y_i(1)\)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>1</td>
<td>?</td>
<td>3</td>
</tr>
<tr>
<td>2</td>
<td>1</td>
<td>?</td>
<td>1</td>
</tr>
<tr>
<td>3</td>
<td>0</td>
<td>2</td>
<td>?</td>
</tr>
<tr>
<td>4</td>
<td>1</td>
<td>?</td>
<td>1</td>
</tr>
<tr>
<td>5</td>
<td>0</td>
<td>3</td>
<td>?</td>
</tr>
<tr>
<td>6</td>
<td>0</td>
<td>0</td>
<td>?</td>
</tr>
</tbody>
</table>
<p>This is what makes causal inference a challenging problem — we will get back to this point below. Still, it is useful to consider the full Science table conceptually because it allows us to rigorously define causal effects.</p>
<h3 id="causal-estimands">Causal Estimands</h3>
<p>An estimand can be defined as a “quantity we would compute if we were omniscient.”
In the current context, being omniscient would mean knowing the entire Science Table
\(\underline{y}\), so we can think of an estimand as a function of the Science, say
\(\tau(\underline{y})\). The simplest quantity that satisfies this definition is the
so-called individual treatment effect (ITE) for unit \(i\), defined as:</p>
<p>\[
\tau_i = y_i(1) - y_i(0), \quad i = 1, \ldots n.
\]</p>
<p>As its name indicates, the ITE \(\tau_i\) is the causal effect
of the treatment on unit \(i\): it compares the response of unit \(i\) if it
were assigned to receive the treatment to the response of unit \(i\) if it
were assigned to receive the control. Since it depends on both the treatment <strong>and</strong>
the control potential outcome of the same unit \(i\), the ITE can never be computed
in practice – this is what makes it an estimand, a quantity that we could only compute
if we were omniscient.</p>
<p>If we have more than a single unit, we are generally interested in some sort of
average effect of the treatment. The Average Treatment Effect (ATE) is defined
as:</p>
<p>\[
\tau^{ATE} = \frac{1}{n} \sum_{i=1}^n \{ y_i(1) - y_i(0)\} = \frac{1}{n} \sum_{i=1}^n \tau_i
\]</p>
<p>What makes these estimands causal is that they are based on contrasts between
potential outcomes.</p>
<!-- raw HTML omitted -->
<!-- raw HTML omitted -->
<h2 id="estimating-causal-effects">Estimating causal effects</h2>
<p>In the previous section, we have introduced the key pieces of the Rubin Causal Model:</p>
<ol>
<li>an experimental population,</li>
<li>an intervention,</li>
<li>the potential outcomes, and</li>
<li>the causal estimands — that is, the quantity that we would like to
learn.</li>
</ol>
<p>So far, we have remained in the realm of the <em>potential</em> — all the quantities introduced
exist prior to the experiment being actually conducted. We have defined our
objective — our causal estimand — but have said nothing about how one would actually
go about estimating that quantity. To do so, we need to move to the realm of the
<em>observed</em>.</p>
<h3 id="observed-outcomes">Observed Outcomes</h3>
<p>An important fact about the Science, which we have alluded to repeatedly, is that
we can never observe it fully. Indeed, for each unit \(i\), we only observe the
unique potential outcome associated with the treatment to which the unit is assigned.
The observed outcome, denoted \(Y_i\), can therefore be written:</p>
<p>\[
Y_i = y_i(W_i) = W_i \,y_i(1) + (1-W_i) \,y_i(0)
\]</p>
<p>Since \(W_i\) is a random variable, the observed outcome \(Y_i\) will also be
a random quantity — hence, we write it in uppercase. We will denote by
\(\vec{Y} = y(\vec{W}) = (Y_1, \ldots, Y_N)\) the vector of observed outcomes.
Once the experiment has been run, the analyst observed only two quantities: the observed
assignment \(\vec{W}\) and the observed outcomes \(\vec{Y}\). For the science table
we displayed in the previous section, and assignment vector \(\vec{W}\), the observed
outcome vector is \(\vec{Y} = (3, 1, 2, 1, 3, 0)\)</p>
<h3 id="assignment-mechanism">Assignment Mechanism</h3>
<p>When we introduced the assignment vector \(\vec{W}\), we said that it was a
random quantity, but we didn’t say much about its distribution \(Pr(\vec{W})\)
other than it was called the <em>assignment mechanism</em> (or the <em>design</em>).</p>
<p>There are, of course, many possible assignment mechanisms for an experiment on
a population of \(n\) individuals — as many, in fact, as there are distributions
with support \(\{0,1\}^n\). To keep things simple, we introduce just two of the
simplest and most popular such distributions.</p>
<p><strong>Definition (Bernoulli Design):</strong> We say that \(\vec{W}\) is
assigned according to a Bernoulli design (or Bernoulli assignment mechanism) with
parameter \(\pi\) if each unit \(i\) is assigned to treatment independently
with probability \(\pi\). That is,</p>
<p>\[
W_i \overset{i.i.d}{\sim} Bernoulli(\pi).
\]</p>
<p><strong>Definition (Completely Randomized Design):</strong> We say that \(\vec{W}\) is
assigned according to a Completely Randomized Design with parameter \(n_1\)
if all assignments with exactly \(n_1\) treated units are equally likely.</p>
<h3 id="estimators">Estimators</h3>
<p>A causal estimand, such as the \(\tau^{ATE}\), depends on the entire science
table, meaning we can never compute it directly. In practice, we
estimate it using data that we observe: the observed assignment
\(\vec{W}\) and the observed outcomes \(\vec{Y}\).</p>
<p>An estimator is a function \(\hat{\tau}\) of the observed data, say
\(\hat{\tau}(\vec{W}, \vec{Y}))\). It can be thought of as a “data-driven”
guess for the estimand of interest; indeed, since it depends only on observed data,
an estimator can always be computed. It is only a guess, though, because it uses
only the observed portion \(\vec{Y}\) of the science table \(\underline{y}\) to
estimate the estimand of interest. Specifically, since it depends on the random
assignment vector \(\vec{W}\), the estimator \(\hat{\tau}(\vec{W}, \vec{Y})\)
is itself a random variable.</p>
<p>Choosing an appropriate estimator of a given estimand is an interesting topic but is beyond this post’s scope. If the causal estimand of interest
\(\tau^{ATE}(\underline{y})\) as defined above, then a natural estimator for it
is the difference in means:</p>
<p>\[
\hat{\tau}^{DiM}(\vec{W},\vec{Y}) = \frac{1}{n_1(\vec{W})} \sum_{i=1}^n W_i Y_i - \frac{1}{n_0(\vec{W})} \sum_{i=1}^n (1-W_i) Y_i
\]</p>
<p>where \(n_1(\vec{W}) = \sum_{i=1}^n W_i\) and \(n_0(\vec{W}) = n - n_1(\vec{W})\).
For the science table we displayed in the previous section and assignment
\(\vec{W} = (1, 1, 0, 1, 0, 0)\), we saw that the observed outcome vector
was \(\vec{Y} = (3, 1, 2, 1, 3, 0)\) and therefore the estimate is:</p>
<p>\[
\hat{\tau}^{DiM} = \frac{1}{3} (3 + 1 + 1) - \frac{1}{3}(2 + 3 + 0) = 0
\]</p>
<p>If the treatment was assigned according to a Completely Randomized Design as defined
above, then \(\hat{\tau}^{DiM}\) is, in the precise sense described below, a good
estimator.</p>
<h2 id="randomization-based-inference-a-primer">Randomization-based inference: a primer</h2>
<p>So far, we haven’t talked about models — or about randomness, really. That’s
because the Rubin Causal Model is largely agnostic about these considerations.
The RCM helps us define clearly what is the causal quantity we are after, and
separate this from what we actually observe.</p>
<p>Once this has been established, we can take a number of paths to estimate causal effects and assess the uncertainty of those estimates. Here we will give a primer on one such approach that is particularly natural and helpful in randomized experiments. Specifically, suppose that the treatment \(\vec{W}\) is assigned according to a completely randomized design. As mentioned above, apart from \(P(\vec{W})\), which we have just specified, we have not assumed any model. In particular, we have not assumed that the potential outcomes follow any distribution, nor that they are independent draws from some distribution, nor are they random. At the same time, we have also not precluded the potential outcomes from having been randomly drawn from some distributions. That is the wonderful nature of the potential outcomes framework: it requires no assumptions on the outcomes!</p>
<p>Let’s see how far we can push this idea, and let’s consider \(\underline{y}\) to
be fixed (and, prior to the experiment, unknown). Notice that the observed outcomes
\(\vec{Y}\) are still random, since they depend on \(\vec{W}\) which is itself
random. We can then state the following result.</p>
<p><strong>Proposition:</strong> If \(\vec{W}\) is assigned according to a completely
randomized design, then the difference in means estimator
\(\hat{\tau}^{DiM}\) is unbiased for the average
treatment effect \(\tau^{ATE}\).</p>
<p>Much more can be said about the properties of \(\hat{\tau}^{DiM}\) in completely
randomized experiments — future posts will explore these properties in greater
details.</p>
<h2 id="references">References</h2>
<p>There are many fantastic books on the topic of causal inference, but we have found the following to be useful:</p>
<p>Imbens, G. W., & Rubin, D. B. (2015). <em>Causal inference in statistics, social, and biomedical sciences.</em> Cambridge University Press.</p>About the authors/about/Fri, 15 Jan 2021 00:00:00 +0000/about/<p>This (growing) collection of essays about causal inference and experimental
design is being written by Guillaume Basse and Iav Bojinov. It is our attempt at
explaining some ideas from our work — and more broadly, from our field — in
an accessible way.</p>
<h2 id="guillaume-basse">Guillaume Basse</h2>
<p>I am an Assistant Professor in the <a href="https://msande.stanford.edu/">MS&E</a> and <a href="https://statistics.stanford.edu/">Statistics departments at Stanford</a>.
My research focuses on Causal Inference and Design of Experiments in the presence
of interference. I got my PhD in Statistics at Harvard in 2018, under the supervision
of Edo Airoldi, then spent a year as a postdoctoral fellow in the Statistics
Department at UC Berkeley where I was advised by Peng Ding. Before coming to the US
I attended the Ecole Centrale Paris, where I studied Applied Mathematics and
Engineering. I have lived in France, Israel, the US and Senegal, where I was born.
My personal website is <a href="https://web.stanford.edu/~gbasse/">here</a>.</p>
<h2 id="iav-bojinov">Iav Bojinov</h2>
<p>I am an assistant professor of business administration in the <a href="https://www.hbs.edu/faculty/units/tom/Pages/default.aspx">Technology and Operations Management</a> unit at Harvard Business School and a faculty affiliate in the <a href="https://statistics.fas.harvard.edu/">Department of Statistics at Harvard University</a>. My research interest is at the interface of causal inference, experimental design, and large-scale computing with the overall goal of democratizing statistical methods in order to help firms innovate and grow. Currently, I am actively pursuing three related research areas: design and analysis of experiments in complex settings, demystifying the value and limitation of experimentation, and understanding the role of data science in the modern AI organization. My personal website is <a href="https://ibojinov.com/">here</a>.</p>