Causal Inference

For many people in strategy, causal inference seems like a bogeyman. I think this is partly because most people react negatively to it, as they associate it with what economists do or what reviewers use to criticize their papers.

The whole point of social science, and perhaps all of science except for classification tasks, is to make causal statements about the way the world works. Otherwise, we’re not doing science; we’re merely telling stories, which may or may not be true.

Thus, there is causality, which is a conceptual and theoretical problem. For instance, we make statements about the way the world works, and these statements often take the form of cause-effect relationships.

In parallel, we need to convince others that our arguments — causal arguments — should be believed. This is where causal inference comes in. Our data analysis should be convincing enough for readers to believe our causal claims about how the world works.

Most of our causal statements are in the form of X causing Y. Our regressions are also in this form — but what we can get from our regressions is correlation, not causation. The empirical task, then, is to convince people that your causal statement can indeed be believed.

I use a very simple framework to think about how one can become more convincing in data analysis. There are essentially two problems that need to be addressed. The fact that there are only two problems doesn’t mean they’re not hard problems to tackle.

Consider the XY relationship below: A/B testing increases startup performance. Now, let’s solve the first problem easily: performance should be measured after the adoption of A/B testing. Performance cannot be measured simultaneously, nor in the opposite direction, where A/B testing adoption happens after performance is measured.

The second problem is much more challenging, and a variety of empirical techniques are used to address this challenge. The second problem can be summarized as: the “all else equal” problem.

This challenge is twofold: first, there may be an unobserved factor, let’s call this factor M, that may lead to both increased performance and the adoption of A/B testing. This unobserved factor is often not just one thing but could be many other things, such as having a startup focused on growth as part of their strategy. This focus on growth leads to both higher performance and the adoption of A/B testing. Thus, the correlation between A/B testing and performance is not causal, but is rather driven by this focus on growth, making the beta coefficient on A/B testing not a “causal estimate” as it cannot be purely attributed to the impact of A/B testing independent of a growth orientation. Another possibility is that A/B testing is one of many technologies adopted by the startup, and they collectively have an effect on startup performance.

In fact, there are as many confounding stories that muddy your causal interpretation of the impact of A/B testing on startup performance as there are readers of your paper — probably more.

In the olden days, you could get by by including “control variables” which would “control” for these confounders by including them in the model.

This is a nice way to start, but by no means convincing to the modern reader, because many of the things that may drive both the adoption of A/B testing and performance may not be observable.

How do you adjust for unobservable sources of heterogeneity that may bias your coefficient?

One approach used by much of the literature is the “fixed effect.” This requires that you have longitudinal data, meaning you measure both performance over time, and your X variable also varies over time. In the case of A/B testing, there are periods where firms do not use A/B testing and then they do. The fixed effect essentially accounts for this unobserved heterogeneity by removing differences in performance and the likelihood of adopting A/B testing that are constant across firms at an individual level from both X and Y.

Fixed effects do a decent job, and in some situations, they are quite effective in dealing with a large amount of unobserved heterogeneity. In fact, you can see in our paper that the inclusion of fixed effects substantially shrinks the coefficient on A/B testing, suggesting that our concerns about bias are legitimate.

However, there’s another issue — time-varying unobserved heterogeneity. Firms are not bundles of fixed traits; their characteristics change over time. As a result, if there is a change in strategy, for instance, to become more data-driven, and this corresponds to more adoption of A/B testing, then the fixed effect does not account for this bias.

If this is the case, you have to account for this unobserved time-varying heterogeneity. The standard way to do this is to consider an instrumental variables approach. In essence, what you’re trying to do is create some degree of time-varying heterogeneity — that is almost random — that should be only correlated with the adoption of A/B testing and not with other channels through which performance can be improved.

You can read more about instrumental variables here.

There are other methods to account for unobserved heterogeneity. These include techniques like difference-in-differences or regression discontinuity designs. A good book on these approaches is the very readable “Mostly Harmless Econometrics” by Angrist and Pischke.

What is the gold standard?

It is the randomized controlled trial. Even these have their challenges, but I’m often struck by how little even sophisticated empiricists understand why randomization works. The basic idea is that of balance.

In a sense, the approaches above are all helping you convince the reader that all else is equal. However, there’s always a doubt that all else might not be equal. The most convincing way is perhaps to assign X1 to the subjects such that, by construction, in expectation, all else is equal.

Randomization does this; when the sample size becomes large, the probability that the sample is balanced between those who adopt and don’t adopt A/B testing increases.

Here is a layperson’s interpretation:

Let’s examine the effect of randomization on balance below.

Let’s say we can measure five different characteristics of startups, X1 through X5, and there are another five, X6 through X10, that we cannot measure.

Now, let’s randomly assign who gets A/B testing and who does not.

In expectation, there should be no correlation between any of the characteristics X1 through X5 and whether a firm adopts A/B testing. Because we use a randomization technique, there should be no correlation. Furthermore, even if the researcher only observes X1 through X5, we know that, because things were randomized, X6 through X10 should not be correlated with A/B testing as well (the unobserved characteristics).

Now, if we observe a correlation between A/B testing and performance, we know that it cannot be biased by its correlation through any of these other variables.

In a broader sense, your attempt at getting at causal inference is trying to get closer and closer to the “all else equal” condition. Randomization does this in the most rigorous and convincing way, but it can also be achieved using other methods.