The Linear Regression
In the previous section, we provided a broad framework for thinking about data analysis for your research paper. In this section, our attention will be on the workhorse of the analytics world, the simple and elegant linear regression. We all learn about linear regression in our statistics classes. Oftentimes, we learn about all the ways that real data aren’t “linear” and how we need to use the bells and whistles provided by other methods to account for the shortcomings of the linear regression model. However, increasingly, researchers have realized that it is a powerful tool and a great first step for any data analysis for a research paper. Of course, Poisson regression, logistic regression, and negative binomial models are all good and important, but you can go a long way with just linear regression.
Let’s begin by thinking about linear regression as a simple equation:
Y = b0 + b1(x1) + e
In this model, Y is your dependent variable — the outcome you care about. Beta 0 (b0) is the baseline estimate of that outcome for the situation where x1 = 0. Beta 1 (b1) represents how much the outcome increases or decreases with every unit change in x1, and of course, e (epsilon) is the part of Y (the outcome) that can’t be explained by b0 + b1(x1).
In the world of strategy, this is your baseline model.
We are often interested in understanding the impact of some mechanism, behavior, treatment, or state of the world on some outcome that we care about. The former is the X. The latter is the Y.
For instance, if I take all my papers, the vast majority of them can be put into this simple X, Y framework. Here’s a list of these papers, and I’ve tried to boil all of them down into the basic X and Ys:
Specialization and career dynamics:
Y = career success X = specialization
The mechanics of social capital and academic performance: Y = academic performance X = academic ability of peers
Peers and network growth: Y = Network growth X = The network structures/size of randomly assigned peers
And so on.
Now, this estimation can be run very simply in most software. The code to run a linear regression in Stata is simply:
reg y x
In R, the code is just as simple:
out = lm(y ~ x) summary(out)
In order to use these two commands in Stata and R, you need a dataset where one column is the Y and the other column is the X. The rows are each of your observations in your dataset.
Here is an example dataset that just has these two columns.
Let’s run our code to see what our estimates of the beta coefficients are.
Obviously, most data analysis tasks for a real research paper will not be so simple. In fact, the only time when a simple regression of the form estimated above makes empirical sense is when X is some treatment indicator that was randomly assigned, and Y is an outcome that happens in the future, after X has been randomly assigned.
Otherwise, the regression cannot be interpreted in the straightforward manner that we are taught in our first statistics class. Correlation does not equal causation. However, we will address this in the next section of this chapter.
However, even for an experiment, this basic regression is not so interesting. In strategy especially, we are interested in what is often called “heterogeneity.” Simply put, heterogeneity is just the idea that some firms or individuals will benefit more or less from a given mechanism, or the impact of a treatment may be bigger or smaller for some subset of subjects versus another.
In order to account for this heterogeneity, we expand our linear regression model from above to include two new terms, B2 and B3. The regression model now looks like:
Y = b0 + b1(x1) + b2(z1) + b3(x1*z1) + e
With this updated model, we can now evaluate some heterogeneity in the impact of X1 on Y, depending on whether the subject varies on some dimension Z1. The term B2 estimates the variation in the outcome as a function of pre-existing variation in the observations on Z1. For instance, if we are interested in understanding the impact of A/B testing on startup performance, where X1 is an indicator of whether a firm uses A/B testing or not (for the sake of argument, let us assume that whether they use A/B testing is randomly assigned), and Z1 can indicate whether the startup is located in Silicon Valley or not. Thus, beta-1 tells us what the impact of A/B testing is on startup performance (Y), and beta-2 tells us the difference in performance of startups located in or outside of Silicon Valley.
Beta 3 is the heterogeneity coefficient. It tells us whether startups that use A/B testing benefit more if they are from Silicon Valley or not. If beta 3 is positive and significant, we can interpret it as: startups based in Silicon Valley benefit more from A/B testing than those that are located outside of it.
For the most part, equation 1 and equation 2 are the two sets of primary equations most research papers in strategy are built around. They are estimated in a variety of different ways. Perhaps the most important distinction is one where there is a time subscript on your observations.
Let’s start with equation 2:
Y(t) = b0 + b1(x1(t-1)) + b2(z1(t-1)) + b3(x1(t-1)*z1(t-1)) + e
The dataset that goes along with this will also be a little bit different from the one above. Instead of just the variables Y, X1, and Z1, we will have multiple observations per unit, for instance, firm, and this is indexed with a time variable _T. To keep track of each firm, we will also have something like a firm ID.
Now, more sophisticated researchers may be asking, “None of this really applies to me because I do not run experiments. As a result, my X1 and Z1 are not randomly assigned, and therefore, I can’t just interpret beta-1 as causal.” And that’s right. In the next section, we will talk about how to think about causal inference in as clear a way as possible.
In a sense, causal inference consists of thinking through two challenges to inference: reverse causality and unobserved heterogeneity. The first one can be easily addressed purely based on the structure of the data. In the next section, we will go through standard procedures for thinking about these two challenges in weaker and stronger ways.
I like to think about this as a gradient of control.
What causal inference is trying to do is make sure that your story is as consistent with your data analysis as possible. It is not “finding an instrument,” but rather ruling out as many alternative explanations for your effect as possible.