VRIN Data

The vast majority of strategy research is empirical, relying on data that captures some aspect of reality and analysis that finds regular patterns in the data to support claims about how the world works.

A great empirical paper is based on great data and great analysis. While we spend a lot of time in our PhD programs discussing what constitutes “great analysis,” we rarely talk about what makes data great.

Here’s my definition: Great data allows you to uniquely provide the most convincing support for your paper’s claims.

It might be helpful to use a standard strategy framework, like VRIN, to organize our thoughts on what makes data great.

This framework helps us consider whether a firm’s resource or capability is a credible source of competitive advantage. Since data is a resource for a researcher and their paper, we can see if it makes sense to think about it using the VRIN framework as well.

Let’s begin by recapping what VRIN+O means:

V = Valuable
R = Rare
I = Inimitable
N = Non-substitutable
O = Organization

Now, let’s think through each of these categories with respect to a dataset.

VALUABLE

First comes “value.” In the standard strategy framework, a resource or capability is valuable if it allows the producer to increase prices or reduce their costs relative to their competitors. In the context of research data, I’ll replace willingness-to-pay with willingness-to-publish. Examples of data value can include:

Measurement: Observing phenomena or testing theories that are otherwise difficult to observe.
Causality: Being able to make convincing cause-and-effect claims about a set of phenomena.
Generalizability: Making statements about a broader (and important) population.
Detail: Related to measurement, but able to discuss a step-by-step causal chain of logic or show important nuanced outcomes/contingencies.
Long-term: Relating to both measurement and detail, but able to show the long-term impact of a phenomenon.

It’s important to note that you cannot achieve all of these things simultaneously. You have to make trade-offs. For instance, when you want generalizability, it’s unlikely that you’ll get clear causal inference.

When you want causal inference, it’s unlikely that you’ll be able to make a generalized statement. Having detailed measurement often comes at the cost of generalizability. These trade-offs are a result of cost.

Once you understand the value of your data, you can match it to the goals of your paper. For example, if your paper’s goal is to make a causal claim about a theory you’re testing, focus on what you can and cannot say with your data. Often, reviewers get confused about what you’re trying to do, so don’t attempt too many things at once, as it may confuse or even infuriate your reader. Like customers, readers are more likely to appreciate your paper when they know what it can and cannot do. Don’t oversell.

RARE

The second component is R (rarity). Are your data rare? For example, consider the dataset from the National Longitudinal Survey of Youth (NLSY). You can easily download this data from the NLSY website and analyze it, just like anyone else. The NLSY is the opposite of rare, meaning that there are potentially hundreds or even thousands of researchers trying to test their theories using this data. Many times, their theories and your theories are the same, meaning that for any paper written using NLSY data, the data is not a competitive advantage because it is not rare. Similarly, when you run an experiment with undergraduates, in a laboratory setting or online using platforms like Mechanical Turk or Prolific, other people can do this easily as well. These data are valuable but not rare.

INIMITABLE

Use of inimitable in the wild:

“I am the one thing in life I can control. I am inimitable; I am an original. I’m not falling behind or running late. I’m not standing still: I am lying in wait.”
Lin-Manuel Miranda, Hamilton: An American Musical (Aaron Burr)

Third, consider whether the data is inimitable. In the context of data, you can possess rare data that can easily be copied. For instance, a nationally representative survey of workers using Qualtrics might ask about their hiring process or interest in entrepreneurship. While such a dataset is “rare,” as only the researchers who conducted the study possess it, other researchers with sufficient funding can recreate the same (or approximate) data by running the survey again using the same research firm for a certain cost. Fortunately, academia offers a decent first-mover advantage, so if others replicate your work, they should probably cite you for it. Another example of imitable data is information scraped online from platforms like Yelp or eBay. Initially, these data sets may have been rare for the researchers who developed the scraping technique, but they ceased to be rare once others learned how to do it.

What, then, constitutes inimitable data? Given the drive for open science, data should not be entirely inimitable. One example of inimitable data is Robert Sapolsky’s research at Stanford University, where he collected blood measurements of baboons over several decades. It would be impossible for someone else to replicate this data, as it has been collected over an extended period, and we cannot go back in time. In social sciences, some datasets of a historical nature involve individuals who have delved into national archives, collected data from text documents, and hand-coded them.

NON-SUBSTITUTABLE

Next, consider data as non-substitutable, meaning individuals cannot make the same types of claims using another dataset. Perfect substitutability may not be necessary; for instance, many data providers offer information about academic publications, including Dimensions, Web of Science, and OpenAlex. Some data may be better, more detailed, or cover a longer time horizon, but most of these datasets can be easily substituted for one another to meet the majority of research needs.

A non-substitutable dataset might include data from a company like Uber. It is challenging to study the impact of a ridesharing platform without access to internal company data. A researcher with access to this data can make more credible claims about the platform’s workings than someone without access. Luigi Zingales offers an interesting discussion about this trade-off; while it benefits the researcher, it may not necessarily be good for science. Another example is the detailed IRS data collected by researchers such as Thomas Piketty and others over the years.

ORGANIZATION

Finally, I believe the most crucial element in all of this is organization. Take the example of patent datasets. Many researchers build tremendous capabilities around analyzing this data, which often results from their investments in cleaning, organizing, and adding features to it that may not have been available in the original datasets. This also includes the ability to merge it with other data, such as firm characteristics, stock prices, scientific publications they use, and the capacity to produce a paper at a much lower cost with these combined sources than someone entering the field without these substantial investments in data organization and infrastructure.

The organization aspect can enable you to do two things: first, increase value, and second, create rarity, inimitability, and non-substitutability, leading to a protected data advantage.

THINKING ABOUT DATA AS A COMPETITIVE ADVANTAGE

Now, you can use the standard strategy framework of VRIN-O to test whether the data or data capabilities you are investing in are worthwhile.

One could imagine assessing your paper and asking: will this data be a competitive advantage in the publication process? Qualitatively, you can only say yes if you are uniquely able to answer the research question or defend your claims with the data you have.

Now, I’d like to reframe the question animating this section in terms of how to think about competitive advantage by considering your data as a resource that you share rather than hoard. As researchers, we all aim to publish our work in the best journals possible and hopefully receive citations. Data is an input in the paper production process. One way to create and capture value is to cultivate high-quality datasets that no one else has. Another is to provide inputs to other researchers that they couldn’t obtain elsewhere or would cost them too much time and effort to create the dataset themselves, and thus you do it yourself. By dramatically lowering the cost of an essential input, you create value for other researchers – they are publishing with your data and citing you as a result – and capture value because you receive credit.

There are two more broad ideas related to data that are probably worth mentioning. Data costs money, and there are two costs associated with building a dataset: fixed and variable.

Datasets usually have both costs, but scaling datasets may grow at different rates depending on the nature of the marginal costs.

For instance, when using public data, most costs are fixed. These include downloading, cleaning, and storing the data on your storage devices. The marginal costs then become very low. Once you build a dataset of patents matched to firms and scientific articles, you can reuse the same data as many times as you can come up with a good research question that can be answered with it. As a result, you can spread these fixed costs over many projects, reducing the average cost of any given data component of a research paper.

In contrast, experiments have both high fixed and high marginal costs. These costs come from various sources, including building the technical infrastructure to run the experiment, followed by investments in creating surveys and interventions. Then there is the marginal cost of obtaining a subject, which can be most easily seen when running a field experiment where you have to incentivize people to participate. For every observation, you have to pay subjects and provide them a place at the table.

Sometimes the fixed costs are very low if someone else has borne them for you. For instance, if you are running a Mechanical Turk experiment, the infrastructure is already there, and the cost of creating a survey that you deploy on a platform like Qualtrics to the sample is relatively low. However, much of the research expense comes from the marginal cost of running the study, with each subject being paid a specific amount to complete your study.

It is helpful to understand the economics of your underlying data. If you have a high fixed cost dataset, consider how you can spread that fixed cost over multiple papers. If you have a high marginal cost dataset, ensure that every observation counts and that you extract the maximum value from each data point by reducing measurement error, ensuring you measure the necessary variables, and so on. If you have a dataset with both high fixed and marginal costs, such as a field experiment, think about how you can use your fixed cost investments in another experiment in the future or leverage the sample for another research paper.

It is important to note that when writing papers, your goal is to create reader surplus. This surplus represents the new knowledge and insight that readers can use in their own scientific endeavors. This knowledge helps them write better papers, make more important or interesting observations about the world, and, more practically, helps them get published.

P, the price, is often reflected in the placement of your article within the world of journal publishing. If your work creates low value, the price that individuals are willing to pay for your article, or their willingness to accept your article in a top-tier journal, is lower. In other words, the perceived value of your article determines its potential for publication in prestigious journals.