Writing Research Code in R

CreateData.R/do transforms your raw data into the derived format that you will use to conduct your analyses. CreateTables.R/do takes your derived data as input and conducts the analysis that will eventually go into your final tables for the paper, both for your main manuscript and the appendix. CreateFigures.R/do also takes your derived data as input and conducts analyses that will eventually go into your final figures for the paper, both for your manuscript and your appendix.

Good code consists of three characteristics: it has to be clear, concise, and complete.

Complete: This means that anybody trying to replicate your results should be able to take the raw data and turn it into the final tables, including notes and comments, that eventually go into your paper.

Concise: This means that your code is parsimonious, meaning that you’re able to produce the results (both intermediate and final) with the least amount of code possible. This requires you to know good programming practices, and will make your code run faster.

Clear: This means that someone not involved in the production of your analysis can understand every line of code and what it does, both individually and together. Clarity requires two things: good comments and documentation, and good variable naming practices.

This section and the next are meant to help you write code that is complete, concise, and clear. To be honest, this is a skill that you will learn through practice. And, in fact, as coding was not a huge part of strategy research when I got my PhD, it took me a while to learn how to do all of these things. In fact, I’m learning more every time I work with younger people who have much better coding hygiene than I.

To begin, I like to have three files for every project. Both Stata and R, in addition to allowing you to enter commands directly into a console, also allow you to write code. In R, this code is stored as a .R file, and in Stata, this code is stored as a .do file.

These are text files. In writing your code, you will begin with the CreateData file.

This file imports all the raw data that you will be using for your analysis, and converts it into the data set that will ultimately be what you analyze in the CreateTables and CreateFigures code files.

At the top of the code file, you should always begin with a header. The header consists of information about the file, the author, the date when the file was originally created, and any significant changes to the code over time. This allows you the ability to know when this file was created and whether it is the correct file for you to use for a given analysis that you wish to conduct.

Here’s an example of a header in R. Note that this is commented text. A comment is indicated by a hash symbol (#) or an asterisk (*) in R.

The first thing you should do after you create your file is to save it in the code directory. The filename should be CreateData.R. It is notable that the filename is all one word, with the second word having a capital letter at the beginning. For instance, CreateData.R is made up of “create” and “Data.” This is called camelCase.

The next step in your CreateData file is to set the working directory. As you recall from our previous discussion on creating your research computing environment, the project directory has several subdirectories, including: canonical, derived, tables, figures, code, and notes.

After your working directory is set, you might want to load the relevant libraries (and install them). The beauty of R is that it has thousands of packages that you can use to analyze your data.

Once your packages are installed and loaded, there are four steps that follow. Not all data sets will require each of these steps, but it is useful to know what they do.

The first step is to import the data. In many cases, your data will not be in one file but in multiple files. You will need to merge the data to create the final data set that you will analyze. In some cases, you may need to create multiple data sets depending on the unit of analysis (for instance, you may analyze your data at the firm level, the firm-year level, or the division-firm-year level). We will discuss this choice at a different point in time.

After you import the data, you will need to transform it into an intermediate file, which will often consist of data cleaning, variable creation, aggregation, and transformation.

R has many importing functions that often take the form of read.XXX (for instance, read.table, read.csv, etc.) Sometimes data sets come in more complicated forms, such as .JSON or .XML. In the example below, I will show you how to import a .csv file (CSV). Many of my projects start with data sets in this text-based data file format.

Save it in a data object.

Next, clean and transform the data and create the relevant variables. Then, save this file in the “derived” directory as an intermediate data structure. It is important to ensure that you understand how this file will be integrated or merged into the other intermediate files that you will create.

This process of importing, cleaning, and saving the intermediate file will be done with every raw input file you have.

Once all the raw input files are created, you can then merge all of them together into a single data frame.

Finally, you should save both your code and your final data file as a single file that you can load when you begin to conduct analysis.

In R, this file is saved as a .RData file.

The next file you need to create is your CreateTables file.

This file is structured in a similar manner to the CreateData file. You should begin with the header, followed by the loading of any relevant packages (which should be the packages relevant for your data analysis). Next, you should set the working directory and then load the data structure that contains the data frame that you will use to conduct your analysis.

I tend to organize this file into sections based on the tables I would like to produce. As discussed earlier, a paper should consist of five main tables:

Table 1: This is your summary statistics table.

Table 2: If you are running an experiment, this is your balance test table, or if you are compelled by old-fashioned reviewers, this is your correlation table.

Table 3: This is the table that tests your main claim or hypothesis.

Table 4: This table focuses on testing “mechanism” or secondary implications.

Table 5: This table focuses on heterogeneity.

Finally, we can create the CreateFigures.R file.

Unlike tables, there are no standard figures for a research paper.

Perhaps the most common figures include:

Figure 1: A histogram. Figure 2: Coefficient plot. Figure 3: Event study plot.