This page is part of the Econometric Computing Learning Resource (ECLR).

Introduction

In this exercise we will document how to assemble a data-set that is not nicely presented as a spreadsheet. The data we will be preparing here is a country level economic growth dataset with a large number of covariates. It contains data for 138 countries and many covariates.

At the time of writing this the data were available from this website by The University of Bristol. From that website you can download a zip file containing the actual datafiles and a text file that provides excellent information on the variables. The data dictionary is also available from here (readme.txt) and the zip file from here (BARLEE.zip).

Once you downloaded the zip file you should extract the zip file into a folder called “BARLEE”. We will now set our working directory to that folder.

## [1] "C:/Rcode/ECLR/data/BARLEE"
setwd("..YOUR_CORRECT_PATH_TO../BARLEE")

Let’s see what the files are called in this directory.

filelist <- list.files("C:/Rcode/ECLR/data/BARLEE")   # Here I use my path, you will have to adjust
filelist
##  [1] "bmp.prn"                  "codes.prn"               
##  [3] "enrolh.prn"               "enrolp.prn"              
##  [5] "enrols.prn"               "fert.prn"                
##  [7] "fert1.prn"                "gdpsh4.prn"              
##  [9] "gdpsh5.prn"               "gdpwb.prn"               
## [11] "geegde.prn"               "govwb.prn"               
## [13] "GrowthData_colnames.csv"  "GrowthData_complete.csv" 
## [15] "GrowthData_countries.csv" "gvxdxe.prn"              
## [17] "high.prn"                 "highc.prn"               
## [19] "human.prn"                "hyr.prn"                 
## [21] "invpub.prn"               "no.prn"                  
## [23] "pinstab.prn"              "politics.prn"            
## [25] "pop.prn"                  "ppp.prn"                 
## [27] "pri.prn"                  "pric.prn"                
## [29] "pyr.prn"                  "readme.txt"              
## [31] "sec.prn"                  "secc.prn"                
## [33] "syr.prn"                  "teacher.prn"             
## [35] "trade.prn"
filelist_prn <- filelist[grepl(".prn",filelist)]   # this extracts only the filenames with ".prn" in the name

All but one file have the extension “.prn” the only other file is “readme.txt” which is the data dictionary. You may wonder what format the “.prn” files are. You will find that information in the data dictionary:

” The files are plain text (ASCII) files. Data on each variable are listed by the order of SHCODE (see Appendix 1). The missing observations are denoted by ‘NA’.”

The variable SHCODE is also described in the data dictionary as a numeric variable representing countries (e.g. SHCODE = 2 represents Angola). We have learned from that the datafiles are ASCII files. This means that they are basically text files and you should be able to open them with any text editor (try!).

Here is the plan for the remainder. We shall first import the data for one of the files to figure out how to do that and then we will try and automate this process for all files.

Prototype

We will first attempt to master the import of one file and then try to automate this process as much as possible for the others.

When you open these (here “fert.prn”) you will see something like this:

There is a number of things you can learn from that.

  1. There are no variable names
  2. Numbers seem to be separated by spaces
  3. Missing numbers are represented by “NA”
  4. There seem to be regular linebreaks (one line displaying over two lines on the screen)
  5. Counting the number of values in one line gives 23 variables
  6. There are 138 lines in that file (you have to trust me or check in your text editor), hence the number of lines correspond to the number of countries as anticipated above.

So, it looks as if this file contains 23 pieces of information (variables) for 138 countries. In the data dictionary you will find that this file should contain the following variables: “fert60 fert65 fert70 fert75 fert80 fert85 mort60 mort65 mort70 mort75 mort80 mort85 lifee060 lifee065 lifee070 lifee075 lifee080 lifee085 gpop1 gpop2 gpop3 gpop4 gpop5”. Indeed these are 23.

Let’s import the data into R. A bit of web-searching (search for “R import ASCII file”) will reveal that there are several functions that can be used for that purpose. For instance (but not only) read_delim and read_table. Let’s try the first (and recall that you can use ?read_delim to figure out what options are available.)

fert1 <- read_delim("C:/Rcode/ECLR/data/BARLEE/fert.prn", delim = " ", col_names = FALSE, na = "NA")
dim(fert1)
## [1] 139  69

This has imported one row too many and way too many variables. Let’s look at the first few rows to see what happened.

head(fert1)
## # A tibble: 6 × 69
##   X1       X2    X3 X4       X5    X6    X7    X8    X9   X10   X11   X12   X13
##   <chr> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ""     7.34  NA   NA     7.44  NA    NA    7.42    NA  NA    7.26    NA NA   
## 2 ""    NA     NA   NA    NA     NA    NA   NA       NA  NA   NA       NA NA   
## 3 ""     6.79  NA   NA     6.84  NA    NA    6.85    NA  NA    6.64    NA NA   
## 4 ""     6.82  NA   NA    NA      6.9  NA   NA       NA   6.9 NA       NA  6.78
## 5 ""     6.36  NA   NA     6.35  NA    NA    6.39    NA  NA    6.46    NA NA   
## 6 ""    NA      6.4 NA    NA     NA     6.4 NA       NA  NA    6.4     NA NA   
## # ℹ 56 more variables: X14 <dbl>, X15 <dbl>, X16 <dbl>, X17 <dbl>, X18 <dbl>,
## #   X19 <dbl>, X20 <dbl>, X21 <dbl>, X22 <dbl>, X23 <dbl>, X24 <dbl>,
## #   X25 <dbl>, X26 <dbl>, X27 <dbl>, X28 <dbl>, X29 <dbl>, X30 <dbl>,
## #   X31 <dbl>, X32 <dbl>, X33 <dbl>, X34 <dbl>, X35 <dbl>, X36 <dbl>,
## #   X37 <dbl>, X38 <dbl>, X39 <dbl>, X40 <dbl>, X41 <dbl>, X42 <dbl>,
## #   X43 <dbl>, X44 <dbl>, X45 <dbl>, X46 <dbl>, X47 <dbl>, X48 <dbl>,
## #   X49 <dbl>, X50 <dbl>, X51 <dbl>, X52 <dbl>, X53 <dbl>, X54 <dbl>, …

We indicated that a space separates values from each other. If you look at the actual text file you will see that there are typically 3 spaces between entries. This explains the extra variables and missing values. Let’s re-import the data but indicate that three spaces separate values.

fert <- read_delim("C:/Rcode/ECLR/data/BARLEE/fert.prn", delim = "   ", col_names = FALSE, na = "NA")
head(fert)
## # A tibble: 6 × 24
##   X1         X2    X3    X4    X5    X6    X7     X8     X9    X10    X11    X12
##   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 " 7.34"  7.44  7.42  7.26  6.87 NA     6.21  0.166  0.154  0.139  0.12   0.098
## 2 ""      NA    NA    NA    NA    NA    NA    NA     NA     NA     NA     NA    
## 3 " 6.79"  6.84  6.85  6.64 NA     6.5  NA     6.5    0.183  0.166  0.155  0.138
## 4 " 6.82" NA     6.9  NA     6.9   6.78  6.82 NA      5.76   0.118  0.112  0.101
## 5 " 6.36"  6.35  6.39  6.46 NA     6.5  NA     6.5    0.217  0.193  0.17   0.159
## 6 "  6.4" NA     6.4  NA     6.4   6.46 NA     6.5   NA      6.68   0.151  0.142
## # ℹ 12 more variables: X13 <dbl>, X14 <dbl>, X15 <dbl>, X16 <dbl>, X17 <dbl>,
## #   X18 <dbl>, X19 <dbl>, X20 <chr>, X21 <dbl>, X22 <chr>, X23 <chr>, X24 <chr>

We still have one row and one variable too many and there still seem to be extra missing values. This is a typical and frustrating problem and there are several routes to solving this problem available.

  1. Import whole rows -> replace any number of spaces with only one space -> separate by space
  2. Use another import function to see whether life becomes easier.

Option 2 certainly sounds more attractive. Let’s try the read_table function:

fert <- read_table("C:/Rcode/ECLR/data/BARLEE/fert.prn", col_names = FALSE, na = "NA")
head(fert)
## # A tibble: 6 × 23
##   X1       X2    X3    X4    X5    X6     X7     X8     X9    X10    X11    X12
##   <chr> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 7.34   7.44  7.42  7.26  6.87  6.21  0.166  0.154  0.139  0.12   0.098  0.08 
## 2 <NA>  NA    NA    NA    NA    NA    NA     NA     NA     NA     NA     NA    
## 3 6.79   6.84  6.85  6.64  6.5   6.5   0.183  0.166  0.155  0.138  0.124  0.118
## 4 6.82   6.9   6.9   6.78  6.82  5.76  0.118  0.112  0.101  0.087  0.078  0.071
## 5 6.36   6.35  6.39  6.46  6.5   6.5   0.217  0.193  0.17   0.159  0.152  0.142
## 6 6.4    6.4   6.4   6.46  6.5   6.68  0.151  0.142  0.137  0.132  0.126  0.117
## # ℹ 11 more variables: X13 <dbl>, X14 <dbl>, X15 <dbl>, X16 <dbl>, X17 <dbl>,
## #   X18 <dbl>, X19 <dbl>, X20 <dbl>, X21 <dbl>, X22 <dbl>, X23 <dbl>

This seems to work like a charm. There is still one row too many.

tail(fert)
## # A tibble: 6 × 23
##   X1         X2    X3    X4    X5    X6     X7    X8     X9    X10    X11    X12
##   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 "6.28"   6.24  6.05  6.15  5.91  5.66  0.163  0.14  0.115  0.093  0.078  0.067
## 2  <NA>   NA    NA    NA    NA    NA    NA     NA    NA     NA     NA     NA    
## 3  <NA>   NA    NA    NA    NA    NA    NA     NA    NA     NA     NA     NA    
## 4  <NA>   NA    NA    NA    NA    NA    NA     NA    NA     NA     NA     NA    
## 5  <NA>   NA    NA    NA    NA    NA    NA     NA    NA     NA     NA     NA    
## 6 "\u001… NA    NA    NA    NA    NA    NA     NA    NA     NA     NA     NA    
## # ℹ 11 more variables: X13 <dbl>, X14 <dbl>, X15 <dbl>, X16 <dbl>, X17 <dbl>,
## #   X18 <dbl>, X19 <dbl>, X20 <dbl>, X21 <dbl>, X22 <dbl>, X23 <dbl>

You should compare to the text file and you will realise that the last row (row 139) does not contain any data, so we can just delete it and only keep rows 1 to 138.

fert <- fert[1:138,]

Now we will have to add the country name and the variable names. By default the variables were named x1 to X23. We can add the real variable names as follows:

cnames <- "fert60 fert65 fert70 fert75 fert80 fert85 mort60 mort65 mort70 mort75 mort80 mort85 lifee060 lifee065 lifee070 lifee075 lifee080 lifee085 gpop1 gpop2 gpop3 gpop4 gpop5"
cnames2 <- str_split(cnames," ")[[1]]

The vector cnames can be copied straight from the data dictionary. The second line separates the names into a list of 23 names. We can then name the columns of the fert object.

colnames(fert) <- cnames2

Now we need to add the country information. In the data dictionary you will find a table of the 138 countries. We have created a csv file of that table which is available from here GrowthData_countries.csv.

data_all <- read.csv("C:/Rcode/ECLR/data/BARLEE/GrowthData_countries.csv")

As we know that the rows in fert are in the same order as the countries in data_all (we learn this from the data dictionary) we can just attach the columns from fert to data_all.

data_all <- cbind(data_all, fert)

From prototype to production

That was quite a lot of work to get the data from one file into R. Altogether we have 31 files of that type. If you do repeated things you can often do them in a loop. Let’s think about whether we can do that here. For every file we would have to do the following

  1. Define the file name, filename
  2. Import filename the read_table function
  3. Change the colnames
  4. Add the columns to data_all

We already have a list of the filenames (filelist_prn) from which we can extract the right filename and hence we can automate step 1. Once we have done that step 2 is straightforward. However, step 3 requires the list of variable names. Above we manually copied them from the data dictionary. But we don’t really want to do that manually. To automate this step we will need to save the variable names in another csv file.

We have done exactly that in GrowthData_colnames.csv.

colnames_list <- read.csv("C:/Rcode/ECLR/data/BARLEE/GrowthData_colnames.csv")
colnames_list$Data <- trimws(colnames_list$Data) # removes leading and trailing spaces

Look at colnames_list to find that the first column (ASCII.File) has filenames and the 2nd column a list of all the variable names (Data).

Now we are in a position to automate the job. We start with a fresh data_all.

data_all <- read.csv("C:/Rcode/ECLR/data/BARLEE/GrowthData_countries.csv")

Now we run a loop through all the names in filelist_prn and for each of these files we move through steps 1 to 4.

for (i in filelist_prn) {
  # Step 1
  filename <- i
  fileandpath <- paste0("C:/Rcode/ECLR/data/BARLEE/",i) # adjust this to your path
  
  # Step 2
  # only import first 138 rows
  temp <- read_table(fileandpath, col_names = FALSE, na = "NA", n_max = 138)
  
  # Step 3
  temp.cnames <- colnames_list[colnames_list$ASCII.File == i,2]
  temp.cnames2 <- str_split(temp.cnames," ")[[1]]
  colnames(temp) <- temp.cnames2
  
  # Step 4
  # only attach cols corresponding to a variable name
  data_all <- cbind(data_all, temp[,1:length(temp.cnames2)])
}

The above code has a few elements that were added after examining results in details.

  • For most datafiles 139 lines were imported with the last line being an empty line. Therefore we included the n_max = 138 option into the read_table function, knowing that each file contains 138 rows of data only.
  • Some variables imported an extra column (i.e. one column more than there were variable names). These were all NAs. Therefore we attached temp[,1:length(temp.cnames2)] to data_all.

Once you hve done this you can save the resulting object data_all in a new csv file which you can then upload to work from.

write.csv(data_all, "C:/Rcode/ECLR/data/BARLEE/GrowthData_complete.csv")

Summary

This exercise demonstrated a piece of data management. It is not typical in its details as each data project will have its own challenges. But there are a few common features you are likely to encounter:

  • The need to merge data from different sources into one data frame.
  • The need to consult a data dictionary (or similar) to understand the data structure. The one provided here is excellent.
  • The opportunity to automate the data upload from multiple files with similar structure.
  • The need to deal with slight irregularities in the data files (like the extra rows or columns).

Each project will provide different challenges and solutions but the way of thinking demonstrated here will translate to many projects. You should, as we have done here, write a script or rmarkdown code which does the data handling with clear explanations. Once you have done that work you can put this code aside and keep working straight from the file you created, here GrowthData_complete.csv.