This page is part of the Econometric Computing Learning Resource (ECLR).
In this exercise we will document how to assemble a data-set that is not nicely presented as a spreadsheet. The data we will be preparing here is a country level economic growth dataset with a large number of covariates. It contains data for 138 countries and many covariates.
At the time of writing this the data were available from this website by The University of Bristol. From that website you can download a zip file containing the actual datafiles and a text file that provides excellent information on the variables. The data dictionary is also available from here (readme.txt) and the zip file from here (BARLEE.zip).
Once you downloaded the zip file you should extract the zip file into a folder called “BARLEE”. We will now set our working directory to that folder.
## [1] "C:/Rcode/ECLR/data/BARLEE"
setwd("..YOUR_CORRECT_PATH_TO../BARLEE")
Let’s see what the files are called in this directory.
filelist <- list.files("C:/Rcode/ECLR/data/BARLEE") # Here I use my path, you will have to adjust
filelist
## [1] "bmp.prn" "codes.prn"
## [3] "enrolh.prn" "enrolp.prn"
## [5] "enrols.prn" "fert.prn"
## [7] "fert1.prn" "gdpsh4.prn"
## [9] "gdpsh5.prn" "gdpwb.prn"
## [11] "geegde.prn" "govwb.prn"
## [13] "GrowthData_colnames.csv" "GrowthData_complete.csv"
## [15] "GrowthData_countries.csv" "gvxdxe.prn"
## [17] "high.prn" "highc.prn"
## [19] "human.prn" "hyr.prn"
## [21] "invpub.prn" "no.prn"
## [23] "pinstab.prn" "politics.prn"
## [25] "pop.prn" "ppp.prn"
## [27] "pri.prn" "pric.prn"
## [29] "pyr.prn" "readme.txt"
## [31] "sec.prn" "secc.prn"
## [33] "syr.prn" "teacher.prn"
## [35] "trade.prn"
filelist_prn <- filelist[grepl(".prn",filelist)] # this extracts only the filenames with ".prn" in the name
All but one file have the extension “.prn” the only other file is “readme.txt” which is the data dictionary. You may wonder what format the “.prn” files are. You will find that information in the data dictionary:
” The files are plain text (ASCII) files. Data on each variable are listed by the order of SHCODE (see Appendix 1). The missing observations are denoted by ‘NA’.”
The variable SHCODE
is also described in the data
dictionary as a numeric variable representing countries
(e.g. SHCODE = 2
represents Angola). We have learned from
that the datafiles are ASCII files. This means that they are basically
text files and you should be able to open them with any text editor
(try!).
Here is the plan for the remainder. We shall first import the data for one of the files to figure out how to do that and then we will try and automate this process for all files.
We will first attempt to master the import of one file and then try to automate this process as much as possible for the others.
When you open these (here “fert.prn”) you will see something like this:
There is a number of things you can learn from that.
So, it looks as if this file contains 23 pieces of information (variables) for 138 countries. In the data dictionary you will find that this file should contain the following variables: “fert60 fert65 fert70 fert75 fert80 fert85 mort60 mort65 mort70 mort75 mort80 mort85 lifee060 lifee065 lifee070 lifee075 lifee080 lifee085 gpop1 gpop2 gpop3 gpop4 gpop5”. Indeed these are 23.
Let’s import the data into R. A bit of web-searching (search for “R
import ASCII file”) will reveal that there are several functions that
can be used for that purpose. For instance (but not only)
read_delim
and read_table
. Let’s try the first
(and recall that you can use ?read_delim
to figure out what
options are available.)
fert1 <- read_delim("C:/Rcode/ECLR/data/BARLEE/fert.prn", delim = " ", col_names = FALSE, na = "NA")
dim(fert1)
## [1] 139 69
This has imported one row too many and way too many variables. Let’s look at the first few rows to see what happened.
head(fert1)
## # A tibble: 6 × 69
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
## <chr> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "" 7.34 NA NA 7.44 NA NA 7.42 NA NA 7.26 NA NA
## 2 "" NA NA NA NA NA NA NA NA NA NA NA NA
## 3 "" 6.79 NA NA 6.84 NA NA 6.85 NA NA 6.64 NA NA
## 4 "" 6.82 NA NA NA 6.9 NA NA NA 6.9 NA NA 6.78
## 5 "" 6.36 NA NA 6.35 NA NA 6.39 NA NA 6.46 NA NA
## 6 "" NA 6.4 NA NA NA 6.4 NA NA NA 6.4 NA NA
## # ℹ 56 more variables: X14 <dbl>, X15 <dbl>, X16 <dbl>, X17 <dbl>, X18 <dbl>,
## # X19 <dbl>, X20 <dbl>, X21 <dbl>, X22 <dbl>, X23 <dbl>, X24 <dbl>,
## # X25 <dbl>, X26 <dbl>, X27 <dbl>, X28 <dbl>, X29 <dbl>, X30 <dbl>,
## # X31 <dbl>, X32 <dbl>, X33 <dbl>, X34 <dbl>, X35 <dbl>, X36 <dbl>,
## # X37 <dbl>, X38 <dbl>, X39 <dbl>, X40 <dbl>, X41 <dbl>, X42 <dbl>,
## # X43 <dbl>, X44 <dbl>, X45 <dbl>, X46 <dbl>, X47 <dbl>, X48 <dbl>,
## # X49 <dbl>, X50 <dbl>, X51 <dbl>, X52 <dbl>, X53 <dbl>, X54 <dbl>, …
We indicated that a space separates values from each other. If you look at the actual text file you will see that there are typically 3 spaces between entries. This explains the extra variables and missing values. Let’s re-import the data but indicate that three spaces separate values.
fert <- read_delim("C:/Rcode/ECLR/data/BARLEE/fert.prn", delim = " ", col_names = FALSE, na = "NA")
head(fert)
## # A tibble: 6 × 24
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 " 7.34" 7.44 7.42 7.26 6.87 NA 6.21 0.166 0.154 0.139 0.12 0.098
## 2 "" NA NA NA NA NA NA NA NA NA NA NA
## 3 " 6.79" 6.84 6.85 6.64 NA 6.5 NA 6.5 0.183 0.166 0.155 0.138
## 4 " 6.82" NA 6.9 NA 6.9 6.78 6.82 NA 5.76 0.118 0.112 0.101
## 5 " 6.36" 6.35 6.39 6.46 NA 6.5 NA 6.5 0.217 0.193 0.17 0.159
## 6 " 6.4" NA 6.4 NA 6.4 6.46 NA 6.5 NA 6.68 0.151 0.142
## # ℹ 12 more variables: X13 <dbl>, X14 <dbl>, X15 <dbl>, X16 <dbl>, X17 <dbl>,
## # X18 <dbl>, X19 <dbl>, X20 <chr>, X21 <dbl>, X22 <chr>, X23 <chr>, X24 <chr>
We still have one row and one variable too many and there still seem to be extra missing values. This is a typical and frustrating problem and there are several routes to solving this problem available.
Option 2 certainly sounds more attractive. Let’s try the
read_table
function:
fert <- read_table("C:/Rcode/ECLR/data/BARLEE/fert.prn", col_names = FALSE, na = "NA")
head(fert)
## # A tibble: 6 × 23
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.34 7.44 7.42 7.26 6.87 6.21 0.166 0.154 0.139 0.12 0.098 0.08
## 2 <NA> NA NA NA NA NA NA NA NA NA NA NA
## 3 6.79 6.84 6.85 6.64 6.5 6.5 0.183 0.166 0.155 0.138 0.124 0.118
## 4 6.82 6.9 6.9 6.78 6.82 5.76 0.118 0.112 0.101 0.087 0.078 0.071
## 5 6.36 6.35 6.39 6.46 6.5 6.5 0.217 0.193 0.17 0.159 0.152 0.142
## 6 6.4 6.4 6.4 6.46 6.5 6.68 0.151 0.142 0.137 0.132 0.126 0.117
## # ℹ 11 more variables: X13 <dbl>, X14 <dbl>, X15 <dbl>, X16 <dbl>, X17 <dbl>,
## # X18 <dbl>, X19 <dbl>, X20 <dbl>, X21 <dbl>, X22 <dbl>, X23 <dbl>
This seems to work like a charm. There is still one row too many.
tail(fert)
## # A tibble: 6 × 23
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "6.28" 6.24 6.05 6.15 5.91 5.66 0.163 0.14 0.115 0.093 0.078 0.067
## 2 <NA> NA NA NA NA NA NA NA NA NA NA NA
## 3 <NA> NA NA NA NA NA NA NA NA NA NA NA
## 4 <NA> NA NA NA NA NA NA NA NA NA NA NA
## 5 <NA> NA NA NA NA NA NA NA NA NA NA NA
## 6 "\u001… NA NA NA NA NA NA NA NA NA NA NA
## # ℹ 11 more variables: X13 <dbl>, X14 <dbl>, X15 <dbl>, X16 <dbl>, X17 <dbl>,
## # X18 <dbl>, X19 <dbl>, X20 <dbl>, X21 <dbl>, X22 <dbl>, X23 <dbl>
You should compare to the text file and you will realise that the last row (row 139) does not contain any data, so we can just delete it and only keep rows 1 to 138.
fert <- fert[1:138,]
Now we will have to add the country name and the variable names. By
default the variables were named x1
to X23
. We
can add the real variable names as follows:
cnames <- "fert60 fert65 fert70 fert75 fert80 fert85 mort60 mort65 mort70 mort75 mort80 mort85 lifee060 lifee065 lifee070 lifee075 lifee080 lifee085 gpop1 gpop2 gpop3 gpop4 gpop5"
cnames2 <- str_split(cnames," ")[[1]]
The vector cnames
can be copied straight from the data
dictionary. The second line separates the names into a list of 23 names.
We can then name the columns of the fert
object.
colnames(fert) <- cnames2
Now we need to add the country information. In the data dictionary you will find a table of the 138 countries. We have created a csv file of that table which is available from here GrowthData_countries.csv.
data_all <- read.csv("C:/Rcode/ECLR/data/BARLEE/GrowthData_countries.csv")
As we know that the rows in fert
are in the same order
as the countries in data_all
(we learn this from the data
dictionary) we can just attach the columns from fert
to
data_all.
data_all <- cbind(data_all, fert)
That was quite a lot of work to get the data from one file into R. Altogether we have 31 files of that type. If you do repeated things you can often do them in a loop. Let’s think about whether we can do that here. For every file we would have to do the following
filename
filename
the read_table
functiondata_all
We already have a list of the filenames (filelist_prn
)
from which we can extract the right filename and hence we can automate
step 1. Once we have done that step 2 is straightforward. However, step
3 requires the list of variable names. Above we manually copied them
from the data dictionary. But we don’t really want to do that manually.
To automate this step we will need to save the variable names in another
csv file.
We have done exactly that in GrowthData_colnames.csv.
colnames_list <- read.csv("C:/Rcode/ECLR/data/BARLEE/GrowthData_colnames.csv")
colnames_list$Data <- trimws(colnames_list$Data) # removes leading and trailing spaces
Look at colnames_list
to find that the first column
(ASCII.File
) has filenames and the 2nd column a list of all
the variable names (Data
).
Now we are in a position to automate the job. We start with a fresh
data_all
.
data_all <- read.csv("C:/Rcode/ECLR/data/BARLEE/GrowthData_countries.csv")
Now we run a loop through all the names in filelist_prn
and for each of these files we move through steps 1 to 4.
for (i in filelist_prn) {
# Step 1
filename <- i
fileandpath <- paste0("C:/Rcode/ECLR/data/BARLEE/",i) # adjust this to your path
# Step 2
# only import first 138 rows
temp <- read_table(fileandpath, col_names = FALSE, na = "NA", n_max = 138)
# Step 3
temp.cnames <- colnames_list[colnames_list$ASCII.File == i,2]
temp.cnames2 <- str_split(temp.cnames," ")[[1]]
colnames(temp) <- temp.cnames2
# Step 4
# only attach cols corresponding to a variable name
data_all <- cbind(data_all, temp[,1:length(temp.cnames2)])
}
The above code has a few elements that were added after examining results in details.
n_max = 138
option
into the read_table
function, knowing that each file
contains 138 rows of data only.temp[,1:length(temp.cnames2)]
to
data_all
.Once you hve done this you can save the resulting object
data_all
in a new csv file which you can then upload to
work from.
write.csv(data_all, "C:/Rcode/ECLR/data/BARLEE/GrowthData_complete.csv")
This exercise demonstrated a piece of data management. It is not typical in its details as each data project will have its own challenges. But there are a few common features you are likely to encounter:
Each project will provide different challenges and solutions but the
way of thinking demonstrated here will translate to many projects. You
should, as we have done here, write a script or rmarkdown code which
does the data handling with clear explanations. Once you have done that
work you can put this code aside and keep working straight from the file
you created, here GrowthData_complete.csv
.