Aim

In this workthrough you will be exposed to some of the most basic, but also most relevant (and sometimes challenging) tasks in data analysis. In particular we will be using real-life data to understand whether natural disasters are becoming more frequent (potentially due to climate change) and whether the increased frequencies of these affects dis proportionally those countries that actually contributed least to climate change.

This is written as if you have never been exposed to coding in R (or any other language). If you have, then lots of the initial sections will look very familiar to you. But this workthrough is also not meant to be a training guide to all the things we do. It is meant as an indication of what is possible and a demonstration of how data analysts actually work.

Importantly, this walkthrough will also show some of the unglamorous tasks of an applied economist/data scientist, such as dealing with error messages and finding out how to solve coding issues.

Preparing your workspace

Before you do each task, you need to prepare your workspace first. Here we assume that you have successfully installed first R and then RStudio.

Step 1. Create a folder.

Create a folder called RWork on your computer. Remember where this folder is located on your computer (you will need the entire path to that folder later).

Step 2. Download the data and save it in the RWork folder

We shall use data from a database of disasters (natural or technical, although we will focus on natural disasters). This database is maintained by The Centre for Research on the Epidemiology of Disasters (CRED). Go to the website and click on "Access the data" and follow the registration instructions. Use your university email and indicate that you are accessing the data from an academic institution. This will give you access to the data. After confirming your registration from your email account you will get to a data download screen and you should use the following settings (note that you should choose the data between 1900-2024):

Image with settings for data download

After clicking download you should find that an EXCEL workshet was downloaded. Make sure you give this worksheet a sensible name. On this occasion I named it "natural.xlsx" and, importantly, save it in the RWork folder you created in step 1.

Step 3. Open R and set your working directory.

We will be loading these data (information on more than 17,000 disasters) into RStudio and then perform some exploratory data analysis. For this to happen we will have to instruct R to load this data. But before we do that it is useful to let R know which folder we wish to work from. Of course it will be our RWork folder we created in the Step 1.

Option 1. Set Working Directory via RStudio Menu Bar

Menu bar → Click Session → Set Working Directory → Choose Directory → Select RWork folder you created in step 1

Image to set working directory

Option 2. Set Working Directory with Code setwd()

Alternatively, you can set the working directory through code. This is where you need to know the entire path of your folder. The command used to set the working directory is setwd().

Example. Suppose I create a folder called RWork on my desktop, and select folder RWork when set working directly, here are the codes for different type of users.

setwd("~/Desktop/RWork")                          # for Mac users
setwd("C:/RWork")                                 # for Windows users

This is an example. You may have given your directory a different name. You should replace the actual path with the path leading to your folder. Always use "/" (forward slash) even if on your computer folders may be separated by a backward slash.

After executing this line of code (CTRL + ENTER on Windows, CMD + ENTER on Macs, or clicking on the run while your cursor is in the line) button you are ready to get data into the software.

Step 4. Open an R script file before you start with your analysis.

Click on the little white sheet with the white plus over a green circle to open an R script:

R script file is just a text file, which can help us to save codes.

Step 5. Type the code in the R script file.

Somewhere underneath, into your script, type the line below. Then have your cursor in that line and either press the run button or type CTRL + ENTER (or COMMAND + ENTER on a Mac).

5+2       # Summation

Here, R has calculated 5+2 and give you the result 7. Also note that the text after "#" is a comment and will be ignored by R.

There are a number of reasons why you should be adding comments to your code. By far the most important is to communicate with your future self. Imagine you are writing code for a project which has a deadline in three weeks. However, you will have to interrupt the exciting data work for a week to finish an essay. When you return to your code in a week time you will have forgotten why you did certain things. Unless, you have added sufficient comments to your code explaining to your future self why you did certain things.

Do add comments liberally.

a <- sqrt(5677170409)

You have just created a new variable, called a and the value this variable has taken is the $\sqrt{5677170409}=75347$ and you should be able to see that variable listed in the Environment tab (top right). So far this is really not much more than a calculator (with a function to save results).

You can now use the variable a in a new calculation. In the next line type

b <- exp(a-76*1000)

You should find that the variable b is now a very small, but positive number. By the way, R is case sensitive, if you try to enter b <- exp(A-76*1000) you should get an error message (Error: object 'A' not found) in the Console as R doesn't know what A is.

Whenever you executed a line of code you will have seen that line of code copied in the Console (typically visible in the lower left hand corner of your RStudio screen). If the code you have executed produces any result to print (like when you calculated 5+2) the result will also be shown. In fact you could type lines of code straight into the Console and press ENTER to execute a line. And in fact you will do that often, especially if you just want to check or try something.

The Console is also the place where you will see error messages displayed. You are sort of communicating with R via the Console.

BUT, if you close RStudio and then re-open it the next day, then everything you typed into the Console will be lost. Yes, gone. Just imagine the two hours of work you did to create some amazing analysis and beautiful plots, gone. Therefore all lines of code that contribute to your data work and you may want to repeat have to be in the script file. Everything you put into a script and saved will still be there the next time you open that script file. And by clicking the Source button R will automatically execute your entire analysis.

Step 6. Save your work as an R script

Save the file in .R format. The R script will contain all your commands you have used in the current session and allows you to replicate what we have done in the future.

Click the disk icon. In the new window, name the R script file and save it in the folder you created in the step 1 (e.g. RWork).

These scripts are like a written recipe of your work which allows you to reproduce your work. If you use R scripts, you do not have to save your workspace image.

Revision question

Which of the following is not good practice (and you will regret doing it)?

Task 1. Installing and Loading R Packages.

The R software comes with a lot of functionality as you download it to your computer. As it is an open-source software, anyone can add functionality to R. This added functionality come in batches of code that are called "packages". To use any of this added functionality, you first need to install the relevant package and then activate (or load) it into your session. Once the package is activated, you can access and use the functions it contains.

For instance, if we intend to use the read_excel() function for importing Excel data in R, it's essential to note that read_excel() is stored in a package called readxl

Therefore, to make use of read_excel(), you must first install and activate the readxl package. Once activated, you will have access to the "read_excel" function within the readxl package for data importation.

Two ways to install packages in R.

Use package window

The easiest way to install packages is through the RStudio package window (that's the pane in the bottom right corner of the screenshot below).

Click on the Packages tab and you will see all the packages that are already installed. To activate them, tick the box next to the package name.

If the required package is not there, you can install it via the Install tab. In the newly opened window, just type in the R package name and click 'install'. To update the installed R packages click on the Update tab that is just next to the Install tab.

Use code

You can also use the command line to install, activate and update packages. Assume you want to install the R-package readxl:

install.packages("readxl")

A package only needs installing once on your computer. It is for this reason that there is no need to save the install.packages("readxl") code in your script.

After the installation we activate/load the package with:

library(readxl)

This needs doing everytime you load your script and therefore the library("readxl") should be saved in your script.

Key point:

Install package: just do once

Activate package: do it every time you need

This means that you have installed the readxl package this time. In future instances, there is no need to reinstall it. However, you will need to activate it by typing the command library(readxl) every time you want to use the read_excel() function stored in it.

We will also use functionality of the tidyverse and ggplot2 packages. In case you have not yet installed these packages on your computer you can install these with the following command:

install.packages(c("tidyverse", "ggplot2"))

Once these two packages are installed, activate/load them in your code. For each task, replace the missing part XXXX to make these codes work.

library(XXXX)
XXXX(XXXX)

Task 2. Import the Excel data set into R.

Once the readxl library is activated, we can use its read_excel() function to import the "natural.xlsx" dataset. We then store the imported dataset in an object named emdat.

emdat <- read_excel("natural.xlsx", guess_max = 15000) 
# for now ignore what guess_max = 15000 does)

It is extremely important to be able to read code and understand what it does. This is especially true in a world where we increasingly do not have to write code from scratch but have helpers at our hand that deliver code. But you need to remember that you are the quality control and are responsible for the code doing what you want it to do.

Here we are again using the assignment operator <- which tells R to define a new object, here emdat, with whatever is on the right hand side of the assignment operator. On the right and side we are using the read_excel function. Functions are the bread and butter of working in R. Hidden behind read_excel is a lot of code that achieves a articular purpose. Usually that code needs some input variable(s). These input variables are delivered in the parentheses directly after the function name. How many and what inputs are needed depends on the function. You can find details on that if you call the help function. Here you would type ?read_excel into the console and the help text will appear in the Help console (typically in the bottom right of your RStudio screen).

The first input is the path to the EXCEL file we wish to import. Note that the default path used is the working directory we set previously. As we saved the datasheet into that directory we only need to give the file name. The function does take that EXCEL file and converts it into a dataframe (think about it as the equivalent of a spreadsheet in R) which then can be used in R. Once you executed this line you should see emdat in the environment tab.

What does the second input (guess_max = 15000) actually do? (More detail on that later)

Sets the maximum number of milliseconds R should take to guess the variable types Sets the maximum number of rows to be imported Determines that missing data should be replaced with random numbers between 1 and 15000 Sets the maximum number of data rows to use for guessing column types.

While the purpose of this may still be a little murky, the help function should give you enough information to answer the question.

You can immediately see from the Environment tab that the object has 17,288 observations (or rows) and variables (or columns).

If you click on the little spreadsheet symbol in the emdat line in the environment you can actually see the spreadsheet. It should look a little like this

## # A tibble: 6 × 46
##   DisNo.      Historic `Classification Key` `Disaster Group` `Disaster Subgroup`
##   <chr>       <chr>    <chr>                <chr>            <chr>              
## 1 1900-0003-… Yes      nat-met-sto-tro      Natural          Meteorological     
## 2 1900-0006-… Yes      nat-hyd-flo-flo      Natural          Hydrological       
## 3 1900-0007-… Yes      nat-bio-epi-vir      Natural          Biological         
## 4 1900-0008-… Yes      nat-geo-vol-ash      Natural          Geophysical        
## 5 1900-0009-… Yes      nat-geo-ear-gro      Natural          Geophysical        
## 6 1900-9001-… Yes      nat-cli-dro-dro      Natural          Climatological     
## # ℹ 41 more variables: `Disaster Type` <chr>, `Disaster Subtype` <chr>,
## #   `External IDs` <chr>, `Event Name` <chr>, ISO <chr>, Country <chr>,
## #   Subregion <chr>, Region <chr>, Location <chr>, Origin <chr>,
## #   `Associated Types` <chr>, `OFDA/BHA Response` <chr>, Appeal <chr>,
## #   Declaration <chr>, `AID Contribution ('000 US$)` <dbl>, Magnitude <dbl>,
## #   `Magnitude Scale` <chr>, Latitude <dbl>, Longitude <dbl>,
## #   `River Basin` <chr>, `Start Year` <dbl>, `Start Month` <dbl>, …

It is often useful to find out what sort of variables are contained in the dataset. There are two useful commands.

names(emdat)

##  [1] "DisNo."                                   
##  [2] "Historic"                                 
##  [3] "Classification Key"                       
##  [4] "Disaster Group"                           
##  [5] "Disaster Subgroup"                        
##  [6] "Disaster Type"                            
##  [7] "Disaster Subtype"                         
##  [8] "External IDs"                             
##  [9] "Event Name"                               
## [10] "ISO"                                      
## [11] "Country"                                  
## [12] "Subregion"                                
## [13] "Region"                                   
## [14] "Location"                                 
## [15] "Origin"                                   
## [16] "Associated Types"                         
## [17] "OFDA/BHA Response"                        
## [18] "Appeal"                                   
## [19] "Declaration"                              
## [20] "AID Contribution ('000 US$)"              
## [21] "Magnitude"                                
## [22] "Magnitude Scale"                          
## [23] "Latitude"                                 
## [24] "Longitude"                                
## [25] "River Basin"                              
## [26] "Start Year"                               
## [27] "Start Month"                              
## [28] "Start Day"                                
## [29] "End Year"                                 
## [30] "End Month"                                
## [31] "End Day"                                  
## [32] "Total Deaths"                             
## [33] "No. Injured"                              
## [34] "No. Affected"                             
## [35] "No. Homeless"                             
## [36] "Total Affected"                           
## [37] "Reconstruction Costs ('000 US$)"          
## [38] "Reconstruction Costs, Adjusted ('000 US$)"
## [39] "Insured Damage ('000 US$)"                
## [40] "Insured Damage, Adjusted ('000 US$)"      
## [41] "Total Damage ('000 US$)"                  
## [42] "Total Damage, Adjusted ('000 US$)"        
## [43] "CPI"                                      
## [44] "Admin Units"                              
## [45] "Entry Date"                               
## [46] "Last Update"

str(emdat)

## tibble [17,288 × 46] (S3: tbl_df/tbl/data.frame)
##  $ DisNo.                                   : chr [1:17288] "1900-0003-USA" "1900-0006-JAM" "1900-0007-JAM" "1900-0008-JPN" ...
##  $ Historic                                 : chr [1:17288] "Yes" "Yes" "Yes" "Yes" ...
##  $ Classification Key                       : chr [1:17288] "nat-met-sto-tro" "nat-hyd-flo-flo" "nat-bio-epi-vir" "nat-geo-vol-ash" ...
##  $ Disaster Group                           : chr [1:17288] "Natural" "Natural" "Natural" "Natural" ...
##  $ Disaster Subgroup                        : chr [1:17288] "Meteorological" "Hydrological" "Biological" "Geophysical" ...
##  $ Disaster Type                            : chr [1:17288] "Storm" "Flood" "Epidemic" "Volcanic activity" ...
##  $ Disaster Subtype                         : chr [1:17288] "Tropical cyclone" "Flood (General)" "Viral disease" "Ash fall" ...
##  $ External IDs                             : chr [1:17288] NA NA NA NA ...
##  $ Event Name                               : chr [1:17288] NA NA "Gastroenteritis" NA ...
##  $ ISO                                      : chr [1:17288] "USA" "JAM" "JAM" "JPN" ...
##  $ Country                                  : chr [1:17288] "United States of America" "Jamaica" "Jamaica" "Japan" ...
##  $ Subregion                                : chr [1:17288] "Northern America" "Latin America and the Caribbean" "Latin America and the Caribbean" "Eastern Asia" ...
##  $ Region                                   : chr [1:17288] "Americas" "Americas" "Americas" "Asia" ...
##  $ Location                                 : chr [1:17288] "Galveston (Texas)" "Saint James" "Porus" NA ...
##  $ Origin                                   : chr [1:17288] NA NA NA NA ...
##  $ Associated Types                         : chr [1:17288] "Avalanche (Snow, Debris)" NA NA NA ...
##  $ OFDA/BHA Response                        : chr [1:17288] "No" "No" "No" "No" ...
##  $ Appeal                                   : chr [1:17288] "No" "No" "No" "No" ...
##  $ Declaration                              : chr [1:17288] "No" "No" "No" "No" ...
##  $ AID Contribution ('000 US$)              : num [1:17288] NA NA NA NA NA NA NA NA NA NA ...
##  $ Magnitude                                : num [1:17288] 220 NA NA NA 5.9 NA NA NA 7.9 NA ...
##  $ Magnitude Scale                          : chr [1:17288] "Kph" "Km2" "Vaccinated" NA ...
##  $ Latitude                                 : num [1:17288] NA NA NA NA 40.3 NA NA NA 40.5 NA ...
##  $ Longitude                                : num [1:17288] NA NA NA NA 43.1 ...
##  $ River Basin                              : chr [1:17288] NA NA NA NA ...
##  $ Start Year                               : num [1:17288] 1900 1900 1900 1900 1900 ...
##  $ Start Month                              : num [1:17288] 9 1 1 7 7 NA NA NA 8 4 ...
##  $ Start Day                                : num [1:17288] 8 6 13 7 12 NA NA NA 10 8 ...
##  $ End Year                                 : num [1:17288] 1900 1900 1900 1900 1900 ...
##  $ End Month                                : num [1:17288] 9 1 1 7 7 NA NA NA 8 4 ...
##  $ End Day                                  : num [1:17288] 8 6 13 7 12 NA NA NA 10 8 ...
##  $ Total Deaths                             : num [1:17288] 6000 300 30 30 140 1250000 11000 200000 18 1000 ...
##  $ No. Injured                              : num [1:17288] NA NA NA NA NA NA NA NA NA NA ...
##  $ No. Affected                             : num [1:17288] NA NA NA NA NA NA NA NA NA NA ...
##  $ No. Homeless                             : num [1:17288] NA NA NA NA NA NA NA NA 24 NA ...
##  $ Total Affected                           : num [1:17288] NA NA NA NA NA NA NA NA 24 NA ...
##  $ Reconstruction Costs ('000 US$)          : num [1:17288] NA NA NA NA NA NA NA NA NA NA ...
##  $ Reconstruction Costs, Adjusted ('000 US$): num [1:17288] NA NA NA NA NA NA NA NA NA NA ...
##  $ Insured Damage ('000 US$)                : num [1:17288] NA NA NA NA NA NA NA NA NA NA ...
##  $ Insured Damage, Adjusted ('000 US$)      : num [1:17288] NA NA NA NA NA NA NA NA NA NA ...
##  $ Total Damage ('000 US$)                  : num [1:17288] 30000 NA NA NA NA NA NA NA NA NA ...
##  $ Total Damage, Adjusted ('000 US$)        : num [1:17288] 1131126 NA NA NA NA ...
##  $ CPI                                      : num [1:17288] 2.65 2.65 2.65 2.65 2.65 ...
##  $ Admin Units                              : chr [1:17288] NA NA NA NA ...
##  $ Entry Date                               : chr [1:17288] "2004-10-18" "2003-07-01" "2003-07-01" "2003-07-01" ...
##  $ Last Update                              : chr [1:17288] "2023-10-17" "2023-09-25" "2023-09-25" "2023-09-25" ...

The str command doesn't only give you a list of variable names but also gives you the variable type. The first variable (DisNo) is of type chr which is basically a text or character variable. It is a unique identifier for a disaster. Most variables are of type chr or num, which stands for numerical variables. Data types are important as there are some things you cannot do with certain variable types. For instance you couldn't multiply chr variables as multiplication with text is not defined.

When importing data we used the option "guess_max = 15000". You are now in a position to understand what this option does. Rerun the import command above with that option switched off (i.e. deleting it) and see whether you can see a difference in the variable types. By default R will use the first 1000 entries of a variable to guess what the best datatype is. Here the first 1000 observations are all very old disasters and for some variables (like reconstruction costs) all these observations are missing (NA). That means that R cannot guess a good data type from these initial 1000 observations. To allow R to guess the right datatype we will force R to guess the datatype from the first 15000 rows. It is for this reason that it is super useful to look at the datatypes and confirm that they are the data types we expect.

There is another small issue which is best fixed at the beginning. As we walk through our project we will need to address particular variables. When variable names have spaces in them, this becomes just a little bit more annoying in R and hence we apply the following line.

names(emdat) <- make.names(names(emdat))

Check out what this line did and once you see the difference, try and understand how the above line achieved this.

Understanding what particular lines of code do is super important. There are three ways to do that.

You know that it does something to the names of the variables in emdat. So look at the names before you run that line (run names(emdat) in the Console) and look at the names again after you run the line above. You should see the difference. (this is really the best way to start)
You call up the help for the make.names function and read through the info (run ?make.names in the Console)
You copy the line of code into a LLM and ask it what it does.

Some of the variable names are very long and it may be useful to shorten them.

names(emdat)[names(emdat) == "AID.Contribution...000.US.."] <- "AID.Contr"
names(emdat)[names(emdat) == "Reconstruction.Costs...000.US.."] <- "Rec.Cost"
names(emdat)[names(emdat) == "Reconstruction.Costs..Adjusted...000.US.."] <- "Rec.Cost.adj"
names(emdat)[names(emdat) == "Insured.Damage...000.US.."] <- "Ins.Damage"
names(emdat)[names(emdat) == "Insured.Damage..Adjusted...000.US.."] <- "Ins.Damage.adj"
names(emdat)[names(emdat) == "Total.Damage...000.US.."] <- "Total.Damage"
names(emdat)[names(emdat) == "Total.Damage..Adjusted...000.US.."] <- "Total.Damage.adj"

Let's look at a particular example here:

names(emdat)[names(emdat) == "AID.Contribution...000.US.."] <- "AID.Contr"

You should always first find the assignment operator <-. What is on the right of this, here the text "AID.Contr" will be put into the place indicated on the left hand side. In this example we are not creating a new object on the left hand side, but we are replacing a value of an object that already exists.

That object that exists is names(emdat). That by itself is the list of column names in emdat. But in this line we do not want the whole list of 46 variable names being replaced with just one name, but rather we wish to replace a particular one with "AID.Contr", leaving the other 45 unchanged. So how do we tell R which of the 46 names we want changed? This is what happens inside the square brackets. There we call names(emdat) == "AID.Contribution...000.US..", it is basically saying that we should replace that variable name that currently is set to "AID.Contribution...000.US..".

Formally this happens by creating a logical vector with 46 entries, 45 of which are FALSE and one is TRUE. But for now that is not important.

Let's look at a particular natural catastrophe, the one with indicator DisNo. == "2019-0209-CHN".

emdat[emdat$DisNo. == "2019-0209-CHN",]  # this used [row,cols] as cols is empty it picks all cols

## # A tibble: 1 × 46
##   DisNo.        Historic Classification.Key Disaster.Group Disaster.Subgroup
##   <chr>         <chr>    <chr>              <chr>          <chr>            
## 1 2019-0209-CHN No       nat-hyd-flo-flo    Natural        Hydrological     
## # ℹ 41 more variables: Disaster.Type <chr>, Disaster.Subtype <chr>,
## #   External.IDs <chr>, Event.Name <chr>, ISO <chr>, Country <chr>,
## #   Subregion <chr>, Region <chr>, Location <chr>, Origin <chr>,
## #   Associated.Types <chr>, OFDA.BHA.Response <chr>, Appeal <chr>,
## #   Declaration <chr>, AID.Contr <dbl>, Magnitude <dbl>, Magnitude.Scale <chr>,
## #   Latitude <dbl>, Longitude <dbl>, River.Basin <chr>, Start.Year <dbl>,
## #   Start.Month <dbl>, Start.Day <dbl>, End.Year <dbl>, End.Month <dbl>, …

This is a flooding incident that happened in China in late June 2019. 300 people died and 4.5 million were affected according to the information in the database. Here is a BBC news article on this heavy rain incident.

You can learn much more about the structure of the database, how the data are collected and checked and the variable descriptions from this documentation page.

So far we haven't done any analysis on the data, but let's start to use the power of the software. Let's say you want to look at some summary statistics for some variables, for instance you are interested in how many people are affected by the natural catastrophes in the database (Total.Affected).

summary(emdat$Total.Affected)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##         1       704      6000    692757     60000 330000000      4486

You can see that the natural disasters registered here affect on average almost 700,000 people, with a max of 330 million. But almost 5000 events have no information on the number of affected people. Especially early events do not have such information.

We can also see a breakdown of the type of incidents that have been recorded. This is recorded in the Disaster.Subgroup variable (for all incidents here Disaster.Group == "Natural").

table(emdat$Disaster.Subgroup)

## 
##        Biological    Climatological Extra-terrestrial       Geophysical 
##              1610              1289                 1              1941 
##      Hydrological    Meteorological 
##              6897              5550

We find different categories with hydrological events (e.g. heavy rainfalls) being the most common.

The two lines of code above illustrate some important coding syntax. We used functions summary and table. Each of these functions did something different, but both needed some input to do their magic (handed to the function inside the ( )). For the summary function the input was emdat$Total.Affected. It created numerical summary statistics as that variable was a numeric variable. What did that mean. In words: "Use the dataset emdat but only use the variable Total.Affected. And then the summary function calculated the summary stats for this variable.

A similar interpretation applies to table(emdat$Disaster.Subgroup) with the difference that the table function is being applied.

PRACTICE: Identify the Extra-terrestrial incident in the dataset and see whether you can find any related news reports.

If you wish to find a particular incident you can use the same structure of a code line as we did to display the particular event above. You will just have to change the variable name by which you filter and the value it should take accordingly.

emdat[emdat$Disaster.Subgroup == "Extra-terrestrial",]  # this used [row,cols] as cols is empty it picks all cols

## # A tibble: 1 × 46
##   DisNo.        Historic Classification.Key Disaster.Group Disaster.Subgroup
##   <chr>         <chr>    <chr>              <chr>          <chr>            
## 1 2013-0060-RUS No       nat-ext-imp-col    Natural        Extra-terrestrial
## # ℹ 41 more variables: Disaster.Type <chr>, Disaster.Subtype <chr>,
## #   External.IDs <chr>, Event.Name <chr>, ISO <chr>, Country <chr>,
## #   Subregion <chr>, Region <chr>, Location <chr>, Origin <chr>,
## #   Associated.Types <chr>, OFDA.BHA.Response <chr>, Appeal <chr>,
## #   Declaration <chr>, AID.Contr <dbl>, Magnitude <dbl>, Magnitude.Scale <chr>,
## #   Latitude <dbl>, Longitude <dbl>, River.Basin <chr>, Start.Year <dbl>,
## #   Start.Month <dbl>, Start.Day <dbl>, End.Year <dbl>, End.Month <dbl>, …

Note that we did not use an assignment operator here. So we did not define a new object. We merely extracted some information and displayed it.

An alternative way to achieve the same (and there are always different ways to achieve the same thing) is:

emdat %>% filter(Disaster.Subgroup == "Extra-terrestrial")

## # A tibble: 1 × 46
##   DisNo.        Historic Classification.Key Disaster.Group Disaster.Subgroup
##   <chr>         <chr>    <chr>              <chr>          <chr>            
## 1 2013-0060-RUS No       nat-ext-imp-col    Natural        Extra-terrestrial
## # ℹ 41 more variables: Disaster.Type <chr>, Disaster.Subtype <chr>,
## #   External.IDs <chr>, Event.Name <chr>, ISO <chr>, Country <chr>,
## #   Subregion <chr>, Region <chr>, Location <chr>, Origin <chr>,
## #   Associated.Types <chr>, OFDA.BHA.Response <chr>, Appeal <chr>,
## #   Declaration <chr>, AID.Contr <dbl>, Magnitude <dbl>, Magnitude.Scale <chr>,
## #   Latitude <dbl>, Longitude <dbl>, River.Basin <chr>, Start.Year <dbl>,
## #   Start.Month <dbl>, Start.Day <dbl>, End.Year <dbl>, End.Month <dbl>, …

How did we extract that info here. We took the emdat object and send it (%>%) to the filter function. The filter function then filtered out the event(s) in the "Extra-terrestrial" subgroup. This type of operation is functionality that comes with the tidyverse package. The %>% operator is also called the piping operator.

PRACTICE: Find the disaster event by which 330000000 people were affected.

Complete the code below to extract the event.

emdat %>% XXXX(XXXX == XXXX)

You should find that the code extracts event "2015-9618-IND".

## # A tibble: 1 × 46
##   DisNo.        Historic Classification.Key Disaster.Group Disaster.Subgroup
##   <chr>         <chr>    <chr>              <chr>          <chr>            
## 1 2015-9618-IND No       nat-cli-dro-dro    Natural        Climatological   
## # ℹ 41 more variables: Disaster.Type <chr>, Disaster.Subtype <chr>,
## #   External.IDs <chr>, Event.Name <chr>, ISO <chr>, Country <chr>,
## #   Subregion <chr>, Region <chr>, Location <chr>, Origin <chr>,
## #   Associated.Types <chr>, OFDA.BHA.Response <chr>, Appeal <chr>,
## #   Declaration <chr>, AID.Contr <dbl>, Magnitude <dbl>, Magnitude.Scale <chr>,
## #   Latitude <dbl>, Longitude <dbl>, River.Basin <chr>, Start.Year <dbl>,
## #   Start.Month <dbl>, Start.Day <dbl>, End.Year <dbl>, End.Month <dbl>, …

So, the natural catastrophe that affected more than 300 Mio people was a drought in the Indian subcontinent.

As I prepared this walk-through I saw the following tweet.

A tweet seen on 20 October 2021

Seeing this I wanted you to understand what this means, as indeed this is a crucial type of operation.

In order to better understand what type of disasters the different subgroups represent we can look at the following bit of code, which uses piping:

table.1 <- emdat %>% 
                  group_by(Disaster.Subgroup, Disaster.Type) %>% 
                  summarise(n = n(),Avg.Affected = mean(Total.Affected, na.rm = TRUE)) %>% 
                  print()

## # A tibble: 15 × 4
## # Groups:   Disaster.Subgroup [6]
##    Disaster.Subgroup Disaster.Type                   n Avg.Affected
##    <chr>             <chr>                       <int>        <dbl>
##  1 Biological        Animal incident                 1           5 
##  2 Biological        Epidemic                     1514       38517.
##  3 Biological        Infestation                    95      640440 
##  4 Climatological    Drought                       796     5504330.
##  5 Climatological    Glacial lake outburst flood     5       29520.
##  6 Climatological    Wildfire                      488       57910.
##  7 Extra-terrestrial Impact                          1      301491 
##  8 Geophysical       Earthquake                   1617      176272.
##  9 Geophysical       Mass movement (dry)            46        1217.
## 10 Geophysical       Volcanic activity             278       44036.
## 11 Hydrological      Flood                        6049      779832.
## 12 Hydrological      Mass movement (wet)           848       37817.
## 13 Meteorological    Extreme temperature           696      701919.
## 14 Meteorological    Fog                             1         NaN 
## 15 Meteorological    Storm                        4853      426788.

It is important to understand what the above operation does, or in other words it is an important skill to be able to read what a line of code does. One way to get support with this is to ask a LLM like ChatGPT to help you read that code.

After running the code above you can see that there were 95 infestations (biological) in the dataset.

This is one line of code (displayed in several lines for readability). To run this you will merely have to put your cursor somewhere into that line and press the run button (or press CTRL + ENTER).

So what did this code do? Here is the code again with comments to explain:

table.1 <-            # Define a new object table.1
          emdat %>%  # Take the emdat object and "send it" (%>%) to the next function
          group_by(Disaster.Subgroup, Disaster.Type) %>% 
                      # Take what was piped and group the data 
                      # by the variables Disaster.Subgroup, Disaster.Type and pipe 
                      # the result to the next function
          summarise(n = n(), Avg.Affected = mean(Total.Affected, na.rm = TRUE)) %>% 
                      # Take what was piped and summarise each group by calculating
                      # the number of observations in each group
                      # the average number of people affected
                      # and pipe the result to the next function
          print()     # print what was piped to you

This is like a data production line. Just imagine how much work this would have been in Excel (although pivot tables get close to the convenience).

Now I want you to produce a similar table but group the data by Region and calculate the average aid received (AID.Contr) and the average total estimated damage (Total.Damage). Copy and paste the previous table and adjust what needs adjusting.

## # A tibble: 5 × 4
##   Region       n Avg.Aid  Avg.Dam
##   <chr>    <int>   <dbl>    <dbl>
## 1 Africa    3188   7890.  183091.
## 2 Americas  4304  24436. 1251189.
## 3 Asia      6907  28062.  769896.
## 4 Europe    2174   1882.  633294.
## 5 Oceania    715   3566.  378723.

In the above we used the mean function. One of the inputs into the function was the variable for which we wanted to calculate the mean, but we also added na.rm = TRUE. Try to figure out what this did. Is it important?

This is the end of Task 2, where we imported the database with natural disasters.

Let's compare to working in EXCEL, which is perhaps the software you would have used for such work previously. Let's imagine it is the end of the day again and you are closing down your computer. You could have done all that work in EXCEL and may have produced great tables. Next morning you get up and you realise that you actually did need to know the original variable names. Well in EXCEL you would not know anymore what they are, as you changed them. YOu would have to go back to the original datafile (perhaps even download it again). Here, we still have the original datafile and you can just go to the code where you change the names and adjust that.

Or you wanted to create summary tables using a different grouping variable. In EXCEL you would have to go through the process of creating the table again. Here, as everything is saved in a script, you just go to the line where you set the grouping variable, change it and press the Source button as that will the re-run your entire code.

Task 3: A Simple Plot Tells a Lot

Above we already presented some useful summary statistics, such as the mean or maximum for a numeric variable, or frequencies for categorical variables. However, often the best way to convey a story is to use graphs. Here we are using the extremely powerful ggplot function. This is not here to actually teach you how this works, but just to wet your appetite.

As we are thinking about the impact of climate change there is a lot of talk about natural disasters becoming more frequent. Let us count the disasters of different types every year and then show this information as a time-series plot. This should demonstrate whether there is indeed an increasing number of natural disasters in the database.

table.3 <- emdat %>% 
              group_by(Disaster.Type,Start.Year) %>% 
              summarise(n = n())

Have a look at table.3 to understand the structure of that object.

each Country-Years has its own row, Disaster.Types are in columns each Disaster.Type-Year has its own row each Country-Disaster.Type-Year has its own row each Country-Disaster.Type has its own row, Years are in columns each Disaster.Type has its own row, Years are in columns

The following code produces a line graph.

plot.ts <- ggplot(table.3, aes(y=n, x=Start.Year, colour = Disaster.Type)) +
  geom_line()
plot.ts

This is not really a session to learn the details of graphing in R. But as you are curious, here is a quick run through of what the code does. * plot.ts <-, A new object called plot.ts is created from the code on the right hand side of <- * ggplot(table.3, ), tells R to create a plot using the data in table.3, although at this stage it doesn't know yet what plot. * aes(y=n, x=Start.Year, colour = Disaster.Type), basically tells R what should be on the axis of the plot, Start.Year on the horizontal axis (x), the number of disasters per year (n) on the vertical axis and that the data should be separated (and shown in different colours) by Disaster.Type. But we still have not instructed R what type of plot to create. * + geom_line(), this is where we instruct R to actually create a line plot with the earlier parameters (try geom_points() to see the difference). This concludes the plot creation. * plot.ts, this just instructs R to actually display the plot we saved in plot.ts

YTou can learn more about how to create great graphs with ggplot from this page on ECLR.

We can immediately see that some of the events have seen massive increases. However, we need to be aware that there are issues with relying on this dataset. Some of these are discussed on the data's website, in particular what they call the time bias, reflecting the insight that the increase in number of disasters may partly be driven by increased data collection.

Reconsidering the plot above it is apparent that it is on the messy side as some categories are very small and make seeing the core information more difficult. So we will filter out event types that overall small number of observations.

table(emdat$Disaster.Type)

## 
##             Animal incident                     Drought 
##                           1                         796 
##                  Earthquake                    Epidemic 
##                        1617                        1514 
##         Extreme temperature                       Flood 
##                         696                        6049 
##                         Fog Glacial lake outburst flood 
##                           1                           5 
##                      Impact                 Infestation 
##                           1                          95 
##         Mass movement (dry)         Mass movement (wet) 
##                          46                         848 
##                       Storm           Volcanic activity 
##                        4853                         278 
##                    Wildfire 
##                         488

We want to pick the types that have more than 200 incidents in the database. What we need is a list of names of event types that meet this criterion. The following achieves that

sel.types <- c( "Storm", "Flood", "Epidemic", "Volcanic activity", "Earthquake",                
                "Drought" ,  "Mass movement (wet)" ,"Wildfire", "Extreme temperature" )

However, it is good to learn how to do this in an automatic way as these type of problems can be so big that it is impractical to do manually.

sel.types2 <- emdat %>% group_by(Disaster.Type) %>% 
                          summarise(n = n()) %>% 
                          filter(n > 199) %>% 
                          pull(Disaster.Type) %>% 
                          print()

## [1] "Drought"             "Earthquake"          "Epidemic"           
## [4] "Extreme temperature" "Flood"               "Mass movement (wet)"
## [7] "Storm"               "Volcanic activity"   "Wildfire"

You can again check with ChatGPT to ensure that you understand what these steps achieved.

Why did we need a list of disaster types we want to focus on? The reason is that with that list in hand we can easily filter out the data we want to concentrate on. To do that we use the filter function, in particular we want to filter (i.e. keep) the data that relate to disaster types that are contained in sel.types2.

table.3 <- table.3 %>%  filter(Disaster.Type in sel.types2)

Have you received an error message?

Image with error message

Try to figure out what the problem is. Dealing with error messages is a bread and butter activity for anyone who works with data in the way we do here. Note the following about error messages:

Receiving error messages is normal and you CANNOT break your computer
Errors protect you from doing unwanted things to your data
The error messages are SOMETIMES helpful in identifying the cause of the problem
Sometimes the error messages point at a symptom rather than the root cause of the problem. (this is in fact the case here)

Here are some tips on how to deal with error messages:

Re-read what you typed, is it 'exactly' what you wanted? (R is case sensitive! Don’t use spaces in variable names!)
Do the objects you refer to actually exist in the Environment?
Re-run code one line at a time and identify line with error.
Read error message and try to find anything that you can understand. Don't worry about the parts you don't understand – there will be lots!

Can you fix the code? You will find an important hint here.

Before this operation table.3 had 864 rows of data. If that has reduced to 802 rows then you have achieved the correct result.

Now we can re-run the graph.

plot.ts <- ggplot(table.3, aes(y=n, x=Start.Year, colour = Disaster.Type))+
  geom_line()
plot.ts

It is still slighty messy. There are two ways to certainly improve the readability of the graph. We could smooth the data series to get a more obvious picture of the time trend. We will not do this here as it will require a bit more work. Another aspect would be to only display data from 1970 onwards. It does appear as before that date the activity stays approximately constant.

Use the filter function to remove all data up to 1969 and then re-plot the graph

table.3 <- table.3 XXXX filter(Start.Year > XXXX)
plot.ts <- XXXX(table.3, aes(y = XXXX, XXXX = XXXX, XXXX = Disaster.Type))+
  XXXX()
plot.ts

There are perhaps still too many lines to see clearly how these different types develop. There is a fairly straightforward way to visually disentangle these.

plot.ts <- plot.ts + facet_wrap(vars(Disaster.Type))
plot.ts

The two disaster types that have seen a very obvious increase are floods (green) and storms (storms), but wildfires, wet mass movements and extreme temperatures have also seen a steady increase (note that all graphs are on identical scales). All of these can plausibly be linked to climate change. The two geological events (earthquakes and volcanic eruptions) have not seen an obvious increase.

Country level emission data

It is often said that the effects of climate change do predominantly hit the countries that are not actually responsible for the bulk of the greenhouse gas (GHG) emissions that cause rising temperatures. In the disaster database we have information on which countries are affected by natural disasters only. We will now link the data from the disaster dataset with another dataset from which we can get information on GHG. Linking datasets is an extremely important activity for anyone dealing with real life data, an activity deserving a place in such an introduction.

That dataset is the EDGAR dataset maintained by the European Union. We are downloading the 2024 version of the dataset, in particular the "EDGAR Total GHG in CO2eq" version.

Screenshot of EDGAR webpage

When you download this you will get a zip file and you should extract the contents of the zip file into your working directory. One of the files is an EXCEL file called "EDGAR_AR5_GHG_1970_2023.xlsx". Let us load this file.

## # A tibble: 6 × 1
##   `Conditions of use and how to cite`                                           
##   <chr>                                                                         
## 1 <NA>                                                                          
## 2 © European Union 2024, European Commission, Joint Research Centre (JRC), EDGA…
## 3 <NA>                                                                          
## 4 All emissions, except for CO2 emissions from fuel combustion, are from the ED…
## 5 IEA-EDGAR CO2 (v3) data are based on data from IEA (2023) Greenhouse Gas Emis…
## 6 <NA>

edgardat <- read_excel("EDGAR_AR5_GHG_1970_2023.xlsx") 
head(edgardat)

This does not look like a proper datafile and you may actually have received an error mesage. What has gone wrong? Not all mistakes generate an error message. Sometimes R will do something but it is not what we want to achieve.

Before uploading a datafile into R it is actually important to understand the structure of the datefile. Hence we should actually open the file in EXCEL and look at it to understand its structure.

Screenshot of EDGAR data in EXCEL

As you look at the file you will see that there are several tabs with data and by default the read_excel function imports the first. But that is not the one with the data we want. The sheet that contains the data we want to look at is called "TOTALS BY COUNTRY" (sheet = "TOTALS BY COUNTRY"). On that sheet we notice something else. The data sheet only starts in row 10 with the variable names. So when we import this we want to ignore the first 9 rows (skip = 9). We need to instruct R to do that. Before you retry the import make sure that you close the spreadsheet.

## # A tibble: 6 × 59
##   IPCC_annex    C_group_IM24_sh    Country_code_A3 Name  Substance Y_1970 Y_1971
##   <chr>         <chr>              <chr>           <chr> <chr>      <dbl>  <dbl>
## 1 Non-Annex_I   Rest Central Amer… ABW             Aruba GWP_100_… 3.67e1 4.15e1
## 2 Non-Annex_I   India +            AFG             Afgh… GWP_100_… 1.54e4 1.54e4
## 3 Non-Annex_I   Southern_Africa    AGO             Ango… GWP_100_… 1.90e4 1.89e4
## 4 Non-Annex_I   Rest Central Amer… AIA             Angu… GWP_100_… 3.36e0 3.40e0
## 5 Int. Aviation Int. Aviation      AIR             Int.… GWP_100_… 1.72e5 1.72e5
## 6 Non-Annex_I   Central Europe     ALB             Alba… GWP_100_… 8.22e3 8.17e3
## # ℹ 52 more variables: Y_1972 <dbl>, Y_1973 <dbl>, Y_1974 <dbl>, Y_1975 <dbl>,
## #   Y_1976 <dbl>, Y_1977 <dbl>, Y_1978 <dbl>, Y_1979 <dbl>, Y_1980 <dbl>,
## #   Y_1981 <dbl>, Y_1982 <dbl>, Y_1983 <dbl>, Y_1984 <dbl>, Y_1985 <dbl>,
## #   Y_1986 <dbl>, Y_1987 <dbl>, Y_1988 <dbl>, Y_1989 <dbl>, Y_1990 <dbl>,
## #   Y_1991 <dbl>, Y_1992 <dbl>, Y_1993 <dbl>, Y_1994 <dbl>, Y_1995 <dbl>,
## #   Y_1996 <dbl>, Y_1997 <dbl>, Y_1998 <dbl>, Y_1999 <dbl>, Y_2000 <dbl>,
## #   Y_2001 <dbl>, Y_2002 <dbl>, Y_2003 <dbl>, Y_2004 <dbl>, Y_2005 <dbl>, …

edgardat <- read_excel("EDGAR_AR5_GHG_1970_2023.xlsx", sheet = "TOTALS BY COUNTRY", skip = 9) 
head(edgardat)

This has worked! Let's have a look at the structure of the edgardat object. We have 225 rows of data, one for each country. The real data are saved in columns which have the years in the title. So the structure is like this

Country	Y_1970	Y_1971	...	Y_2023
Aruba	36.7	41.5	...	561.5
Afghanistan	15437	15364	...	29460.1
Angola	18999	18866	...	67700.8
...	...	...	...	...

where the numbers are the CO2 equivalent GHG emissions for a particular country in a particular year.

This way of presenting data is sometimes called a wide format. It is not the best format to have your data in if you want to work in R (or any other language). We really want the data in what is called the long format:

Country	Year	Value	Variable
Aruba	1970	36.7	GHGem
Aruba	1971	41.5	GHGem
Aruba	...	...	...
Aruba	2023	561.5	GHGem
Afghanistan	1970	15437	GHGem
Afghanistan	1971	15364	GHGem
Afghanistan	...	...	...
Afghanistan	2023	29460.1	GHGem
Angola	1970	18999	GHGem
Angola	1971	18866	GHGem
Angola	...	...	...
Angola	2023	67700.8	GHGem
...	...	...	...

If you only keep one variable you could do without the Variable column, but with that variable you can also keep several variables in the same structure. Basically each observation (Country-Year-Variable) has its own row.

The question is then how to translate the current wide version into a long version.

edgardat.long <- edgardat %>%
  pivot_longer(
    cols = starts_with("Y_"),         # select all columns starting with "Y_"
    names_to = "Year",                # new column name for year
    values_to = "GHG.em") %>%               # new column name for values
  mutate(Year = sub("Y_", "", Year))         # remove the "Y_" prefix

head(edgardat.long)

## # A tibble: 6 × 7
##   IPCC_annex  C_group_IM24_sh      Country_code_A3 Name  Substance  Year  GHG.em
##   <chr>       <chr>                <chr>           <chr> <chr>      <chr>  <dbl>
## 1 Non-Annex_I Rest Central America ABW             Aruba GWP_100_A… 1970    36.7
## 2 Non-Annex_I Rest Central America ABW             Aruba GWP_100_A… 1971    41.5
## 3 Non-Annex_I Rest Central America ABW             Aruba GWP_100_A… 1972    52.4
## 4 Non-Annex_I Rest Central America ABW             Aruba GWP_100_A… 1973    57.4
## 5 Non-Annex_I Rest Central America ABW             Aruba GWP_100_A… 1974    56.6
## 6 Non-Annex_I Rest Central America ABW             Aruba GWP_100_A… 1975    70.7

The above did the trick! Here is not really the place to explain the above in detail. Suffice to say that this is a reasonably common operation such that applied economists will know that the function to use is called pivot_longer. However, unless you do this on a very regular basis one is unlikely to remember all the detail of how to apply this function by heart.

This is where ChatGPT (or another LLM) can become an excellent coding partner. You can check out this interaction with ChatGPT to see how a well formulated query can deliver the solution. As you can see from this example you will have to give ChatGPT very clear instruction which includes a clear understanding of what it is you want.

Now we have a file in which have we a measure of annual GHG emissions for all countries in all years. Among these countries we have for instance China (which is massive) but also Luxembourg (which is really small). We know that we should not really compare the GHG emissions for these two countries.

edgardat.long %>% filter(Name %in% c("China", "Luxembourg"), Year == "2023") %>% 
  select(Name, Year, GHG.em)

## # A tibble: 2 × 3
##   Name       Year     GHG.em
##   <chr>      <chr>     <dbl>
## 1 China      2023  15943987.
## 2 Luxembourg 2023      7861.

Unsurprisingly China's emissions are much much larger. When dealing with issues like this we will compare per-capita statistics, i.e. we want to compare CO2 emissions per person living in a country. In order to be able to do that we first have to import population data for the countries we have.

Annual population data are available from various sources. The "WorldBank_population.csv" file contains such data, from the World Bank. Let's import these.

##                  Country.Name Country.Code    Indicator.Name Indicator.Code
## 1                       Aruba          ABW Population, total    SP.POP.TOTL
## 2 Africa Eastern and Southern          AFE Population, total    SP.POP.TOTL
## 3                 Afghanistan          AFG Population, total    SP.POP.TOTL
## 4  Africa Western and Central          AFW Population, total    SP.POP.TOTL
## 5                      Angola          AGO Population, total    SP.POP.TOTL
## 6                     Albania          ALB Population, total    SP.POP.TOTL
##       X1960     X1961     X1962     X1963     X1964     X1965     X1966
## 1     54922     55578     56320     57002     57619     58190     58694
## 2 130075728 133534923 137171659 140945536 144904094 149033472 153281203
## 3   9035043   9214083   9404406   9604487   9814318  10036008  10266395
## 4  97630925  99706674 101854756 104089175 106388440 108772632 111246953
## 5   5231654   5301583   5354310   5408320   5464187   5521981   5581386
## 6   1608800   1659800   1711319   1762621   1814135   1864791   1914573
##       X1967     X1968     X1969     X1970     X1971     X1972     X1973
## 1     58990     59069     59052     58950     58781     58047     58299
## 2 157704381 162329396 167088245 171984985 177022314 182126556 187524135
## 3  10505959  10756922  11017409  11290128  11567667  11853696  12157999
## 4 113795019 116444636 119203521 122086536 125072948 128176494 131449942
## 5   5641807   5702699   5763685   5852788   5991102   6174262   6388528
## 6   1965598   2022272   2081695   2135479   2187853   2243126   2296752
##       X1974     X1975     X1976     X1977     X1978     X1979     X1980
## 1     58349     58295     58368     58580     58776     59191     59909
## 2 193186642 198914573 204802976 210680842 217074286 223974122 230792729
## 3  12469127  12773954  13059851  13340756  13611441  13655567  13169311
## 4 134911581 138569918 142337272 146258576 150402616 154721711 159166518
## 5   6613367   6842947   7074664   7317829   7576734   7847207   8133872
## 6   2350124   2404831   2458526   2513546   2566266   2617832   2671997
##       X1981     X1982     X1983     X1984     X1985     X1986     X1987
## 1     60563     61276     62228     62901     61728     59931     59159
## 2 238043099 245822010 253644643 261458202 269450407 277621771 286067346
## 3  11937581  10991378  10917982  11190221  11426852  11420074  11387818
## 4 163762473 168585118 173255157 177880746 182811038 187889141 193104347
## 5   8435607   8751648   9082983   9425917   9779120  10139450  10497858
## 6   2726056   2784278   2843960   2904429   2964762   3022635   3083605
##       X1988     X1989     X1990     X1991     X1992     X1993     X1994
## 1     59331     60443     62753     65896     69005     73685     77595
## 2 294498625 302939121 311748681 320442961 329082707 338324002 347441809
## 3  11523298  11874088  12045660  12238879  13278974  14943172  16250794
## 4 198485027 204062274 209566031 215178709 221191375 227246778 233360104
## 5  10861291  11238562  11626360  12023529  12423712  12827135  13249764
## 6   3142336   3227943   3286542   3266790   3247039   3227287   3207536
##       X1995     X1996     X1997     X1998     X1999     X2000     X2001
## 1     79805     83021     86301     88451     89659     90588     91439
## 2 356580375 366138524 375646235 385505757 395750933 406156661 416807868
## 3  17065836  17763266  18452091  19159996  19887785  20130327  20284307
## 4 239801875 246415446 253207584 260297834 267506298 274968446 282780717
## 5  13699778  14170973  14660413  15159370  15667235  16194869  16747208
## 6   3187784   3168033   3148281   3128530   3108778   3089027   3060173
##       X2002     X2003     X2004     X2005     X2006     X2007     X2008
## 1     92074     93128     95138     97635     99405    100150    100917
## 2 427820358 439173286 450928044 463076637 475606210 488580707 502070763
## 3  21378117  22733049  23560654  24404567  25424094  25909852  26482622
## 4 290841795 299142845 307725100 316588476 325663158 334984176 344586109
## 5  17327699  17943712  18600423  19291161  20015279  20778561  21578655
## 6   3051010   3039616   3026939   3011487   2992547   2970017   2947314
##       X2009     X2010     X2011     X2012     X2013     X2014     X2015
## 1    101604    101838    102591    104110    105675    106807    107906
## 2 516003448 530308387 544737983 559609961 575202699 590968990 607123269
## 3  27466101  28284089  29347708  30560034  31622704  32792523  33831764
## 4 354343844 364358270 374790143 385360349 396030207 406992047 418127845
## 5  22414773  23294825  24218352  25177394  26165620  27160769  28157798
## 6   2927519   2913021   2905195   2900401   2895092   2889104   2880703
##       X2016     X2017     X2018     X2019     X2020     X2021     X2022
## 1    108727    108735    108908    109203    108587    107700    107310
## 2 623369401 640058741 657801085 675950189 694446100 713090928 731821393
## 3  34700612  35688935  36743039  37856121  39068979  40000412  40578842
## 4 429454743 440882906 452195915 463365429 474569351 485920997 497387180
## 5  29183070  30234839  31297155  32375632  33451132  34532429  35635029
## 6   2876101   2873457   2866376   2854191   2837849   2811666   2777689
##       X2023     X2024
## 1    107359    107624
## 2 750503764 769294618
## 3  41454761  42647492
## 4 509398589 521764076
## 5  36749906  37885849
## 6   2745972   2714617

popdat <- read.csv("WorldBank_population.csv")
head(popdat)

This is another dataset in wide format (international institutions love data in wide format). Use the same approach as above to turn this dataset into a new object (popdat.long) which presents the data in long format

If you get it right you should be able to see the following data.

head(popdat.long)

## # A tibble: 6 × 6
##   Country.Name Country.Code Indicator.Name    Indicator.Code Year    pop
##   <chr>        <chr>        <chr>             <chr>          <chr> <dbl>
## 1 Aruba        ABW          Population, total SP.POP.TOTL    1960  54922
## 2 Aruba        ABW          Population, total SP.POP.TOTL    1961  55578
## 3 Aruba        ABW          Population, total SP.POP.TOTL    1962  56320
## 4 Aruba        ABW          Population, total SP.POP.TOTL    1963  57002
## 5 Aruba        ABW          Population, total SP.POP.TOTL    1964  57619
## 6 Aruba        ABW          Population, total SP.POP.TOTL    1965  58190

You should start by copying and pasting the previous conversion code and trying to adjust it for the structure of popdat:

popdat.long <- popdat %>%
  XXXX(
    cols = starts_with(XXXX),         # select all columns starting with "X"
    names_to = XXXX,                # new column name for year
    XXXX = "pop" ) %>%               # new column name for values
  mutate(Year = sub(XXXX, "", XXXX))         # remove the "X" prefix

This is the complete code:

popdat.long <- popdat %>%
  pivot_longer(
    cols = starts_with("X"),         # select all columns starting with "X"
    names_to = "Year",                # new column name for year
    values_to = "pop" ) %>%               # new column name for values
  mutate(Year = sub("X", "", Year))         # remove the "X" prefix

In order to calculate per-capita emissions we now need to bring the population and GHG emission data together. What we will do is to merge the population data into the edgar.long dataset and once we have done that we can calculate the per-capita emissions. Before we apply the merge function we strip popdat.long down to the two variables on the basis of which we will merge ("Country.Code" and "Year") and the variable which we want to merge into edgardat.long, i.e. "pop".

popdat.long <- popdat.long %>%  select(Country.Code, Year, pop)
edgardat.long <- merge(edgardat.long, popdat.long, # lists the two objects to merge
                       by.x = c("Country_code_A3","Year"), # lists the variables on which to match in the first object (x)
                       by.y = c("Country.Code","Year")) # lists the variables on which to match in the second object (y)

You will see that in edgardat.long we now also have the pop variable. The way we used the merge function it only kept those country-year observations for which we have population data (check ?merge to see in the help function how you could change that).

Now we can calculate the per capita GHG emissions. The data in the variable GHG.em are measured in the unit Gg, or Giga grams, which is equivalent to $10^9$ grams = 1000 tonnes. The new GHG.em.pc variable we will use the unit Tonnes per person.

edgardat.long <- edgardat.long %>%  mutate(GHG.em.pc = 1000*GHG.em/pop)

Now we can compare the GHG emissions of China and Luxembourg

edgardat.long %>% filter(Name %in% c("China", "Luxembourg"), Year == "2023") %>% 
  select(Name, Year, GHG.em, pop, GHG.em.pc)

##         Name Year       GHG.em        pop GHG.em.pc
## 1      China 2023 15943986.553 1410710000  11.30210
## 2 Luxembourg 2023     7860.933     666430  11.79559

As it turns out, they are very similar.

When you do such calculations you should always check that the results are actually plausible and you havn't done any mistake. This is especially important when you are manipulating units of measurement as we did here. An easy way to check the calculation we did here is to go to Google or your preferred LLM and check whether the above is about right. For instance ask what is the current GHG emission per capita in Luxembourg. It turns out that our calculation is in the correct ballpark!

Do natural disasters happen predominantly in poor countries?

In fact the question we will really ask is whether natural disasters happen predominantly in countries that are not responsible for large GHG emissions.

Let's think what we would want to look at. In the end we will want to look at a scatter plot which has the following axes

Average GHG emissions per capita between 1970 to date. This we can get by using the data we have in edgardat.long and averaging across all years by country.
We need a measure of whether climate change is affecting countries more or less. Some countries are, through their geographical location, more exposed to natural disasters. E.g. countries in the Caribbean will be more exposed to storms than central European countries. We therefore should calculate a change in exposure to natural disasters. It is acknowledged that the effects of climate change have been increasing over the last few decades, as global temperatures have increased. So one way we can calculate an increased exposure to natural disasters is to calculate say the growth rate of storms (or any climate related events) from the 70s (1970 to 1979) to the the last ten years of data available (2014-2023).

Avererage GHG emissions per capita

So let's calculate these.

country.info <- edgardat.long %>%  group_by(Country_code_A3, Name) %>% 
                    summarise(avg.GHG.em.pc = mean(GHG.em.pc))

We also change the name of the Country_Code_A3 variable in country.info to ISO which then matches the variable name of the three letter country code we get from the emdat data.

names(Country.Info)[names(country.info) = "Country_code_A3"] <- "ISO"

When you run this code you should see an error message. In fact there are two errors in that line of code. Fix them to make sure that the variable is actually renames. Check afterwards by running names(country.info) in the Console to confirm that the Country code variable is now named ISO.

## [1] "ISO"           "Name"          "avg.GHG.em.pc"

Growth in natural disasters

We basically need to calculate two measures of exposure to natural disasters (those that can potentially be linked to the effects of climate change), a measure for the 70s, the base measure, and then a recent measure.

Let us look back at the information in the emdat database, by looking at a particular event. In particular we need to understand how events that affect multiple countries are recorded. To investigate this we will attempt to identify Hurricane Sandy (Wikipedia entry to Hurricane Sandy) which, in late October 2012 affected several countries in the Carribean and then the US. So let's filter by what we know.

test <- emdat %>% filter(Start.Year == 2012, Start.Month == 10, Disaster.Subgroup == "Meteorological")

There are 14 events selected and if you look at the data you will see that in the Event.Name variable there is actually an indication which events are related to Sandy. So we can actually filter on that.

test <- emdat %>% filter(Event.Name == "Hurricane Sandy") %>% select(DisNo., Event.Name, ISO, Country, Start.Year, Start.Month, Start.Day) %>%  print()

## # A tibble: 6 × 7
##   DisNo.        Event.Name      ISO   Country   Start.Year Start.Month Start.Day
##   <chr>         <chr>           <chr> <chr>          <dbl>       <dbl>     <dbl>
## 1 2012-0410-BHS Hurricane Sandy BHS   Bahamas         2012          10        24
## 2 2012-0410-CUB Hurricane Sandy CUB   Cuba            2012          10        22
## 3 2012-0410-DOM Hurricane Sandy DOM   Dominica…       2012          10        24
## 4 2012-0410-HTI Hurricane Sandy HTI   Haiti           2012          10        24
## 5 2012-0410-JAM Hurricane Sandy JAM   Jamaica         2012          10        24
## 6 2012-0410-USA Hurricane Sandy USA   United S…       2012          10        28

As you can see, for each affected country we get a separate event. This will greatly facilitate our measure of exposure to climate change as we are calculating country level exposures.

What we did here is perhaps one of the most important activities an applied economist or data scientist will ever have to do. UNDERSTAND THE DATA. This will involve looking at particular data examples. Investigate what variables are available, understand what the unit of observation is, understand the sample of data available to you, be knowledgeable about the limitations the data come with.

There is no general recipe to do that. You will have to be curious and ask critical questions. Here we knew that we shoudl find a particular event (Hurricane Sandy) in the data and then we went to find it in the data. If you see how something you know is represented in teh data then this can be a great path to understanding the data.

On top of that looking at summary statistics and graphical representations is often a great path to improved data understanding.

In order to calculate exposure we begin by calculating the number of events that affected a particular country between 1970 and 1979. But we will only count the events that could potentially be explained by climate change, i.e. those in the following Disaster.Subgroup: Climatological, Hydrological and Meteorological.

sel.dis.types <- c("Climatological", "Hydrological", "Meteorological")

Country.info.dis.base <- emdat %>% 
                          filter(Disaster.Subgroup %in% sel.dis.types, 
                                 Start.Year > 1969, 
                                 Start.Year < 1980) %>% 
                          group_by(ISO) %>% 
                          summarise(base = n())

Note, we are explicitly not claiming that any individual event here is the result of climate change. We can see that we have counts of events for 124 countries. That means that a good number of countries did not experience any disasters.

You could quite rightly ask why we don't just take the disaster count in countries in 1970 and compare that to 2023. Why do we need to calculate averages?

The reason is that disasters are a lumpy occurrence. There can be a particular year with many bad storms being followed by a year without. And then it matters whether 1970 was one or the other. Climate measures should always be taken over longer periods and not just a year.

And in fact we could also consider aggregating regionally across countries.

Now we do the same but for the most recent years (2014-2023)

Country.info.dis.recent <- emdat %>% 
                          filter(XXXX %in% sel.dis.types, 
                                 Start.Year > XXXX, 
                                 XXXX < XXXX) %>% 
                          group_by(ISO,Region) XXXX 
                          summarise(recent = XXXX)

Here you will know that you got it right if Country.info.dis.recent has 201 observations and 3 variables and the first few rows look like this.

head(Country.info.dis.recent)

## # A tibble: 6 × 3
## # Groups:   ISO [6]
##   ISO   Region   recent
##   <chr> <chr>     <int>
## 1 AFG   Asia         49
## 2 AGO   Africa       20
## 3 AIA   Americas      1
## 4 ALB   Europe        9
## 5 ARE   Asia          1
## 6 ARG   Americas     32

Note, that here we also added Region as a grouping variable. This isn't really necessary, but it will add Region to Country.info.dis.recent and we will make good use of that variable as we plot the data.

The first indication that the number of disasters has increased is that we are now counting events for more than 200 countries, however, do remember that a recency bias is one of the biases potentially contained in the database. Another bias may be that in general the reporting in more advanced countries is better. Being conscious of that point we will still proceed by merging the two data objects we just created. We will merge these variables into the country.info database in which we already saved the avg.GHG.em.pc variable.

country.info <- merge(country.info, Country.info.dis.base, all.x = TRUE)

Recall that the merge function does require inputs, at the very least the two dataframes that require merging. In the previous use we also specified the by.x and by.y options. We don't do that here. Instead we are using the all.x = TRUE option which was not used previously. Try and figure out why this is.

You should start by looking at the function documentation which you can access by typing ?merge into the documentation.

The answers are to be found by consulting the functions documentation. This can also be found here. In there you can find the following information on the the by.x and by.y options

By default the data frames are merged on the columns with names they both have, but separate specifications of the columns can be given by by.x and by.y.

So when we use the by.x and by.y options we do tell R which are the variables for which it is to find matches. If that is not specified, as in this case, then automatically R matches on the variables that have identical names.

names(country.info)

## [1] "ISO"           "Name"          "avg.GHG.em.pc" "base"

names(Country.info.dis.base)

## [1] "ISO"  "base"

Here that is the ISO variable.

The all.x = TRUE optiom is included here, but not in the previous use. The help function offers the following information:

all.x; logical; if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output.

So when we did not use this option, the resulting merged file only had rows for units of observations (here countries) that contained information in both datasets. Now, with setting x.all = TRUE we ensure that all rows from the first dataframe (x) are maintained in the resulting file, whether they found a match in the second dataframe or not.

And now we add the recent variable to country.info dataframe. Use exactly the same settings as in the merge we just performed.

XXXX <- merge(XXXX)

You should still see 203 rows in country.info, now with 6 variables. Confirm that the first few lines look like this.

head(country.info)

##   ISO                 Name avg.GHG.em.pc base   Region recent
## 1 ABW                Aruba     3.3557689   NA     <NA>     NA
## 2 AFG          Afghanistan     0.9381111    5     Asia     49
## 3 AGO               Angola     3.0963688   NA   Africa     20
## 4 ALB              Albania     3.2126632   NA   Europe      9
## 5 ARE United Arab Emirates    47.5906985   NA     Asia      1
## 6 ARG            Argentina     8.3201143    9 Americas     32

And now we can calculate a growth rate of number of incidents (nat.dis.gr) and the log difference (nat.dis.dlog). The latter can be preferable for a number of reasons.

country.info <- country.info %>% mutate(nat.dis.gr = (recent-base)/base,
                                        nat.dis.dlog = log(recent)-log(base))

You should again look at your calculations and ensure that the everything looks sensible.

country.info %>% filter(Name %in% c("China", "Luxembourg", "Mexico"))

##   ISO       Name avg.GHG.em.pc base   Region recent nat.dis.gr nat.dis.dlog
## 1 CHN      China      5.415383    4     Asia    197  48.250000     3.896909
## 2 LUX Luxembourg     29.809649   NA   Europe      4         NA           NA
## 3 MEX     Mexico      5.399160   13 Americas     54   3.153846     1.424035

There are a number of things we can learn from this.

For countries which did not report any natural disasters in the base period we can not calculate a growth rate (any growth from 0 would be infinite growth). Luxembourg is an example in case here.
In the case of Mexico we counted 13 natural disasters in the base period and 54 in the recent period. The latter number is 315% larger than the former. This growth is recorded as 3.15.
In the case of China we see that in the base period only 4 natural disasters were recorded in the database. This seems an implausibly low number and reminds us that at the time information in China was tightly controlled. One result of this may be that the international recording of natural disasters has not been happening at the same level of openness and accuracy with which it has been done for the more recent period. This serves as a reminder that you always have to think about the data quality.

In the first instance you have to be open about your doubts when you communicate any analysis based on these data. Often you can do very little to change the data and openness is the best approach. Let the user of your analysis decide whether the data quality means that your analysis should not be trusted.

As we may be somewhat dubious especially with the event count in the base period there is a number of numerical precautions you could make. Perhaps we could move the base period to the 80s, or when we look at increased exposure we may rely more on the log difference (nat.dis.dlog). For the actual growth rate the dubious number, the base period, does go into the denominator and that has a massive numerical influence. The log difference is less susceptible to eratic movements in the base period.

Despite the above reservations we are now plotting the scatter plot we indicated earlier we wanted to plot.

plot.scatter <- ggplot(country.info, aes(y=nat.dis.gr, x=avg.GHG.em.pc, colour = Region))+
  geom_point()
plot.scatter

There are a number of issues with this plot, let's fix them:

plot.scatter <- plot.scatter +
                    xlim(0,30) + 
                    labs(x = "p.c. GHG emissions", y = "Increase in nat. dis. exposure",
                          title ="GHG emissions and natural disasters")
plot.scatter

There does seem to be some negative correlation. The countries with the highest GHG emissions do not have the largest exposure to increases in natural disasters. The largest increases seem to come from countries with lowest GHG emissions. But do recall that we ought to be somewhat careful about the growth data, especially if they come from countries with potentially dubious base year data. Looking at the log difference will protect us to a certain degree.

plot.scatter2 <- ggplot(country.info, aes(y=nat.dis.dlog, x=avg.GHG.em.pc, colour = Region)) +
                    geom_point()  +
                    xlim(0,30) + 
                    labs(x = "p.c. GHG emissions", y = "Increase in nat. dis. exposure",
                          title ="GHG emissions and natural disasters")
plot.scatter2

Now we can see that there is no obvious relationship between the two variables. But do note that almost all countries here experience a significant increase in exposure to natural disasters that can potentially be linked to climate change, irrespective of how much they contribute to the problem.

So the above analysis does not say that there are no effects of climate change. In contrast we do observe a global increase in the number of natural disasters. But they seem to be hitting high and low polluters in to a similar extend.

What the analysis here cannot show is that most likely poorer countries (usually the ones with lower emissions) are less equipped to actually deal with the consequences of increased natural disasters. But that would be a new and equally interesting analysis.

End of session

Before you leave RStudio you need to save your script file. That means that tomorrow you can just come back open the script file and continue to work.

Conclusion

In this workthrough we presented a graphical representation of the relationship between GHG emission behaviour and increased exposure to natural disasters at a country level. In the process of producing this graphical evidence we did a number of common tasks for any data scientist.

Finding relevant data
Importing data into R
Cleaning data
Producing summary statistics
Producing graphical representations of data
Calculating new variables
Merging data from different datasets

We also learned a number of things about working as an applied economist or data scientist.

Always work in script files!
Commenting your code will save your future self a lot of time
Error messages are part of the daily life
Fixing errors in code is super important
Some errors stop the code from running but other errors are not so obvious (hence you always need to test whether you have actually achieved what you wanted to achieve)
Using the help function is important
Using Google, ChatGPT or other LLMs can help you out, but to use them you need to know quite a bit and you need to ask well-formulated questions

Climate Change Data - An Introduction to Working in R

Ralf Becker (Manchester), Yichen Zhu (Aston)

Aim