Let’s start by loading a datafile. We are using the datafile with 752
women’s wages and working hours as well as other variables. These are
the data in the mroz.csv
file (mroz.csv).
Make sure that this file is loaded into a data subfolder of your working
directory.
mydata <- read.csv("data/mroz.csv",na.strings = ".") # Opens mroz.csv from working directory
You may wonder what the option na.strings = "."
does. It
tells R that any .
entry in the spreadsheet is actually a
misisng observation. It is then labelled as such by R.
Let’s check out what variables we have in this data-file
names(mydata)
## [1] "inlf" "hours" "kidslt6" "kidsge6" "age" "educ"
## [7] "wage" "repwage" "hushrs" "husage" "huseduc" "huswage"
## [13] "faminc" "mtr" "motheduc" "fatheduc" "unem" "city"
## [19] "exper" "nwifeinc" "lwage" "expersq"
The variable hours
indicates the number of hours worked
by an individual and the variable wage
represents the
estimated hourly wage earned (in 1975). You can find short variable
descriptions from a link in the Datasets page (see above). You need to
understand what data types the variables represent (check
str(mydata)
to confirm the R datatypes.) Common variable
types are int
for nteger numbers, num
for any
numerical variables, chr
for text variables and
factor
for categorical variables.
str(mydata)
## 'data.frame': 753 obs. of 22 variables:
## $ inlf : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hours : int 1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
## $ kidslt6 : int 1 0 1 0 1 0 0 0 0 0 ...
## $ kidsge6 : int 0 2 3 3 2 0 2 0 2 2 ...
## $ age : int 32 30 35 34 31 54 37 54 48 39 ...
## $ educ : int 12 12 12 12 14 12 16 12 12 12 ...
## $ wage : num 3.35 1.39 4.55 1.1 4.59 ...
## $ repwage : num 2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ...
## $ hushrs : int 2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
## $ husage : int 34 30 40 53 32 57 37 53 52 43 ...
## $ huseduc : int 12 9 12 10 12 11 12 8 4 12 ...
## $ huswage : num 4.03 8.44 3.58 3.54 10 ...
## $ faminc : int 16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ...
## $ mtr : num 0.722 0.661 0.692 0.781 0.622 ...
## $ motheduc: int 12 7 12 7 12 14 14 3 7 7 ...
## $ fatheduc: int 7 7 7 7 14 7 7 3 7 7 ...
## $ unem : num 5 11 5 5 9.5 7.5 5 5 3 5 ...
## $ city : int 0 1 0 0 1 1 0 0 0 0 ...
## $ exper : int 14 5 15 6 7 33 11 35 24 21 ...
## $ nwifeinc: num 10.9 19.5 12 6.8 20.1 ...
## $ lwage : num 1.2102 0.3285 1.5141 0.0921 1.5243 ...
## $ expersq : int 196 25 225 36 49 1089 121 1225 576 441 ...
You can see that all data are coded as numerical or integer
variables. Which datatype a variable has is important. For instance, if
you want ot make any calculations with a variable (e.g. add it to
something, calculate a logarithm, etc.), then this will only work if R
has recognised a variable as a number (num
or
int
). If a variable was a text (chr
) variable,
you would be nable to use it in calculations.
When you import data, R automatically recognise the datatypes. We will see below how you can change data-types. Go back to where you importaed the data (the
read.csv
command line). Re-run this command without the option that told R what missing data are (na.strings = "."
). Check what datatype thewage
variable has. Can you calculate with that variable?
One of the things you would usually like to do at the beginning of your work is to get some summary statistics for all variables in the dataset.
summary(mydata)
## inlf hours kidslt6 kidsge6
## Min. :0.0000 Min. : 0.0 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.: 0.0 1st Qu.:0.0000 1st Qu.:0.000
## Median :1.0000 Median : 288.0 Median :0.0000 Median :1.000
## Mean :0.5684 Mean : 740.6 Mean :0.2377 Mean :1.353
## 3rd Qu.:1.0000 3rd Qu.:1516.0 3rd Qu.:0.0000 3rd Qu.:2.000
## Max. :1.0000 Max. :4950.0 Max. :3.0000 Max. :8.000
##
## age educ wage repwage
## Min. :30.00 Min. : 5.00 Min. : 0.1282 Min. :0.00
## 1st Qu.:36.00 1st Qu.:12.00 1st Qu.: 2.2626 1st Qu.:0.00
## Median :43.00 Median :12.00 Median : 3.4819 Median :0.00
## Mean :42.54 Mean :12.29 Mean : 4.1777 Mean :1.85
## 3rd Qu.:49.00 3rd Qu.:13.00 3rd Qu.: 4.9708 3rd Qu.:3.58
## Max. :60.00 Max. :17.00 Max. :25.0000 Max. :9.98
## NA's :325
## hushrs husage huseduc huswage
## Min. : 175 Min. :30.00 Min. : 3.00 Min. : 0.4121
## 1st Qu.:1928 1st Qu.:38.00 1st Qu.:11.00 1st Qu.: 4.7883
## Median :2164 Median :46.00 Median :12.00 Median : 6.9758
## Mean :2267 Mean :45.12 Mean :12.49 Mean : 7.4822
## 3rd Qu.:2553 3rd Qu.:52.00 3rd Qu.:15.00 3rd Qu.: 9.1667
## Max. :5010 Max. :60.00 Max. :17.00 Max. :40.5090
##
## faminc mtr motheduc fatheduc
## Min. : 1500 Min. :0.4415 Min. : 0.000 Min. : 0.000
## 1st Qu.:15428 1st Qu.:0.6215 1st Qu.: 7.000 1st Qu.: 7.000
## Median :20880 Median :0.6915 Median :10.000 Median : 7.000
## Mean :23081 Mean :0.6789 Mean : 9.251 Mean : 8.809
## 3rd Qu.:28200 3rd Qu.:0.7215 3rd Qu.:12.000 3rd Qu.:12.000
## Max. :96000 Max. :0.9415 Max. :17.000 Max. :17.000
##
## unem city exper nwifeinc
## Min. : 3.000 Min. :0.0000 Min. : 0.00 Min. :-0.02906
## 1st Qu.: 7.500 1st Qu.:0.0000 1st Qu.: 4.00 1st Qu.:13.02504
## Median : 7.500 Median :1.0000 Median : 9.00 Median :17.70000
## Mean : 8.624 Mean :0.6428 Mean :10.63 Mean :20.12896
## 3rd Qu.:11.000 3rd Qu.:1.0000 3rd Qu.:15.00 3rd Qu.:24.46600
## Max. :14.000 Max. :1.0000 Max. :45.00 Max. :96.00000
##
## lwage expersq
## Min. :-2.0542 Min. : 0
## 1st Qu.: 0.8165 1st Qu.: 16
## Median : 1.2476 Median : 81
## Mean : 1.1902 Mean : 178
## 3rd Qu.: 1.6036 3rd Qu.: 225
## Max. : 3.2189 Max. :2025
## NA's :325
Look at the summary statistics. How many missing values are there in the
wage
variable?
When you have missing observations in a variable, this can complicate
some of the operations. Let’s demonstrate this by calculating the mean
wage
value. You can do this in R by using the
mean
function.
mean(mydata$wage)
## [1] NA
As you can see, you get the value NA
. The reason for
this is that R has detected missing values in the wage
variable. You may ask why R is not just ignoring missing observations.
There is a good reason for this. If there are missing values, you as the
data analyst should carefully think why there are missing values.
We could create a temporary subset of the data with only those
observations which have missing data in the wage
variable
temp <- mydata[is.na(mydata$wage),]
What do all the observations in
temp
have in common? Look at thehours
variable.
If you decided that it was ok to ignore the missing observations in
your calculation, then you could tell R to ignore the missing
observations by adding the na.rm = TRUE
option to the mean
function.
mean(mydata$wage,na.rm = TRUE)
## [1] 4.177682
Try a few other summary statistics like the standard deviation (
sd()
), the variance (var()
), the median (median()
), the inter-quartile range (IQR()
), the minimum (min()
) or the maximum (max()
). Also try to figure out how to calculate a percentile (say the 15th percentile) in R. You may have to internet search (“How to calculate a percentile in R”) to find out. Searching the internet is one of the most important programming techniques!
A similar issue arises when you are calculating a summary statistics
which involves more than one variable. Say you wanted to calculate
correlations between the educ
, wage
and the
huswage
variables. Look at the difference in results for
the following two commands.
cor(mydata[c("educ","wage","huswage")])
## educ wage huswage
## educ 1.0000000 NA 0.2849361
## wage NA 1 NA
## huswage 0.2849361 NA 1.0000000
cor(mydata[c("educ","wage","huswage")],use = "complete")
## educ wage huswage
## educ 1.0000000 0.3419544 0.3030052
## wage 0.3419544 1.0000000 0.2158857
## huswage 0.3030052 0.2158857 1.0000000
cor(mydata[c("educ","wage","huswage")],use = "pairwise")
## educ wage huswage
## educ 1.0000000 0.3419544 0.2849361
## wage 0.3419544 1.0000000 0.2158857
## huswage 0.2849361 0.2158857 1.0000000
Use the information in the help function (type ?cor
in
the Console) to figure out why the results are different.
You already learned how to get summary statistics for all variables
in the dataset above. Especially when you have many variables, the
output of summary(mydata)
will be quite cumbersome. You
therefore often want to calculate certain summary statistics for some
variables only.
summary(mydata[c("hours","husage")])
## hours husage
## Min. : 0.0 Min. :30.00
## 1st Qu.: 0.0 1st Qu.:38.00
## Median : 288.0 Median :46.00
## Mean : 740.6 Mean :45.12
## 3rd Qu.:1516.0 3rd Qu.:52.00
## Max. :4950.0 Max. :60.00
cor(mydata[c("hours","husage")])
## hours husage
## hours 1.00000000 -0.03108875
## husage -0.03108875 1.00000000
mydata.sub0 <- subset(mydata,select=c("hours","husage"))
summary(mydata.sub0)
## hours husage
## Min. : 0.0 Min. :30.00
## 1st Qu.: 0.0 1st Qu.:38.00
## Median : 288.0 Median :46.00
## Mean : 740.6 Mean :45.12
## 3rd Qu.:1516.0 3rd Qu.:52.00
## Max. :4950.0 Max. :60.00
cor(mydata.sub0)
## hours husage
## hours 1.00000000 -0.03108875
## husage -0.03108875 1.00000000
It is important to understand that there is not one correct way to do things, but rather there are many correct ways of doing things.
An issue you will often encounter is that you want to do some
analysis but not on all the observations you have in your dataset. Let’s
say you wish to calculate the correlation between the hours a women
works (hours
) and the hours her husband works
(hushrs
). But you want to do that only for those women who
actually work more than 100 hours a year.
There are multiple ways of doing this (and also check out the Data Summary using Tidyverse page for an elegant alternative way of doing so).
mydata.sub1 <- subset(mydata, hours > 100, select = c("hours","hushrs"))
cor(mydata.sub1)
## hours hushrs
## hours 1.00000000 -0.02351165
## hushrs -0.02351165 1.00000000
Above we learned how to calculate summary statistics for numeric
variables. If you have categorical variables, like years of education
(educ
) or number of kids younger than 6 years old
(kidslt6
) you actually want to see the number of
observations in the categories. You can do this with the
table
function.
table(mydata$educ)
##
## 5 6 7 8 9 10 11 12 13 14 15 16 17
## 4 6 8 30 25 44 43 381 44 51 14 57 46
As you can see the largest group is the women with 12 years of education which is equivalent to high school education.
If you wanted to see this information not in terms of percentages you could so this as follows:
prop.table(table(mydata$educ))
##
## 5 6 7 8 9 10
## 0.005312085 0.007968127 0.010624170 0.039840637 0.033200531 0.058432935
## 11 12 13 14 15 16
## 0.057104914 0.505976096 0.058432935 0.067729084 0.018592297 0.075697211
## 17
## 0.061088977
Now you can see that just over 50% have a high school education as their highest qualification.
Create the same type of table to figure out what percentage of women in the sample have no children younger than 6 years. You should find that it is just over 80%
##
## 0 1 2 3
## 0.804780876 0.156706507 0.034528552 0.003984064
There are a lot of digits in this output. You can tell R to show fewer digits. Run the following line and re-run the table with proportions to see the difference
options(digits = 2)
Often you may want to create such tables for two rather than only one
variable, say both for educ
and for kidslt6
.
We call tables like this contingency tables or sometime cross tabs.
table(mydata[c("educ","kidslt6")]) # counts
## kidslt6
## educ 0 1 2 3
## 5 4 0 0 0
## 6 6 0 0 0
## 7 5 2 1 0
## 8 28 2 0 0
## 9 20 5 0 0
## 10 37 5 2 0
## 11 36 6 1 0
## 12 308 65 8 0
## 13 36 4 2 2
## 14 40 8 3 0
## 15 9 2 2 1
## 16 43 12 2 0
## 17 34 7 5 0
Now, presenting thisin terms of percentages is even more important.
prop.table(table(mydata[c("educ","kidslt6")]) ) # percentages
## kidslt6
## educ 0 1 2 3
## 5 0.0053 0.0000 0.0000 0.0000
## 6 0.0080 0.0000 0.0000 0.0000
## 7 0.0066 0.0027 0.0013 0.0000
## 8 0.0372 0.0027 0.0000 0.0000
## 9 0.0266 0.0066 0.0000 0.0000
## 10 0.0491 0.0066 0.0027 0.0000
## 11 0.0478 0.0080 0.0013 0.0000
## 12 0.4090 0.0863 0.0106 0.0000
## 13 0.0478 0.0053 0.0027 0.0027
## 14 0.0531 0.0106 0.0040 0.0000
## 15 0.0120 0.0027 0.0027 0.0013
## 16 0.0571 0.0159 0.0027 0.0000
## 17 0.0452 0.0093 0.0066 0.0000
You can see that almost 41% of women in the sample had 12 years of
education and no children younger than 6 (educ==12
and
kidslt6 == 0
).
When you show percentages/proportions with multiple variables then
you will often want to see conditional probabilities. Say, given a women
has high school education (educ==12
), what is the
probability of that women having no children younger than 6
(kidslt6 == 0
).
The following line of code does give you that result, but what should
you replace the “XXXX” with? With “1” or with “2”? Use the help function
to understand (?prop.table
).
prop.table(table(mydata[c("educ","kidslt6")]), margin = XXXX ) # conditional percentages
When you get it right you should get the following table:
## kidslt6
## educ 0 1 2 3
## 5 1.000 0.000 0.000 0.000
## 6 1.000 0.000 0.000 0.000
## 7 0.625 0.250 0.125 0.000
## 8 0.933 0.067 0.000 0.000
## 9 0.800 0.200 0.000 0.000
## 10 0.841 0.114 0.045 0.000
## 11 0.837 0.140 0.023 0.000
## 12 0.808 0.171 0.021 0.000
## 13 0.818 0.091 0.045 0.045
## 14 0.784 0.157 0.059 0.000
## 15 0.643 0.143 0.143 0.071
## 16 0.754 0.211 0.035 0.000
## 17 0.739 0.152 0.109 0.000
The probability that a women with 12 years of education has no children younger than 6 years old is 80.8%.
Here you have learned how to calculate summary statistics to particular variables in a dataset. How to restrict your calculation to particular rows n the dataset and importantly how to produce contingency tables.