# Your exam scores:
c(65,68,71,74, 53, 58)
[1] 65 68 71 74 53 58
# Mean of the exam score:
mean(c(65,68,71,74, 53, 58))
[1] 64.83333
This week we will continue learning the fundamentals of R. We are going to introduce the pipe operator and using R libraries.
We often put one function inside another function. For example, you are trying to find the mean of your exam scores.
# Your exam scores:
c(65,68,71,74, 53, 58)
[1] 65 68 71 74 53 58
# Mean of the exam score:
mean(c(65,68,71,74, 53, 58))
[1] 64.83333
In the example above, you combine your exam scores using c()
and then put this into mean()
. In other words, the output of c()
becomes the input of mean()
.
Sometimes we need to repeatedly put functions inside of other functions to create long chains. In such cases, reading and writing code might become cumbersome.
For example, let’s say that only the highest five exam scores are going to be considered.
# Your exam scores:
c(65,68,71,74, 53, 58)
[1] 65 68 71 74 53 58
# Your exam scores sorted:
sort(c(65,68,71,74, 53, 58), decreasing = T)
[1] 74 71 68 65 58 53
# Your exam scores sorted and highest five selected
head(sort(c(65,68,71,74, 53, 58), decreasing = T), n = 5)
[1] 74 71 68 65 58
# the mean of highest five
mean(head(sort(c(65,68,71,74, 53, 58), decreasing = T), n = 5))
[1] 67.2
There is a four step chain, which is visualised below.
flowchart LR A(combine) --> B(sort highest to lowest) --> C(take the highest five exam scores) --> D(take the mean)
We can use the operator |>
to avoid writing one function inside another. |>
is called the pipe operator.
c(65,68,71,74, 53, 58) |> sort(decreasing = T) |> head(n = 5) |> mean()
[1] 67.2
This makes writing and reading the code easier.
R comes with many functions, but you can write your own functions as well.
For example, mean()
calculates the mean of a numerical vector. You could also calculate it by taking the sum of numbers and dividing it by n.
# some temperature readings of a room
<- c(19, 21, 17, 24, 15)
temps
# mean of temperatures using the mean function
mean(temps)
[1] 19.2
# calculating the mean by 'hand'
sum(temps)/length(temps)
[1] 19.2
We can create a function that takes a vector of numbers, sums them up and divides it by the length of the vector.
function(x) {
sum(x) / length(x)
}
Here x
is a placeholder for the input. The function has one input, x
, which is summed up and divided by its length.
# keep this function as my_function
<- function(x) {
my_function sum(x) / length(x)
}
Let’s use our tailor-made function.
my_function(temps)
[1] 19.2
As you can see, it is giving the exactly same result.
Functions can have more than one input.
# another function called my_other_function
<- function(input_one, input_two) {
my_other_function * input_two / (input_one + input_two)
input_one
}
#
my_other_function(10, 40)
[1] 8
# my_other_function(10, 40) is equivalent of:
10 * 40) / (10 + 40) (
[1] 8
#> [1] 8
my_other_function
gets two inputs, input_one
and input_two
, their multiplication is divided by their sum. Mathematically, my other function is equivalent to the function below:
\[ f(x,y) = \frac{x * y}{x + y} \]
For example, let’s put 10 and 40 for respective inputs.
# calculating by typing
10 * 40) / (10 + 40) (
[1] 8
# calculating by using my_other_function
my_other_function(10, 40)
[1] 8
Recall that the mode is the most frequent observation. R does not have an in-built function to display the mode. Instead we can create our own function.
Let’s create pet data and store the fruits I ate last week
# fruits I ate last week:
<- c("banana", "apple", "banana", "mango", "banana", "watermelon", "apple") fruits
Let’s see the frequencies.
# frequencies of fruits
table(fruits)
fruits
apple banana mango watermelon
2 3 1 1
It is east to see that banana is the mode because it has the highest value.
But it is not enough because I want to generalize this. To ask R to bring me the mode, I need to construct the following command:
table(fruits)[ table(fruits) == max(table(fruits)) ]
banana
3
This is hard to process. Let’s unpack it, starting from the inside.
# maximum of the table
max(table(fruits))
[1] 3
Now I know that the frequency of the most common value. Next, let’s ask R if a category has the highest frequency.
table(fruits) == max(table(fruits))
fruits
apple banana mango watermelon
FALSE TRUE FALSE FALSE
For each category, R checks if the category is the most common. Finally, I will tell R to bring only the most common category.
table(fruits)[ table(fruits) == max(table(fruits)) ]
banana
3
If you write this in plain English, we are telling R to make a frequency table of fruits, and from that table bring only the category that is most common. You can generalize this.
# home-made function for finding the mode
<- function(x){
mode_function table(x)[ table(x) == max(table(x)) ]
}
Let’s use it.
mode_function(fruits)
banana
3
R community is large and active. Many clever people are creating their own additions to R, just like we created the mode_function()
. For example, there is an addition to R called DescTools
which comes with a function that calculates the mode of the data.
These additions are called libraries. You can think them as mini-applications within R.
Let’s install and use DescTools
.
install.packages("DescTools")
You need to install a package only once, but you need to recall it in each session.
library(DescTools)
# Mode of fruits (using DescTools Mode() function)
Mode(fruits)
[1] "banana"
attr(,"freq")
[1] 3
Recall that we discussed the approval rating of last US Presidents. Let’s use this data. It is available here and also on Blackboard.
Let’s remove everything on your environment and import the data.
# remove everything from the environment
rm(list = ls())
# load the data
<- read.csv("data/approval_data.csv") df
Visually see the data on your screen:
# View(df) View(df)
This is a time-series cross-sectional dataset, meaning that there are multiple cases measured over a time period. It is cross-sectional because there are multiple US Presidents. It is time-series because there are measurements over time. Unit of observation is President-Time
, making it a time-series dataset.
Let’s continue exploring the dataset.
# check the variable names
names(df)
[1] "president" "start_date" "end_date" "approving"
[5] "disapproving" "unsure_nodata"
# see the names of Presidents and data sources
table(df$president)
Barack Obama Bill Clinton Donald Trump George Bush Joe Biden W Bush
418 216 141 113 44 282
Let’s see the summary of all variables.
# summary of all variables
summary(df)
president start_date end_date approving
Length:1214 Length:1214 Length:1214 Min. :25.00
Class :character Class :character Class :character 1st Qu.:42.00
Mode :character Mode :character Mode :character Median :48.00
Mean :50.28
3rd Qu.:57.00
Max. :90.00
disapproving unsure_nodata
Min. : 6.0 Min. : 1.000
1st Qu.:36.0 1st Qu.: 4.000
Median :46.0 Median : 6.000
Mean :43.6 Mean : 6.124
3rd Qu.:52.0 3rd Qu.: 8.000
Max. :71.0 Max. :43.000
Note that it is pooled: it is shown for all Presidents in the data. For example, let’s see the five-point summary for approval rate.
# summary of the approving
summary(df$approving)
Min. 1st Qu. Median Mean 3rd Qu. Max.
25.00 42.00 48.00 50.28 57.00 90.00
This is the summary not for a specific President, but for all Presidents in the dataset. It is called pooled because data for all presidents are ‘pooled’ together for analysis regardless of who they are.
It is a good way to check the data for inconsistencies but it is not really good for analysis because it is giving me the descriptive summary for all presidents at once. I want to break it down by President.
Still, let’s continue exploring inconsistencies. For example, data for approval
, disapproval
, and unsure
ratings cannot be more than 100 or less than 0. They need to add up to roughly 100.
# summary of the approving
summary(df$approving)
Min. 1st Qu. Median Mean 3rd Qu. Max.
25.00 42.00 48.00 50.28 57.00 90.00
# summary of the disapproving
summary(df$disapproving)
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.0 36.0 46.0 43.6 52.0 71.0
# summary of unsure
summary(df$unsure_nodata)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 4.000 6.000 6.124 8.000 43.000
# are there any inconsistent values?
# add them up:
table(df$approving + df$disapproving + df$unsure_nodata)
99 100 101
2 1209 3
A few of summations add up to 99 and 101, but these are most likely because of rounding. Data looks good and reliable.
Next, we want to create a five point summary for each US President. Let’s say we want to compare their means, medians, and standard deviations.
I want to have the table below:
president | approve_n | approve_min | approve_q1 | approve_q2 | approve_q3 | approve_max | approve_mean | approve_sd |
---|---|---|---|---|---|---|---|---|
Barack Obama | 418 | 40 | 45 | 47.0 | 50.75 | 67 | 47.97 | 5.37 |
Bill Clinton | 216 | 37 | 50 | 57.0 | 60.25 | 73 | 55.49 | 7.62 |
Donald Trump | 141 | 34 | 38 | 40.0 | 42.00 | 49 | 40.38 | 3.12 |
George Bush | 113 | 29 | 50 | 65.0 | 73.00 | 89 | 61.42 | 15.01 |
Joe Biden | 44 | 36 | 39 | 41.0 | 43.00 | 57 | 42.80 | 5.84 |
W Bush | 282 | 25 | 37 | 50.5 | 62.00 | 90 | 51.35 | 15.87 |
Tidyverse is the name of a set of packages developed by people behind R Studio. It is commonly used for data wrangling, data analysis, and visualisation. They simplify base R by bringing fast, easy-to-use and useful functions. You can check their website: https://www.tidyverse.org/
They are especially useful for grouped data. We are going to use group_by
and summarize
functions to calculate descriptive statistics by US President.
Tidyverse should be installed on UEA computers in the IT Lab. If you are using your own laptop, you need to install it once.
# using your own laptop: install tidyverse once install.packages("tidyverse")
After installation, you can call the library.
library(tidyverse)
Time to group by President. I am going to use the pipe operator to make writing easier. Let’s calculate the mean of approval rating for each US President.
# simplest way to find the mean for grouped data
|> group_by(president) |> summarize(mean(approving)) df
# A tibble: 6 × 2
president `mean(approving)`
<chr> <dbl>
1 Barack Obama 48.0
2 Bill Clinton 55.5
3 Donald Trump 40.4
4 George Bush 61.4
5 Joe Biden 42.8
6 W Bush 51.3
summarize()
is a tidyverse function. summary()
is a base R function. Two functions are different.
This is a good start, but we can do better than this. mean(approving)
looks awkward. Let’s rename this. Also, we can make our code more readable.
# a nicer version for finding the mean for each presidents
|>
df group_by(president) |>
summarize(approval_mean = mean(approving))
# A tibble: 6 × 2
president approval_mean
<chr> <dbl>
1 Barack Obama 48.0
2 Bill Clinton 55.5
3 Donald Trump 40.4
4 George Bush 61.4
5 Joe Biden 42.8
6 W Bush 51.3
This is an excellent start! We can see that George W Bush has the highest mean approval rating whereas Donald Trump has the lowest mean approval rating. What about the highest and lowest approval ratings ever? These correspond to minimum and maximum. We can also calculate these.
# mean, max and min
|>
df group_by(president) |>
summarize(
approval_mean = mean(approving),
approval_min = min(approving),
approval_max = max(approving)
)
# A tibble: 6 × 4
president approval_mean approval_min approval_max
<chr> <dbl> <int> <int>
1 Barack Obama 48.0 40 67
2 Bill Clinton 55.5 37 73
3 Donald Trump 40.4 34 49
4 George Bush 61.4 29 89
5 Joe Biden 42.8 36 57
6 W Bush 51.3 25 90
George W Bush has both the maximum and minimum approval ratings. If you know the early 21st century US history, this is most likely because of 9/11 attacks (see rally ’round the flag effect) and the failures of Iraq War and the 2008 Financial Crises (see Great Recession).
We can also see less obvious statistics, such as highest minimum (Barack Obama with 40) and lowest maximum (Donald Trump with 49).
Recall that for a full descriptive summary, we need to show:
Now that we have a good grip of the code, we can easily add these features.
# full grouped summary
|>
df group_by(president) |>
summarize(
approve_n = length(approving),
approve_min = min(approving),
approve_q1 = quantile(approving, probs = 0.25),
approve_q2 = quantile(approving, probs = 0.50),
approve_q3 = quantile(approving, probs = 0.75),
approve_max = max(approving),
approve_mean = mean(approving),
approve_sd = sd(approving),
)
# A tibble: 6 × 9
president approve_n approve_min approve_q1 approve_q2 approve_q3 approve_max
<chr> <int> <int> <dbl> <dbl> <dbl> <int>
1 Barack Oba… 418 40 45 47 50.8 67
2 Bill Clint… 216 37 50 57 60.2 73
3 Donald Tru… 141 34 38 40 42 49
4 George Bush 113 29 50 65 73 89
5 Joe Biden 44 36 39 41 43 57
6 W Bush 282 25 37 50.5 62 90
# ℹ 2 more variables: approve_mean <dbl>, approve_sd <dbl>
You can take this summary into a different object. Let’s call this sum_app
(or any other name of your choice).
<- df |>
sum_app group_by(president) |>
summarize(
n = length(approving),
min = min(approving),
q1 = quantile(approving, probs = 0.25),
median = quantile(approving, probs = 0.50),
q3 = quantile(approving, probs = 0.75),
max = max(approving),
mean = mean(approving),
sd = sd(approving),
)
Now, that it is in an object, we can write it into a csv file and use it in our report.
# write approval summary into a file:
write.csv(sum_app, "data/approval_summary_bypres.csv", row.names = F)
After doing this, I realized that mean and standard deviation has too many decimal points! What do I do? Just, wrap a round()
before mean and sd.
# it is enough to have two decimal points for mean and sd
$mean <- round(sum_app$mean, 2)
sum_app$sd <- round(sum_app$sd, 2)
sum_app
# write it again
# write approval summary into a file:
write.csv(sum_app, "data/approval_summary_bypres.csv", row.names = F)