4  Functions, Libraries and Operators

This week we will continue learning the fundamentals of R. We are going to introduce the pipe operator and using R libraries.

4.1 Pipe operator

We often put one function inside another function. For example, you are trying to find the mean of your exam scores.

# Your exam scores:
c(65,68,71,74, 53, 58)
[1] 65 68 71 74 53 58
# Mean of the exam score:
mean(c(65,68,71,74, 53, 58))
[1] 64.83333

In the example above, you combine your exam scores using c() and then put this into mean(). In other words, the output of c() becomes the input of mean().

Sometimes we need to repeatedly put functions inside of other functions to create long chains. In such cases, reading and writing code might become cumbersome.

For example, let’s say that only the highest five exam scores are going to be considered.

# Your exam scores:
c(65,68,71,74, 53, 58)
[1] 65 68 71 74 53 58
# Your exam scores sorted: 
sort(c(65,68,71,74, 53, 58), decreasing = T)
[1] 74 71 68 65 58 53
# Your exam scores sorted and highest five selected
head(sort(c(65,68,71,74, 53, 58), decreasing = T), n = 5)
[1] 74 71 68 65 58
# the mean of highest five
mean(head(sort(c(65,68,71,74, 53, 58), decreasing = T), n = 5))
[1] 67.2

There is a four step chain, which is visualised below.

flowchart LR
  A(combine) --> B(sort highest to lowest) --> C(take the highest five exam scores) --> D(take the mean)

We can use the operator |> to avoid writing one function inside another. |> is called the pipe operator.

c(65,68,71,74, 53, 58) |> sort(decreasing = T) |> head(n = 5) |> mean()
[1] 67.2

This makes writing and reading the code easier.

4.2 Creating functions

R comes with many functions, but you can write your own functions as well.

For example, mean() calculates the mean of a numerical vector. You could also calculate it by taking the sum of numbers and dividing it by n.

# some temperature readings of a room
temps <- c(19, 21, 17, 24, 15)

# mean of temperatures using the mean function
mean(temps)
[1] 19.2
# calculating the mean by 'hand'
sum(temps)/length(temps)
[1] 19.2

We can create a function that takes a vector of numbers, sums them up and divides it by the length of the vector.

function(x) {
  sum(x) /  length(x)
  }

Here x is a placeholder for the input. The function has one input, x, which is summed up and divided by its length.

# keep this function as my_function 
my_function <- function(x) {
  sum(x) /  length(x)
  }

Let’s use our tailor-made function.

my_function(temps)
[1] 19.2

As you can see, it is giving the exactly same result.

Functions can have more than one input.

# another function called my_other_function
my_other_function <- function(input_one, input_two) {
  input_one * input_two / (input_one + input_two)
}


my_other_function(10, 40)
[1] 8
# my_other_function(10, 40) is equivalent of:

(10 * 40) / (10 + 40)
[1] 8
#> [1] 8

my_other_function gets two inputs, input_one and input_two, their multiplication is divided by their sum. Mathematically, my other function is equivalent to the function below:

\[ f(x,y) = \frac{x * y}{x + y} \]

For example, let’s put 10 and 40 for respective inputs.

# calculating by typing
(10 * 40) / (10 + 40)
[1] 8
# calculating by using my_other_function
my_other_function(10, 40)
[1] 8

4.2.1 Challenge: finding the mode

Recall that the mode is the most frequent observation. R does not have an in-built function to display the mode. Instead we can create our own function.

Let’s create pet data and store the fruits I ate last week

# fruits I ate last week:
fruits <- c("banana", "apple", "banana", "mango", "banana", "watermelon", "apple")

Let’s see the frequencies.

# frequencies of fruits
table(fruits)
fruits
     apple     banana      mango watermelon 
         2          3          1          1 

It is east to see that banana is the mode because it has the highest value.

But it is not enough because I want to generalize this. To ask R to bring me the mode, I need to construct the following command:

table(fruits)[ table(fruits) == max(table(fruits)) ] 
banana 
     3 

This is hard to process. Let’s unpack it, starting from the inside.

# maximum of the table
max(table(fruits))
[1] 3

Now I know that the frequency of the most common value. Next, let’s ask R if a category has the highest frequency.

table(fruits) == max(table(fruits))
fruits
     apple     banana      mango watermelon 
     FALSE       TRUE      FALSE      FALSE 

For each category, R checks if the category is the most common. Finally, I will tell R to bring only the most common category.

table(fruits)[ table(fruits) == max(table(fruits)) ] 
banana 
     3 

If you write this in plain English, we are telling R to make a frequency table of fruits, and from that table bring only the category that is most common. You can generalize this.

# home-made function for finding the mode
mode_function <- function(x){
    table(x)[ table(x) == max(table(x)) ] 
    }

Let’s use it.

mode_function(fruits)
banana 
     3 

4.3 Libraries

R community is large and active. Many clever people are creating their own additions to R, just like we created the mode_function(). For example, there is an addition to R called DescTools which comes with a function that calculates the mode of the data.

These additions are called libraries. You can think them as mini-applications within R.

Let’s install and use DescTools.

install.packages("DescTools")

You need to install a package only once, but you need to recall it in each session.

library(DescTools)
# Mode of fruits (using DescTools Mode() function)
Mode(fruits)
[1] "banana"
attr(,"freq")
[1] 3

4.4 US Presidents Approval Data

Recall that we discussed the approval rating of last US Presidents. Let’s use this data. It is available here and also on Blackboard.

Let’s remove everything on your environment and import the data.

# remove everything from the environment
rm(list = ls())

# load the data
df <- read.csv("data/approval_data.csv")

Visually see the data on your screen:

# View(df)
View(df)

This is a time-series cross-sectional dataset, meaning that there are multiple cases measured over a time period. It is cross-sectional because there are multiple US Presidents. It is time-series because there are measurements over time. Unit of observation is President-Time, making it a time-series dataset.

Let’s continue exploring the dataset.

# check the variable names
names(df)
[1] "president"     "start_date"    "end_date"      "approving"    
[5] "disapproving"  "unsure_nodata"
# see the names of Presidents and data sources 
table(df$president)

Barack Obama Bill Clinton Donald Trump  George Bush    Joe Biden       W Bush 
         418          216          141          113           44          282 

Let’s see the summary of all variables.

# summary of all variables
summary(df)
  president          start_date          end_date           approving    
 Length:1214        Length:1214        Length:1214        Min.   :25.00  
 Class :character   Class :character   Class :character   1st Qu.:42.00  
 Mode  :character   Mode  :character   Mode  :character   Median :48.00  
                                                          Mean   :50.28  
                                                          3rd Qu.:57.00  
                                                          Max.   :90.00  
  disapproving  unsure_nodata   
 Min.   : 6.0   Min.   : 1.000  
 1st Qu.:36.0   1st Qu.: 4.000  
 Median :46.0   Median : 6.000  
 Mean   :43.6   Mean   : 6.124  
 3rd Qu.:52.0   3rd Qu.: 8.000  
 Max.   :71.0   Max.   :43.000  

Note that it is pooled: it is shown for all Presidents in the data. For example, let’s see the five-point summary for approval rate.

# summary of the approving
summary(df$approving)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  25.00   42.00   48.00   50.28   57.00   90.00 

This is the summary not for a specific President, but for all Presidents in the dataset. It is called pooled because data for all presidents are ‘pooled’ together for analysis regardless of who they are.

It is a good way to check the data for inconsistencies but it is not really good for analysis because it is giving me the descriptive summary for all presidents at once. I want to break it down by President.

Still, let’s continue exploring inconsistencies. For example, data for approval, disapproval, and unsure ratings cannot be more than 100 or less than 0. They need to add up to roughly 100.

# summary of the approving
summary(df$approving)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  25.00   42.00   48.00   50.28   57.00   90.00 
# summary of the disapproving
summary(df$disapproving)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    6.0    36.0    46.0    43.6    52.0    71.0 
# summary of unsure
summary(df$unsure_nodata)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   4.000   6.000   6.124   8.000  43.000 
# are there any inconsistent values? 
# add them up:
table(df$approving + df$disapproving + df$unsure_nodata)

  99  100  101 
   2 1209    3 

A few of summations add up to 99 and 101, but these are most likely because of rounding. Data looks good and reliable.

4.5 Five point summary for each US President

Next, we want to create a five point summary for each US President. Let’s say we want to compare their means, medians, and standard deviations.

I want to have the table below:

president approve_n approve_min approve_q1 approve_q2 approve_q3 approve_max approve_mean approve_sd
Barack Obama 418 40 45 47.0 50.75 67 47.97 5.37
Bill Clinton 216 37 50 57.0 60.25 73 55.49 7.62
Donald Trump 141 34 38 40.0 42.00 49 40.38 3.12
George Bush 113 29 50 65.0 73.00 89 61.42 15.01
Joe Biden 44 36 39 41.0 43.00 57 42.80 5.84
W Bush 282 25 37 50.5 62.00 90 51.35 15.87

4.5.1 Dealing with grouped data using tidyverse

Tidyverse is the name of a set of packages developed by people behind R Studio. It is commonly used for data wrangling, data analysis, and visualisation. They simplify base R by bringing fast, easy-to-use and useful functions. You can check their website: https://www.tidyverse.org/

They are especially useful for grouped data. We are going to use group_by and summarize functions to calculate descriptive statistics by US President.

Tidyverse should be installed on UEA computers in the IT Lab. If you are using your own laptop, you need to install it once.

# using your own laptop: install tidyverse once
install.packages("tidyverse")

After installation, you can call the library.

library(tidyverse)

Time to group by President. I am going to use the pipe operator to make writing easier. Let’s calculate the mean of approval rating for each US President.

# simplest way to find the mean for grouped data
df |> group_by(president) |>  summarize(mean(approving))
# A tibble: 6 × 2
  president    `mean(approving)`
  <chr>                    <dbl>
1 Barack Obama              48.0
2 Bill Clinton              55.5
3 Donald Trump              40.4
4 George Bush               61.4
5 Joe Biden                 42.8
6 W Bush                    51.3
Note

summarize() is a tidyverse function. summary() is a base R function. Two functions are different.

This is a good start, but we can do better than this. mean(approving) looks awkward. Let’s rename this. Also, we can make our code more readable.

# a nicer version for finding the mean for each presidents
df |> 
  group_by(president) |>  
  summarize(approval_mean = mean(approving))
# A tibble: 6 × 2
  president    approval_mean
  <chr>                <dbl>
1 Barack Obama          48.0
2 Bill Clinton          55.5
3 Donald Trump          40.4
4 George Bush           61.4
5 Joe Biden             42.8
6 W Bush                51.3

This is an excellent start! We can see that George W Bush has the highest mean approval rating whereas Donald Trump has the lowest mean approval rating. What about the highest and lowest approval ratings ever? These correspond to minimum and maximum. We can also calculate these.

# mean, max and min
df |> 
  group_by(president) |>  
  summarize(
    approval_mean = mean(approving),
    approval_min  = min(approving),
    approval_max  = max(approving)
  )
# A tibble: 6 × 4
  president    approval_mean approval_min approval_max
  <chr>                <dbl>        <int>        <int>
1 Barack Obama          48.0           40           67
2 Bill Clinton          55.5           37           73
3 Donald Trump          40.4           34           49
4 George Bush           61.4           29           89
5 Joe Biden             42.8           36           57
6 W Bush                51.3           25           90

George W Bush has both the maximum and minimum approval ratings. If you know the early 21st century US history, this is most likely because of 9/11 attacks (see rally ’round the flag effect) and the failures of Iraq War and the 2008 Financial Crises (see Great Recession).

We can also see less obvious statistics, such as highest minimum (Barack Obama with 40) and lowest maximum (Donald Trump with 49).

Recall that for a full descriptive summary, we need to show:

  • n (the number of observations)
  • minimum
  • first quartile
  • median
  • second quartile
  • maximum
  • mean
  • standard deviation

Now that we have a good grip of the code, we can easily add these features.

# full grouped summary 
df |> 
  group_by(president) |>
  summarize(
    approve_n   = length(approving),
    approve_min = min(approving),
    approve_q1  = quantile(approving, probs = 0.25),
    approve_q2  = quantile(approving, probs = 0.50),
    approve_q3  = quantile(approving, probs = 0.75),
    approve_max = max(approving),
    approve_mean = mean(approving),
    approve_sd = sd(approving),
    )
# A tibble: 6 × 9
  president   approve_n approve_min approve_q1 approve_q2 approve_q3 approve_max
  <chr>           <int>       <int>      <dbl>      <dbl>      <dbl>       <int>
1 Barack Oba…       418          40         45       47         50.8          67
2 Bill Clint…       216          37         50       57         60.2          73
3 Donald Tru…       141          34         38       40         42            49
4 George Bush       113          29         50       65         73            89
5 Joe Biden          44          36         39       41         43            57
6 W Bush            282          25         37       50.5       62            90
# ℹ 2 more variables: approve_mean <dbl>, approve_sd <dbl>

You can take this summary into a different object. Let’s call this sum_app (or any other name of your choice).

sum_app <- df |> 
  group_by(president) |>
  summarize(
    n   = length(approving),
    min = min(approving),
    q1  = quantile(approving, probs = 0.25),
    median  = quantile(approving, probs = 0.50),
    q3  = quantile(approving, probs = 0.75),
    max = max(approving),
    mean = mean(approving),
    sd = sd(approving),
    )

Now, that it is in an object, we can write it into a csv file and use it in our report.

# write approval summary into a file:
write.csv(sum_app, "data/approval_summary_bypres.csv", row.names = F)

After doing this, I realized that mean and standard deviation has too many decimal points! What do I do? Just, wrap a round() before mean and sd.

# it is enough to have two decimal points for mean and sd
sum_app$mean <- round(sum_app$mean, 2)
sum_app$sd <- round(sum_app$sd, 2)

# write it again
# write approval summary into a file:
write.csv(sum_app, "data/approval_summary_bypres.csv", row.names = F)