Functions and Packages

Author

Dr. Mohammad Nasir Abdullah

1. Introduction to Function

A function is a self-contained block of code that encapsulates a specific task or related group of tasks. Functions take some inputs, perform their task, and then send an output. R comes with numerous built-in functions, and it also allows you to create your own, known as user-defined functions.

What is function?

A function in R is a piece of code written to carry out a specified task. It takes some input, processes it, and returns a result. Functions help in reducing redundancy, making the code more readable, and debugging easier.

Types of Functions:

  1. Built-in Functions: Pre-defined functions that are included in R.

    – Example: sum(), mean(), print()

  2. User-Defined Functions: Functions created by the user for a specific task.

  3. Anonymous Functions: Functions without a name, used for short tasks.

Why use functions?

Modularity: Break down complex tasks into smaller, manageable parts.

Re-usability: Write once, use many times.

Clarity: Make your code more understandable and easier to maintain.

Anatomy of a function

A typical function in R has the following components:

Name: Identifies the function and is used to call it.

Parameters: Inputs that are passed into the function.

Body: The code block that performs the task.

Return Value: The output of the function.

Creating a simple function

Creating a function in R involves specifying the function name, parameters, the operations to be performed, and the return value. Below, we delve deeper into creating simple functions, providing various examples and explaining each step in detail.

Here is the basic syntax for creating a function in R:

function_name <- function(parameters) { 
  # Body of the function    
  return(result)    
  } 

function_name: The name used to call the function.

parameters: The variables that are input to the function.

result: The output of the function.

Example 1: Adding two numbers

add_numbers <- function(a, b) {
  sum <- a + b
  return(sum)
}

#to use the function
add_numbers(2,6)
[1] 8

Example 2: Calculating mean, median and mode

#create a data
data <- c(2,3,4,5,6,7,8,9,10,11, 11, 11, 12,11)

#calculate mean
mean_data <- mean(data)
cat("mean of the data: ", mean_data, "\n") 
mean of the data:  7.857143 
#calculate median
median_data <- median(data)
cat("median of the data: ", median_data, "\n") 
median of the data:  8.5 
#calculate mode
get_mode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}
mode_data <- get_mode(data)
cat("mode of the data: ", mode_data, "\n") 
mode of the data:  11 

Example 3: calculating mean, median, and mode for mtcars dataset

#function to calculate mode
get_mode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

#creating main function to calculate mean, median and mode for all numerical variables
stat_mtcars <- function(dataset) {
  #filter out non-numeric columns
  numeric_data <- dataset[sapply(dataset, is.numeric)]
  
  #Apply functions to each column
  means <- sapply(numeric_data, mean, na.rm=T)
  medians <- sapply(numeric_data, median, na.rm=T)
  modes <- sapply(numeric_data, get_mode )
  
  #combine results into a list
  results <- data.frame(
    mean = means, 
    median = medians,
    mode = modes
  )
  
  return(results)
  
}


#use the function
stat_mtcars(mtcars)
           mean  median   mode
mpg   20.090625  19.200  21.00
cyl    6.187500   6.000   8.00
disp 230.721875 196.300 275.80
hp   146.687500 123.000 110.00
drat   3.596563   3.695   3.92
wt     3.217250   3.325   3.44
qsec  17.848750  17.710  17.02
vs     0.437500   0.000   0.00
am     0.406250   0.000   0.00
gear   3.687500   4.000   3.00
carb   2.812500   2.000   4.00

Example 4: Convert numeric data to factor variables

convert_factor <- function(dataset, column_names) {
  specified_columns <- names(dataset)
  for(col in column_names){
    dataset[[col]] <- as.factor(dataset[[col]])
  }
  return(dataset)
}

#test the function
mtcars<- convert_factor(mtcars, c("cyl","vs", "am", "gear", "carb"))

str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
 $ am  : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
 $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
 $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...

Example 5: Calculating mean, variance, standard deviation, median, and IQR

stat_cal <- function(dataset){
  
  #filter out the non-numerical variables
  numeric_data <- dataset[sapply(dataset, is.numeric)]
  
  #Apply function to each column
  means <- sapply(numeric_data, mean, na.rm=T)
  variance <- sapply(numeric_data, var, na.rm=T)
  sds <- sapply(numeric_data, sd, na.rm=T)
  medians <- sapply(numeric_data, median, na.rm=T)
  iqrs <- sapply(numeric_data, IQR, na.rm=T)
  
  #create a data.frame from results
  result <- data.frame(
    Means = means, 
    Variances = variance, 
    StandardDeviation = sds,
    Medians = medians,
    IQRs = iqrs
  )
  
  return(result)
}


#test the function
stat_cal(mtcars)
          Means    Variances StandardDeviation Medians      IQRs
mpg   20.090625 3.632410e+01         6.0269481  19.200   7.37500
disp 230.721875 1.536080e+04       123.9386938 196.300 205.17500
hp   146.687500 4.700867e+03        68.5628685 123.000  83.50000
drat   3.596563 2.858814e-01         0.5346787   3.695   0.84000
wt     3.217250 9.573790e-01         0.9784574   3.325   1.02875
qsec  17.848750 3.193166e+00         1.7869432  17.710   2.00750

Example 6: Detect Missing Values

detect_missing_values <- function(dataset) { 
  missing_counts <- numeric(ncol(dataset))
  names(missing_counts) <- colnames(dataset)
  
  for(i in 1:ncol(dataset)){
    missing_counts[i] <- sum(is.na(dataset[[i]]))
  }
  
  #Filter out columns with no missing values
  missing_counts <- missing_counts[missing_counts>0]
  return(missing_counts)
  
}


#to test this function
sample_data <- data.frame(
  A = c(1,2,NA, 4,5),
  B = c(NA, 2,3,4,5),
  C = c(1,2,3,4,5)
)

detect_missing_values(sample_data)
A B 
1 1 
detect_missing_values(mtcars)
named numeric(0)
#let create missing value for mtcars dataset
data1 <- mtcars
data1[c(1,2,3,4), c(5,4,3,2)] <- NA

detect_missing_values(data1)
 cyl disp   hp drat 
   4    4    4    4 

Introduction to Packages

A package in R is a collection of functions, sample data, and documentation bundled together. By using packages, you can leverage the work of others to perform complex tasks with just a few lines of code.

Why Use Packages?

Enhanced Functionality: Packages provide additional functions to perform a wide variety of tasks.

Efficiency: Save time and effort by using pre-written and tested code.

Community Support: Benefit from the extensive and vibrant R community.

Installing packages

You can install packages directly from CRAN (Comprehensive R Archive Network), or other repositories, and also from local files.

#installing the 'dplyr' package from CRAN 
install.packages("dplyr")

Loading packages

After installing a package, you need to load it into the R environment to use its functions.

#loading the 'dplyr' package 
library(dplyr)

Using package functions

After loading a package, you can use its functions by calling them like any other function in R.

#using the 'filter' function from 'dplyr' to filter rows in a data frame.  
dplyr::filter(mtcars, mpg > 20)

To see list of functions available in a package

ls(getNamespace("dplyr"))

To see the documentation of the package

help(package="dplyr")

Exercise

  1. Basic Statistics:

    a. Load the iris dataset. Compute the mean, median and standard deviation for the Sepal.Length and Sepal.Width columns.

    b. Using the mtcars dataset, determine which car model has the highest miles per gallon (mpg).

  2. Data manipulation:

    a. From the mtcars dataset, filter only those rows where the number of cylinders (cyl) is 4.

    b. Using the iris dataset, group the data by species and compute the average Sepal.Length for each group.

  3. Custom Functions:

    a. Write a function that takes a dataframes and a column name as input and returns the range (min to max) of that column.

    b. Develop a function that accepts a dataframe and returns a list of columns that have missing values along with the count of missing values.

  4. Data Cleaning:

    a. Identity and replace any negative values in the Sepal.Length column of the iris dataset with the mean value of the column.

    b. Using any dataset of your choice with missing values, impute the missing values using the median of the respective columns.