<- function(parameters) {
function_name # Body of the function
return(result)
}
Functions and Packages
1. Introduction to Function
A function is a self-contained block of code that encapsulates a specific task or related group of tasks. Functions take some inputs, perform their task, and then send an output. R comes with numerous built-in functions, and it also allows you to create your own, known as user-defined functions.
What is function?
A function in R is a piece of code written to carry out a specified task. It takes some input, processes it, and returns a result. Functions help in reducing redundancy, making the code more readable, and debugging easier.
Types of Functions:
Built-in Functions: Pre-defined functions that are included in R.
– Example:
sum()
,mean()
,print()
User-Defined Functions: Functions created by the user for a specific task.
Anonymous Functions: Functions without a name, used for short tasks.
Why use functions?
Modularity: Break down complex tasks into smaller, manageable parts.
Re-usability: Write once, use many times.
Clarity: Make your code more understandable and easier to maintain.
Anatomy of a function
A typical function in R has the following components:
• Name: Identifies the function and is used to call it.
• Parameters: Inputs that are passed into the function.
• Body: The code block that performs the task.
• Return Value: The output of the function.
Creating a simple function
Creating a function in R involves specifying the function name, parameters, the operations to be performed, and the return value. Below, we delve deeper into creating simple functions, providing various examples and explaining each step in detail.
Here is the basic syntax for creating a function in R:
• function_name: The name used to call the function.
• parameters: The variables that are input to the function.
• result: The output of the function.
Example 1: Adding two numbers
<- function(a, b) {
add_numbers <- a + b
sum return(sum)
}
#to use the function
add_numbers(2,6)
[1] 8
Example 2: Calculating mean, median and mode
#create a data
<- c(2,3,4,5,6,7,8,9,10,11, 11, 11, 12,11)
data
#calculate mean
<- mean(data)
mean_data cat("mean of the data: ", mean_data, "\n")
mean of the data: 7.857143
#calculate median
<- median(data)
median_data cat("median of the data: ", median_data, "\n")
median of the data: 8.5
#calculate mode
<- function(v) {
get_mode <- unique(v)
uniqv which.max(tabulate(match(v, uniqv)))]
uniqv[
}<- get_mode(data)
mode_data cat("mode of the data: ", mode_data, "\n")
mode of the data: 11
Example 3: calculating mean, median, and mode for mtcars dataset
#function to calculate mode
<- function(v) {
get_mode <- unique(v)
uniqv which.max(tabulate(match(v, uniqv)))]
uniqv[
}
#creating main function to calculate mean, median and mode for all numerical variables
<- function(dataset) {
stat_mtcars #filter out non-numeric columns
<- dataset[sapply(dataset, is.numeric)]
numeric_data
#Apply functions to each column
<- sapply(numeric_data, mean, na.rm=T)
means <- sapply(numeric_data, median, na.rm=T)
medians <- sapply(numeric_data, get_mode )
modes
#combine results into a list
<- data.frame(
results mean = means,
median = medians,
mode = modes
)
return(results)
}
#use the function
stat_mtcars(mtcars)
mean median mode
mpg 20.090625 19.200 21.00
cyl 6.187500 6.000 8.00
disp 230.721875 196.300 275.80
hp 146.687500 123.000 110.00
drat 3.596563 3.695 3.92
wt 3.217250 3.325 3.44
qsec 17.848750 17.710 17.02
vs 0.437500 0.000 0.00
am 0.406250 0.000 0.00
gear 3.687500 4.000 3.00
carb 2.812500 2.000 4.00
Example 4: Convert numeric data to factor variables
<- function(dataset, column_names) {
convert_factor <- names(dataset)
specified_columns for(col in column_names){
<- as.factor(dataset[[col]])
dataset[[col]]
}return(dataset)
}
#test the function
<- convert_factor(mtcars, c("cyl","vs", "am", "gear", "carb"))
mtcars
str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
$ am : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
$ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
$ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
Example 5: Calculating mean, variance, standard deviation, median, and IQR
<- function(dataset){
stat_cal
#filter out the non-numerical variables
<- dataset[sapply(dataset, is.numeric)]
numeric_data
#Apply function to each column
<- sapply(numeric_data, mean, na.rm=T)
means <- sapply(numeric_data, var, na.rm=T)
variance <- sapply(numeric_data, sd, na.rm=T)
sds <- sapply(numeric_data, median, na.rm=T)
medians <- sapply(numeric_data, IQR, na.rm=T)
iqrs
#create a data.frame from results
<- data.frame(
result Means = means,
Variances = variance,
StandardDeviation = sds,
Medians = medians,
IQRs = iqrs
)
return(result)
}
#test the function
stat_cal(mtcars)
Means Variances StandardDeviation Medians IQRs
mpg 20.090625 3.632410e+01 6.0269481 19.200 7.37500
disp 230.721875 1.536080e+04 123.9386938 196.300 205.17500
hp 146.687500 4.700867e+03 68.5628685 123.000 83.50000
drat 3.596563 2.858814e-01 0.5346787 3.695 0.84000
wt 3.217250 9.573790e-01 0.9784574 3.325 1.02875
qsec 17.848750 3.193166e+00 1.7869432 17.710 2.00750
Example 6: Detect Missing Values
<- function(dataset) {
detect_missing_values <- numeric(ncol(dataset))
missing_counts names(missing_counts) <- colnames(dataset)
for(i in 1:ncol(dataset)){
<- sum(is.na(dataset[[i]]))
missing_counts[i]
}
#Filter out columns with no missing values
<- missing_counts[missing_counts>0]
missing_counts return(missing_counts)
}
#to test this function
<- data.frame(
sample_data A = c(1,2,NA, 4,5),
B = c(NA, 2,3,4,5),
C = c(1,2,3,4,5)
)
detect_missing_values(sample_data)
A B
1 1
detect_missing_values(mtcars)
named numeric(0)
#let create missing value for mtcars dataset
<- mtcars
data1 c(1,2,3,4), c(5,4,3,2)] <- NA
data1[
detect_missing_values(data1)
cyl disp hp drat
4 4 4 4
Introduction to Packages
A package in R is a collection of functions, sample data, and documentation bundled together. By using packages, you can leverage the work of others to perform complex tasks with just a few lines of code.
Why Use Packages?
• Enhanced Functionality: Packages provide additional functions to perform a wide variety of tasks.
• Efficiency: Save time and effort by using pre-written and tested code.
• Community Support: Benefit from the extensive and vibrant R community.
Installing packages
You can install packages directly from CRAN (Comprehensive R Archive Network), or other repositories, and also from local files.
#installing the 'dplyr' package from CRAN
install.packages("dplyr")
Loading packages
After installing a package, you need to load it into the R environment to use its functions.
#loading the 'dplyr' package
library(dplyr)
Using package functions
After loading a package, you can use its functions by calling them like any other function in R.
#using the 'filter' function from 'dplyr' to filter rows in a data frame.
::filter(mtcars, mpg > 20) dplyr
To see list of functions available in a package
ls(getNamespace("dplyr"))
To see the documentation of the package
help(package="dplyr")
Exercise
Basic Statistics:
a. Load the
iris
dataset. Compute the mean, median and standard deviation for theSepal.Length
andSepal.Width
columns.b. Using the
mtcars
dataset, determine which car model has the highest miles per gallon (mpg
).Data manipulation:
a. From the
mtcars
dataset, filter only those rows where the number of cylinders (cyl
) is 4.b. Using the
iris
dataset, group the data by species and compute the averageSepal.Length
for each group.Custom Functions:
a. Write a function that takes a dataframes and a column name as input and returns the range (min to max) of that column.
b. Develop a function that accepts a dataframe and returns a list of columns that have missing values along with the count of missing values.
Data Cleaning:
a. Identity and replace any negative values in the
Sepal.Length
column of theiris
dataset with the mean value of the column.b. Using any dataset of your choice with missing values, impute the missing values using the median of the respective columns.