apply(X, MARGIN, FUN, ..., simplify = TRUE)
Apply Family
Apply Family
The apply
family of functions in R is a cornerstone of efficient data manipulation and analysis. These functions allow users to perform repetitive tasks over arrays, lists, and data frames in a more concise and readable manner than traditional loop structures. In this chapter, we will delve into the practical applications of these functions using two engaging datasets: the Penguins and Flights datasets.
Understanding the Apply Family
The apply
family comprises several functions, each tailored for specific types of data and operations:
apply
: Used for arrays and matrices,apply
is ideal for performing functions on rows or columns of a matrix.lapply
andsapply
: These functions are designed for lists and vectors.lapply
returns a list, whilesapply
simplifies the output to an array or matrix when possible.vapply
: Similar tosapply
, but with a pre-specified type of return value, making it safer and more predictable.tapply
: Ideal for group-wise operations on arrays or data frames,tapply
applies a function to subsets of data.mapply
: A multivariate version oflapply
,mapply
applies a function to multiple arguments in parallel.
Understanding these functions is crucial for data scientists and statisticians working with R, as they significantly enhance the efficiency and readability of data analysis code.
Using apply
function
The apply
function in R is a fundamental tool in data manipulation and analysis. It is used to apply a function to the rows or columns of a matrix or, more generally, to a higher-dimensional array. The objective of apply
is to make operations over arrays more efficient and code more concise. It is particularly useful for performing summary statistics, transformations, and custom operations without the need for explicit loops.
Objective
The primary objective of apply
is to simplify repetitive operations across rows or columns of data structures like matrices and arrays. By using apply
, one can write cleaner and more efficient R code, which is easier to read and maintain.
General use of the function
Arguments
X |
an array, including a matrix. |
MARGIN |
a vector giving the subscripts which the function will be applied over. E.g., for a matrix:1 indicates rows,2 indicates columns,c(1, 2) indicates rows and columns.Where X has named dimnames, it can be a character vector selecting dimension names. |
FUN |
the function to be applied: see ‘Details’. In the case of functions like + , %*% , etc., the function name must be backquoted or quoted. |
... |
optional arguments to FUN . |
simplify |
a logical indicating whether results should be simplified if possible. |
Example 1:
use apply
to calculate measures such as mean and standard deviation for numeric variables in the dataset.
library(palmerpenguins) #to use penguins dataset
Warning: package 'palmerpenguins' was built under R version 4.3.2
<- na.omit(penguins) #to remove rows with missing data
penguins
# Mean values for selected columns
apply(penguins[, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")], 2, mean)
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
43.99279 17.16486 200.96697 4207.05706
# Standard deviation for selected columns
apply(penguins[, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")], 2, sd)
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
5.468668 1.969235 14.015765 805.215802
Example 2:
apply
can also be used for computing medians and interquartile ranges.
# Median values
apply(penguins[, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")], 2, median)
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
44.5 17.3 197.0 4050.0
# Interquartile ranges
apply(penguins[, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")], 2, IQR)
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
9.1 3.1 23.0 1225.0
Example 3:
Assume we have a dataset where respondents have answered a series of questions, each rated on a numerical scale (like 1 to 5). Our goal is to calculate the total score for each respondent.
# Simulating a questionnaire dataset
set.seed(123) # For reproducible results
<- as.data.frame(matrix(sample(1:5, 100, replace = TRUE), nrow = 20))
questionnaire colnames(questionnaire) <- paste0("Q", 1:5)
# Calculating row sums
<- apply(questionnaire, 1, sum)
total_scores
# Adding total scores to the dataset
$total_score <- total_scores
questionnaire
# Display first 5 observations
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
slice_head(questionnaire,n=5)
Q1 Q2 Q3 Q4 Q5 total_score
1 3 2 5 5 4 19
2 3 1 5 5 2 16
3 2 3 4 3 2 14
4 2 4 5 1 4 16
5 3 1 2 4 4 14
Example 4:
Calculate the average of all numerical variables in flights dataset.
library(nycflights13)
#selecting all numerical variables in the flights dataset
<- flights %>%
num_flights select(where(is.numeric))
#calculate average of all numerical variables in the dataset
apply(num_flights, 2, function(x) mean(x, na.rm=T))
year month day dep_time sched_dep_time
2013.000000 6.548510 15.710787 1349.109947 1344.254840
dep_delay arr_time sched_arr_time arr_delay flight
12.639070 1502.054999 1536.380220 6.895377 1971.923620
air_time distance hour minute
150.686460 1039.912604 13.180247 26.230100
Using sapply
function
sapply
is part of the apply
family in R, and it’s a user-friendly version of lapply
because it tries to simplify the output. If the function applied to each element of the list or vector returns a single number, sapply
will return a vector. If it returns a vector of the same length for each element, sapply
will return a matrix.
The primary objective of sapply
(Simplified Apply) in R is to apply a function over a list, vector, or data frame column-wise and simplify the output to the most appropriate data structure. Unlike lapply
, which always returns a list, sapply
attempts to simplify the result to a vector or matrix, making the output more concise and often easier to work with. This simplification is particularly useful when the applied function returns a single value per element (such as a sum or mean), as it avoids the overhead of dealing with list outputs.
sapply
is often used when you need to:
Perform an operation on each element of a list or vector and expect a simplified result.
Generate summary statistics or perform transformations on dataset columns.
Apply user-defined or built-in functions across elements and desire a compact and readable output.
General Formula of sapply
The general syntax of sapply
in R is as follows:
sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
X
: This is the object (list, vector, or data frame) over which the function is to be applied.FUN
: The function that needs to be applied to each element ofX
. This can be a standard function or a user-defined function....
: Additional arguments toFUN
.simplify
: IfTRUE
,sapply
will try to simplify the output to a vector or matrix. IfFALSE
, the output will be the same as that oflapply
(a list).USE.NAMES
: IfTRUE
and ifX
is character, the output will have named elements corresponding toX
.
Example 5
Calculate the average of departure delay by carrier
#by using dplyr method
%>%
flights group_by(carrier) %>%
summarize(mean_dep = mean(dep_delay, na.rm = T))
# A tibble: 16 × 2
carrier mean_dep
<chr> <dbl>
1 9E 16.7
2 AA 8.59
3 AS 5.80
4 B6 13.0
5 DL 9.26
6 EV 20.0
7 F9 20.2
8 FL 18.7
9 HA 4.90
10 MQ 10.6
11 OO 12.6
12 UA 12.1
13 US 3.78
14 VX 12.9
15 WN 17.7
16 YV 19.0
#by using sapply
sapply(split(flights$dep_delay,flights$carrier),
function(x) mean(x, na.rm=T))
9E AA AS B6 DL EV F9 FL
16.725769 8.586016 5.804775 13.022522 9.264505 19.955390 20.215543 18.726075
HA MQ OO UA US VX WN YV
4.900585 10.552041 12.586207 12.106073 3.782418 12.869421 17.711744 18.996330
Example 6
Count the total number of flights for each month
#by using dplyr methods
%>%
flights group_by(month) %>%
summarize(count1 = length(month))
# A tibble: 12 × 2
month count1
<int> <int>
1 1 27004
2 2 24951
3 3 28834
4 4 28330
5 5 28796
6 6 28243
7 7 29425
8 8 29327
9 9 27574
10 10 28889
11 11 27268
12 12 28135
#by using sapply
sapply(split(flights$month, flights$month), length)
1 2 3 4 5 6 7 8 9 10 11 12
27004 24951 28834 28330 28796 28243 29425 29327 27574 28889 27268 28135
Example 7
Find the maximum arrival delay for each destination
#using dplyr
%>%
flights group_by(dest) %>%
summarize(max1 = max(arr_delay, na.rm=T))
Warning: There was 1 warning in `summarize()`.
ℹ In argument: `max1 = max(arr_delay, na.rm = T)`.
ℹ In group 52: `dest = "LGA"`.
Caused by warning in `max()`:
! no non-missing arguments to max; returning -Inf
# A tibble: 105 × 2
dest max1
<chr> <dbl>
1 ABQ 153
2 ACK 221
3 ALB 328
4 ANC 39
5 ATL 895
6 AUS 349
7 AVL 228
8 BDL 266
9 BGR 238
10 BHM 291
# ℹ 95 more rows
#using sapply
# Find the maximum arrival delay for each destination
sapply(split(flights$arr_delay, flights$dest),
function(x) max(x, na.rm = TRUE))
Warning in max(x, na.rm = TRUE): no non-missing arguments to max; returning
-Inf
ABQ ACK ALB ANC ATL AUS AVL BDL BGR BHM BNA BOS BQN BTV BUF BUR
153 221 328 39 895 349 228 266 238 291 364 422 208 396 396 247
BWI BZN CAE CAK CHO CHS CLE CLT CMH CRW CVG DAY DCA DEN DFW DSM
851 154 224 433 228 331 469 744 1127 189 989 292 384 834 598 322
DTW EGE EYW FLL GRR GSO GSP HDN HNL HOU IAD IAH ILM IND JAC JAX
674 266 45 405 340 444 312 43 1272 376 577 783 143 350 175 336
LAS LAX LEX LGA LGB MCI MCO MDW MEM MHT MIA MKE MSN MSP MSY MTJ
852 784 -22 -Inf 302 456 744 422 332 335 878 441 364 915 780 101
MVY MYR OAK OKC OMA ORD ORF PBI PDX PHL PHX PIT PSE PSP PVD PWM
168 207 318 262 364 1109 451 449 850 325 393 372 182 17 285 310
RDU RIC ROC RSW SAN SAT SAV SBN SDF SEA SFO SJC SJU SLC SMF SNA
430 463 387 272 769 681 443 53 357 444 1007 226 371 847 271 189
SRQ STL STT SYR TPA TUL TVC TYS XNA
474 802 330 391 931 262 220 281 330
Example 8
Calculate the proportion of delayed flights for each origin airport
# Function to calculate proportion of delayed flights
<- function(x) sum(x > 0, na.rm = TRUE) / length(x)
prop_delayed
# Calculate the proportion of delayed flights for each origin airport
sapply(split(flights$dep_delay, flights$origin), prop_delayed)
EWR JFK LGA
0.4362229 0.3777083 0.3218933
Example 9
Compute the standard deviation of air time for flights to each destination
# Calculate the standard deviation of air time for each destination
sapply(split(flights$air_time, flights$dest),
function(x) sd(x, na.rm = TRUE))
ABQ ACK ALB ANC ATL AUS AVL BDL
19.291371 8.127495 3.084754 14.672009 9.812387 18.224186 7.379748 3.285493
BGR BHM BNA BOS BQN BTV BUF BUR
3.329300 10.497089 10.960876 4.948552 9.163295 3.957284 5.231471 18.643140
BWI BZN CAE CAK CHO CHS CLE CLT
4.350574 12.667898 8.299650 4.780790 5.159485 8.353413 6.932541 8.897082
CMH CRW CVG DAY DCA DEN DFW DSM
6.596712 6.319128 8.523668 7.379901 6.460277 15.462580 17.233628 13.113565
DTW EGE EYW FLL GRR GSO GSP HDN
7.853928 17.374784 11.518846 12.274766 8.601846 6.195388 8.130337 11.765334
HNL HOU IAD IAH ILM IND JAC JAX
21.682888 18.139940 5.630869 16.731984 7.220584 9.404597 11.178423 10.200820
LAS LAX LEX LGA LGB MCI MCO MDW
17.243012 18.291075 NA NA 18.219866 14.370087 10.666273 9.406482
MEM MHT MIA MKE MSN MSP MSY MTJ
12.861058 3.074606 11.163614 9.201886 10.510813 11.751595 14.898290 13.770689
MVY MYR OAK OKC OMA ORD ORF PBI
2.823029 8.194051 16.149086 19.307846 13.968264 10.226233 5.101740 11.261183
PDX PHL PHX PIT PSE PSP PVD PWM
16.257123 6.556883 18.580863 6.738083 9.325991 15.940350 3.288825 3.913012
RDU RIC ROC RSW SAN SAT SAV SBN
6.201977 5.435177 5.241903 10.893647 19.176512 17.716757 8.709996 6.932211
SDF SEA SFO SJC SJU SLC SMF SNA
9.401284 15.630948 17.234986 16.476307 10.388108 14.837516 14.813419 19.889615
SRQ STL STT SYR TPA TUL TVC TYS
11.192370 11.503000 11.009211 4.530559 11.189950 17.257618 6.195816 8.734937
XNA
16.005388
Using lapply
function
lapply
(List Apply) is a function in R used to apply a function over elements of a list or vector and returns a list. The primary objective of lapply
is to perform operations iteratively over list elements or vector elements, applying the same function to each element. Unlike sapply
, lapply
always returns a list, regardless of the output of the function being applied. This makes lapply
particularly useful when you expect the output to be complex or when you need to retain the structure of the original data.
General Formula of lapply
The general syntax of lapply
in R is as follows:
lapply(X, FUN, ...)
X
: The object (list or vector) over which the function is to be applied.FUN
: The function to be applied to each element ofX
. It can be a predefined function or a user-defined function....
: Additional arguments toFUN
.
lapply
returns a list of the same length as X
, with each element being the result of applying FUN
to the corresponding element of X
.
Example 10
Creating a list of unique carriers for each month
# Group flights by month and get unique carriers for each month
lapply(split(flights, flights$month),
function(x) unique(x$carrier))
$`1`
[1] "UA" "AA" "B6" "DL" "EV" "MQ" "US" "WN" "VX" "FL" "AS" "9E" "F9" "HA" "YV"
[16] "OO"
$`2`
[1] "US" "UA" "B6" "AA" "EV" "FL" "MQ" "DL" "WN" "9E" "VX" "AS" "F9" "HA" "YV"
$`3`
[1] "B6" "US" "UA" "AA" "EV" "MQ" "DL" "9E" "FL" "WN" "VX" "AS" "F9" "HA" "YV"
$`4`
[1] "US" "UA" "AA" "B6" "MQ" "EV" "DL" "FL" "WN" "VX" "AS" "9E" "F9" "HA" "YV"
$`5`
[1] "VX" "US" "AA" "UA" "B6" "MQ" "EV" "DL" "WN" "FL" "AS" "9E" "F9" "HA" "YV"
$`6`
[1] "B6" "US" "UA" "AA" "EV" "DL" "WN" "FL" "MQ" "VX" "AS" "9E" "HA" "F9" "YV"
[16] "OO"
$`7`
[1] "B6" "AA" "VX" "US" "UA" "DL" "EV" "MQ" "WN" "9E" "AS" "FL" "HA" "YV" "F9"
$`8`
[1] "B6" "US" "UA" "AA" "DL" "EV" "FL" "WN" "MQ" "9E" "VX" "AS" "HA" "YV" "F9"
[16] "OO"
$`9`
[1] "B6" "UA" "AA" "EV" "DL" "MQ" "WN" "FL" "US" "VX" "AS" "9E" "HA" "F9" "YV"
[16] "OO"
$`10`
[1] "US" "UA" "AA" "B6" "EV" "DL" "MQ" "FL" "WN" "9E" "VX" "AS" "F9" "YV" "HA"
$`11`
[1] "B6" "US" "UA" "AA" "DL" "WN" "EV" "FL" "MQ" "9E" "VX" "AS" "F9" "HA" "YV"
[16] "OO"
$`12`
[1] "B6" "US" "UA" "AA" "EV" "DL" "WN" "9E" "VX" "AS" "MQ" "F9" "FL" "HA" "YV"
Example 11
Generate summaries of departure delays for each origin airport
# Generate a summary of departure delays for each origin airport
lapply(split(flights, flights$origin),
function(x) summary(x$dep_delay))
$EWR
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-25.00 -4.00 -1.00 15.11 15.00 1126.00 3239
$JFK
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-43.00 -5.00 -1.00 12.11 10.00 1301.00 1863
$LGA
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-33.00 -6.00 -3.00 10.35 7.00 911.00 3153
Example 12
Count the number of flights to each destination
# Count the number of flights for each destination
lapply(split(flights, flights$dest), nrow)
Example 13
Calculate the average air time for flights for each carrier
# Calculate the average air time for each carrier
lapply(split(flights, flights$carrier),
function(x) mean(x$air_time, na.rm = TRUE))
$`9E`
[1] 86.7816
$AA
[1] 188.8223
$AS
[1] 325.6178
$B6
[1] 151.1772
$DL
[1] 173.6888
$EV
[1] 90.07619
$F9
[1] 229.5991
$FL
[1] 101.1439
$HA
[1] 623.0877
$MQ
[1] 91.18025
$OO
[1] 83.48276
$UA
[1] 211.7914
$US
[1] 88.5738
$VX
[1] 337.0023
$WN
[1] 147.8248
$YV
[1] 65.74081
Example 14
Find the maximum arrival delay for each month
# Find the maximum arrival delay for each month
lapply(split(flights, flights$month),
function(x) max(x$arr_delay, na.rm = TRUE))
$`1`
[1] 1272
$`2`
[1] 834
$`3`
[1] 915
$`4`
[1] 931
$`5`
[1] 875
$`6`
[1] 1127
$`7`
[1] 989
$`8`
[1] 490
$`9`
[1] 1007
$`10`
[1] 688
$`11`
[1] 796
$`12`
[1] 878
Using vapply
function
vapply
is a function in R that serves a similar purpose to sapply
and lapply
, but with an added layer of safety and predictability. The primary objective of vapply
is to apply a function over the elements of a vector or list and return an array or matrix with a pre-specified type and size. This makes vapply
particularly useful when you want to ensure the consistency and type safety of the output, avoiding surprises or errors that might arise from unexpected output types.
General Formula of vapply
The general syntax of vapply
in R is as follows:
vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)
X
: The object (list, vector, or expression) over which the function is to be applied.FUN
: The function to apply to each element ofX
.FUN.VALUE
: A (template) value indicating the type and size of the output to expect fromFUN
. It ensures that the output has this structure....
: Additional arguments toFUN
.USE.NAMES
: IfTRUE
andX
is named, the names are preserved.
vapply
is safer than sapply
because it requires you to specify the expected output type, thus avoiding unintended results or errors at runtime.
Example 15
Calculate the average departure delay for each carrier, ensuring a numeric vector output
# Calculate the average departure delay for each carrier
vapply(split(flights$dep_delay, flights$carrier),
function(x) mean(x, na.rm = TRUE), numeric(1))
9E AA AS B6 DL EV F9 FL
16.725769 8.586016 5.804775 13.022522 9.264505 19.955390 20.215543 18.726075
HA MQ OO UA US VX WN YV
4.900585 10.552041 12.586207 12.106073 3.782418 12.869421 17.711744 18.996330
Example 16
Count the total number of flights for each month, with the output as an integer vector
# Count the number of flights for each month
vapply(split(flights$month, flights$month), length, integer(1))
1 2 3 4 5 6 7 8 9 10 11 12
27004 24951 28834 28330 28796 28243 29425 29327 27574 28889 27268 28135
Example 17
Find the maximum arrival delay for each destination, ensuring a numeric vector output
# Find the maximum arrival delay for each destination
vapply(split(flights$arr_delay, flights$dest),
function(x) max(x, na.rm = TRUE),
numeric(1))
Warning in max(x, na.rm = TRUE): no non-missing arguments to max; returning
-Inf
ABQ ACK ALB ANC ATL AUS AVL BDL BGR BHM BNA BOS BQN BTV BUF BUR
153 221 328 39 895 349 228 266 238 291 364 422 208 396 396 247
BWI BZN CAE CAK CHO CHS CLE CLT CMH CRW CVG DAY DCA DEN DFW DSM
851 154 224 433 228 331 469 744 1127 189 989 292 384 834 598 322
DTW EGE EYW FLL GRR GSO GSP HDN HNL HOU IAD IAH ILM IND JAC JAX
674 266 45 405 340 444 312 43 1272 376 577 783 143 350 175 336
LAS LAX LEX LGA LGB MCI MCO MDW MEM MHT MIA MKE MSN MSP MSY MTJ
852 784 -22 -Inf 302 456 744 422 332 335 878 441 364 915 780 101
MVY MYR OAK OKC OMA ORD ORF PBI PDX PHL PHX PIT PSE PSP PVD PWM
168 207 318 262 364 1109 451 449 850 325 393 372 182 17 285 310
RDU RIC ROC RSW SAN SAT SAV SBN SDF SEA SFO SJC SJU SLC SMF SNA
430 463 387 272 769 681 443 53 357 444 1007 226 371 847 271 189
SRQ STL STT SYR TPA TUL TVC TYS XNA
474 802 330 391 931 262 220 281 330
Example 18
Calculate the proportion of delayed flights for each origin airport, with the output as a numeric vector
# Function to calculate the proportion of delayed flights
<- function(x) sum(x > 0, na.rm = TRUE) / length(x)
prop_delayed
# Calculate the proportion of delayed flights for each origin airport
vapply(split(flights$dep_delay, flights$origin),
numeric(1)) prop_delayed,
EWR JFK LGA
0.4362229 0.3777083 0.3218933
Example 19
Compute the standard deviation of air time for flights for each carrier, ensuring a numeric vector output
# Calculate the standard deviation of air time for each carrier
vapply(split(flights$air_time, flights$carrier),
function(x) sd(x, na.rm = TRUE),
numeric(1))
9E AA AS B6 DL EV F9 FL
42.90986 81.68095 16.16666 89.64308 84.82097 39.82934 15.16282 23.94281
HA MQ OO UA US VX WN YV
20.68882 30.63775 35.18382 101.02968 75.17153 20.88173 55.52888 19.71549
Using tapply
function
The tapply
function in R is designed for applying a function to subsets of a vector and then combining the results. The primary objective of tapply
is to perform grouped data analysis, where you need to apply a function to subsets of data defined by factors or categorical variables. It is particularly useful for summarizing data across groups, such as calculating group means, sums, or other summary statistics.
General Formula of tapply
The general syntax of tapply
in R is as follows:
tapply(X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)
X
: A vector or an object which can be coerced to a vector. This is the data to be divided into groups and analyzed.INDEX
: One or more factors, or list of factors, by which to splitX
. The lengths ofX
andINDEX
should align.FUN
: The function to be applied to each subset ofX
....
: Additional arguments toFUN
.default
: The default value to be used if a subset ofX
is empty.simplify
: IfTRUE
andFUN
returns scalars, the result is a vector; ifFALSE
, the result is always a list.
tapply
returns an array (if simplify
is TRUE
and the output fits into an array) or a list of values obtained by applying FUN
to each subset of X
.
Example 20
Calculating the average departure delay for each carrier
# Calculate the average departure delay by carrier
tapply(flights$dep_delay, flights$carrier,
function(x) mean(x, na.rm = TRUE))
9E AA AS B6 DL EV F9 FL
16.725769 8.586016 5.804775 13.022522 9.264505 19.955390 20.215543 18.726075
HA MQ OO UA US VX WN YV
4.900585 10.552041 12.586207 12.106073 3.782418 12.869421 17.711744 18.996330
Example 21
Counting the total number of flights for each month
# Count the total number of flights per month
tapply(flights$flight, flights$month, length)
1 2 3 4 5 6 7 8 9 10 11 12
27004 24951 28834 28330 28796 28243 29425 29327 27574 28889 27268 28135
Example 22
Finding the maximum arrival delay for each origin airport
# Find the maximum arrival delay by origin airport
tapply(flights$arr_delay, flights$origin,
function(x) max(x, na.rm = TRUE))
EWR JFK LGA
1109 1272 915
Example 23
Calculating the proportion of delayed flights (arrival delay > 0) for each destination
# Calculate the proportion of delayed flights by destination
tapply(flights$arr_delay > 0, flights$dest,
function(x) mean(x, na.rm = TRUE))
ABQ ACK ALB ANC ATL AUS AVL BDL
0.4212598 0.3939394 0.4401914 0.6250000 0.4719368 0.4077146 0.4559387 0.3495146
BGR BHM BNA BOS BQN BTV BUF BUR
0.3826816 0.4498141 0.4497041 0.3157369 0.4673423 0.4055777 0.3901532 0.4486486
BWI BZN CAE CAK CHO CHS CLE CLT
0.3947836 0.4571429 0.8207547 0.5368171 0.4130435 0.4324030 0.4037324 0.4269416
CMH CRW CVG DAY DCA DEN DFW DSM
0.4302465 0.4776119 0.4510067 0.4517513 0.4393590 0.4494351 0.3428708 0.4971319
DTW EGE EYW FLL GRR GSO GSP HDN
0.3606467 0.4299517 0.5882353 0.4380936 0.5192308 0.4477212 0.4822785 0.5714286
HNL HOU IAD IAH ILM IND JAC JAX
0.3409415 0.4325492 0.4412038 0.4069160 0.3831776 0.4204947 0.8095238 0.4700724
LAS LAX LEX LGA LGB MCI MCO MDW
0.3541667 0.3723325 0.0000000 NaN 0.3615734 0.4848806 0.3970072 0.4655901
MEM MHT MIA MKE MSN MSP MSY MTJ
0.4454330 0.4431330 0.3325282 0.4868955 0.5125899 0.3997691 0.4029610 0.3571429
MVY MYR OAK OKC OMA ORD ORF PBI
0.2761905 0.3620690 0.3754045 0.6380952 0.4736842 0.3741398 0.4274756 0.4424233
PDX PHL PHX PIT PSE PSP PVD PWM
0.4135618 0.4354315 0.3973079 0.3932993 0.4972067 0.3333333 0.5055866 0.4313811
RDU RIC ROC RSW SAN SAT SAV SBN
0.4368082 0.5098039 0.4007634 0.4003427 0.4056848 0.3899848 0.4753004 0.4000000
SDF SEA SFO SJC SJU SLC SMF SNA
0.4519928 0.3266409 0.3750854 0.3780488 0.3784861 0.3398613 0.5319149 0.2943350
SRQ STL STT SYR TPA TUL TVC TYS
0.3913405 0.4391598 0.3281853 0.3954306 0.4117727 0.6564626 0.3684211 0.5311419
XNA
0.4596774
Example 24
Computing the median air time for flights for each combination of month and carrier
# Compute the median air time by month and carrier
tapply(flights$air_time, list(flights$month, flights$carrier),
na.rm = TRUE) median,
9E AA AS B6 DL EV F9 FL HA MQ OO UA US VX WN YV
1 69 171.5 344.0 149 153 88 244.0 119 638.0 89 132.0 202 82 352.0 129 51.0
2 68 181.0 325.0 152 154 87 235.5 118 615.5 86 NA 201 80 346.0 122 49.5
3 73 179.0 328.0 147 151 84 229.0 109 646.0 82 NA 195 77 346.0 119 47.0
4 77 180.0 325.5 146 149 89 237.0 111 630.0 84 NA 198 76 344.5 120 79.0
5 77 169.0 314.0 136 141 84 217.5 105 615.0 78 NA 193 73 327.0 113 77.0
6 83 168.0 323.5 139 143 86 223.0 105 607.0 80 84.5 194 74 330.0 117 75.0
7 83 159.0 319.0 135 140 87 217.5 104 606.0 79 NA 193 74 321.0 113 77.0
8 85 156.0 317.0 134 139 87 217.0 105 614.0 80 69.0 195 74 329.0 116 78.5
9 85 159.0 313.5 132 137 84 215.0 102 605.0 78 68.0 189 73 326.0 112 79.0
10 90 163.0 317.0 137 142 88 227.0 108 619.0 81 NA 195 77 338.0 122 77.5
11 94 166.0 334.0 145 148 92 235.0 113 634.0 84 157.0 202 81 350.0 127 49.0
12 94 173.0 340.0 151 155 96 241.0 117 628.0 88 NA 207 86 353.0 133 84.0
Using mapply
function
mapply
(Multivariate Apply) in R is a function designed to apply a function to multiple arguments (vectors or lists) simultaneously. The primary objective of mapply
is to extend the capabilities of sapply
and lapply
by allowing the application of functions over multiple arguments. This is particularly useful when you have parallel arrays or lists and you want to apply a function to corresponding elements of each array.
General Formula of mapply
The general syntax of mapply
in R is as follows:
mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
FUN
: The function to be applied....
: Arguments to the function - vectors or lists of equal length.MoreArgs
: A list of other arguments toFUN
.SIMPLIFY
: IfTRUE
,mapply
tries to simplify the result to an array; ifFALSE
, the result is a list.USE.NAMES
: Logical; ifTRUE
and if...
has names, the result will preserve these names.
mapply
can be seen as a multivariate version of sapply
. It applies FUN
to the first elements of each argument in ...
, then to the second elements, third elements, and so on.
Example 25
Calculating the speed (distance divided by air time) for each flights
# Function to calculate speed
<- function(distance, air_time) distance / air_time
calculate_speed
# Calculate speed for each flight
head(mapply(calculate_speed, flights$distance, flights$air_time))
[1] 6.167401 6.237885 6.806250 8.612022 6.568966 4.793333
Example 26
Creating a date object for each flight
# Function to combine year, month, and day into a date
<- function(year, month, day) as.Date(paste(year, month, day, sep = "-"))
combine_date
# Create a date for each flight
head(mapply(combine_date, flights$year, flights$month, flights$day))
[1] 15706 15706 15706 15706 15706 15706
Example 27
Comparing scheduled air time with actual air time to find the difference
# Function to calculate time difference
<- function(scheduled, actual) actual - scheduled
time_difference
# Calculate the difference in times
head(mapply(time_difference, flights$sched_arr_time, flights$arr_time))
[1] 11 20 73 -18 -25 12
Example 28
Checking if both departure and arrival times are delayed
# Function to check if both times are delayed
<- function(dep_delay, arr_delay) dep_delay > 0 & arr_delay > 0
both_delayed
# Check if both departure and arrival times are delayed for each flight
head(mapply(both_delayed, flights$dep_delay, flights$arr_delay))
[1] TRUE TRUE TRUE FALSE FALSE FALSE
now using mtcars dataset
Example 29
Convert the mpg
(miles per gallon) values to kilometers per liter (1 mile = 1.60934 km, 1 gallon = 3.78541 liters)
# Function to convert miles per gallon to kilometers per liter
<- function(mpg) mpg * 1.60934 / 3.78541
mpg_to_kmpl
# Apply the conversion to the mpg column
mapply(mpg_to_kmpl, mtcars$mpg)
[1] 8.928000 8.928000 9.693257 9.098057 7.950171 7.695086 6.079543
[8] 10.373486 9.693257 8.162743 7.567543 6.972343 7.354971 6.462171
[15] 4.421486 4.421486 6.249600 13.774628 12.924343 14.412343 9.140571
[22] 6.589714 6.462171 5.654400 8.162743 11.606400 11.053714 12.924343
[29] 6.717257 8.375314 6.377143 9.098057
Example 30
Compute the power-to-weight ratio for each car (horsepower to weight). Horsepower is in hp
and weight is in wt
(1000 lbs)
# Function to calculate power-to-weight ratio (hp per 1000 lbs)
<- function(hp, wt) hp / wt
power_to_weight
# Apply the function to hp and wt columns
mapply(power_to_weight, mtcars$hp, mtcars$wt)
[1] 41.98473 38.26087 40.08621 34.21462 50.87209 30.34682 68.62745 19.43574
[9] 30.15873 35.75581 35.75581 44.22604 48.25737 47.61905 39.04762 39.63864
[17] 43.03087 30.00000 32.19814 35.42234 39.35091 42.61364 43.66812 63.80208
[25] 45.51365 34.10853 42.52336 74.68605 83.28076 63.17690 93.83754 39.20863
Example 31
Estimate the time to accelerate from 0 to 100 km/h based on qsec
(1/4 mile time). This is a rough estimation assuming linear acceleration
# Function to estimate time to 100 km/h
<- function(qsec) qsec * (100 / (1/4 * 1.60934))
time_to_100
# Apply the function to the qsec column
mapply(time_to_100, mtcars$qsec)
[1] 4091.118 4230.306 4625.499 4831.794 4230.306 5025.663 3937.018 4970.982
[9] 5691.774 4548.448 4697.578 4324.754 4374.464 4473.884 4468.913 4429.145
[17] 4329.725 4839.251 4603.129 4946.127 4973.467 4193.023 4299.899 3830.142
[25] 4237.762 4697.578 4150.770 4200.480 3603.962 3852.511 3628.817 4623.013
Example 32
Create a full name for each car by combining the row names (car brand and model)
# Function to create a full name
<- function(name) paste(strsplit(name, " ")[[1]], collapse = " ")
full_name
# Apply the function to row names
mapply(full_name, rownames(mtcars))
Mazda RX4 Mazda RX4 Wag Datsun 710
"Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
Hornet 4 Drive Hornet Sportabout Valiant
"Hornet 4 Drive" "Hornet Sportabout" "Valiant"
Duster 360 Merc 240D Merc 230
"Duster 360" "Merc 240D" "Merc 230"
Merc 280 Merc 280C Merc 450SE
"Merc 280" "Merc 280C" "Merc 450SE"
Merc 450SL Merc 450SLC Cadillac Fleetwood
"Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
Lincoln Continental Chrysler Imperial Fiat 128
"Lincoln Continental" "Chrysler Imperial" "Fiat 128"
Honda Civic Toyota Corolla Toyota Corona
"Honda Civic" "Toyota Corolla" "Toyota Corona"
Dodge Challenger AMC Javelin Camaro Z28
"Dodge Challenger" "AMC Javelin" "Camaro Z28"
Pontiac Firebird Fiat X1-9 Porsche 914-2
"Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
Lotus Europa Ford Pantera L Ferrari Dino
"Lotus Europa" "Ford Pantera L" "Ferrari Dino"
Maserati Bora Volvo 142E
"Maserati Bora" "Volvo 142E"