Apply Family

Author

Dr. Mohammad Nasir Abdullah

Apply Family

The apply family of functions in R is a cornerstone of efficient data manipulation and analysis. These functions allow users to perform repetitive tasks over arrays, lists, and data frames in a more concise and readable manner than traditional loop structures. In this chapter, we will delve into the practical applications of these functions using two engaging datasets: the Penguins and Flights datasets.

Understanding the Apply Family

The apply family comprises several functions, each tailored for specific types of data and operations:

  • apply: Used for arrays and matrices, apply is ideal for performing functions on rows or columns of a matrix.

  • lapply and sapply: These functions are designed for lists and vectors. lapply returns a list, while sapply simplifies the output to an array or matrix when possible.

  • vapply: Similar to sapply, but with a pre-specified type of return value, making it safer and more predictable.

  • tapply: Ideal for group-wise operations on arrays or data frames, tapply applies a function to subsets of data.

  • mapply: A multivariate version of lapply, mapply applies a function to multiple arguments in parallel.

Understanding these functions is crucial for data scientists and statisticians working with R, as they significantly enhance the efficiency and readability of data analysis code.

Using apply function

The apply function in R is a fundamental tool in data manipulation and analysis. It is used to apply a function to the rows or columns of a matrix or, more generally, to a higher-dimensional array. The objective of apply is to make operations over arrays more efficient and code more concise. It is particularly useful for performing summary statistics, transformations, and custom operations without the need for explicit loops.

Objective

The primary objective of apply is to simplify repetitive operations across rows or columns of data structures like matrices and arrays. By using apply, one can write cleaner and more efficient R code, which is easier to read and maintain.

General use of the function

apply(X, MARGIN, FUN, ..., simplify = TRUE)

Arguments

X an array, including a matrix.
MARGIN a vector giving the subscripts which the function will be applied over. E.g., for a matrix:
1 indicates rows,
2 indicates columns,
c(1, 2) indicates rows and columns.
Where X has named dimnames, it can be a character vector selecting dimension names.
FUN the function to be applied: see ‘Details’. In the case of functions like +, %*%, etc., the function name must be backquoted or quoted.
... optional arguments to FUN.
simplify a logical indicating whether results should be simplified if possible.

Example 1:

use apply to calculate measures such as mean and standard deviation for numeric variables in the dataset.

library(palmerpenguins) #to use penguins dataset
Warning: package 'palmerpenguins' was built under R version 4.3.2
penguins <- na.omit(penguins) #to remove rows with missing data

# Mean values for selected columns
apply(penguins[, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")], 2, mean)
   bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
         43.99279          17.16486         200.96697        4207.05706 
# Standard deviation for selected columns
apply(penguins[, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")], 2, sd)
   bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
         5.468668          1.969235         14.015765        805.215802 

Example 2:

apply can also be used for computing medians and interquartile ranges.

# Median values
apply(penguins[, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")], 2, median)
   bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
             44.5              17.3             197.0            4050.0 
# Interquartile ranges
apply(penguins[, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")], 2, IQR)
   bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
              9.1               3.1              23.0            1225.0 

Example 3:

Assume we have a dataset where respondents have answered a series of questions, each rated on a numerical scale (like 1 to 5). Our goal is to calculate the total score for each respondent.

# Simulating a questionnaire dataset
set.seed(123)  # For reproducible results
questionnaire <- as.data.frame(matrix(sample(1:5, 100, replace = TRUE), nrow = 20))
colnames(questionnaire) <- paste0("Q", 1:5)

# Calculating row sums
total_scores <- apply(questionnaire, 1, sum)

# Adding total scores to the dataset
questionnaire$total_score <- total_scores

# Display first 5 observations
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
slice_head(questionnaire,n=5)
  Q1 Q2 Q3 Q4 Q5 total_score
1  3  2  5  5  4          19
2  3  1  5  5  2          16
3  2  3  4  3  2          14
4  2  4  5  1  4          16
5  3  1  2  4  4          14

Example 4:

Calculate the average of all numerical variables in flights dataset.

library(nycflights13)
#selecting all numerical variables in the flights dataset
num_flights <- flights %>% 
                   select(where(is.numeric))


#calculate average of all numerical variables in the dataset
apply(num_flights, 2, function(x) mean(x, na.rm=T))
          year          month            day       dep_time sched_dep_time 
   2013.000000       6.548510      15.710787    1349.109947    1344.254840 
     dep_delay       arr_time sched_arr_time      arr_delay         flight 
     12.639070    1502.054999    1536.380220       6.895377    1971.923620 
      air_time       distance           hour         minute 
    150.686460    1039.912604      13.180247      26.230100 

Using sapply function

sapply is part of the apply family in R, and it’s a user-friendly version of lapply because it tries to simplify the output. If the function applied to each element of the list or vector returns a single number, sapply will return a vector. If it returns a vector of the same length for each element, sapply will return a matrix.

The primary objective of sapply (Simplified Apply) in R is to apply a function over a list, vector, or data frame column-wise and simplify the output to the most appropriate data structure. Unlike lapply, which always returns a list, sapply attempts to simplify the result to a vector or matrix, making the output more concise and often easier to work with. This simplification is particularly useful when the applied function returns a single value per element (such as a sum or mean), as it avoids the overhead of dealing with list outputs.

sapply is often used when you need to:

  1. Perform an operation on each element of a list or vector and expect a simplified result.

  2. Generate summary statistics or perform transformations on dataset columns.

  3. Apply user-defined or built-in functions across elements and desire a compact and readable output.

General Formula of sapply

The general syntax of sapply in R is as follows:

sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
  • X: This is the object (list, vector, or data frame) over which the function is to be applied.

  • FUN: The function that needs to be applied to each element of X. This can be a standard function or a user-defined function.

  • ...: Additional arguments to FUN.

  • simplify: If TRUE, sapply will try to simplify the output to a vector or matrix. If FALSE, the output will be the same as that of lapply (a list).

  • USE.NAMES: If TRUE and if X is character, the output will have named elements corresponding to X.

Example 5

Calculate the average of departure delay by carrier

#by using dplyr method 

flights %>%    
  group_by(carrier) %>%   
  summarize(mean_dep = mean(dep_delay, na.rm = T))  
# A tibble: 16 × 2
   carrier mean_dep
   <chr>      <dbl>
 1 9E         16.7 
 2 AA          8.59
 3 AS          5.80
 4 B6         13.0 
 5 DL          9.26
 6 EV         20.0 
 7 F9         20.2 
 8 FL         18.7 
 9 HA          4.90
10 MQ         10.6 
11 OO         12.6 
12 UA         12.1 
13 US          3.78
14 VX         12.9 
15 WN         17.7 
16 YV         19.0 
#by using sapply 

sapply(split(flights$dep_delay,flights$carrier), 
         function(x) mean(x, na.rm=T))  
       9E        AA        AS        B6        DL        EV        F9        FL 
16.725769  8.586016  5.804775 13.022522  9.264505 19.955390 20.215543 18.726075 
       HA        MQ        OO        UA        US        VX        WN        YV 
 4.900585 10.552041 12.586207 12.106073  3.782418 12.869421 17.711744 18.996330 

Example 6

Count the total number of flights for each month

#by using dplyr methods

flights %>%
  group_by(month) %>%
  summarize(count1 = length(month))
# A tibble: 12 × 2
   month count1
   <int>  <int>
 1     1  27004
 2     2  24951
 3     3  28834
 4     4  28330
 5     5  28796
 6     6  28243
 7     7  29425
 8     8  29327
 9     9  27574
10    10  28889
11    11  27268
12    12  28135
#by using sapply
sapply(split(flights$month, flights$month), length)
    1     2     3     4     5     6     7     8     9    10    11    12 
27004 24951 28834 28330 28796 28243 29425 29327 27574 28889 27268 28135 

Example 7

Find the maximum arrival delay for each destination

#using dplyr 
flights %>%
  group_by(dest) %>%
  summarize(max1 = max(arr_delay, na.rm=T))
Warning: There was 1 warning in `summarize()`.
ℹ In argument: `max1 = max(arr_delay, na.rm = T)`.
ℹ In group 52: `dest = "LGA"`.
Caused by warning in `max()`:
! no non-missing arguments to max; returning -Inf
# A tibble: 105 × 2
   dest   max1
   <chr> <dbl>
 1 ABQ     153
 2 ACK     221
 3 ALB     328
 4 ANC      39
 5 ATL     895
 6 AUS     349
 7 AVL     228
 8 BDL     266
 9 BGR     238
10 BHM     291
# ℹ 95 more rows
#using sapply
# Find the maximum arrival delay for each destination
sapply(split(flights$arr_delay, flights$dest), 
                                     function(x) max(x, na.rm = TRUE))
Warning in max(x, na.rm = TRUE): no non-missing arguments to max; returning
-Inf
 ABQ  ACK  ALB  ANC  ATL  AUS  AVL  BDL  BGR  BHM  BNA  BOS  BQN  BTV  BUF  BUR 
 153  221  328   39  895  349  228  266  238  291  364  422  208  396  396  247 
 BWI  BZN  CAE  CAK  CHO  CHS  CLE  CLT  CMH  CRW  CVG  DAY  DCA  DEN  DFW  DSM 
 851  154  224  433  228  331  469  744 1127  189  989  292  384  834  598  322 
 DTW  EGE  EYW  FLL  GRR  GSO  GSP  HDN  HNL  HOU  IAD  IAH  ILM  IND  JAC  JAX 
 674  266   45  405  340  444  312   43 1272  376  577  783  143  350  175  336 
 LAS  LAX  LEX  LGA  LGB  MCI  MCO  MDW  MEM  MHT  MIA  MKE  MSN  MSP  MSY  MTJ 
 852  784  -22 -Inf  302  456  744  422  332  335  878  441  364  915  780  101 
 MVY  MYR  OAK  OKC  OMA  ORD  ORF  PBI  PDX  PHL  PHX  PIT  PSE  PSP  PVD  PWM 
 168  207  318  262  364 1109  451  449  850  325  393  372  182   17  285  310 
 RDU  RIC  ROC  RSW  SAN  SAT  SAV  SBN  SDF  SEA  SFO  SJC  SJU  SLC  SMF  SNA 
 430  463  387  272  769  681  443   53  357  444 1007  226  371  847  271  189 
 SRQ  STL  STT  SYR  TPA  TUL  TVC  TYS  XNA 
 474  802  330  391  931  262  220  281  330 

Example 8

Calculate the proportion of delayed flights for each origin airport

# Function to calculate proportion of delayed flights
prop_delayed <- function(x) sum(x > 0, na.rm = TRUE) / length(x)

# Calculate the proportion of delayed flights for each origin airport
sapply(split(flights$dep_delay, flights$origin), prop_delayed)
      EWR       JFK       LGA 
0.4362229 0.3777083 0.3218933 

Example 9

Compute the standard deviation of air time for flights to each destination

# Calculate the standard deviation of air time for each destination
sapply(split(flights$air_time, flights$dest), 
                               function(x) sd(x, na.rm = TRUE))
      ABQ       ACK       ALB       ANC       ATL       AUS       AVL       BDL 
19.291371  8.127495  3.084754 14.672009  9.812387 18.224186  7.379748  3.285493 
      BGR       BHM       BNA       BOS       BQN       BTV       BUF       BUR 
 3.329300 10.497089 10.960876  4.948552  9.163295  3.957284  5.231471 18.643140 
      BWI       BZN       CAE       CAK       CHO       CHS       CLE       CLT 
 4.350574 12.667898  8.299650  4.780790  5.159485  8.353413  6.932541  8.897082 
      CMH       CRW       CVG       DAY       DCA       DEN       DFW       DSM 
 6.596712  6.319128  8.523668  7.379901  6.460277 15.462580 17.233628 13.113565 
      DTW       EGE       EYW       FLL       GRR       GSO       GSP       HDN 
 7.853928 17.374784 11.518846 12.274766  8.601846  6.195388  8.130337 11.765334 
      HNL       HOU       IAD       IAH       ILM       IND       JAC       JAX 
21.682888 18.139940  5.630869 16.731984  7.220584  9.404597 11.178423 10.200820 
      LAS       LAX       LEX       LGA       LGB       MCI       MCO       MDW 
17.243012 18.291075        NA        NA 18.219866 14.370087 10.666273  9.406482 
      MEM       MHT       MIA       MKE       MSN       MSP       MSY       MTJ 
12.861058  3.074606 11.163614  9.201886 10.510813 11.751595 14.898290 13.770689 
      MVY       MYR       OAK       OKC       OMA       ORD       ORF       PBI 
 2.823029  8.194051 16.149086 19.307846 13.968264 10.226233  5.101740 11.261183 
      PDX       PHL       PHX       PIT       PSE       PSP       PVD       PWM 
16.257123  6.556883 18.580863  6.738083  9.325991 15.940350  3.288825  3.913012 
      RDU       RIC       ROC       RSW       SAN       SAT       SAV       SBN 
 6.201977  5.435177  5.241903 10.893647 19.176512 17.716757  8.709996  6.932211 
      SDF       SEA       SFO       SJC       SJU       SLC       SMF       SNA 
 9.401284 15.630948 17.234986 16.476307 10.388108 14.837516 14.813419 19.889615 
      SRQ       STL       STT       SYR       TPA       TUL       TVC       TYS 
11.192370 11.503000 11.009211  4.530559 11.189950 17.257618  6.195816  8.734937 
      XNA 
16.005388 

Using lapply function

lapply (List Apply) is a function in R used to apply a function over elements of a list or vector and returns a list. The primary objective of lapply is to perform operations iteratively over list elements or vector elements, applying the same function to each element. Unlike sapply, lapply always returns a list, regardless of the output of the function being applied. This makes lapply particularly useful when you expect the output to be complex or when you need to retain the structure of the original data.

General Formula of lapply

The general syntax of lapply in R is as follows:

lapply(X, FUN, ...)
  • X: The object (list or vector) over which the function is to be applied.

  • FUN: The function to be applied to each element of X. It can be a predefined function or a user-defined function.

  • ...: Additional arguments to FUN.

lapply returns a list of the same length as X, with each element being the result of applying FUN to the corresponding element of X.

Example 10

Creating a list of unique carriers for each month

# Group flights by month and get unique carriers for each month
lapply(split(flights, flights$month), 
                    function(x) unique(x$carrier))
$`1`
 [1] "UA" "AA" "B6" "DL" "EV" "MQ" "US" "WN" "VX" "FL" "AS" "9E" "F9" "HA" "YV"
[16] "OO"

$`2`
 [1] "US" "UA" "B6" "AA" "EV" "FL" "MQ" "DL" "WN" "9E" "VX" "AS" "F9" "HA" "YV"

$`3`
 [1] "B6" "US" "UA" "AA" "EV" "MQ" "DL" "9E" "FL" "WN" "VX" "AS" "F9" "HA" "YV"

$`4`
 [1] "US" "UA" "AA" "B6" "MQ" "EV" "DL" "FL" "WN" "VX" "AS" "9E" "F9" "HA" "YV"

$`5`
 [1] "VX" "US" "AA" "UA" "B6" "MQ" "EV" "DL" "WN" "FL" "AS" "9E" "F9" "HA" "YV"

$`6`
 [1] "B6" "US" "UA" "AA" "EV" "DL" "WN" "FL" "MQ" "VX" "AS" "9E" "HA" "F9" "YV"
[16] "OO"

$`7`
 [1] "B6" "AA" "VX" "US" "UA" "DL" "EV" "MQ" "WN" "9E" "AS" "FL" "HA" "YV" "F9"

$`8`
 [1] "B6" "US" "UA" "AA" "DL" "EV" "FL" "WN" "MQ" "9E" "VX" "AS" "HA" "YV" "F9"
[16] "OO"

$`9`
 [1] "B6" "UA" "AA" "EV" "DL" "MQ" "WN" "FL" "US" "VX" "AS" "9E" "HA" "F9" "YV"
[16] "OO"

$`10`
 [1] "US" "UA" "AA" "B6" "EV" "DL" "MQ" "FL" "WN" "9E" "VX" "AS" "F9" "YV" "HA"

$`11`
 [1] "B6" "US" "UA" "AA" "DL" "WN" "EV" "FL" "MQ" "9E" "VX" "AS" "F9" "HA" "YV"
[16] "OO"

$`12`
 [1] "B6" "US" "UA" "AA" "EV" "DL" "WN" "9E" "VX" "AS" "MQ" "F9" "FL" "HA" "YV"

Example 11

Generate summaries of departure delays for each origin airport

# Generate a summary of departure delays for each origin airport
lapply(split(flights, flights$origin), 
                function(x) summary(x$dep_delay))
$EWR
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 -25.00   -4.00   -1.00   15.11   15.00 1126.00    3239 

$JFK
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 -43.00   -5.00   -1.00   12.11   10.00 1301.00    1863 

$LGA
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 -33.00   -6.00   -3.00   10.35    7.00  911.00    3153 

Example 12

Count the number of flights to each destination

# Count the number of flights for each destination
lapply(split(flights, flights$dest), nrow)

Example 13

Calculate the average air time for flights for each carrier

# Calculate the average air time for each carrier
lapply(split(flights, flights$carrier), 
                      function(x) mean(x$air_time, na.rm = TRUE))
$`9E`
[1] 86.7816

$AA
[1] 188.8223

$AS
[1] 325.6178

$B6
[1] 151.1772

$DL
[1] 173.6888

$EV
[1] 90.07619

$F9
[1] 229.5991

$FL
[1] 101.1439

$HA
[1] 623.0877

$MQ
[1] 91.18025

$OO
[1] 83.48276

$UA
[1] 211.7914

$US
[1] 88.5738

$VX
[1] 337.0023

$WN
[1] 147.8248

$YV
[1] 65.74081

Example 14

Find the maximum arrival delay for each month

# Find the maximum arrival delay for each month
lapply(split(flights, flights$month), 
                   function(x) max(x$arr_delay, na.rm = TRUE))
$`1`
[1] 1272

$`2`
[1] 834

$`3`
[1] 915

$`4`
[1] 931

$`5`
[1] 875

$`6`
[1] 1127

$`7`
[1] 989

$`8`
[1] 490

$`9`
[1] 1007

$`10`
[1] 688

$`11`
[1] 796

$`12`
[1] 878

Using vapply function

vapply is a function in R that serves a similar purpose to sapply and lapply, but with an added layer of safety and predictability. The primary objective of vapply is to apply a function over the elements of a vector or list and return an array or matrix with a pre-specified type and size. This makes vapply particularly useful when you want to ensure the consistency and type safety of the output, avoiding surprises or errors that might arise from unexpected output types.

General Formula of vapply

The general syntax of vapply in R is as follows:

vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)
  • X: The object (list, vector, or expression) over which the function is to be applied.

  • FUN: The function to apply to each element of X.

  • FUN.VALUE: A (template) value indicating the type and size of the output to expect from FUN. It ensures that the output has this structure.

  • ...: Additional arguments to FUN.

  • USE.NAMES: If TRUE and X is named, the names are preserved.

vapply is safer than sapply because it requires you to specify the expected output type, thus avoiding unintended results or errors at runtime.

Example 15

Calculate the average departure delay for each carrier, ensuring a numeric vector output

# Calculate the average departure delay for each carrier
vapply(split(flights$dep_delay, flights$carrier),
       function(x) mean(x, na.rm = TRUE), numeric(1))
       9E        AA        AS        B6        DL        EV        F9        FL 
16.725769  8.586016  5.804775 13.022522  9.264505 19.955390 20.215543 18.726075 
       HA        MQ        OO        UA        US        VX        WN        YV 
 4.900585 10.552041 12.586207 12.106073  3.782418 12.869421 17.711744 18.996330 

Example 16

Count the total number of flights for each month, with the output as an integer vector

# Count the number of flights for each month
vapply(split(flights$month, flights$month), length, integer(1))
    1     2     3     4     5     6     7     8     9    10    11    12 
27004 24951 28834 28330 28796 28243 29425 29327 27574 28889 27268 28135 

Example 17

Find the maximum arrival delay for each destination, ensuring a numeric vector output

# Find the maximum arrival delay for each destination
vapply(split(flights$arr_delay, flights$dest), 
             function(x) max(x, na.rm = TRUE), 
              numeric(1))
Warning in max(x, na.rm = TRUE): no non-missing arguments to max; returning
-Inf
 ABQ  ACK  ALB  ANC  ATL  AUS  AVL  BDL  BGR  BHM  BNA  BOS  BQN  BTV  BUF  BUR 
 153  221  328   39  895  349  228  266  238  291  364  422  208  396  396  247 
 BWI  BZN  CAE  CAK  CHO  CHS  CLE  CLT  CMH  CRW  CVG  DAY  DCA  DEN  DFW  DSM 
 851  154  224  433  228  331  469  744 1127  189  989  292  384  834  598  322 
 DTW  EGE  EYW  FLL  GRR  GSO  GSP  HDN  HNL  HOU  IAD  IAH  ILM  IND  JAC  JAX 
 674  266   45  405  340  444  312   43 1272  376  577  783  143  350  175  336 
 LAS  LAX  LEX  LGA  LGB  MCI  MCO  MDW  MEM  MHT  MIA  MKE  MSN  MSP  MSY  MTJ 
 852  784  -22 -Inf  302  456  744  422  332  335  878  441  364  915  780  101 
 MVY  MYR  OAK  OKC  OMA  ORD  ORF  PBI  PDX  PHL  PHX  PIT  PSE  PSP  PVD  PWM 
 168  207  318  262  364 1109  451  449  850  325  393  372  182   17  285  310 
 RDU  RIC  ROC  RSW  SAN  SAT  SAV  SBN  SDF  SEA  SFO  SJC  SJU  SLC  SMF  SNA 
 430  463  387  272  769  681  443   53  357  444 1007  226  371  847  271  189 
 SRQ  STL  STT  SYR  TPA  TUL  TVC  TYS  XNA 
 474  802  330  391  931  262  220  281  330 

Example 18

Calculate the proportion of delayed flights for each origin airport, with the output as a numeric vector

# Function to calculate the proportion of delayed flights
prop_delayed <- function(x) sum(x > 0, na.rm = TRUE) / length(x)

# Calculate the proportion of delayed flights for each origin airport
vapply(split(flights$dep_delay, flights$origin), 
                prop_delayed, numeric(1))
      EWR       JFK       LGA 
0.4362229 0.3777083 0.3218933 

Example 19

Compute the standard deviation of air time for flights for each carrier, ensuring a numeric vector output

# Calculate the standard deviation of air time for each carrier
vapply(split(flights$air_time, flights$carrier), 
                function(x) sd(x, na.rm = TRUE), 
                 numeric(1))
       9E        AA        AS        B6        DL        EV        F9        FL 
 42.90986  81.68095  16.16666  89.64308  84.82097  39.82934  15.16282  23.94281 
       HA        MQ        OO        UA        US        VX        WN        YV 
 20.68882  30.63775  35.18382 101.02968  75.17153  20.88173  55.52888  19.71549 

Using tapply function

The tapply function in R is designed for applying a function to subsets of a vector and then combining the results. The primary objective of tapply is to perform grouped data analysis, where you need to apply a function to subsets of data defined by factors or categorical variables. It is particularly useful for summarizing data across groups, such as calculating group means, sums, or other summary statistics.

General Formula of tapply

The general syntax of tapply in R is as follows:

tapply(X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)
  • X: A vector or an object which can be coerced to a vector. This is the data to be divided into groups and analyzed.

  • INDEX: One or more factors, or list of factors, by which to split X. The lengths of X and INDEX should align.

  • FUN: The function to be applied to each subset of X.

  • ...: Additional arguments to FUN.

  • default: The default value to be used if a subset of X is empty.

  • simplify: If TRUE and FUN returns scalars, the result is a vector; if FALSE, the result is always a list.

tapply returns an array (if simplify is TRUE and the output fits into an array) or a list of values obtained by applying FUN to each subset of X.

Example 20

Calculating the average departure delay for each carrier

# Calculate the average departure delay by carrier
tapply(flights$dep_delay, flights$carrier, 
                        function(x) mean(x, na.rm = TRUE))
       9E        AA        AS        B6        DL        EV        F9        FL 
16.725769  8.586016  5.804775 13.022522  9.264505 19.955390 20.215543 18.726075 
       HA        MQ        OO        UA        US        VX        WN        YV 
 4.900585 10.552041 12.586207 12.106073  3.782418 12.869421 17.711744 18.996330 

Example 21

Counting the total number of flights for each month

# Count the total number of flights per month
tapply(flights$flight, flights$month, length)
    1     2     3     4     5     6     7     8     9    10    11    12 
27004 24951 28834 28330 28796 28243 29425 29327 27574 28889 27268 28135 

Example 22

Finding the maximum arrival delay for each origin airport

# Find the maximum arrival delay by origin airport
tapply(flights$arr_delay, flights$origin, 
                   function(x) max(x, na.rm = TRUE))
 EWR  JFK  LGA 
1109 1272  915 

Example 23

Calculating the proportion of delayed flights (arrival delay > 0) for each destination

# Calculate the proportion of delayed flights by destination
tapply(flights$arr_delay > 0, flights$dest, 
                   function(x) mean(x, na.rm = TRUE))
      ABQ       ACK       ALB       ANC       ATL       AUS       AVL       BDL 
0.4212598 0.3939394 0.4401914 0.6250000 0.4719368 0.4077146 0.4559387 0.3495146 
      BGR       BHM       BNA       BOS       BQN       BTV       BUF       BUR 
0.3826816 0.4498141 0.4497041 0.3157369 0.4673423 0.4055777 0.3901532 0.4486486 
      BWI       BZN       CAE       CAK       CHO       CHS       CLE       CLT 
0.3947836 0.4571429 0.8207547 0.5368171 0.4130435 0.4324030 0.4037324 0.4269416 
      CMH       CRW       CVG       DAY       DCA       DEN       DFW       DSM 
0.4302465 0.4776119 0.4510067 0.4517513 0.4393590 0.4494351 0.3428708 0.4971319 
      DTW       EGE       EYW       FLL       GRR       GSO       GSP       HDN 
0.3606467 0.4299517 0.5882353 0.4380936 0.5192308 0.4477212 0.4822785 0.5714286 
      HNL       HOU       IAD       IAH       ILM       IND       JAC       JAX 
0.3409415 0.4325492 0.4412038 0.4069160 0.3831776 0.4204947 0.8095238 0.4700724 
      LAS       LAX       LEX       LGA       LGB       MCI       MCO       MDW 
0.3541667 0.3723325 0.0000000       NaN 0.3615734 0.4848806 0.3970072 0.4655901 
      MEM       MHT       MIA       MKE       MSN       MSP       MSY       MTJ 
0.4454330 0.4431330 0.3325282 0.4868955 0.5125899 0.3997691 0.4029610 0.3571429 
      MVY       MYR       OAK       OKC       OMA       ORD       ORF       PBI 
0.2761905 0.3620690 0.3754045 0.6380952 0.4736842 0.3741398 0.4274756 0.4424233 
      PDX       PHL       PHX       PIT       PSE       PSP       PVD       PWM 
0.4135618 0.4354315 0.3973079 0.3932993 0.4972067 0.3333333 0.5055866 0.4313811 
      RDU       RIC       ROC       RSW       SAN       SAT       SAV       SBN 
0.4368082 0.5098039 0.4007634 0.4003427 0.4056848 0.3899848 0.4753004 0.4000000 
      SDF       SEA       SFO       SJC       SJU       SLC       SMF       SNA 
0.4519928 0.3266409 0.3750854 0.3780488 0.3784861 0.3398613 0.5319149 0.2943350 
      SRQ       STL       STT       SYR       TPA       TUL       TVC       TYS 
0.3913405 0.4391598 0.3281853 0.3954306 0.4117727 0.6564626 0.3684211 0.5311419 
      XNA 
0.4596774 

Example 24

Computing the median air time for flights for each combination of month and carrier

# Compute the median air time by month and carrier
tapply(flights$air_time, list(flights$month, flights$carrier), 
                 median, na.rm = TRUE)
   9E    AA    AS  B6  DL EV    F9  FL    HA MQ    OO  UA US    VX  WN   YV
1  69 171.5 344.0 149 153 88 244.0 119 638.0 89 132.0 202 82 352.0 129 51.0
2  68 181.0 325.0 152 154 87 235.5 118 615.5 86    NA 201 80 346.0 122 49.5
3  73 179.0 328.0 147 151 84 229.0 109 646.0 82    NA 195 77 346.0 119 47.0
4  77 180.0 325.5 146 149 89 237.0 111 630.0 84    NA 198 76 344.5 120 79.0
5  77 169.0 314.0 136 141 84 217.5 105 615.0 78    NA 193 73 327.0 113 77.0
6  83 168.0 323.5 139 143 86 223.0 105 607.0 80  84.5 194 74 330.0 117 75.0
7  83 159.0 319.0 135 140 87 217.5 104 606.0 79    NA 193 74 321.0 113 77.0
8  85 156.0 317.0 134 139 87 217.0 105 614.0 80  69.0 195 74 329.0 116 78.5
9  85 159.0 313.5 132 137 84 215.0 102 605.0 78  68.0 189 73 326.0 112 79.0
10 90 163.0 317.0 137 142 88 227.0 108 619.0 81    NA 195 77 338.0 122 77.5
11 94 166.0 334.0 145 148 92 235.0 113 634.0 84 157.0 202 81 350.0 127 49.0
12 94 173.0 340.0 151 155 96 241.0 117 628.0 88    NA 207 86 353.0 133 84.0

Using mapply function

mapply (Multivariate Apply) in R is a function designed to apply a function to multiple arguments (vectors or lists) simultaneously. The primary objective of mapply is to extend the capabilities of sapply and lapply by allowing the application of functions over multiple arguments. This is particularly useful when you have parallel arrays or lists and you want to apply a function to corresponding elements of each array.

General Formula of mapply

The general syntax of mapply in R is as follows:

mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
  • FUN: The function to be applied.

  • ...: Arguments to the function - vectors or lists of equal length.

  • MoreArgs: A list of other arguments to FUN.

  • SIMPLIFY: If TRUE, mapply tries to simplify the result to an array; if FALSE, the result is a list.

  • USE.NAMES: Logical; if TRUE and if ... has names, the result will preserve these names.

mapply can be seen as a multivariate version of sapply. It applies FUN to the first elements of each argument in ..., then to the second elements, third elements, and so on.

Example 25

Calculating the speed (distance divided by air time) for each flights

# Function to calculate speed
calculate_speed <- function(distance, air_time) distance / air_time

# Calculate speed for each flight
head(mapply(calculate_speed, flights$distance, flights$air_time))
[1] 6.167401 6.237885 6.806250 8.612022 6.568966 4.793333

Example 26

Creating a date object for each flight

# Function to combine year, month, and day into a date
combine_date <- function(year, month, day) as.Date(paste(year, month, day, sep = "-"))

# Create a date for each flight
head(mapply(combine_date, flights$year, flights$month, flights$day))
[1] 15706 15706 15706 15706 15706 15706

Example 27

Comparing scheduled air time with actual air time to find the difference

# Function to calculate time difference
time_difference <- function(scheduled, actual) actual - scheduled

# Calculate the difference in times
head(mapply(time_difference, flights$sched_arr_time, flights$arr_time))
[1]  11  20  73 -18 -25  12

Example 28

Checking if both departure and arrival times are delayed

# Function to check if both times are delayed
both_delayed <- function(dep_delay, arr_delay) dep_delay > 0 & arr_delay > 0

# Check if both departure and arrival times are delayed for each flight
head(mapply(both_delayed, flights$dep_delay, flights$arr_delay))
[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE

now using mtcars dataset

Example 29

Convert the mpg (miles per gallon) values to kilometers per liter (1 mile = 1.60934 km, 1 gallon = 3.78541 liters)

# Function to convert miles per gallon to kilometers per liter
mpg_to_kmpl <- function(mpg) mpg * 1.60934 / 3.78541

# Apply the conversion to the mpg column
mapply(mpg_to_kmpl, mtcars$mpg)
 [1]  8.928000  8.928000  9.693257  9.098057  7.950171  7.695086  6.079543
 [8] 10.373486  9.693257  8.162743  7.567543  6.972343  7.354971  6.462171
[15]  4.421486  4.421486  6.249600 13.774628 12.924343 14.412343  9.140571
[22]  6.589714  6.462171  5.654400  8.162743 11.606400 11.053714 12.924343
[29]  6.717257  8.375314  6.377143  9.098057

Example 30

Compute the power-to-weight ratio for each car (horsepower to weight). Horsepower is in hp and weight is in wt (1000 lbs)

# Function to calculate power-to-weight ratio (hp per 1000 lbs)
power_to_weight <- function(hp, wt) hp / wt

# Apply the function to hp and wt columns
mapply(power_to_weight, mtcars$hp, mtcars$wt)
 [1] 41.98473 38.26087 40.08621 34.21462 50.87209 30.34682 68.62745 19.43574
 [9] 30.15873 35.75581 35.75581 44.22604 48.25737 47.61905 39.04762 39.63864
[17] 43.03087 30.00000 32.19814 35.42234 39.35091 42.61364 43.66812 63.80208
[25] 45.51365 34.10853 42.52336 74.68605 83.28076 63.17690 93.83754 39.20863

Example 31

Estimate the time to accelerate from 0 to 100 km/h based on qsec (1/4 mile time). This is a rough estimation assuming linear acceleration

# Function to estimate time to 100 km/h
time_to_100 <- function(qsec) qsec * (100 / (1/4 * 1.60934))

# Apply the function to the qsec column
mapply(time_to_100, mtcars$qsec)
 [1] 4091.118 4230.306 4625.499 4831.794 4230.306 5025.663 3937.018 4970.982
 [9] 5691.774 4548.448 4697.578 4324.754 4374.464 4473.884 4468.913 4429.145
[17] 4329.725 4839.251 4603.129 4946.127 4973.467 4193.023 4299.899 3830.142
[25] 4237.762 4697.578 4150.770 4200.480 3603.962 3852.511 3628.817 4623.013

Example 32

Create a full name for each car by combining the row names (car brand and model)

# Function to create a full name
full_name <- function(name) paste(strsplit(name, " ")[[1]], collapse = " ")

# Apply the function to row names
mapply(full_name, rownames(mtcars))
            Mazda RX4         Mazda RX4 Wag            Datsun 710 
          "Mazda RX4"       "Mazda RX4 Wag"          "Datsun 710" 
       Hornet 4 Drive     Hornet Sportabout               Valiant 
     "Hornet 4 Drive"   "Hornet Sportabout"             "Valiant" 
           Duster 360             Merc 240D              Merc 230 
         "Duster 360"           "Merc 240D"            "Merc 230" 
             Merc 280             Merc 280C            Merc 450SE 
           "Merc 280"           "Merc 280C"          "Merc 450SE" 
           Merc 450SL           Merc 450SLC    Cadillac Fleetwood 
         "Merc 450SL"         "Merc 450SLC"  "Cadillac Fleetwood" 
  Lincoln Continental     Chrysler Imperial              Fiat 128 
"Lincoln Continental"   "Chrysler Imperial"            "Fiat 128" 
          Honda Civic        Toyota Corolla         Toyota Corona 
        "Honda Civic"      "Toyota Corolla"       "Toyota Corona" 
     Dodge Challenger           AMC Javelin            Camaro Z28 
   "Dodge Challenger"         "AMC Javelin"          "Camaro Z28" 
     Pontiac Firebird             Fiat X1-9         Porsche 914-2 
   "Pontiac Firebird"           "Fiat X1-9"       "Porsche 914-2" 
         Lotus Europa        Ford Pantera L          Ferrari Dino 
       "Lotus Europa"      "Ford Pantera L"        "Ferrari Dino" 
        Maserati Bora            Volvo 142E 
      "Maserati Bora"          "Volvo 142E"