Data Summarization

Author

Dr. Mohammad Nasir Abdullah

Data Summarization

Data summarization is an essential process in statistical programming, which involves reducing and simplifying large datasets into more manageable and understandable forms. This process is crucial as it helps in highlighting the key aspects of the data by extracting important patterns, trends, and relationships. In the realm of statistical analysis, summarization is not just about making the data smaller or simpler; it’s about capturing the essence of the data in a way that is both informative and useful for analysis.

Significance in Large Datasets

The advent of big data has made data summarization more important than ever. With the sheer volume of data available today, it’s practically impossible to analyze every individual data point. Summarization helps in distilling large datasets to a form where they can be easily interpreted and analyzed. This process not only saves time and computational resources but also aids in making data-driven decisions more effectively.

For instance, summarizing sales data over years can reveal trends and patterns that might not be evident when looking at daily sales figures. Similarly, summarizing survey data can help in quickly understanding the general opinion or trend, without getting lost in the myriad of individual responses.

Types of Summarization Techniques

  1. Numerical Summarization: This involves using statistical measures to summarize the key characteristics of numerical data. Techniques include calculating measures of central tendency (like mean, median, and mode) and measures of variability or spread (like range, variance, and standard deviation). These techniques are fundamental in providing a quick snapshot of the data’s overall distribution and central values.

  2. Categorical Summarization: When dealing with categorical (or qualitative) data, summarization often involves understanding the frequency or occurrence of different categories. Techniques include creating frequency tables, cross-tabulations, and using measures like mode. This type of summarization is particularly useful in understanding the distribution of categorical variables, like customer categories or product types.

  3. Visual Summarization: Visual representations of data, such as histograms, bar charts, box plots, and scatter plots, provide an intuitive way to summarize and understand complex datasets. These techniques are invaluable in revealing patterns, trends, outliers, and relationships in data that might not be obvious in textual or numerical summaries.

In R, these summarization techniques are supported by a variety of functions and packages, making it an ideal environment for both basic and advanced data summarization tasks. This chapter will delve into these techniques, providing the reader with the knowledge and tools to effectively summarize and interpret large datasets using R.

Summarizing Numerical Data

Introduction to Normality Testing.

Normality testing is a fundamental step in statistical analysis, particularly when deciding which statistical methods to apply. Many statistical techniques assume that the data follows a normal (or Gaussian) distribution. However, real-world data often deviates from this idealized distribution. Therefore, assessing the normality of your dataset is crucial before applying techniques that assume normality, such as parametric tests.

In R, there are several methods to test for normality, including graphical methods like Q-Q plots and statistical tests like the Shapiro-Wilk test. Let’s explore how to perform these tests in R with an example dataset.

Graphical Methods to Identify Normality distribution

1) Histogram

library(palmerpenguins)
library(ggplot2)

data <- na.omit(penguins$body_mass_g)  # Remove NA values

num_bins_sturges <- 1 + log2(length(data)) #calculating number of bins using struge's rules

ggplot(data.frame(data), aes(x = data)) +
    geom_histogram(bins = round(num_bins_sturges)) +
    ggtitle("Histogram with Sturges' Rule") + theme_light()

Based on the Histogram, we can conclude that the data is a bit skewed to the right.

2) Boxplot

ggplot(penguins, aes(body_mass_g)) + geom_boxplot() +
  ggtitle("A Boxplot showing distribution of weight of Penguins in grams") + 
  theme_minimal() + 
  scale_y_continuous(limits=c(-1,1))

Based on the boxplot, we can clearly see that the data was a bit skewed to the right.

3) Kurtosis and Skewness

Kurtosis is a measure of a sample or population that identifies how flat or peaked it is with respect to a normal distribution. It refers to how concentrated the values are in the center of the distribution. In other hand, sample can be described as a measure of horizontal symmetry with respect to a normal distribution by measuring the skewness coefficient. If the coefficient of skewness can be either positive (right skew), zero (no skew), or negative (left skew).

Steps examining sample normality by skewness and kurtosis:

1) Determine sample mean and standard deviation
2) Determine sample kurtosis and skewness

Formula Skewness:

Formula for Standard Error of Skewness:

Formula Kurtosis:

Formula for Standard Error of Kurtosis:


3) Calculate the standard error of the kurtosis and the standard error of the skewness
4) Calculate Z-Score for the kurtosis and Z-Score for the skewness

Formula for Z-Score Skewness:

Formula for Z-Score Kurtosis:


5) Compare the Z-Score to the critical region obtained from the normal distribution

  • The Z-score is to examine the sample’s approximation to a normal distribution.

  • Z-score values for kurtosis and skewness must fall between -2 and 2 to pass the normality assumption for alpha=0.05.

  • Z-score for Skewness: This score indicates how many standard deviations the skewness of the dataset is from the mean skewness of a normally distributed dataset. A higher absolute value of the z-score indicates a greater degree of asymmetry.

  • Z-score for Kurtosis: Similarly, this score tells us how many standard deviations the kurtosis of the dataset is from the mean kurtosis of a normal distribution. It reflects the “tailedness” or the peak height of the distribution.

library(e1071)
library(dplyr)


penguins %>%
  filter(!is.na(body_mass_g)) %>%
  summarize(skewness = skewness(body_mass_g, type=2, na.rm=T), 
           se_skewness = sqrt((6 * length(body_mass_g) * (length(body_mass_g) - 1)) / ((length(body_mass_g) - 2) * (length(body_mass_g) + 1) * (length(body_mass_g) + 3))), 
           kurtosis =  kurtosis(body_mass_g, type = 2, na.rm=T),
           se_kurtosis = sqrt((4 * se_skewness^2) / (length(body_mass_g))),
           
          Z_Skewness = ((skewness-0)/se_skewness),
          Z_Kurtosis = ((kurtosis - 0)/se_kurtosis))
# A tibble: 1 × 6
  skewness se_skewness kurtosis se_kurtosis Z_Skewness Z_Kurtosis
     <dbl>       <dbl>    <dbl>       <dbl>      <dbl>      <dbl>
1    0.470       0.132   -0.719      0.0143       3.57      -50.4

Interpretation Guidelines

  • Z-score between -2 and 2: This range is generally considered to indicate that the skewness or kurtosis is not significantly different from what would be expected in a normal distribution. The data can be considered approximately normal in this aspect.

  • Z-score less than -2: This suggests a significant deviation from normality in a negative direction. For skewness, it would mean a left skew (tail is on the left side). For kurtosis, it indicates a platykurtic distribution (less peaked than a normal distribution).

  • Z-score greater than 2: This indicates a significant deviation in a positive direction. For skewness, it would imply a right skew (tail is on the right side). For kurtosis, it suggests a leptokurtic distribution (more peaked than a normal distribution).

It is clearly seen that the value of Z-Score for skewness is beyond 2, which indicating the data is skewed to the right tail. and the value of Z-Score for kurtosis is less than -2, indicating platykurtic distribution (less peaked than a normal distribution).

4) Q-Q plots

Q-Q plots are scatterplots created by plotting two sets of quantiles against each other: the quantiles from the dataset and the quantiles of a normal distribution. If the data are normally distributed, the points in the Q-Q plot will approximately lie on a straight line.

Interpreting the Q-Q Plot

  • Linearity: In a Q-Q plot, if the data are normally distributed, the points will approximately lie on a straight line. The closer the points are to the line (qqline), the more normal the distribution is.

  • Departures from Linearity: Deviations from the line can indicate departures from normality:

    • Left Tail (lower end of the plot): If the points deviate below the line at the lower end, it suggests a left-skewed distribution (lighter left tail than normal).

    • Right Tail (upper end of the plot): If the points deviate above the line at the upper end, it indicates a right-skewed distribution (heavier right tail than normal).

    • Center of the Plot: If the points deviate from the line in the middle of the plot, it suggests a difference in the median or a distribution with different kurtosis from normal (either more peaked or flatter than a normal distribution).

#Create Q-Q plot
qqnorm(penguins$body_mass_g)

#Adding straight diagonal line to plot
qqline(penguins$body_mass_g)

qqline(penguins$body_mass_g,
       main = 'Q-Q Plot for Normality', 
       xlab = 'Theoretical Dist', 
       ylab = 'Sample dist', 
       col = 'steelblue')

Another way to construct qqplot

library(ggpubr)
ggqqplot(penguins$body_mass_g)

By examining how the points in the Q-Q plot deviate from the reference line, we can infer if and how the body_mass_g distribution deviates from a normal distribution. Remember, a Q-Q plot provides a visual assessment and should be used in conjunction with other methods for a comprehensive analysis of normality.

5) p-p Plot

A P-P (Probability-Probability) plot is a graphical technique used to assess how closely two probability distributions agree with each other. Specifically, it compares the cumulative distribution function (CDF) of a sample data set against the CDF of a theoretical distribution, which is often a normal distribution in many applications. The P-P plot is particularly useful for evaluating the goodness of fit of a distribution model to observed data.

How a P-P Plot Works:

  1. Theoretical Distribution: This is often a standard distribution like a normal distribution, but it could be any specified distribution. The cumulative probabilities from this distribution are plotted on one axis (usually the x-axis).

  2. Sample Data Distribution: The observed data’s cumulative probabilities are calculated and plotted on the other axis (usually the y-axis).

  3. Plotting: On a P-P plot, each point represents a cumulative probability from the sample data against the corresponding cumulative probability from the theoretical distribution. If the sample data perfectly follow the theoretical distribution, all the points will lie on the diagonal line.

Interpreting a P-P Plot:

  • Linearity: If the points on a P-P plot lie close to the diagonal line (45-degree line), this indicates that the sample data follow the theoretical distribution closely.

  • Deviations from Linearity:

    • Both Ends Deviate: If the points deviate from the diagonal at both the lower and upper tails, it suggests that the sample distribution has heavier or lighter tails than the theoretical distribution.

    • Middle Deviates: If the points deviate in the middle of the plot but are closer at the tails, this could indicate a discrepancy in the central part of the distribution, like a difference in variance between the sample and theoretical distributions.

  • S-shaped Curve: An S-shaped curve in the P-P plot indicates that one of the distributions is more skewed compared to the other.

library(CircStats)
data1 <- penguins %>% 
  filter(!is.na(body_mass_g))

pp.plot(data1$body_mass_g, ref.line=TRUE)

         mu     kappa
1 0.3731475 0.2059357

6) Statistical Test

  1. Shapiro-Wilk Test

    • The Shapiro-Wilk test is widely used for testing normality. This test checks the null hypothesis that the data were drawn from a normal distribution.

    • It is generally preferred for small sample sizes (< 50 samples), but can be used for larger ones.

shapiro.test(penguins$body_mass_g)

    Shapiro-Wilk normality test

data:  penguins$body_mass_g
W = 0.95921, p-value = 3.679e-08
library(broom)
tidy(shapiro.test(data1$body_mass_g))
# A tibble: 1 × 3
  statistic      p.value method                     
      <dbl>        <dbl> <chr>                      
1     0.959 0.0000000368 Shapiro-Wilk normality test
  1. Kolmogorov-Smirnov Test (K-S Test)

    • The K-S test compares the empirical distribution function of the data with the expected distribution function for a normal distribution.

    • This test is more sensitive towards the center of the distribution than the tails.

#removing missing value
data1 <- penguins %>%
  filter(!is.na(body_mass_g))
# Standardizing the data
standardized_data <- (data1$body_mass_g - mean(data1$body_mass_g)) / sd(data1$body_mass_g)

# Kolmogorov-Smirnov Test
ks.test(standardized_data, "pnorm")

    Asymptotic one-sample Kolmogorov-Smirnov test

data:  standardized_data
D = 0.10408, p-value = 0.00121
alternative hypothesis: two-sided
  1. Anderson-Darling Test

    • Similar to the K-S test, the Anderson-Darling test is a test of goodness of fit.

    • This test gives more weight to the tails than the K-S test and is thus more sensitive to outliers.

library(nortest)
ad.test(penguins$body_mass_g)

    Anderson-Darling normality test

data:  penguins$body_mass_g
A = 4.543, p-value = 2.757e-11
  1. Lilliefors Test

    • A modification of the K-S test, the Lilliefors test is specifically designed for testing normality.

    • It is particularly useful when the mean and variance of the distribution are unknown.

library(nortest)
lillie.test(penguins$body_mass_g)

    Lilliefors (Kolmogorov-Smirnov) normality test

data:  penguins$body_mass_g
D = 0.10408, p-value = 1.544e-09
  1. Jarque-Bera Test

    • The Jarque-Bera test is a goodness-of-fit test of whether sample data have the skewness and kurtosis matching a normal distribution.

    • It is more commonly used in econometrics.

library(tseries)
#removing missing value
data1 <- penguins %>%
  filter(!is.na(body_mass_g))

jarque.bera.test(data1$body_mass_g)

    Jarque Bera Test

data:  data1$body_mass_g
X-squared = 20.014, df = 2, p-value = 4.508e-05
  1. D’Agostino skewness Test

    • D’Agostino’s K-squared test is a statistical method used to test the hypothesis that a given sample comes from a normally distributed population, focusing particularly on the skewness of the distribution.

    • Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable.

    • Focus on Skewness: The test specifically examines the skewness of the distribution, which is the degree to which it leans to one side. In a perfectly normal distribution, the skewness would be zero, indicating symmetry.

    • Test Statistic: The test involves calculating a standardized skewness measure and then squaring it to get the test statistic. This statistic essentially measures how far the sample’s skewness deviates from that expected in a normal distribution.

    • Sample Size Consideration: The calculation of the test statistic incorporates the sample size, as skewness can be more variable in smaller samples. The test is adjusted to account for this variability.

    • P-Value: The test produces a p-value, which indicates the probability of observing the given degree of skewness (or more extreme) if the null hypothesis of normality is true. A low p-value (typically less than 0.05) would lead to rejecting the null hypothesis, suggesting the distribution is not normal with respect to skewness.

    • Applicability: While D’Agostino’s K-squared test is robust for various sample sizes, it is particularly useful for larger samples where the approximation of the p-value becomes more accurate.

    • Interpretation: It’s important to interpret the results within the context of your data. Even if a dataset fails this test, it might still be approximately normal enough for certain statistical methods that assume normality. Conversely, a pass does not guarantee all aspects of normality (like kurtosis).

library(moments)

#removing missing value
data1 <- penguins %>%
  filter(!is.na(body_mass_g))

moments::agostino.test(data1$body_mass_g)

    D'Agostino skewness test

data:  data1$body_mass_g
skew = 0.46826, z = 3.44537, p-value = 0.0005703
alternative hypothesis: data have a skewness
  1. Pearson’s Chi-Square Normality Test

    • The Pearson Chi-Square normality test, often referred to as the Pearson’s Chi-Squared test for goodness-of-fit, is a statistical test used to determine if a sample comes from a population with a specific distribution, in this case, a normal distribution.

    • Basic Concept: The test compares the observed frequencies in each category/bin of the data to the frequencies expected if the data followed a normal distribution. The bins are typically intervals of values in your data set.

    • Test Statistic: The test statistic is computed by summing the squared differences between the observed and expected frequencies, divided by the expected frequencies.

    • Degrees of Freedom: The degrees of freedom for the test are typically the number of bins minus the number of parameters estimated (mean and standard deviation for a normal distribution) and minus one.

    • P-Value: The test yields a p-value which indicates the probability of observing the data if it actually comes from a normal distribution. A low p-value (typically less than 0.05) suggests that the data do not come from a normal distribution.

    • Cautions: The test can be sensitive to the choice of bins and is generally less powerful for detecting deviations from normality than other tests like the Shapiro-Wilk test. It is more suitable for large sample sizes.

library(nortest)
pearson.test(penguins$body_mass_g)

    Pearson chi-square normality test

data:  penguins$body_mass_g
P = 59.947, p-value = 2.087e-06
  1. Shapiro-Francia Normality Test

    • The Shapiro-Francia normality test is a modification of the Shapiro-Wilk test for normality. It is particularly useful for testing the normality of distributions that are suspected of having light tails. Compared to the Shapiro-Wilk test, the Shapiro-Francia test is less sensitive to deviations in the tails of the distribution.

    • Focus on Light Tails: The test is specifically tailored to detect departures from normality due to light tails.

    • Test Statistic: Similar to the Shapiro-Wilk test, the Shapiro-Francia test computes a test statistic that measures the closeness of the data to the normal distribution. However, it uses a different weighting scheme that makes it less sensitive to the tails.

    • Interpretation: As with other normality tests, a low p-value (typically less than 0.05) suggests that the data do not come from a normal distribution.

library(nortest)
sf.test(penguins$body_mass_g)

    Shapiro-Francia normality test

data:  penguins$body_mass_g
W = 0.96126, p-value = 4.027e-07
  1. Cramer-Von Mises Normality Test

    • The Cramér-von Mises criterion is a goodness-of-fit test used in statistics to test the hypothesis that a given set of observations is drawn from a specified distribution. In the context of testing for normality, the Cramér-von Mises test compares the empirical cumulative distribution function (ECDF) of the sample data against the CDF of a normal distribution.

    • Test Statistic: The Cramér-von Mises test statistic measures the sum of squared differences between the ECDF of the sample and the CDF of the theoretical normal distribution.

    • Sensitivity: Unlike tests that may focus more on the tails (like the Anderson-Darling test) or the center of the distribution (like the Kolmogorov-Smirnov test), the Cramér-von Mises test is a more general test of goodness-of-fit and doesn’t give special weight to any part of the distribution.

    • P-Value: The test yields a p-value, which can be used to determine whether to reject the null hypothesis that the data are drawn from a normal distribution. A low p-value (usually less than 0.05) suggests that the data are not normally distributed.

library(nortest)
cvm.test(penguins$body_mass_g)

    Cramer-von Mises normality test

data:  penguins$body_mass_g
W = 0.75983, p-value = 2.474e-08

Exercise 1

1) Apply the lilliefor’s test to dep_delay across all flights in nycflights13. What does the result indicate about the distribution of departure delays?

2) Assess whether the body mass (body_mass_g) of these penguins follows a normal distribution for each species. [palmerpenguins]

3) Perform Jarque-Bera test on flipper_length_mm for penguins on each island (island). Which island exhibit the most normal-like distribution in terms of flipper length?

Basic Statistical Summarisation

Data summarization is a crucial aspect of statistical analysis, providing a way to describe and understand large datasets through a few summary statistics. Among the key concepts in data summarization are measures of central tendency, position, and dispersion. Each of these measures gives different insights into the nature of the data. [for continuous/numerical data only]

1. Measures of Central Tendency

Measures of central tendency describe the center point or typical value of a dataset. The most common measures are:

  • Mean (Arithmetic Average): The sum of all values divided by the number of values. It’s sensitive to outliers and can be skewed by them.
# calculating mean of numerical variable
mean(penguins$body_mass_g, na.rm = TRUE)
[1] 4201.754
  • Median: The middle value when the data is sorted in ascending order. It’s less affected by outliers and skewness and provides a better central value for skewed distributions.
#calculating median of numerical variable
median(penguins$body_mass_g, na.rm=TRUE)
[1] 4050
  • Mode: The most frequently occurring value in the dataset. There can be more than one mode in a dataset (bimodal, multimodal). Useful in understanding the most common value, especially for categorical data.
as.numeric(names(sort(table(penguins$body_mass_g), decreasing = TRUE)[1]))
[1] 3800

2. Measures of Position

Measures of position describe how data points fall in relation to the distribution or to each other. These include:

  • Percentiles: Values below which a certain percentage of the data falls. For example, the 25th percentile (or 1st quartile) is the value below which 25% of the data lies.
quantile(penguins$body_mass_g, na.rm=TRUE)
  0%  25%  50%  75% 100% 
2700 3550 4050 4750 6300 
  • Quartiles: Special percentiles that divide the dataset into four equal parts. The median is the second quartile.
#1st Quartile
quantile(penguins$body_mass_g, 0.25, na.rm=TRUE)
 25% 
3550 
#2nd Quartile
quantile(penguins$body_mass_g, 0.50, na.rm=TRUE) #same as median
 50% 
4050 
#3rd Quartile
quantile(penguins$body_mass_g, 0.75, na.rm=TRUE)
 75% 
4750 
  • Interquartile Range (IQR): The range between the first and third quartiles (25th and 75th percentiles). It represents the middle 50% of the data and is a measure of variability that’s not influenced by outliers.
IQR(penguins$body_mass_g, na.rm=TRUE)
[1] 1200

3. Measures of Dispersion

Measures of dispersion or variability tell us about the spread of the data points in a dataset:

  • min: To find the minimum value in the variable
min(penguins$body_mass_g, na.rm = TRUE)
[1] 2700
  • max: To find the maximum value in the variable
max(penguins$body_mass_g, na.rm=TRUE)
[1] 6300
  • length: to find how many observations in the variable
length(penguins$body_mass_g)
[1] 344
  • Range: The difference between the highest and lowest values. It’s simple but can be heavily influenced by outliers.
max(penguins$body_mass_g, na.rm=TRUE) - min(penguins$body_mass_g, na.rm=TRUE)
[1] 3600
  • Variance: The average of the squared differences from the mean. It gives a sense of the spread of the data, but it’s not in the same unit as the data.
var(penguins$body_mass_g, na.rm=TRUE)
[1] 643131.1
  • Standard Deviation (SD): The square root of the variance. It’s in the same units as the data and describes how far data points tend to deviate from the mean.
sd(penguins$body_mass_g, na.rm=TRUE)
[1] 801.9545
  • Coefficient of Variation (CV): The ratio of the standard deviation to the mean. It’s a unitless measure of relative variability, useful for comparing variability across datasets with different units or means.
sd(penguins$body_mass_g, na.rm=TRUE) / mean(penguins$body_mass_g, na.rm=TRUE) * 100
[1] 19.08618

Summary() Function

The summary() function in R is a generic function used to produce result summaries of various model and data objects. When applied to a data frame, it provides a quick overview of the statistical properties of each column. The function is particularly useful for getting a rapid sense of the data, especially during the initial stages of data analysis.

Key Features of summary() in R:

  1. Applicability to Different Objects: The summary() function can be used on different types of objects in R, including vectors, data frames, and model objects. The output format varies depending on the type of object.

  2. Default Output for Data Frames: For a data frame, summary() typically returns the following statistics for each column:

    • For numeric variables: Minimum, 1st Quartile, Median, Mean, 3rd Quartile, Maximum.

    • For factor variables: Counts for each level, and NA count if there are missing values.

  3. Handling of Missing Values: The function includes NA values in its output, providing a count of missing values, which is crucial for data cleaning and preprocessing.

  4. Customization: The behavior of summary() can be customized for user-defined classes (S3 or S4) in R. This means that when you create a new type of object, you can also define what summary() should return when applied to objects of this type.

  5. Use in Exploratory Data Analysis (EDA): It is often used as a preliminary step in EDA to get a sense of the data distribution, identify possible outliers, and detect missing values.

summary(penguins)
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2                                 

Notes:

  • While summary() provides a quick and useful overview, it’s often just a starting point for data analysis. Depending on the results, you might need more detailed analysis, such as specific statistical tests or detailed data visualizations.

  • The function is particularly handy for quickly checking data after importation, allowing for a rapid assessment of data quality, structure, and potential areas that may require further investigation.

Data Summarization on Matrices

  1. rowMeans(x):

    • Purpose: Calculates the mean of each row in a matrix x.

    • Use Case: Useful when you need to find the average across different variables (columns) for each observation (row).

    • Output: Returns a numeric vector containing the mean of each row.

data <- penguins
# Selecting only numeric columns for the matrix
numeric_data <- data[, sapply(data, is.numeric)]
numeric_matrix <- as.matrix(na.omit(numeric_data))

# Apply summarization functions
# Means of each row
row_means <- rowMeans(numeric_matrix)
  1. colMeans(x):

    • Purpose: Computes the mean of each column in a matrix x.

    • Use Case: Used when you want to find the average value of each variable (column) across all observations (rows).

    • Output: Produces a numeric vector with the mean of each column.

# Means of each column
colMeans(numeric_matrix)
   bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
         43.92193          17.15117         200.91520        4201.75439 
             year 
       2008.02924 
  1. rowSums(x):

    • Purpose: Calculates the sum of each row in a matrix x.

    • Use Case: Helpful for aggregating data across multiple variables (columns) for each individual observation (row).

    • Output: Returns a numeric vector containing the sum of each row.

# Sums of each row
row_sums <- rowSums(numeric_matrix)
  1. colSums(x):

    • Purpose: Computes the sum of each column in a matrix x.

    • Use Case: Useful for aggregating data for each variable (column) across all observations (rows).

    • Output: Generates a numeric vector with the sum of each column.

# Sums of each column
colSums(numeric_matrix)
   bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
          15021.3            5865.7           68713.0         1437000.0 
             year 
         686746.0 

Detecting how many missing values

To detect how many missing values in the variable

table(is.na(penguins$body_mass_g))

FALSE  TRUE 
  342     2 

TRUE means number of missing values in the dataset for variable body_mass_g.

Apply function

We can calculate summary statistics simultaneously by using sapply, this function allow us to get the numerical statistics measures for all numerical values. we will discuss about apply family in other lecture.

Mean

sapply(numeric_data, mean, na.rm=TRUE)
   bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
         43.92193          17.15117         200.91520        4201.75439 
             year 
       2008.02907 

Standard Deviation

sapply(numeric_data, sd, na.rm=TRUE)
   bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
        5.4595837         1.9747932        14.0617137       801.9545357 
             year 
        0.8183559 

Sum / total in a column

sapply(numeric_data, sum, na.rm=T)
   bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
          15021.3            5865.7           68713.0         1437000.0 
             year 
         690762.0 

Basically, we can just substitute the numerical measure inside the 2nd arguments.

Summary Statistics by group

Usually in data analysis, if there are a categorical variable, we would like to compare the numerical measure with the categorical data. For example, we want to know the mean and standard deviation by gender (male and female) instead of sumarizing the mean and standard deviation for whole dataset.

Example: calculating mean and standard deviation for mtcars dataset

#Testing the normality of Miles per Gallon variable by Number of cylinders
#Selecing 4 cylinders
cyl4 <- mtcars %>%
  filter(cyl==4)
lillie.test(cyl4$mpg)

    Lilliefors (Kolmogorov-Smirnov) normality test

data:  cyl4$mpg
D = 0.16784, p-value = 0.5226
#Selecting 6 cylinders
cyl6 <- mtcars%>%
  filter(cyl==6)
lillie.test(cyl6$mpg)

    Lilliefors (Kolmogorov-Smirnov) normality test

data:  cyl6$mpg
D = 0.23502, p-value = 0.2863
#Selecting 8 cylinders
cyl8 <- mtcars%>%
  filter(cyl==8)
lillie.test(cyl8$mpg)

    Lilliefors (Kolmogorov-Smirnov) normality test

data:  cyl8$mpg
D = 0.16305, p-value = 0.3983

Calculating Mean and Standard Deviation for mpg by cyl, since the normality assumption is met.

mtcars %>%
  group_by(factor(cyl)) %>%
  summarise(mean_mpg = mean(mpg),
            sd_mpg = sd(mpg))
# A tibble: 3 × 3
  `factor(cyl)` mean_mpg sd_mpg
  <fct>            <dbl>  <dbl>
1 4                 26.7   4.51
2 6                 19.7   1.45
3 8                 15.1   2.56

Example: calculating median and IQR for non-normal distributed data (mtcars dataset) - Just Assume even though it is normal [for the sake of demonstration]

#checking normality
Spe1 <- mtcars %>% 
  filter(am==0)
lillie.test(Spe1$mpg)

    Lilliefors (Kolmogorov-Smirnov) normality test

data:  Spe1$mpg
D = 0.087337, p-value = 0.9648
Spe2 <- mtcars %>% 
  filter(am==1)
lillie.test(Spe2$mpg)

    Lilliefors (Kolmogorov-Smirnov) normality test

data:  Spe2$mpg
D = 0.14779, p-value = 0.6107
mtcars %>%
  group_by(am) %>%
  filter(!is.na(am)) %>%
  summarise(median_mpg = median(mpg),
            IQR_mpg = IQR(mpg))
# A tibble: 2 × 3
     am median_mpg IQR_mpg
  <dbl>      <dbl>   <dbl>
1     0       17.3    4.25
2     1       22.8    9.4 

To Present this finding statistical measures in a table:

Table 1: Descriptive statistics

Factor mean (sd) median (IQR)
Number of Cylinders
4 26.67 (4.51)
6 19.70 (1.45)
8 15.10 (2.56)
Transmission
Automatics 17.30 (4.25)
Manual 22.80 (9.40)

Several usefull package for numerical data

1) Hmisc package

library(Hmisc)
Hmisc::describe(penguins$bill_length_mm)
penguins$bill_length_mm 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     342        2      164        1    43.92    6.274    35.70    36.60 
     .25      .50      .75      .90      .95 
   39.23    44.45    48.50    50.80    51.99 

lowest : 32.1 33.1 33.5 34   34.1, highest: 55.1 55.8 55.9 58   59.6

2) pastecs package

library(pastecs)
stat.desc(penguins$bill_length_mm, desc=TRUE, p=0.95)
     nbr.val     nbr.null       nbr.na          min          max        range 
3.420000e+02 0.000000e+00 2.000000e+00 3.210000e+01 5.960000e+01 2.750000e+01 
         sum       median         mean      SE.mean CI.mean.0.95          var 
1.502130e+04 4.445000e+01 4.392193e+01 2.952205e-01 5.806825e-01 2.980705e+01 
     std.dev     coef.var 
5.459584e+00 1.243020e-01 

3) psych package

library(psych)
psych::describe(penguins$bill_length_mm)
   vars   n  mean   sd median trimmed  mad  min  max range skew kurtosis  se
X1    1 342 43.92 5.46  44.45   43.91 7.04 32.1 59.6  27.5 0.05    -0.89 0.3

we can also describe for the whole dataset

psych::describe(penguins)
                  vars   n    mean     sd  median trimmed    mad    min    max
species*             1 344    1.92   0.89    2.00    1.90   1.48    1.0    3.0
island*              2 344    1.66   0.73    2.00    1.58   1.48    1.0    3.0
bill_length_mm       3 342   43.92   5.46   44.45   43.91   7.04   32.1   59.6
bill_depth_mm        4 342   17.15   1.97   17.30   17.17   2.22   13.1   21.5
flipper_length_mm    5 342  200.92  14.06  197.00  200.34  16.31  172.0  231.0
body_mass_g          6 342 4201.75 801.95 4050.00 4154.01 889.56 2700.0 6300.0
sex*                 7 333    1.50   0.50    2.00    1.51   0.00    1.0    2.0
year                 8 344 2008.03   0.82 2008.00 2008.04   1.48 2007.0 2009.0
                   range  skew kurtosis    se
species*             2.0  0.16    -1.73  0.05
island*              2.0  0.61    -0.91  0.04
bill_length_mm      27.5  0.05    -0.89  0.30
bill_depth_mm        8.4 -0.14    -0.92  0.11
flipper_length_mm   59.0  0.34    -1.00  0.76
body_mass_g       3600.0  0.47    -0.74 43.36
sex*                 1.0 -0.02    -2.01  0.03
year                 2.0 -0.05    -1.51  0.04

4) skimr package

library(skimr)
skim(penguins$bill_length_mm)
Data summary
Name penguins$bill_length_mm
Number of rows 344
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁

we can also describe for the whole dataset

skim(penguins)
Data summary
Name penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇

Summarizing Categorical Data

Summarizing categorical data is an essential part of data analysis, especially when dealing with survey results, demographic information, or any data where variables are qualitative rather than quantitative. The goal is to gain insights into the distribution of categories, identify patterns, and make inferences about the population being studied.

Key Concepts in Summarizing Categorical Data:

  1. Frequency Counts:

    • The most basic form of summarization for categorical data is to count the number of occurrences of each category.

    • In R, table() function is commonly used for this purpose.

# To tabulate categorical data
table(penguins$species)

   Adelie Chinstrap    Gentoo 
      152        68       124 
  1. Proportions and Percentages:

    • Converting frequency counts into proportions or percentages provides a clearer understanding of the data relative to the whole.

    • This is particularly useful when comparing groups of different sizes.

#to get relative frequency by group
prop.table(table(penguins$species))

   Adelie Chinstrap    Gentoo 
0.4418605 0.1976744 0.3604651 

To combine the frequency value and percentage value together:

freq1 <- as.data.frame(table(penguins$species))
percent1 <- as.data.frame(prop.table(table(penguins$species))*100)
names(percent1)[2] <- "percentage"
cbind(freq1, percent1[2])
       Var1 Freq percentage
1    Adelie  152   44.18605
2 Chinstrap   68   19.76744
3    Gentoo  124   36.04651
combine <- cbind(freq1, round(percent1[2],2))
combine
       Var1 Freq percentage
1    Adelie  152      44.19
2 Chinstrap   68      19.77
3    Gentoo  124      36.05
  1. Cross-tabulation:

    • Cross-tabulation (or contingency tables) involves summarizing two or more categorical variables simultaneously, making it possible to observe the relationship between them.

    • In R, this can be achieved using the table() function with multiple variables.

# to tabulate the cross tabulation between species and island
table(penguins$species, penguins$island)
           
            Biscoe Dream Torgersen
  Adelie        44    56        52
  Chinstrap      0    68         0
  Gentoo       124     0         0

Several Useful packages for categorical data analysis

1) janitor package 

library(janitor)
tabyl(penguins$species, sort=TRUE)
 penguins$species   n   percent
           Adelie 152 0.4418605
        Chinstrap  68 0.1976744
           Gentoo 124 0.3604651

2) epiDisplay package

library(epiDisplay)
tab1(penguins$species, sort.group = "decreasing", 
     cum.percent = TRUE)

penguins$species : 
          Frequency Percent Cum. percent
Adelie          152    44.2         44.2
Gentoo          124    36.0         80.2
Chinstrap        68    19.8        100.0
  Total         344   100.0        100.0

It will provide with bar graph.

3) summarytools package

library(summarytools)
summarytools::freq(penguins$species)
Frequencies  
penguins$species  
Type: Factor  

                  Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
--------------- ------ --------- -------------- --------- --------------
         Adelie    152     44.19          44.19     44.19          44.19
      Chinstrap     68     19.77          63.95     19.77          63.95
         Gentoo    124     36.05         100.00     36.05         100.00
           <NA>      0                               0.00         100.00
          Total    344    100.00         100.00    100.00         100.00

4) gmodels package

library(gmodels)
CrossTable(penguins$species, format="SPSS") #will return as spss format

   Cell Contents
|-------------------------|
|                   Count |
|             Row Percent |
|-------------------------|

Total Observations in Table:  344 

          |    Adelie  | Chinstrap  |    Gentoo  | 
          |-----------|-----------|-----------|
          |      152  |       68  |      124  | 
          |   44.186% |   19.767% |   36.047% | 
          |-----------|-----------|-----------|

 

This package can be also use for cross tabulation table, for example tabulation for species and sex

library(gmodels)
CrossTable(penguins$species, penguins$sex,
           format="SPSS", 
           expected = T, #expected value
           prop.r = T, #row total
           prop.c = F, #column total
           prop.t = F, #overall total
           prop.chisq = F, #chi-square contribution of each cell
           chisq = T, #the results of a chi-square
           fisher = F, #the result of a Fisher Exact test
           mcnemar = F) #the result of McNemar test

   Cell Contents
|-------------------------|
|                   Count |
|         Expected Values |
|             Row Percent |
|-------------------------|

Total Observations in Table:  333 

                 | penguins$sex 
penguins$species |   female  |     male  | Row Total | 
-----------------|-----------|-----------|-----------|
          Adelie |       73  |       73  |      146  | 
                 |   72.342  |   73.658  |           | 
                 |   50.000% |   50.000% |   43.844% | 
-----------------|-----------|-----------|-----------|
       Chinstrap |       34  |       34  |       68  | 
                 |   33.694  |   34.306  |           | 
                 |   50.000% |   50.000% |   20.420% | 
-----------------|-----------|-----------|-----------|
          Gentoo |       58  |       61  |      119  | 
                 |   58.964  |   60.036  |           | 
                 |   48.739% |   51.261% |   35.736% | 
-----------------|-----------|-----------|-----------|
    Column Total |      165  |      168  |      333  | 
-----------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  0.04860717     d.f. =  2     p =  0.9759894 


 
       Minimum expected frequency: 33.69369 

Let’s try real example

TB Incidence

By using TB incidence dataset, you can download it from this link: https://dataintror.s3.ap-southeast-1.amazonaws.com/tb_incidence.xlsx

library(readxl)
Warning: package 'readxl' was built under R version 4.2.3
url <- "https://dataintror.s3.ap-southeast-1.amazonaws.com/tb_incidence.xlsx"
destfile <- "tb_incidence.xlsx"
curl::curl_download(url, destfile)
data1 <- read_excel(destfile)

1) Rename the first column to “country”.

library(dplyr)
data1 <- rename(data1, country = 'TB incidence, all forms (per 100 000 population per year)' )
names(data1[1])
[1] "country"

2) Find the mean for all variables before year 2000.

avg <- dplyr::select(data1, starts_with("1")) 
colnames(avg)
 [1] "1990" "1991" "1992" "1993" "1994" "1995" "1996" "1997" "1998" "1999"
colMeans(avg, na.rm=T)
    1990     1991     1992     1993     1994     1995     1996     1997 
105.5797 107.6715 108.3140 110.3188 111.9662 114.1981 115.3527 118.8792 
    1998     1999 
121.5169 125.0435 

3) Create a new variable to represent the mean of TB incidence before year 2000 for all observations.

data1$before_2000_avg <- rowMeans(avg, na.rm=T)
names(data1)
 [1] "country"         "1990"            "1991"            "1992"           
 [5] "1993"            "1994"            "1995"            "1996"           
 [9] "1997"            "1998"            "1999"            "2000"           
[13] "2001"            "2002"            "2003"            "2004"           
[17] "2005"            "2006"            "2007"            "before_2000_avg"
head(data1[, c("country", "before_2000_avg")])
# A tibble: 6 × 2
  country        before_2000_avg
  <chr>                    <dbl>
1 Afghanistan              168  
2 Albania                   26.3
3 Algeria                   41.8
4 American Samoa             8.5
5 Andorra                   28.8
6 Angola                   225. 
new <- data1 %>%
  dplyr::select("country", "before_2000_avg")

4) Summarize the data by mean and standard deviation

summary(new)
   country          before_2000_avg 
 Length:208         Min.   :  3.50  
 Class :character   1st Qu.: 26.45  
 Mode  :character   Median : 61.20  
                    Mean   :113.88  
                    3rd Qu.:175.20  
                    Max.   :637.10  
                    NA's   :1       

Youth Tobacco

Import the dataset from this link: https://dataintror.s3.ap-southeast-1.amazonaws.com/Youth_Tobacco_Survey_YTS_Data.csv

library(readr)
data2 <- read_csv("https://dataintror.s3.ap-southeast-1.amazonaws.com/Youth_Tobacco_Survey_YTS_Data.csv")

1) Explore these variables, take note on the frequency of each categories (MeasureDesc, Gender, and Response)

table(data2$MeasureDesc)

              Percent of Current Smokers Who Want to Quit 
                                                     1205 
Quit Attempt in Past Year Among Current Cigarette Smokers 
                                                     1041 
                                           Smoking Status 
                                                     3783 
                                              User Status 
                                                     3765 
table(data2$Gender)

 Female    Male Overall 
   3256    3256    3282 
table(data2$Response)

 Current     Ever Frequent 
    2514     2520     2514 
library(summarytools)
summarytools::freq(data2$MeasureDesc)
Frequencies  
data2$MeasureDesc  
Type: Character  

                                                                  Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
--------------------------------------------------------------- ------ --------- -------------- --------- --------------
                    Percent of Current Smokers Who Want to Quit   1205     12.30          12.30     12.30          12.30
      Quit Attempt in Past Year Among Current Cigarette Smokers   1041     10.63          22.93     10.63          22.93
                                                 Smoking Status   3783     38.63          61.56     38.63          61.56
                                                    User Status   3765     38.44         100.00     38.44         100.00
                                                           <NA>      0                               0.00         100.00
                                                          Total   9794    100.00         100.00    100.00         100.00
summarytools::freq(data2$Gender)
Frequencies  
data2$Gender  
Type: Character  

                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
------------- ------ --------- -------------- --------- --------------
       Female   3256     33.24          33.24     33.24          33.24
         Male   3256     33.24          66.49     33.24          66.49
      Overall   3282     33.51         100.00     33.51         100.00
         <NA>      0                               0.00         100.00
        Total   9794    100.00         100.00    100.00         100.00
summarytools::freq(data2$Response)
Frequencies  
data2$Response  
Type: Character  

                 Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
-------------- ------ --------- -------------- --------- --------------
       Current   2514     33.31          33.31     25.67          25.67
          Ever   2520     33.39          66.69     25.73          51.40
      Frequent   2514     33.31         100.00     25.67          77.07
          <NA>   2246                              22.93         100.00
         Total   9794    100.00         100.00    100.00         100.00

2) Filter MeasureDesc = Smoking Status, all gender = overall and response = current year

sub_yts <- data2 %>% 
  filter(MeasureDesc == "Smoking Status", 
        Gender == "Overall", 
        Response == "Current") %>% 
  dplyr::select (YEAR, LocationDesc, Data_Value)

head(sub_yts)
# A tibble: 6 × 3
   YEAR LocationDesc Data_Value
  <dbl> <chr>             <dbl>
1  2015 Arizona             3.2
2  2015 Connecticut         0.8
3  2015 Connecticut         5.6
4  2015 Georgia            10.8
5  2015 Hawaii              3  
6  2015 Hawaii              7.4

3) we want to summarize data_value based on YEAR

dplyr::summarize(group_by(sub_yts, YEAR), 
          year_avg=mean(Data_Value, na.rm=T))
# A tibble: 17 × 2
    YEAR year_avg
   <dbl>    <dbl>
 1  1999    20.5 
 2  2000    19.9 
 3  2001    15.7 
 4  2002    16.8 
 5  2003    13.2 
 6  2004    13.9 
 7  2005    14.1 
 8  2006    14.1 
 9  2007    13.0 
10  2008    12.2 
11  2009    11.7 
12  2010    12.3 
13  2011    11.8 
14  2012     9.95
15  2013     7.78
16  2014     7.16
17  2015     6.58

Exercise 2

using nycflights13 library and use flights dataset, answer the questions below:

1) Use a lilliefors test to check if the distribution of flight durations (air_time) is normally distributed. What do the test results suggest?

2) Create a histogram of air_time. Does the histogram visually confirm the findings from the Shapiro-Wilk test?

3) Calculate the skewness and kurtosis of dep_delay. How do these values help in understanding the nature of departure delays?

4) Generate a boxplot showing arrival delays (arr_delay) for each carrier. Which carrier shows the greatest variability in arrival delays?

5) Calculate the average departure delay for each month. Which month has the highest average delay?

6) Identify the top five busiest airports by the number of departures. Use a bar plot to visualize these counts.

7) Calculate the proportion of flights to each destination (dest). Which are the top three most common destinations?

8) Perform a Chi-square test to check for independence between carrier and dest. What does the result imply about the relationship between these two variables?

9) Analyze the mean and median dep_delay and arr_delay for each airline (carrier). Use these statistics to determine which airlines, on average, have the longest and shortest delays. What does this suggest about the performance of different airlines?

10) Divide the day into four time periods (e.g., Morning, Afternoon, Evening, Night) and calculate the number of flights that depart in each time period. Which time of day is busiest for departures? Create a bar plot to visualize this information.