library(palmerpenguins)
library(ggplot2)
<- na.omit(penguins$body_mass_g) # Remove NA values
data
<- 1 + log2(length(data)) #calculating number of bins using struge's rules
num_bins_sturges
ggplot(data.frame(data), aes(x = data)) +
geom_histogram(bins = round(num_bins_sturges)) +
ggtitle("Histogram with Sturges' Rule") + theme_light()
Data Summarization
Data Summarization
Data summarization is an essential process in statistical programming, which involves reducing and simplifying large datasets into more manageable and understandable forms. This process is crucial as it helps in highlighting the key aspects of the data by extracting important patterns, trends, and relationships. In the realm of statistical analysis, summarization is not just about making the data smaller or simpler; it’s about capturing the essence of the data in a way that is both informative and useful for analysis.
Significance in Large Datasets
The advent of big data has made data summarization more important than ever. With the sheer volume of data available today, it’s practically impossible to analyze every individual data point. Summarization helps in distilling large datasets to a form where they can be easily interpreted and analyzed. This process not only saves time and computational resources but also aids in making data-driven decisions more effectively.
For instance, summarizing sales data over years can reveal trends and patterns that might not be evident when looking at daily sales figures. Similarly, summarizing survey data can help in quickly understanding the general opinion or trend, without getting lost in the myriad of individual responses.
Types of Summarization Techniques
Numerical Summarization: This involves using statistical measures to summarize the key characteristics of numerical data. Techniques include calculating measures of central tendency (like mean, median, and mode) and measures of variability or spread (like range, variance, and standard deviation). These techniques are fundamental in providing a quick snapshot of the data’s overall distribution and central values.
Categorical Summarization: When dealing with categorical (or qualitative) data, summarization often involves understanding the frequency or occurrence of different categories. Techniques include creating frequency tables, cross-tabulations, and using measures like mode. This type of summarization is particularly useful in understanding the distribution of categorical variables, like customer categories or product types.
Visual Summarization: Visual representations of data, such as histograms, bar charts, box plots, and scatter plots, provide an intuitive way to summarize and understand complex datasets. These techniques are invaluable in revealing patterns, trends, outliers, and relationships in data that might not be obvious in textual or numerical summaries.
In R, these summarization techniques are supported by a variety of functions and packages, making it an ideal environment for both basic and advanced data summarization tasks. This chapter will delve into these techniques, providing the reader with the knowledge and tools to effectively summarize and interpret large datasets using R.
Summarizing Numerical Data
Introduction to Normality Testing.
Normality testing is a fundamental step in statistical analysis, particularly when deciding which statistical methods to apply. Many statistical techniques assume that the data follows a normal (or Gaussian) distribution. However, real-world data often deviates from this idealized distribution. Therefore, assessing the normality of your dataset is crucial before applying techniques that assume normality, such as parametric tests.
In R, there are several methods to test for normality, including graphical methods like Q-Q plots and statistical tests like the Shapiro-Wilk test. Let’s explore how to perform these tests in R with an example dataset.
Graphical Methods to Identify Normality distribution
1) Histogram
Based on the Histogram, we can conclude that the data is a bit skewed to the right.
2) Boxplot
ggplot(penguins, aes(body_mass_g)) + geom_boxplot() +
ggtitle("A Boxplot showing distribution of weight of Penguins in grams") +
theme_minimal() +
scale_y_continuous(limits=c(-1,1))
Based on the boxplot, we can clearly see that the data was a bit skewed to the right.
3) Kurtosis and Skewness
Kurtosis is a measure of a sample or population that identifies how flat or peaked it is with respect to a normal distribution. It refers to how concentrated the values are in the center of the distribution. In other hand, sample can be described as a measure of horizontal symmetry with respect to a normal distribution by measuring the skewness coefficient. If the coefficient of skewness can be either positive (right skew), zero (no skew), or negative (left skew).
Steps examining sample normality by skewness and kurtosis:
1) Determine sample mean and standard deviation
2) Determine sample kurtosis and skewness
Formula Skewness:
Formula for Standard Error of Skewness:
Formula Kurtosis:
Formula for Standard Error of Kurtosis:
3) Calculate the standard error of the kurtosis and the standard error of the skewness
4) Calculate Z-Score for the kurtosis and Z-Score for the skewness
Formula for Z-Score Skewness:
Formula for Z-Score Kurtosis:
5) Compare the Z-Score to the critical region obtained from the normal distribution
The Z-score is to examine the sample’s approximation to a normal distribution.
Z-score values for kurtosis and skewness must fall between -2 and 2 to pass the normality assumption for alpha=0.05.
Z-score for Skewness: This score indicates how many standard deviations the skewness of the dataset is from the mean skewness of a normally distributed dataset. A higher absolute value of the z-score indicates a greater degree of asymmetry.
Z-score for Kurtosis: Similarly, this score tells us how many standard deviations the kurtosis of the dataset is from the mean kurtosis of a normal distribution. It reflects the “tailedness” or the peak height of the distribution.
library(e1071)
library(dplyr)
%>%
penguins filter(!is.na(body_mass_g)) %>%
summarize(skewness = skewness(body_mass_g, type=2, na.rm=T),
se_skewness = sqrt((6 * length(body_mass_g) * (length(body_mass_g) - 1)) / ((length(body_mass_g) - 2) * (length(body_mass_g) + 1) * (length(body_mass_g) + 3))),
kurtosis = kurtosis(body_mass_g, type = 2, na.rm=T),
se_kurtosis = sqrt((4 * se_skewness^2) / (length(body_mass_g))),
Z_Skewness = ((skewness-0)/se_skewness),
Z_Kurtosis = ((kurtosis - 0)/se_kurtosis))
# A tibble: 1 × 6
skewness se_skewness kurtosis se_kurtosis Z_Skewness Z_Kurtosis
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.470 0.132 -0.719 0.0143 3.57 -50.4
Interpretation Guidelines
Z-score between -2 and 2: This range is generally considered to indicate that the skewness or kurtosis is not significantly different from what would be expected in a normal distribution. The data can be considered approximately normal in this aspect.
Z-score less than -2: This suggests a significant deviation from normality in a negative direction. For skewness, it would mean a left skew (tail is on the left side). For kurtosis, it indicates a platykurtic distribution (less peaked than a normal distribution).
Z-score greater than 2: This indicates a significant deviation in a positive direction. For skewness, it would imply a right skew (tail is on the right side). For kurtosis, it suggests a leptokurtic distribution (more peaked than a normal distribution).
It is clearly seen that the value of Z-Score for skewness is beyond 2, which indicating the data is skewed to the right tail. and the value of Z-Score for kurtosis is less than -2, indicating platykurtic distribution (less peaked than a normal distribution).
4) Q-Q plots
Q-Q plots are scatterplots created by plotting two sets of quantiles against each other: the quantiles from the dataset and the quantiles of a normal distribution. If the data are normally distributed, the points in the Q-Q plot will approximately lie on a straight line.
Interpreting the Q-Q Plot
Linearity: In a Q-Q plot, if the data are normally distributed, the points will approximately lie on a straight line. The closer the points are to the line (
qqline
), the more normal the distribution is.Departures from Linearity: Deviations from the line can indicate departures from normality:
Left Tail (lower end of the plot): If the points deviate below the line at the lower end, it suggests a left-skewed distribution (lighter left tail than normal).
Right Tail (upper end of the plot): If the points deviate above the line at the upper end, it indicates a right-skewed distribution (heavier right tail than normal).
Center of the Plot: If the points deviate from the line in the middle of the plot, it suggests a difference in the median or a distribution with different kurtosis from normal (either more peaked or flatter than a normal distribution).
#Create Q-Q plot
qqnorm(penguins$body_mass_g)
#Adding straight diagonal line to plot
qqline(penguins$body_mass_g)
qqline(penguins$body_mass_g,
main = 'Q-Q Plot for Normality',
xlab = 'Theoretical Dist',
ylab = 'Sample dist',
col = 'steelblue')
Another way to construct qqplot
library(ggpubr)
ggqqplot(penguins$body_mass_g)
By examining how the points in the Q-Q plot deviate from the reference line, we can infer if and how the body_mass_g
distribution deviates from a normal distribution. Remember, a Q-Q plot provides a visual assessment and should be used in conjunction with other methods for a comprehensive analysis of normality.
5) p-p Plot
A P-P (Probability-Probability) plot is a graphical technique used to assess how closely two probability distributions agree with each other. Specifically, it compares the cumulative distribution function (CDF) of a sample data set against the CDF of a theoretical distribution, which is often a normal distribution in many applications. The P-P plot is particularly useful for evaluating the goodness of fit of a distribution model to observed data.
How a P-P Plot Works:
Theoretical Distribution: This is often a standard distribution like a normal distribution, but it could be any specified distribution. The cumulative probabilities from this distribution are plotted on one axis (usually the x-axis).
Sample Data Distribution: The observed data’s cumulative probabilities are calculated and plotted on the other axis (usually the y-axis).
Plotting: On a P-P plot, each point represents a cumulative probability from the sample data against the corresponding cumulative probability from the theoretical distribution. If the sample data perfectly follow the theoretical distribution, all the points will lie on the diagonal line.
Interpreting a P-P Plot:
Linearity: If the points on a P-P plot lie close to the diagonal line (45-degree line), this indicates that the sample data follow the theoretical distribution closely.
Deviations from Linearity:
Both Ends Deviate: If the points deviate from the diagonal at both the lower and upper tails, it suggests that the sample distribution has heavier or lighter tails than the theoretical distribution.
Middle Deviates: If the points deviate in the middle of the plot but are closer at the tails, this could indicate a discrepancy in the central part of the distribution, like a difference in variance between the sample and theoretical distributions.
S-shaped Curve: An S-shaped curve in the P-P plot indicates that one of the distributions is more skewed compared to the other.
library(CircStats)
<- penguins %>%
data1 filter(!is.na(body_mass_g))
pp.plot(data1$body_mass_g, ref.line=TRUE)
mu kappa
1 0.3731475 0.2059357
6) Statistical Test
Shapiro-Wilk Test
The Shapiro-Wilk test is widely used for testing normality. This test checks the null hypothesis that the data were drawn from a normal distribution.
It is generally preferred for small sample sizes (< 50 samples), but can be used for larger ones.
shapiro.test(penguins$body_mass_g)
Shapiro-Wilk normality test
data: penguins$body_mass_g
W = 0.95921, p-value = 3.679e-08
library(broom)
tidy(shapiro.test(data1$body_mass_g))
# A tibble: 1 × 3
statistic p.value method
<dbl> <dbl> <chr>
1 0.959 0.0000000368 Shapiro-Wilk normality test
Kolmogorov-Smirnov Test (K-S Test)
The K-S test compares the empirical distribution function of the data with the expected distribution function for a normal distribution.
This test is more sensitive towards the center of the distribution than the tails.
#removing missing value
<- penguins %>%
data1 filter(!is.na(body_mass_g))
# Standardizing the data
<- (data1$body_mass_g - mean(data1$body_mass_g)) / sd(data1$body_mass_g)
standardized_data
# Kolmogorov-Smirnov Test
ks.test(standardized_data, "pnorm")
Asymptotic one-sample Kolmogorov-Smirnov test
data: standardized_data
D = 0.10408, p-value = 0.00121
alternative hypothesis: two-sided
Anderson-Darling Test
Similar to the K-S test, the Anderson-Darling test is a test of goodness of fit.
This test gives more weight to the tails than the K-S test and is thus more sensitive to outliers.
library(nortest)
ad.test(penguins$body_mass_g)
Anderson-Darling normality test
data: penguins$body_mass_g
A = 4.543, p-value = 2.757e-11
Lilliefors Test
A modification of the K-S test, the Lilliefors test is specifically designed for testing normality.
It is particularly useful when the mean and variance of the distribution are unknown.
library(nortest)
lillie.test(penguins$body_mass_g)
Lilliefors (Kolmogorov-Smirnov) normality test
data: penguins$body_mass_g
D = 0.10408, p-value = 1.544e-09
Jarque-Bera Test
The Jarque-Bera test is a goodness-of-fit test of whether sample data have the skewness and kurtosis matching a normal distribution.
It is more commonly used in econometrics.
library(tseries)
#removing missing value
<- penguins %>%
data1 filter(!is.na(body_mass_g))
jarque.bera.test(data1$body_mass_g)
Jarque Bera Test
data: data1$body_mass_g
X-squared = 20.014, df = 2, p-value = 4.508e-05
D’Agostino skewness Test
D’Agostino’s K-squared test is a statistical method used to test the hypothesis that a given sample comes from a normally distributed population, focusing particularly on the skewness of the distribution.
Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable.
Focus on Skewness: The test specifically examines the skewness of the distribution, which is the degree to which it leans to one side. In a perfectly normal distribution, the skewness would be zero, indicating symmetry.
Test Statistic: The test involves calculating a standardized skewness measure and then squaring it to get the test statistic. This statistic essentially measures how far the sample’s skewness deviates from that expected in a normal distribution.
Sample Size Consideration: The calculation of the test statistic incorporates the sample size, as skewness can be more variable in smaller samples. The test is adjusted to account for this variability.
P-Value: The test produces a p-value, which indicates the probability of observing the given degree of skewness (or more extreme) if the null hypothesis of normality is true. A low p-value (typically less than 0.05) would lead to rejecting the null hypothesis, suggesting the distribution is not normal with respect to skewness.
Applicability: While D’Agostino’s K-squared test is robust for various sample sizes, it is particularly useful for larger samples where the approximation of the p-value becomes more accurate.
Interpretation: It’s important to interpret the results within the context of your data. Even if a dataset fails this test, it might still be approximately normal enough for certain statistical methods that assume normality. Conversely, a pass does not guarantee all aspects of normality (like kurtosis).
library(moments)
#removing missing value
<- penguins %>%
data1 filter(!is.na(body_mass_g))
::agostino.test(data1$body_mass_g) moments
D'Agostino skewness test
data: data1$body_mass_g
skew = 0.46826, z = 3.44537, p-value = 0.0005703
alternative hypothesis: data have a skewness
Pearson’s Chi-Square Normality Test
The Pearson Chi-Square normality test, often referred to as the Pearson’s Chi-Squared test for goodness-of-fit, is a statistical test used to determine if a sample comes from a population with a specific distribution, in this case, a normal distribution.
Basic Concept: The test compares the observed frequencies in each category/bin of the data to the frequencies expected if the data followed a normal distribution. The bins are typically intervals of values in your data set.
Test Statistic: The test statistic is computed by summing the squared differences between the observed and expected frequencies, divided by the expected frequencies.
Degrees of Freedom: The degrees of freedom for the test are typically the number of bins minus the number of parameters estimated (mean and standard deviation for a normal distribution) and minus one.
P-Value: The test yields a p-value which indicates the probability of observing the data if it actually comes from a normal distribution. A low p-value (typically less than 0.05) suggests that the data do not come from a normal distribution.
Cautions: The test can be sensitive to the choice of bins and is generally less powerful for detecting deviations from normality than other tests like the Shapiro-Wilk test. It is more suitable for large sample sizes.
library(nortest)
pearson.test(penguins$body_mass_g)
Pearson chi-square normality test
data: penguins$body_mass_g
P = 59.947, p-value = 2.087e-06
Shapiro-Francia Normality Test
The Shapiro-Francia normality test is a modification of the Shapiro-Wilk test for normality. It is particularly useful for testing the normality of distributions that are suspected of having light tails. Compared to the Shapiro-Wilk test, the Shapiro-Francia test is less sensitive to deviations in the tails of the distribution.
Focus on Light Tails: The test is specifically tailored to detect departures from normality due to light tails.
Test Statistic: Similar to the Shapiro-Wilk test, the Shapiro-Francia test computes a test statistic that measures the closeness of the data to the normal distribution. However, it uses a different weighting scheme that makes it less sensitive to the tails.
Interpretation: As with other normality tests, a low p-value (typically less than 0.05) suggests that the data do not come from a normal distribution.
library(nortest)
sf.test(penguins$body_mass_g)
Shapiro-Francia normality test
data: penguins$body_mass_g
W = 0.96126, p-value = 4.027e-07
Cramer-Von Mises Normality Test
The Cramér-von Mises criterion is a goodness-of-fit test used in statistics to test the hypothesis that a given set of observations is drawn from a specified distribution. In the context of testing for normality, the Cramér-von Mises test compares the empirical cumulative distribution function (ECDF) of the sample data against the CDF of a normal distribution.
Test Statistic: The Cramér-von Mises test statistic measures the sum of squared differences between the ECDF of the sample and the CDF of the theoretical normal distribution.
Sensitivity: Unlike tests that may focus more on the tails (like the Anderson-Darling test) or the center of the distribution (like the Kolmogorov-Smirnov test), the Cramér-von Mises test is a more general test of goodness-of-fit and doesn’t give special weight to any part of the distribution.
P-Value: The test yields a p-value, which can be used to determine whether to reject the null hypothesis that the data are drawn from a normal distribution. A low p-value (usually less than 0.05) suggests that the data are not normally distributed.
library(nortest)
cvm.test(penguins$body_mass_g)
Cramer-von Mises normality test
data: penguins$body_mass_g
W = 0.75983, p-value = 2.474e-08
Exercise 1
1) Apply the lilliefor’s test to dep_delay
across all flights in nycflights13
. What does the result indicate about the distribution of departure delays?
2) Assess whether the body mass (body_mass_g
) of these penguins
follows a normal distribution for each species
. [palmerpenguins
]
3) Perform Jarque-Bera test on flipper_length_mm
for penguins on each island (island
). Which island exhibit the most normal-like distribution in terms of flipper length?
Basic Statistical Summarisation
Data summarization is a crucial aspect of statistical analysis, providing a way to describe and understand large datasets through a few summary statistics. Among the key concepts in data summarization are measures of central tendency, position, and dispersion. Each of these measures gives different insights into the nature of the data. [for continuous/numerical data only]
1. Measures of Central Tendency
Measures of central tendency describe the center point or typical value of a dataset. The most common measures are:
- Mean (Arithmetic Average): The sum of all values divided by the number of values. It’s sensitive to outliers and can be skewed by them.
# calculating mean of numerical variable
mean(penguins$body_mass_g, na.rm = TRUE)
[1] 4201.754
- Median: The middle value when the data is sorted in ascending order. It’s less affected by outliers and skewness and provides a better central value for skewed distributions.
#calculating median of numerical variable
median(penguins$body_mass_g, na.rm=TRUE)
[1] 4050
- Mode: The most frequently occurring value in the dataset. There can be more than one mode in a dataset (bimodal, multimodal). Useful in understanding the most common value, especially for categorical data.
as.numeric(names(sort(table(penguins$body_mass_g), decreasing = TRUE)[1]))
[1] 3800
2. Measures of Position
Measures of position describe how data points fall in relation to the distribution or to each other. These include:
- Percentiles: Values below which a certain percentage of the data falls. For example, the 25th percentile (or 1st quartile) is the value below which 25% of the data lies.
quantile(penguins$body_mass_g, na.rm=TRUE)
0% 25% 50% 75% 100%
2700 3550 4050 4750 6300
- Quartiles: Special percentiles that divide the dataset into four equal parts. The median is the second quartile.
#1st Quartile
quantile(penguins$body_mass_g, 0.25, na.rm=TRUE)
25%
3550
#2nd Quartile
quantile(penguins$body_mass_g, 0.50, na.rm=TRUE) #same as median
50%
4050
#3rd Quartile
quantile(penguins$body_mass_g, 0.75, na.rm=TRUE)
75%
4750
- Interquartile Range (IQR): The range between the first and third quartiles (25th and 75th percentiles). It represents the middle 50% of the data and is a measure of variability that’s not influenced by outliers.
IQR(penguins$body_mass_g, na.rm=TRUE)
[1] 1200
3. Measures of Dispersion
Measures of dispersion or variability tell us about the spread of the data points in a dataset:
- min: To find the minimum value in the variable
min(penguins$body_mass_g, na.rm = TRUE)
[1] 2700
- max: To find the maximum value in the variable
max(penguins$body_mass_g, na.rm=TRUE)
[1] 6300
- length: to find how many observations in the variable
length(penguins$body_mass_g)
[1] 344
- Range: The difference between the highest and lowest values. It’s simple but can be heavily influenced by outliers.
max(penguins$body_mass_g, na.rm=TRUE) - min(penguins$body_mass_g, na.rm=TRUE)
[1] 3600
- Variance: The average of the squared differences from the mean. It gives a sense of the spread of the data, but it’s not in the same unit as the data.
var(penguins$body_mass_g, na.rm=TRUE)
[1] 643131.1
- Standard Deviation (SD): The square root of the variance. It’s in the same units as the data and describes how far data points tend to deviate from the mean.
sd(penguins$body_mass_g, na.rm=TRUE)
[1] 801.9545
- Coefficient of Variation (CV): The ratio of the standard deviation to the mean. It’s a unitless measure of relative variability, useful for comparing variability across datasets with different units or means.
sd(penguins$body_mass_g, na.rm=TRUE) / mean(penguins$body_mass_g, na.rm=TRUE) * 100
[1] 19.08618
Summary()
Function
The summary()
function in R is a generic function used to produce result summaries of various model and data objects. When applied to a data frame, it provides a quick overview of the statistical properties of each column. The function is particularly useful for getting a rapid sense of the data, especially during the initial stages of data analysis.
Key Features of summary()
in R:
Applicability to Different Objects: The
summary()
function can be used on different types of objects in R, including vectors, data frames, and model objects. The output format varies depending on the type of object.Default Output for Data Frames: For a data frame,
summary()
typically returns the following statistics for each column:For numeric variables: Minimum, 1st Quartile, Median, Mean, 3rd Quartile, Maximum.
For factor variables: Counts for each level, and NA count if there are missing values.
Handling of Missing Values: The function includes NA values in its output, providing a count of missing values, which is crucial for data cleaning and preprocessing.
Customization: The behavior of
summary()
can be customized for user-defined classes (S3 or S4) in R. This means that when you create a new type of object, you can also define whatsummary()
should return when applied to objects of this type.Use in Exploratory Data Analysis (EDA): It is often used as a preliminary step in EDA to get a sense of the data distribution, identify possible outliers, and detect missing values.
summary(penguins)
species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
Notes:
While
summary()
provides a quick and useful overview, it’s often just a starting point for data analysis. Depending on the results, you might need more detailed analysis, such as specific statistical tests or detailed data visualizations.The function is particularly handy for quickly checking data after importation, allowing for a rapid assessment of data quality, structure, and potential areas that may require further investigation.
Data Summarization on Matrices
rowMeans(x)
:Purpose: Calculates the mean of each row in a matrix
x
.Use Case: Useful when you need to find the average across different variables (columns) for each observation (row).
Output: Returns a numeric vector containing the mean of each row.
<- penguins
data # Selecting only numeric columns for the matrix
<- data[, sapply(data, is.numeric)]
numeric_data <- as.matrix(na.omit(numeric_data))
numeric_matrix
# Apply summarization functions
# Means of each row
<- rowMeans(numeric_matrix) row_means
colMeans(x)
:Purpose: Computes the mean of each column in a matrix
x
.Use Case: Used when you want to find the average value of each variable (column) across all observations (rows).
Output: Produces a numeric vector with the mean of each column.
# Means of each column
colMeans(numeric_matrix)
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
43.92193 17.15117 200.91520 4201.75439
year
2008.02924
rowSums(x)
:Purpose: Calculates the sum of each row in a matrix
x
.Use Case: Helpful for aggregating data across multiple variables (columns) for each individual observation (row).
Output: Returns a numeric vector containing the sum of each row.
# Sums of each row
<- rowSums(numeric_matrix) row_sums
colSums(x)
:Purpose: Computes the sum of each column in a matrix
x
.Use Case: Useful for aggregating data for each variable (column) across all observations (rows).
Output: Generates a numeric vector with the sum of each column.
# Sums of each column
colSums(numeric_matrix)
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
15021.3 5865.7 68713.0 1437000.0
year
686746.0
Detecting how many missing values
To detect how many missing values in the variable
table(is.na(penguins$body_mass_g))
FALSE TRUE
342 2
TRUE
means number of missing values in the dataset for variable body_mass_g.
Apply function
We can calculate summary statistics simultaneously by using sapply
, this function allow us to get the numerical statistics measures for all numerical values. we will discuss about apply
family in other lecture.
Mean
sapply(numeric_data, mean, na.rm=TRUE)
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
43.92193 17.15117 200.91520 4201.75439
year
2008.02907
Standard Deviation
sapply(numeric_data, sd, na.rm=TRUE)
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
5.4595837 1.9747932 14.0617137 801.9545357
year
0.8183559
Sum / total in a column
sapply(numeric_data, sum, na.rm=T)
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
15021.3 5865.7 68713.0 1437000.0
year
690762.0
Basically, we can just substitute the numerical measure inside the 2nd arguments.
Summary Statistics by group
Usually in data analysis, if there are a categorical variable, we would like to compare the numerical measure with the categorical data. For example, we want to know the mean and standard deviation by gender (male and female) instead of sumarizing the mean and standard deviation for whole dataset.
Example: calculating mean and standard deviation for mtcars
dataset
#Testing the normality of Miles per Gallon variable by Number of cylinders
#Selecing 4 cylinders
<- mtcars %>%
cyl4 filter(cyl==4)
lillie.test(cyl4$mpg)
Lilliefors (Kolmogorov-Smirnov) normality test
data: cyl4$mpg
D = 0.16784, p-value = 0.5226
#Selecting 6 cylinders
<- mtcars%>%
cyl6 filter(cyl==6)
lillie.test(cyl6$mpg)
Lilliefors (Kolmogorov-Smirnov) normality test
data: cyl6$mpg
D = 0.23502, p-value = 0.2863
#Selecting 8 cylinders
<- mtcars%>%
cyl8 filter(cyl==8)
lillie.test(cyl8$mpg)
Lilliefors (Kolmogorov-Smirnov) normality test
data: cyl8$mpg
D = 0.16305, p-value = 0.3983
Calculating Mean and Standard Deviation for mpg
by cyl
, since the normality assumption is met.
%>%
mtcars group_by(factor(cyl)) %>%
summarise(mean_mpg = mean(mpg),
sd_mpg = sd(mpg))
# A tibble: 3 × 3
`factor(cyl)` mean_mpg sd_mpg
<fct> <dbl> <dbl>
1 4 26.7 4.51
2 6 19.7 1.45
3 8 15.1 2.56
Example: calculating median and IQR for non-normal distributed data (mtcars
dataset) - Just Assume even though it is normal [for the sake of demonstration]
#checking normality
<- mtcars %>%
Spe1 filter(am==0)
lillie.test(Spe1$mpg)
Lilliefors (Kolmogorov-Smirnov) normality test
data: Spe1$mpg
D = 0.087337, p-value = 0.9648
<- mtcars %>%
Spe2 filter(am==1)
lillie.test(Spe2$mpg)
Lilliefors (Kolmogorov-Smirnov) normality test
data: Spe2$mpg
D = 0.14779, p-value = 0.6107
%>%
mtcars group_by(am) %>%
filter(!is.na(am)) %>%
summarise(median_mpg = median(mpg),
IQR_mpg = IQR(mpg))
# A tibble: 2 × 3
am median_mpg IQR_mpg
<dbl> <dbl> <dbl>
1 0 17.3 4.25
2 1 22.8 9.4
To Present this finding statistical measures in a table:
Table 1: Descriptive statistics
Factor | mean (sd) | median (IQR) |
---|---|---|
Number of Cylinders | ||
4 | 26.67 (4.51) | |
6 | 19.70 (1.45) | |
8 | 15.10 (2.56) | |
Transmission | ||
Automatics | 17.30 (4.25) | |
Manual | 22.80 (9.40) |
Several usefull package for numerical data
1) Hmisc
package
library(Hmisc)
::describe(penguins$bill_length_mm) Hmisc
penguins$bill_length_mm
n missing distinct Info Mean Gmd .05 .10
342 2 164 1 43.92 6.274 35.70 36.60
.25 .50 .75 .90 .95
39.23 44.45 48.50 50.80 51.99
lowest : 32.1 33.1 33.5 34 34.1, highest: 55.1 55.8 55.9 58 59.6
2) pastecs
package
library(pastecs)
stat.desc(penguins$bill_length_mm, desc=TRUE, p=0.95)
nbr.val nbr.null nbr.na min max range
3.420000e+02 0.000000e+00 2.000000e+00 3.210000e+01 5.960000e+01 2.750000e+01
sum median mean SE.mean CI.mean.0.95 var
1.502130e+04 4.445000e+01 4.392193e+01 2.952205e-01 5.806825e-01 2.980705e+01
std.dev coef.var
5.459584e+00 1.243020e-01
3) psych
package
library(psych)
::describe(penguins$bill_length_mm) psych
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 342 43.92 5.46 44.45 43.91 7.04 32.1 59.6 27.5 0.05 -0.89 0.3
we can also describe for the whole dataset
::describe(penguins) psych
vars n mean sd median trimmed mad min max
species* 1 344 1.92 0.89 2.00 1.90 1.48 1.0 3.0
island* 2 344 1.66 0.73 2.00 1.58 1.48 1.0 3.0
bill_length_mm 3 342 43.92 5.46 44.45 43.91 7.04 32.1 59.6
bill_depth_mm 4 342 17.15 1.97 17.30 17.17 2.22 13.1 21.5
flipper_length_mm 5 342 200.92 14.06 197.00 200.34 16.31 172.0 231.0
body_mass_g 6 342 4201.75 801.95 4050.00 4154.01 889.56 2700.0 6300.0
sex* 7 333 1.50 0.50 2.00 1.51 0.00 1.0 2.0
year 8 344 2008.03 0.82 2008.00 2008.04 1.48 2007.0 2009.0
range skew kurtosis se
species* 2.0 0.16 -1.73 0.05
island* 2.0 0.61 -0.91 0.04
bill_length_mm 27.5 0.05 -0.89 0.30
bill_depth_mm 8.4 -0.14 -0.92 0.11
flipper_length_mm 59.0 0.34 -1.00 0.76
body_mass_g 3600.0 0.47 -0.74 43.36
sex* 1.0 -0.02 -2.01 0.03
year 2.0 -0.05 -1.51 0.04
4) skimr
package
library(skimr)
skim(penguins$bill_length_mm)
Name | penguins$bill_length_mm |
Number of rows | 344 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
data | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
we can also describe for the whole dataset
skim(penguins)
Name | penguins |
Number of rows | 344 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
factor | 3 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
year | 0 | 1.00 | 2008.03 | 0.82 | 2007.0 | 2007.00 | 2008.00 | 2009.0 | 2009.0 | ▇▁▇▁▇ |
Summarizing Categorical Data
Summarizing categorical data is an essential part of data analysis, especially when dealing with survey results, demographic information, or any data where variables are qualitative rather than quantitative. The goal is to gain insights into the distribution of categories, identify patterns, and make inferences about the population being studied.
Key Concepts in Summarizing Categorical Data:
Frequency Counts:
The most basic form of summarization for categorical data is to count the number of occurrences of each category.
In R,
table()
function is commonly used for this purpose.
# To tabulate categorical data
table(penguins$species)
Adelie Chinstrap Gentoo
152 68 124
Proportions and Percentages:
Converting frequency counts into proportions or percentages provides a clearer understanding of the data relative to the whole.
This is particularly useful when comparing groups of different sizes.
#to get relative frequency by group
prop.table(table(penguins$species))
Adelie Chinstrap Gentoo
0.4418605 0.1976744 0.3604651
To combine the frequency value and percentage value together:
<- as.data.frame(table(penguins$species))
freq1 <- as.data.frame(prop.table(table(penguins$species))*100)
percent1 names(percent1)[2] <- "percentage"
cbind(freq1, percent1[2])
Var1 Freq percentage
1 Adelie 152 44.18605
2 Chinstrap 68 19.76744
3 Gentoo 124 36.04651
<- cbind(freq1, round(percent1[2],2))
combine combine
Var1 Freq percentage
1 Adelie 152 44.19
2 Chinstrap 68 19.77
3 Gentoo 124 36.05
Cross-tabulation:
Cross-tabulation (or contingency tables) involves summarizing two or more categorical variables simultaneously, making it possible to observe the relationship between them.
In R, this can be achieved using the
table()
function with multiple variables.
# to tabulate the cross tabulation between species and island
table(penguins$species, penguins$island)
Biscoe Dream Torgersen
Adelie 44 56 52
Chinstrap 0 68 0
Gentoo 124 0 0
Several Useful packages for categorical data analysis
1) janitor
package
library(janitor)
tabyl(penguins$species, sort=TRUE)
penguins$species n percent
Adelie 152 0.4418605
Chinstrap 68 0.1976744
Gentoo 124 0.3604651
2) epiDisplay
package
library(epiDisplay)
tab1(penguins$species, sort.group = "decreasing",
cum.percent = TRUE)
penguins$species :
Frequency Percent Cum. percent
Adelie 152 44.2 44.2
Gentoo 124 36.0 80.2
Chinstrap 68 19.8 100.0
Total 344 100.0 100.0
It will provide with bar graph.
3) summarytools
package
library(summarytools)
::freq(penguins$species) summarytools
Frequencies
penguins$species
Type: Factor
Freq % Valid % Valid Cum. % Total % Total Cum.
--------------- ------ --------- -------------- --------- --------------
Adelie 152 44.19 44.19 44.19 44.19
Chinstrap 68 19.77 63.95 19.77 63.95
Gentoo 124 36.05 100.00 36.05 100.00
<NA> 0 0.00 100.00
Total 344 100.00 100.00 100.00 100.00
4) gmodels
package
library(gmodels)
CrossTable(penguins$species, format="SPSS") #will return as spss format
Cell Contents
|-------------------------|
| Count |
| Row Percent |
|-------------------------|
Total Observations in Table: 344
| Adelie | Chinstrap | Gentoo |
|-----------|-----------|-----------|
| 152 | 68 | 124 |
| 44.186% | 19.767% | 36.047% |
|-----------|-----------|-----------|
This package can be also use for cross tabulation table, for example tabulation for species and sex
library(gmodels)
CrossTable(penguins$species, penguins$sex,
format="SPSS",
expected = T, #expected value
prop.r = T, #row total
prop.c = F, #column total
prop.t = F, #overall total
prop.chisq = F, #chi-square contribution of each cell
chisq = T, #the results of a chi-square
fisher = F, #the result of a Fisher Exact test
mcnemar = F) #the result of McNemar test
Cell Contents
|-------------------------|
| Count |
| Expected Values |
| Row Percent |
|-------------------------|
Total Observations in Table: 333
| penguins$sex
penguins$species | female | male | Row Total |
-----------------|-----------|-----------|-----------|
Adelie | 73 | 73 | 146 |
| 72.342 | 73.658 | |
| 50.000% | 50.000% | 43.844% |
-----------------|-----------|-----------|-----------|
Chinstrap | 34 | 34 | 68 |
| 33.694 | 34.306 | |
| 50.000% | 50.000% | 20.420% |
-----------------|-----------|-----------|-----------|
Gentoo | 58 | 61 | 119 |
| 58.964 | 60.036 | |
| 48.739% | 51.261% | 35.736% |
-----------------|-----------|-----------|-----------|
Column Total | 165 | 168 | 333 |
-----------------|-----------|-----------|-----------|
Statistics for All Table Factors
Pearson's Chi-squared test
------------------------------------------------------------
Chi^2 = 0.04860717 d.f. = 2 p = 0.9759894
Minimum expected frequency: 33.69369
Let’s try real example
TB Incidence
By using TB incidence dataset, you can download it from this link: https://dataintror.s3.ap-southeast-1.amazonaws.com/tb_incidence.xlsx
library(readxl)
Warning: package 'readxl' was built under R version 4.2.3
<- "https://dataintror.s3.ap-southeast-1.amazonaws.com/tb_incidence.xlsx"
url <- "tb_incidence.xlsx"
destfile ::curl_download(url, destfile)
curl<- read_excel(destfile) data1
1) Rename the first column to “country
”.
library(dplyr)
<- rename(data1, country = 'TB incidence, all forms (per 100 000 population per year)' )
data1 names(data1[1])
[1] "country"
2) Find the mean for all variables before year 2000.
<- dplyr::select(data1, starts_with("1"))
avg colnames(avg)
[1] "1990" "1991" "1992" "1993" "1994" "1995" "1996" "1997" "1998" "1999"
colMeans(avg, na.rm=T)
1990 1991 1992 1993 1994 1995 1996 1997
105.5797 107.6715 108.3140 110.3188 111.9662 114.1981 115.3527 118.8792
1998 1999
121.5169 125.0435
3) Create a new variable to represent the mean of TB incidence before year 2000 for all observations.
$before_2000_avg <- rowMeans(avg, na.rm=T)
data1names(data1)
[1] "country" "1990" "1991" "1992"
[5] "1993" "1994" "1995" "1996"
[9] "1997" "1998" "1999" "2000"
[13] "2001" "2002" "2003" "2004"
[17] "2005" "2006" "2007" "before_2000_avg"
head(data1[, c("country", "before_2000_avg")])
# A tibble: 6 × 2
country before_2000_avg
<chr> <dbl>
1 Afghanistan 168
2 Albania 26.3
3 Algeria 41.8
4 American Samoa 8.5
5 Andorra 28.8
6 Angola 225.
<- data1 %>%
new ::select("country", "before_2000_avg") dplyr
4) Summarize the data by mean and standard deviation
summary(new)
country before_2000_avg
Length:208 Min. : 3.50
Class :character 1st Qu.: 26.45
Mode :character Median : 61.20
Mean :113.88
3rd Qu.:175.20
Max. :637.10
NA's :1
Youth Tobacco
Import the dataset from this link: https://dataintror.s3.ap-southeast-1.amazonaws.com/Youth_Tobacco_Survey_YTS_Data.csv
library(readr)
<- read_csv("https://dataintror.s3.ap-southeast-1.amazonaws.com/Youth_Tobacco_Survey_YTS_Data.csv") data2
1) Explore these variables, take note on the frequency of each categories (MeasureDesc, Gender, and Response)
table(data2$MeasureDesc)
Percent of Current Smokers Who Want to Quit
1205
Quit Attempt in Past Year Among Current Cigarette Smokers
1041
Smoking Status
3783
User Status
3765
table(data2$Gender)
Female Male Overall
3256 3256 3282
table(data2$Response)
Current Ever Frequent
2514 2520 2514
library(summarytools)
::freq(data2$MeasureDesc) summarytools
Frequencies
data2$MeasureDesc
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
--------------------------------------------------------------- ------ --------- -------------- --------- --------------
Percent of Current Smokers Who Want to Quit 1205 12.30 12.30 12.30 12.30
Quit Attempt in Past Year Among Current Cigarette Smokers 1041 10.63 22.93 10.63 22.93
Smoking Status 3783 38.63 61.56 38.63 61.56
User Status 3765 38.44 100.00 38.44 100.00
<NA> 0 0.00 100.00
Total 9794 100.00 100.00 100.00 100.00
::freq(data2$Gender) summarytools
Frequencies
data2$Gender
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
------------- ------ --------- -------------- --------- --------------
Female 3256 33.24 33.24 33.24 33.24
Male 3256 33.24 66.49 33.24 66.49
Overall 3282 33.51 100.00 33.51 100.00
<NA> 0 0.00 100.00
Total 9794 100.00 100.00 100.00 100.00
::freq(data2$Response) summarytools
Frequencies
data2$Response
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
-------------- ------ --------- -------------- --------- --------------
Current 2514 33.31 33.31 25.67 25.67
Ever 2520 33.39 66.69 25.73 51.40
Frequent 2514 33.31 100.00 25.67 77.07
<NA> 2246 22.93 100.00
Total 9794 100.00 100.00 100.00 100.00
2) Filter MeasureDesc = Smoking Status, all gender = overall and response = current year
<- data2 %>%
sub_yts filter(MeasureDesc == "Smoking Status",
== "Overall",
Gender == "Current") %>%
Response ::select (YEAR, LocationDesc, Data_Value)
dplyr
head(sub_yts)
# A tibble: 6 × 3
YEAR LocationDesc Data_Value
<dbl> <chr> <dbl>
1 2015 Arizona 3.2
2 2015 Connecticut 0.8
3 2015 Connecticut 5.6
4 2015 Georgia 10.8
5 2015 Hawaii 3
6 2015 Hawaii 7.4
3) we want to summarize data_value based on YEAR
::summarize(group_by(sub_yts, YEAR),
dplyryear_avg=mean(Data_Value, na.rm=T))
# A tibble: 17 × 2
YEAR year_avg
<dbl> <dbl>
1 1999 20.5
2 2000 19.9
3 2001 15.7
4 2002 16.8
5 2003 13.2
6 2004 13.9
7 2005 14.1
8 2006 14.1
9 2007 13.0
10 2008 12.2
11 2009 11.7
12 2010 12.3
13 2011 11.8
14 2012 9.95
15 2013 7.78
16 2014 7.16
17 2015 6.58
Exercise 2
using nycflights13
library and use flights
dataset, answer the questions below:
1) Use a lilliefors test to check if the distribution of flight durations (air_time
) is normally distributed. What do the test results suggest?
2) Create a histogram of air_time
. Does the histogram visually confirm the findings from the Shapiro-Wilk test?
3) Calculate the skewness and kurtosis of dep_delay
. How do these values help in understanding the nature of departure delays?
4) Generate a boxplot showing arrival delays (arr_delay
) for each carrier. Which carrier shows the greatest variability in arrival delays?
5) Calculate the average departure delay for each month. Which month has the highest average delay?
6) Identify the top five busiest airports by the number of departures. Use a bar plot to visualize these counts.
7) Calculate the proportion of flights to each destination (dest
). Which are the top three most common destinations?
8) Perform a Chi-square test to check for independence between carrier
and dest
. What does the result imply about the relationship between these two variables?
9) Analyze the mean and median dep_delay
and arr_delay
for each airline (carrier
). Use these statistics to determine which airlines, on average, have the longest and shortest delays. What does this suggest about the performance of different airlines?
10) Divide the day into four time periods (e.g., Morning, Afternoon, Evening, Night) and calculate the number of flights that depart in each time period. Which time of day is busiest for departures? Create a bar plot to visualize this information.