<- data.frame(Name = c("Ali", "Abu", "Ahmad"),
df Age = c(9, 6, 2),
Score = c(82, 93, 92))
df
Name Age Score
1 Ali 9 82
2 Abu 6 93
3 Ahmad 2 92
Data sets frequently consist of more than one column of data, where each column represents measurements of a single variable. Each row usually represents a single observation. This format is referred to as case-by-variable format.
Most data sets are stored in R as data frames. These are like matrices, but with the columns having their own names.
A data frame is one of the most commonly used data structures in R, especially for data analysis and statistical modelling. Conceptually, it can be thought of as a table or a spreadsheet, where you have rows representing observations and columns representing variables. A data frame is similar to a matrix, but with the added flexibility that different columns can contain different types of data (eg: numeric, character, factor).
Features:
Mixed Data Types
: Unlike matrices, data frames can store different classes of objects in each column.Column Names
: Columns in a data frame can have names, which makes accessing and manipulating data easier and more intuitive.Row Names
: By default, rows have index names (from 1 to the number of rows), but these can also be explicitly set to other values.Creation:
A data frame can be created using the data.frame()
function:
<- data.frame(Name = c("Ali", "Abu", "Ahmad"),
df Age = c(9, 6, 2),
Score = c(82, 93, 92))
df
Name Age Score
1 Ali 9 82
2 Abu 6 93
3 Ahmad 2 92
Indexing:
$
operator or double square brackets [[…]]
.#extract names from df
$Name df
[1] "Ali" "Abu" "Ahmad"
#extract score from df
"Score"]] df[[
[1] 82 93 92
[…]
.#Extracting first row
1, ] df[
Name Age Score
1 Ali 9 82
#Extracting 3rd row
3, ] df[
Name Age Score
3 Ahmad 2 92
#Extracting data that contain more than 90
$Score > 90, ] df[df
Name Age Score
2 Abu 6 93
3 Ahmad 2 92
head() and tail()
: Display the first or last part of a data frame.str()
: Provides the structure of a data frame, showing the data type of each column and the first few entries.summary()
: Gives a statistical summary of all columns in a data frame.dim()
: Returns the dimensions (number of rows and columns) of a data frame.rownames()
and colnames()
: Get or set the row or column names of a data frame.merge()
: Merges two data frames by common columns or row names.1) head()
and tail()
These functions display the first or last part of a data frame, respectively. By default, they show six rows.
# Create a sample data frame
<- data.frame(Name = c("Ali", "Abu", "Ahmad", "Aminah", "Rosnah", "Rozanae", "Rohana"),
df Age = c(25, 32, 29, 24, 27, 31, 23),
Score = c(85, 90, 93, 87, 78, 91, 82))
# Display the first few rows
head(df)
Name Age Score
1 Ali 25 85
2 Abu 32 90
3 Ahmad 29 93
4 Aminah 24 87
5 Rosnah 27 78
6 Rozanae 31 91
# Display the last few rows
tail(df)
Name Age Score
2 Abu 32 90
3 Ahmad 29 93
4 Aminah 24 87
5 Rosnah 27 78
6 Rozanae 31 91
7 Rohana 23 82
2) str()
This function provides a concise display of the structure of an object, such as a data frame.
# Display the structure of df
str(df)
'data.frame': 7 obs. of 3 variables:
$ Name : chr "Ali" "Abu" "Ahmad" "Aminah" ...
$ Age : num 25 32 29 24 27 31 23
$ Score: num 85 90 93 87 78 91 82
3) summary()
Gives a statistical summary of all columns in a data frame.
# Get a summary of df
summary(df)
Name Age Score
Length:7 Min. :23.00 Min. :78.00
Class :character 1st Qu.:24.50 1st Qu.:83.50
Mode :character Median :27.00 Median :87.00
Mean :27.29 Mean :86.57
3rd Qu.:30.00 3rd Qu.:90.50
Max. :32.00 Max. :93.00
4) dim()
Returns the dimensions of an object.
# Get the dimensions of df (number of rows and columns)
dim(df)
[1] 7 3
5) rownames()
and colnames()
Retrieve or set the row or column names of a data frame.
# Get row names of df
rownames(df)
[1] "1" "2" "3" "4" "5" "6" "7"
# Get column names of df
colnames(df)
[1] "Name" "Age" "Score"
# Set new row names for df
rownames(df) <- c("A", "B", "C", "D", "E", "F", "G")
6) merge()
Merge two data frames by common columns or row names.
# Create another sample data frame
<- data.frame(Name = c("Ali", "Abu", "Rosnah", "Rohana"),
df2 Grade = c("A", "B", "A", "C"))
# Merge df and df2 by the "Name" column
<- merge(df, df2, by="Name")
merged_df print(merged_df)
Name Age Score Grade
1 Abu 32 90 B
2 Ali 25 85 A
3 Rohana 23 82 C
4 Rosnah 27 78 A
This dataset comprises various specifications and details about different car models from the 1970s.
1. Quick Glance at the Dataset
First, let’s take a quick look at the mtcars
dataset:
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
2. Structure of the Dataset (str()
)
Examining the structure of mtcars
:
str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
3. Summary of the Dataset (summary()
)
Providing a statistical summary:
summary(mtcars)
mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb
Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :4.000 Median :2.000
Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
4. Dimensions of the Dataset (dim()
)
Checking the number of rows and columns:
dim(mtcars)
[1] 32 11
5. Column Names (colnames()
)
Retrieving the names of the columns:
colnames(mtcars) #same as names(mtcars)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
[11] "carb"
7. Subsetting Example
Extracting data for cars with 6 cylinders and horsepower (hp
) greater than 150:
$cyl == 6 & mtcars$hp > 150, ] mtcars[mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
Exercise 1:
Create a data frame named students
with the following columns: Name
, Age
, Grade
, and Subject
. Populate it with at least 5 rows of sample data.
Display the structure of the students
data frame using the str()
function.
Add a new column to the students
data frame named Attendance
and populate it with sample data.
Exercise 2:
From the mtcars
dataset, extract the mpg
(miles per gallon) and hp
(horsepower) columns and save them as a new data frame named car_specs
.
Retrieve the first 6 rows of the car_specs
data frame.
Create a subset of mtcars
containing only cars with 6 cylinders (cyl
).
Exercise 3:
Calculate the median horsepower (hp
) of all cars in the mtcars
dataset.
How many cars in the dataset have an automatic transmission (am
column: 0 represents automatic, 1 represents manual)?
Which car model in the mtcars
dataset has the highest miles per gallon (mpg
)?
Exercise 4:
Extract and display all cars from mtcars
with 4 cylinders (cyl
).
How many cars in the mtcars
dataset have more than 100 horsepower (hp
) and weigh (column wt
) less than 3,000 lbs?
Retrieve all car models from mtcars
that have an automatic transmission and can cover more than 20 miles per gallon.
Exercise 5:
How many rows and columns are present in the mtcars
dataset?
What are the names of all the columns in the dataset?
Display the last 8 rows of the dataset.
Exercise 6:
Calculate the median horsepower (hp
) of all cars in the dataset.
How many cars in the dataset have an automatic transmission (am
column: 0 represents automatic, 1 represents manual)?
Which car model has the highest miles per gallon (mpg
)?
Exercise 7:
Extract and display all cars with 4 cylinders (cyl
).
How many cars have more than 100 horsepower (hp
) and weigh (column wt
) less than 3,000 lbs?
Retrieve all car models that have an automatic transmission and can cover more than 20 miles per gallon.