Import and Export Data

Author

Dr. Mohammad Nasir Abdullah

Importing Data

Importing data is one of the first steps in the field of data science that comes before data analysis or modelling. In order to do any kind of analysis, you need good data that is organised well. As a result, it is very important to have a quick and error-free way to import data. In the same way, after doing modelling or analysis in R, it is common to need to share the results, visualisations, or changed datasets for sharing, reporting, or further processing in order tools.

Why is effective data import/export crucial?

  1. Integrity of the data: If something goes wrong during the import process, like not handling missing values, data types, or encoding correctly, the research results could be wrong. Import methods that work well keep the integrity of the data.
  2. Efficiency: The time you spend fixing problems with importing and exporting data could be better spent analysing the data itself. Importing and exporting data without any problems can greatly increase productivity.
  3. Versatility: The world of data is very big, and it can be kept in a lot of different ways, such as in CSV files, Excel spreadsheets, databases, and even more specialised formats like SPSS or SAS. A data worker can be more flexible and adaptable if they know how to import and export data from these different sources.
  4. Collaboration: People who work with data often have to work with others or show what they have found to people who matter. Exporting tools that work well make it easy to share results, datasets, or visualisations, which helps people work together and make decisions.

In this notes, we will use demo.csv, demo.sav, demo.xls, Data Exercise.sav, and KAP1.dta. All the dataset can be downloaded through this link: https://dataintror.s3-ap-southeast-1.amazonaws.com/dataset+Introduction+to+R.zip

At first, let us setting up the evironment. we need to make sure we already identify the working directory and clear all object in the environment.

1) To clear all objects in the R environment

rm(list=ls())

2) To check working directory

#To check current working directory
getwd()

#If you not satisfied, we can change the current working directory using :
setwd("C:/Users/Nasirthegreats/Documents")

1. Importing from a Comma Separated file

First, we must make sure the original file has these criteria’s:

  1. The first row of the file should contain variable names at the top.
  2. The variable names must include normal letters, numbers, and underscores (for example: Age, Admission_date, Score_1).
  3. Should not include special characters such as space, @, $, # (For example: “Admission Date”, #race, @prob).
  4. The data should begin at the second row and first column onwards.
  5. Everything that is not part of the data should be removed (for example: comments, labels, graphs, pivot tables).

1st Method using read.table

This is the example of comma separated text file:

mydata <- read.table("demo.csv", header = TRUE, sep = ",")

Next, we can check for the dimension of the dataset by using this code:

dim(mydata) #To look at the dimensions of the data.
[1] 6400   28
            #It will display row and columns

we also can view some of the observations by:

head(mydata) #It will show first 6 observations for all variables

head(mydata[1:4], n=10L)
#It will show first 10 observations (10L)
#For variable number 1 until number 4

2nd Method using .csv

data2 <- read.csv("demo.csv")

3rd Method using readr

library(readr)
data3 <- read_csv("demo.csv")
Rows: 6400 Columns: 28
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): Gender
dbl (27): Age, Marital status, Address, Income, Income Category, Car price, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Comparing time to load in R

#to compare time to load in R
time1 <- proc.time()
data2 <- read.csv("demo.csv")
proc.time() - time1
   user  system elapsed 
   0.01    0.00    0.06 
time2 <- proc.time()
data3 <- read_csv("demo.csv")
Rows: 6400 Columns: 28
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): Gender
dbl (27): Age, Marital status, Address, Income, Income Category, Car price, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
proc.time()- time2
   user  system elapsed 
   0.08    0.02    0.13 

Just another function to import data using data.table packages which is fread(). This function utilizing parallel processing method.

library(data.table)

time2 <- proc.time()
data21 <- fread("demo.csv")
proc.time()- time2
   user  system elapsed 
      0       0       0 
data21 <- fread("demo.csv", nrows = 12)

2. Importing from MS Excel file

One of the best ways to read an Excel file is to export it to a comma delimited file and import it using the method above. Alternatively, you can use the xlsx package to access Excel files. The first row should contain variable/column names. As as start, we can read in the first worksheet from the workbook demo.xls by double click on the orginal excel file and make sure the first row contains variable names and all the criteria mention in the previous section.

To start importing excel file, we need to install a library named xlsx.

install.packages("readxl")

If you already installed the package, please load it up by calling the library

library(readxl)

Now we can begin to import the data from MS Excel file

mydata1 <- read_excel("demo.xls", sheet="demo")
#we need to specify the sheet name.
#In this case the sheet name is "demo"

In the file name, you can mention either it is .xls or .xlsx. There are another method to import data from excel, without specifying the sheet name. But you need to know which number of sheets you want to import, in this case, the sheet is in the 1st sheet.

mydata1 <- read_excel("demo.xls", 1)
#we need to specify the sheet number. 
#In this case the "demo" sheet is the first sheet.

3. Importing data from SPSS file

1st method using foreign package

To import data file from SPSS format file, we need to install foreign library. In foreign library, we able to import data file from SPSS and STATA.

install.packages("foreign")

If you already install the package, please load it up by calling the library

library(foreign)

Now, we can begin to import the data from SPSS file

data.sav <- read.spss("Data Exercise.sav", to.data.frame=TRUE) #SPSS
#Last option converts value labels to R factors.

2nd method using Hmisc package

There are another method to import SPSS data file into R, by using Hmisc library

install.packages("Hmisc")
library(Hmisc) #load the packages

Attaching package: 'Hmisc'
The following objects are masked from 'package:base':

    format.pval, units
mydata1 <- spss.get("Data Exercise.sav", use.value.labels = TRUE)

3rd method using haven package

To import SPSS data file into R using haven package

library(haven)
data4 <- read_sav("demo.sav")

Compare time to load in R

#to compare time to load in R
time1 <- proc.time()
data7 = read.spss("demo.sav", to.data.frame = TRUE)
Warning in read.spss("demo.sav", to.data.frame = TRUE): Undeclared level(s) 2
added in variable: marital
proc.time() - time1
   user  system elapsed 
   0.07    0.00    0.11 
time2 <- proc.time()
data8 <- read_sav("demo.sav")
proc.time()- time2
   user  system elapsed 
   0.03    0.00    0.03 

4. Importing data from STATA file

1st method using foreign package

In this example, we will use foreign library to import data

library(foreign)

Now, we will begin to import STATA data file format .dta into R.

mydata <- read.dta("KAP1.dta")

It actually pretty fast to import STATA data file into R.

2nd method using haven package

library(haven)
data10 <- read_dta("KAP1.dta")

5. Importing data from Internet / Web

The internet is a vast repository of data, ranging from structured dataset in CSV or Excel formats, APIs returning JSON or XML data, to unstructured data in web pages. R provides a range of tools and packages that make importing this data straightforwards.

From website: https://catalog.data.gov/dataset/youth-tobacco-survey-yts-data, we will download this data https://data.cdc.gov/api/views/4juz-x2tp/rows.csv?accessType=DOWNLOAD, noticed that this is .csv file data provided from this webpage.

my_data <- read_csv("https://data.cdc.gov/api/views/4juz-x2tp/rows.csv?accessType=DOWNLOAD")
Rows: 10600 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (24): LocationAbbr, LocationDesc, TopicType, TopicDesc, MeasureDesc, Dat...
dbl  (7): YEAR, Data_Value, Data_Value_Std_Err, Low_Confidence_Limit, High_C...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Now, let explore some other dataset from data.gov.my. which is Number of births in Malaysia by date: https://storage.data.gov.my/demography/births.csv

birth <- read_csv("https://storage.data.gov.my/demography/births.csv")
Rows: 37833 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): state
dbl  (1): births
date (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Another example from the same website: RON97 Petrol Price: https://storage.data.gov.my/commodities/fuelprice.csv

ron97 <- read.csv("https://storage.data.gov.my/commodities/fuelprice.csv")

6. Importing data from .parquet file

Parquet is a columnar storage file format optimized for analytics. Originating from the Hadoop ecosystem, Parquet is especially used with big data technologies because of its efficiency. Some features and advantages of Parquet include:

  1. Columnar Storage: By storing data column-wise, Parquet allows more efficient I/O operations, compression, and encoding shcemes.
  2. Schema Evolution: Parquet supports complex nested data structures and allows schema evolution, meaning you can modify the schema over time without having to rewrite the entire dataset.
  3. Language and Framework Neutrality: Parquet can be used with any project in the Hadoop ecosystem, regardless of the data processing framework, data model, or programming language.
  4. Compression: Columnar storage also makes it possible to better compress data. Different compression schemes can be specified for different columns.

Reading a Parquet file in R

To read a parquet file in R, you can use the arrow package, which provides a set of tools for working with Parquet and Arrow dataset:

Here’s how you can read a Parquet file into R using the arrow package:

install.packages("arrow")

 once installed the package, you can load it up into R environment:

library(arrow)

Attaching package: 'arrow'
The following object is masked from 'package:utils':

    timestamp

Let try to import Parquet file from data.gov.my where this data is about Exchange Rate Malaysia vs China

data1 <- read_parquet("https://storage.data.gov.my/finsector/exchangerates.parquet")

head(data1)
# A tibble: 6 × 25
  date                myr_usd myr_rmb myr_sgd myr_eur myr_jpy myr_twd myr_thb
  <dttm>                <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 2003-12-01 08:00:00   0.263    2.18   0.453   0.220    28.8    8.36    10.5
2 2003-12-02 08:00:00   0.263    2.18   0.453   0.218    28.6    8.36    10.5
3 2003-12-03 08:00:00   0.264    2.18   0.453   0.218    28.6    8.36    10.5
4 2003-12-04 08:00:00   0.263    2.18   0.451   0.218    28.5    8.36    10.5
5 2003-12-05 08:00:00   0.264    2.18   0.452   0.217    28.4    8.36    10.5
6 2003-12-06 08:00:00   0.264    2.18   0.452   0.217    28.4    8.36    10.5
# ℹ 17 more variables: myr_idr <dbl>, myr_hkd <dbl>, myr_krw <dbl>,
#   myr_vnd <dbl>, myr_inr <dbl>, myr_aud <dbl>, myr_php <dbl>, myr_aed <dbl>,
#   myr_sar <dbl>, myr_try <dbl>, myr_gbp <dbl>, myr_brl <dbl>, myr_mxn <dbl>,
#   myr_bdt <dbl>, myr_chf <dbl>, myr_cad <dbl>, myr_rub <dbl>

Noticed that the parquet file here is same as tibble data format.

7. Importing data from .zip file

Reading .zip file without manually unzipped it.

require(readr)
myData <- read_csv("zomato.csv.zip")
Multiple files in zip: reading 'zomato.csv'
Rows: 51717 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (15): url, address, name, online_order, book_table, rate, phone, locatio...
dbl  (1): votes
num  (1): approx_cost(for two people)

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Exporting data

The ability to export data efficiently and accurately is a crucial step in the data analysis process. As a bridge between data processing and data communication, exporting ensures that insights generated through meticulous analyses can be shared, interpreted, and acted upon across platforms and stakeholders.

R, a leading language in statistical computing and graphics, has carved out a reputation for its data manipulation capabilities. However, its prowess is not limited to just importing and transforming data. Its extensive suite of exporting tools ensures that users can save their work in a myriad of formats, be it traditional spreadsheets like Excel, databases like MySQL, or even more specialized formats like SPSS and SAS. This flexibility ensures that R integrates seamlessly into various data workflows, making it a preferred tool for many data professionals.

However, with flexibility comes complexity. The vast array of exporting options can be overwhelming for both new and experienced users. The choice of format, the specific package to use, and the nuances of each exporting function can impact the integrity and usability of the exported data. Hence, understanding these options and their implications becomes paramount.

In this notes, we aim to demystify the data exporting process in R. Through detailed explanations, examples, and best practices, we will guide readers through the diverse landscape of data exporting in R. Whether you are looking to share your findings with colleagues, create visualizations in another software, or move large datasets to a different storage platform, this guide will equip you with the knowledge and skills you need to do so effectively.

1. Exporting data to tab delimited file (.txt)

we will try to export object data1 to .txt file

library(foreign)
write.table(data1, "data1.txt", sep = "\t")

Exporting data to comma separated file (.csv)

write.csv(data1, "data2.csv")

Exporting data to excel data type

Method 1

library(openxlsx)
write.xlsx(data1, "data3.xls", sheetName = "Sheet1", 
           colNames = TRUE, rowNames = TRUE, 
           append = FALSE)

Method 2

library(writexl)
write_xlsx(data1, "data.xlsx")

Exporting data to SPSS format

Method 1

library(foreign)
write.foreign(data1, "data2.txt", 
              "data2.sav", package = "SPSS")

Method 2

library(haven)
write_sav(data1, "data.sav")

Exporting data to SAS format

Method 1

write.foreign(data1, "data2.txt", 
              "data2.sas", package = "SAS")

Method 2

write_sas(data1, "data_sas.sas7bdat")
Warning: `write_sas()` was deprecated in haven 2.5.2.
ℹ Please use `write_xpt()` instead.

Exporting data to STATA format

Method 1

write.dta(data1, "data2.dta")

Method 2

write_dta(data1, "data2.dta")

Export to RDS and RDATA

R native format are efficient for storing R objects

saveRDS(data1, "data2.rds")
loaded_data <- readRDS("data2.rds")


#multiple object with RDATA
save(data1, data2, file="data1and2.RData")
load("data1and2.RData")

Conclusion Remarks

R offers an extensive set of tools for exporting and importing data. Depending on you needs and the tools you or your collaborators use, you can choose the appropriate method for exporting and importing. Always keep in mind the compatibility and constrains of your chosen data format.

Exercise

Question 1

a. Download a sample CSV file from the internet and import it into R. What function would you use from base R to achieve this?

b. Import an Excel file with multiple sheets into R. How would you access the second sheet specifically?

Question 2

a. Use R to read a table directly from a Wikipedia page of your choice (Hint: Consider the “rvest” package)

b. Given a URL that returns JSON data, how would you import this data into an R dataframe?

Question 3

a. Suppose you have a .sav file, which is an SPSS data file. Which R package would you consider to import this file?

b. How would you import a Parquet file into R?

Question 4

a. You have a dataframe df. Export it to a CSV file named “my_data.csv” without row names.

b. How would you export the same dataframe df to an Excel with a sheet name “Dataset”?

Question 5

a. You are given a URL of a CSV file. Write a script to download this CSV, import it into R, make any transformation of your choice (eg: filter, mutate), and then export the transformed dataset to an Excel file.

b. You have scraped multiple tables from different web pages into a list of dataframes in R. How would you export each dataframe in the list to its own sheet in a single Excel workbook?