The ‘tidyverse’ package in R is a very useful tool for data analysis in R as it covers almost everything you need to analyze a dataset. It is a combination of several big libraries that makes it a huge library to learn. In his article, I will try to provide a good overview of the tidyverse library that gives enough resources to perform a data analysis task well and also a great base for further learning. It can also be used as a cheat sheet.
If one does not have years of experience, it can be useful to have a list of operations or data analysis ideas on a page in front of you. So, I tried to compile quite a good amount of commonly used operations on this page to help myself and you.
The everyday data analysis packages that are included in tidyverse package are:
ggplot2
dplyr
tidyr
readr
purr
tibble
stringr
forcats
This article touches all of these packages except tibble. If you do not know what tibble is, it is also a kind of DataFrame. I am not going on details here. I used simple data frames for all the example here.
I will start with some simple things and slowly move towards some more complex tasks.
Let’s start!
First import the tidyverse library.
library(tidyverse)
I will start with some simple functions of stringr library which are pretty self-explanatory. So, I will not explain them too much.
Converting a string to lower case:
x = "Happy New Year 2022"
str_to_lower(x)
Output:
[1] "happy new year 2022"
Converting a string to upper case:
str_to_upper(x)
Output:
[1] "HAPPY NEW YEAR 2022"
Combining several strings to one:
str_c("I ", "am ", "very ", "happy")
Output:
[1] "I am very happy"
Taking a subset of a list of strings:
Here I will make a list of strings and then take only the first three letters of each string:
x = c("Apple", "Tears", "Romkom")
str_sub(x, 1, 3)
Output:
[1] "App" "Tea" "Rom"
For the next demonstrations, I will use a dataset called the flight dataset that is a part of the nycflights13 library.
library(nycflights13)
The library is imported. Now, you are ready to use the flight dataset. The flight dataset is big. It is a big dataset. So, it is not possible to show a screenshot here. Here are the columns of the dataset:
names(flights)
Output:
[1] "year" "month" "day" "dep_time"
[5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
[9] "arr_delay" "carrier" "flight" "tailnum"
[13] "origin" "dest" "air_time" "distance"
[17] "hour" "minute" "time_hour"
I will start with the ‘unite’ function that combines two columns together. This code block below unites the flight and carrier columns and make a new column called ‘flight_carr’:
flights %>%
unite_(.,"flight_carr", c("flight", "carrier"))
Here is the part of the dataset that shows the new flight_carr column:
Factor function is very useful when we have categorical data and we need to use them as levels. For visualizations and for machine learning, it is necessary to use the factor function for categorical data to store them as levels.
Here I am factorizing the carrier column and printing the unique carriers:
carr = factor(flights$carrier)
levels(carr)
Output:
[1] "9E" "AA" "AS" "B6" "DL" "EV" "F9" "FL" "HA" "MQ" "OO" "UA" "US"
[14] "VX" "WN" "YV"
Let’s see the number count of each carrier:
fct_count(carr)
The next one is the map_dbl function from the purr package that takes a statistical function and returns the result. Here I will take the ‘distance’ and ‘sched_arr_time’ columns and find the ‘mean’ of both of them:
map_dbl(flights[, c("distance", "sched_arr_time")], ~mean(.x))
Output:
distance sched_arr_time
1039.913 1536.380
GGPLOT2 is a huge visualization library that comes with tidyverse package as well. Here is an example:
ggplot(data = flights) +
aes(x = distance) +
geom_density(adjust = 1, fill = "#0c4c8a") +
theme_minimal()
I have a detailed tutorial on ggplot2 where you will find a collection of visualization techniques:
A Collection of Data Visualizations in ggplot2
Great Learning Materials for Beginners as well
towardsdatascience.com
Let’s see some functions of lubridate package:
dates = c("January 18, 2020", "May 19, 2020", "July 20, 2020")
mdy(dates)
Output:
[1] "2020-01-18" "2020-05-19" "2020-07-20"
Hour-minutes-seconds data:
x = c("11:03:07", "09:35:20", "09:18:32")
hms(x)
Output:
[1] "11H 3M 7S" "9H 35M 20S" "9H 18M 32S"
Let’s go back to our flights’ dataset. There is the year, month, and day data in the flights’ dataset. We can make a date column using that and find the number of flights on each date and the mean distance.
flights %>%
group_by(date = make_date(year, month, day)) %>%
summarise(number_of_flights = n(), mean_distance = mean(distance, na.rm = TRUE))
How to take a subset of a DataFrame?
This code block takes a sample of 15 rows of data from the flights’ dataset.
flights %>%
slice_sample(n = 15)
The following code block takes a sample of 15% data from the flights’ dataset:
flights %>%
slice_sample(prop = 0.15)
Selecting specific columns from a large dataset
Here I am taking origin, dest, carrier, and flight columns from the flight dataset:
flights %>%
select(origin, dest, carrier, flight)
What if from this big flights dataset I want most of the columns except time_hour and tailnum columns.
select(flights, -time_hour, -tailnum)
This code block will select all the columns of the flights’ dataset except time_hour and tailnum columns.
Selecting the columns using the portion of the column names can be helpful as well.
The following line of code selects all the columns that start with “air_”:
flights %>%
select(starts_with("air_"))
There is only one column that starts with “air_”.
These are the columns that end with “delay”:
flights %>%
select(ends_with("delay"))
The following columns contain the part “dep” in them.
flights %>%
select(contains("dep"))
How to filter the rows based on specific conditions?
Keeping the rows of data where ‘dest’ starts with “FL” and filtering out the rest of the data:
flights %>%
filter(dest %>% str_detect("^FL"))
Here is the part of the output data:
Look at the ‘dest’ column above. All the values start with ‘FL’.
Here is another example of a filter function. Keeping the rows where month = 2 and filtering the rest.
filter(flights, month == 2)
This line of code will return the dataset where the month value is 2.
Using filter and select both in the same line of code
Selecting the columns origin, dest, distance, and arr_time where distance value is greater than 650:
select(filter(flights, distance > 650), origin, dest, distance, arr_time)
The same thing can be done using piping as well:
flights %>%
filter(distance > 650) %>%
select(origin, dest, distance, arr_time)
In the following example, I am selecting the flight and distance from the flights’ dataset and taking the average distance where the flight number is 1545.
flights %>%
select(flight, distance) %>%
filter(flight == 1545) %>%
summarise(avg_dist = mean(distance))
Creating New Columns Using the Existing Columns
I am creating two new columns arr_time_new and arr_time_old adding and subtracting 20 with the arr_time column using mutate operation. Before that, I am using the filter to remove the null values.
flights %>%
filter(!is.na(arr_time)) %>%
mutate(arr_time_new = arr_time + 20,
arr_time_old = arr_time -20)
One last example of mutating. Here we will define a long-distance flight as the distance greater than 1000 and count how many flights are long-distance as per this definition.
flights %>%
mutate(long_distance = (distance >= 1000)) %>%
count(long_distance)
Las Vegas arrival delay and Seattle arrival on time count:
flights %>%
mutate(dest = case_when(
(dest == 'LAS') & arr_delay > 20 ~ "Las Vegas arriavl - Delayes",
(dest == "SEA") & arr_delay <= 20 ~ "Seattle arrival - On time" )) %>%
count(dest)
Replacing the names and count
The origins are represented as EWR, LGA, and JFK. We will replace them with their full name and count the number of flights and present them in sorted order:
flights %>%
mutate(origin = str_replace_all(origin, c(
"^EWR$" = "Newark International",
"^LGA$" = "LaGuaria Airport",
"^JFK$" = "John F. Kennedy International"
))) %>%
count(origin)
Summarizing Data Using Group By Function
I want to know the average distance of all the flights each month. In this case, I will use the group_by function on the month column and find the mean distance using summarise on distance column:
flights %>%
group_by(month)%>%
summarize(mean_distance = mean(distance, na.rm = TRUE))
Group by function can be used on several variables and more than summarisation function can be used as well.
Here is an example. Another function called ‘arrange’ is also used here to get the dataset sorted by mean distance.
flights %>%
filter(!is.na(distance))%>%
group_by(year, month) %>%
summarize(mean_distance = mean(distance),
min_distance = min(distance)) %>%
arrange(mean_distance)
Look, the dataset is arranged according to mean distance in ascending order.
Count Function
Let’s count the flight by flight number:
flights %>%
count(flight)
Counting the number of flights of each origin and sorting them by the number of flights:
flights %>%
count(origin, sort = TRUE)
Let’s see how many flights are there for each origin-destination combination:
flights %>%
count(origin, dest)
Making a flight path column that shows origin -> destination combination and count the flights of each flight path:
flights %>%
count(flight_path = str_c(origin, "->", dest), sort = TRUE)
Spread Function
To demonstrate the spread function, let’s prepare something first. I am making the mean dep_delay for each origin-destination combination:
flg = flights %>% filter(!is.na(dep_delay)) %>% group_by(origin, dest) %>% summarize(mean_dep_delay = mean(dep_delay))flg
Here is the use of spread where the key is the origin and value is the mean_dep_delay:
flg %>%
spread(key = origin, value=mean_dep_delay)
Look origin is spread. There are three origins in this dataset and they became the columns of this spread dataset now. For each destination mean dep_delay from each origin is shown here.
But the problem is there are some null values in this output dataset. We can fill them up with zeros using this code:
flights_spread = flg %>%
spread(origin, mean_dep_delay, fill=0)
Please feel free to run this code yourself and see the result.
Gather Function
Using the gather function I will gather the values from the values of mean_dep_delay of the columns ‘EWR’ to the column ‘LGA’:
flights_spread %>%
gather(key = "origin", value = "mean_dep_delay", EWR:LGA)
That’s all in this article!
Conclusion
There is so much more that can be done with the tidyverse package. I tried to give an overview that touches quite a lot of things in different areas. Hope you will use this to do some interesting work yourself.
Feel free to follow me on Twitter and check out my new YouTube channel
#DataScience #DataAnalytics #MachineLearning #RProgramming