A Data Storytelling Project with Some Advanced Visualization in ggplot2
Shot of a young female designer working in her office

A Data Storytelling Project with Some Advanced Visualization in ggplot2

Data Visualization is the most effective way to communicate data to people, I believe. So, it is important for every data scientist or analyst to learn visualization. This article will share some visualization in R. I will use ggplot2 library in R which is a rich library. Lots of visualization techniques are included in ggplot2 library. I myself am trying to learn as much as I can.

I had to use this public dataset and summarize using some visualization a few weeks ago. But for preparing the data for visualization, some data analysis techniques are also used.

These are the libraries that are necessary for this project:

library(ggplot2)
library(psych)
library(dplyr)
library(tidyr)
library(ggpubr)
library(usmap)

Here I am importing the dataset first. The dataset shows the US export data of agricultural products per state.

us_ag = read.csv("https://raw.githubusercontent.com/plotly/datasets/master/2011_us_ag_exports.csv")

These are the columns of this dataset:

names(us_ag)

Output:

[1] "code"   "state"    "category"  "total.exports" "beef"         
 [6] "pork"  "poultry"  "dairy"  "fruits.fresh"  "fruits.proc"  
[11] "total.fruits" "veggies.fresh" "veggies.proc" "total.veggies" "corn"         
[16] "wheat" "cotton"

Let’s begin the real fun!

I thought it would be interesting to see if there is a correlation between the exports of the agricultural products. I will make a smaller dataset with the agricultural products only and make a correlation plot.

corr_data = us_ag[, c("pork", "beef", "poultry", "dairy", "total.fruits", 
"total.veggies", "corn", "wheat", "cotton")]

Using this corr_data dataset to make a correlation plot:

corPlot(corr_data, cex = 0.8, main="Corrlation Between Variables",
        las = 2)

There are a few interesting correlations shown here. The states that export a lot of corn also exports a lot of pork. The correlation is 0.77. Poultry and cotton also show a significant correlation of 0.61.

Next plot will be a diverging bar plot that will show if the state’s total exports is below average or above average.

Some data preparation is required before doing that. First, I need to standardize the total exports column.

us_ag$exports_z = round((us_ag$total.exports - mean(us_ag$total.exports))/sd(us_ag$total.exports), 2)

Now, I will label the data as ‘below average’ and ‘above average’. If the exports_z value is less than zero it will be considered as below average otherwise, above average. And then we will sort the dataset according to the exports_z values.

us_ag$export_level = ifelse(us_ag$exports_z < 0, "below", "above")
us_ag_sorted = us_ag[order(us_ag$exports_z),]us_ag_sorted$state = factor(us_ag_sorted$state, levels=us_ag_sorted$state)

Data is prepared for the diverging plot. Below is the code block to make the diverging plot:

{r fig.height=11, fig.width=6}
ggplot(us_ag_sorted, aes(x = state, y = exports_z, label=exports_z)) + 
  geom_point(stat = 'identity', aes(col=export_level), size=7) + 
  scale_color_manual(name="Exports",
                     labels = c("Above Average", "Below Average"),
                     values = c("above"="#00ba38", "below"="#f8766d"))+
  geom_text(color="white", size = 2) +
  labs(title="Diverging Bar Plot Showing Exports Level",
       subtitle = "Normalized total export",
       x = "Normalized Total Exports",
       y = "State") +
  ylim(-2, 5)+
  coord_flip()

Here is the plot:

For detailed information, I would also like to see the exports of each product per state. There are so many products. If I want to put all of them in the same plot it will not be very useful. So, I divided them into several plots. And eventually, I will combine them later.

The first plot will only include beef, pork, and poultry. Here is the code:

{r fig.height=4, fig.width=12}
us_ag %>% select(state, beef, pork, poultry) %>%
  pivot_longer(., cols = c(beef, pork, poultry),
               values_to="Val") %>%
  ggplot(aes(x = state, y = Val, color=name, alpha = 0.7))+
  geom_point(size = 4) + 
  scale_color_manual(values = c("beef" = "black", "pork" = "red", "poultry" = "green")) +
  geom_line(aes(group=name)) + 
  theme(axis.text.x = element_text(angle = 90)) + 
  guides(alpha=FALSE)+labs(x = "State",
                           y = "Exports",
                           title = "Beef, Pork, and Poultry Exports of States"
                           )

Output plot:




As you can see, Iowa is the biggest exporter of pork and Texas is the biggest exporter of beef.

Dairy, total fruits, and total vegetables are included in the next plot:

{r fig.height= 4, fig.width=14}
us_ag %>% select(state, total.fruits, total.veggies, dairy) %>%
  pivot_longer(., cols = c(total.fruits, total.veggies, dairy),
               values_to="Val") %>%
  ggplot(aes(state, Val, fill = name)) + geom_col(width = 0.9)+
  theme(axis.text.x = element_text(angle = 90))

Here is the plot:

California exports the largest amount of fruits, vegetables, and dairy that is way more than any other state.

Finally, corn, cotton, and wheat:

{r fig.height= 4, fig.width=14}
us_ag %>% select(state, corn, cotton, wheat) %>%
  pivot_longer(., cols = c(corn, cotton, wheat),
               values_to="Val") %>%
  ggplot(aes(state, Val, fill = name)) + geom_col(width = 0.9)+
  theme(axis.text.x = element_text(angle = 90))+
  labs(x = "State",
       y = "Exports",
       title = "Corn, Cotton, and Wheat Exports of States")

Output plot is here:

Texas is the biggest exporter of cotton. I myself visited a lot of cotton fields when I lived in Texas and Iowa is the biggest exporter of corn. There are a few other states that export a good amount of corn as well. as you can see in the picture.

If I put all three of these plots together, that will look good in a presentation and will cover on one page. Here is the code for combining them together:

{r fig.height=10, fig.width=14}
p1 = us_ag %>% select(state, beef, pork, poultry) %>%
  pivot_longer(., cols = c(beef, pork, poultry),
               values_to="Val") %>%
  ggplot(aes(x = state, y = Val, color=name, alpha = 0.7))+
  geom_point(size = 4) + 
  scale_color_manual(values = c("beef" = "black", "pork" = "red", "poultry" = "green")) +
  geom_line(aes(group=name)) + 
  theme(axis.text.x = element_text(angle = 90)) + 
  guides(alpha="none")+
  labs(x = "State",
       y = "Exports",
       title = "Beef, Pork, and Poultry Exports of States")p2 = us_ag %>% select(state, total.fruits, total.veggies, dairy) %>%
  pivot_longer(., cols = c(total.fruits, total.veggies, dairy),
               values_to="Val") %>%
  ggplot(aes(state, Val, fill = name)) + geom_col(width = 0.9)+
  theme(axis.text.x = element_text(angle = 90))+
  labs(x = "State",
       y = "Exports",
       title = "Dairy, Fruits, and Vegetables Exports of States")p3 = us_ag %>% select(state, corn, cotton, wheat) %>%
  pivot_longer(., cols = c(corn, cotton, wheat),
               values_to="Val") %>%
  ggplot(aes(state, Val, fill = name)) + geom_col(width = 0.9)+
  theme(axis.text.x = element_text(angle = 90))+
  labs(x = "State",
       y = "Exports",
       title = "Corn, Cotton, and Wheat Exports of States")ggarrange(p1, p2, p3, nrow=3)

Here is the output plot:

For the next plot, I decided to plot the best exports products on a US map. For that, I need latitude-longitude data. This dataset itself does not include any lat-long data. So, I simply downloaded some latitude-longitude data from Kaggle. You can find this lat-long data here. Please feel free to download and use it.




First I will merge this lat-long data with the main dataset:

latlong = read.csv("statelatlong.csv")
latlong = rename(latlong, c("code" = "State"))
latlong = subset(latlong, select=-c(City))
latlong1 = latlong[, c(3, 2, 1)]
us_ag_ll = merge(latlong, us_ag, by = "code")

This us_ag_ll dataset includes all the columns and also the lat-long data.

We need to find out which product is the highest exports product for each state. For that I will make a dataset with all the agricultural products only:

df = us_ag[, c("beef", "pork", "poultry", "dairy", "total.fruits", "total.veggies", "corn", "wheat", "cotton")]

The following code block finds out the maximum value and the name of the product with the maximum export value:

df$max_val = apply(X=df, MARGIN = 1, FUN = max)
df$maxExportItem= colnames(df)[apply(df,1,which.max)]

This maxExportItem column needs to be included in the main dataset us_ag_ll.

us_ag_ll$max_export_item = df$maxExportItem
us_ag_ll$max_val = df$max_val

We need to transform the latitude and longitude data to be able to use it in the map.

transformed = usmap_transform(us_ag_ll[, c("Longitude", "Latitude")])

If you want please check the transformed dataset.

“Longitude” and “Latitude” columns have been transformed and two column named “Longitude.1” and “Latitude.1” are created in the transformed dataset.

We will use that “Longitude.1” and “Latitude.1” for the plot.

I always think it is a good idea to put some values on the map. I will put the maximum export item and the export amount in each state. So, We should prepare that data.

us_ag_ll$Max_Export_Val = paste(us_ag_ll$max_export_item, us_ag_ll$max_val, sep="-")

All the data preparation is done! Now let’s do the plotting! The color gradient will represent the total exports. The value in the map will show the maximum export item and the export value of the maximum export item of each state.

{r fig.height=11}
plot_usmap(data = us_ag_ll, values = "total.exports", color = "blue")+
  geom_text(data=us_ag_ll, aes(x = transformed$Longitude.1,
                                y = transformed$Latitude.1, 
                                label = Max_Export_Val), size=3.5, color = "black")+
  scale_fill_continuous(low = "white", high = "blue", name = "Total Exports", label = scales::comma) + 
  labs(title = "Total Exports in Color in Each State", subtitle = "Top Exports Items and Export Values of Each State")+
  theme(legend.position = "bottom", plot.title = element_text(size = 20, face = "bold"),
        plot.subtitle = element_text(size = 18))

Output:

The placement of the values in the map can be improved maybe by some better latitude-longitude data. I used what I found in the Kaggle.

That’s all I wanted to share in this article!

Conclusion:

There are so many different ways you can approach a dataset and visualize it. If you are also a learner like me and trying to upscale your skill, please try your own idea, and if you can improve these plots in your own way, feel free to share your codes through a GitHub link in the comment section.

Feel free to follow me on Twitter and like my Facebook page.




#DataScience #DataAnalytics #DataVisualization #ggplot2 

Leave a Reply

Close Menu