 ## A Complete Beginners Guide to Data Visualization with ggplot2

For R user ggplot2 is the most popular visualization library with a huge number of graphics available. It is simple to use and is able to generate complex plots with simple commands fast. For an R user, there is no reason to not work with ggplot2 for data visualization

As I mentioned earlier, a lot of options and graphics are available. Nobody can remember all of those. So, it is helpful to have a cheat sheet or guide in hand. This article is an attempt to make a nice guide or a cheat sheet for some common types of plots from basic to advanced level.

For doing all the exercises here, I used RStudio. Hopefully, that’s good for you as well. Let’s start with the basics.

This is the command to install the ggplot2 if you do not have it already:

`install.packages("ggplot2")`

After the installation is done, the library needs to be called like this:

`library("ggplot2")`

The ggplot is ready to be used. Time to do some cool plotting.

A basic scatter plot

there are a lot of datasets inbuilt in the platform for practice. I will use a few of those for all the demonstrations in this article, starting with the very popular iris dataset.

`data(iris)`

Please use the head() command to see the first 5 rows of the dataset to examine the dataset

`head(iris)`

The dataset is small enough to show a screenshot here. Here is how to plot a basic scatter plot in ggplot. Sepal length will be on the x-axis and the petal length will be on the y axis.

`ggplot(iris, aes(x=Sepal.Length, y=Petal.Length))+geom_point()`

The code is pretty self-explanatory. The first input was the dataset itself. Then in aesthetic, there are ‘x’ and ‘y’ parameters. And at the end the function to define the type of plot. In this case, ‘geom_point()’. This function can take some style parameters that will be discussed later in this article.

Separate the Species by Color

The dataset shows, there are different species. It will be interesting to see the sepal length and the petal length varies with different species. It only requires to add a color parameter in the aesthetics. We will set the color as Species. So, different species will have different colors.

`ggplot(iris, aes(x=Sepal.Length, y=Petal.Length, col=Species))+geom_point()`

The difference is clear. Some people like to use different shapes to show the difference. Here is how to do it.

`ggplot(iris, aes(x=Sepal.Length, y=Petal.Length, col=Species, shape=Species))+geom_point()`

Another variable can be added using transparency. I will add the alpha parameter and set it to petal length. So lower the petal length, the more transparent the point will be.

`ggplot(iris, aes(x=Sepal.Length, y=Petal.Length, col=Species, shape=Species, size=Sepal.Width, alpha=Petal.Width))+geom_point()`

Too many legends! Sometimes less is more. We should get rid of some legends. Let’s get rid of alpha and the size legends. Only species will be enough. The ‘guides’ function will help getting rid of the legends.

`ggplot(iris, aes(x=Sepal.Length, y=Petal.Length, col=Species, shape=Species, size=Sepal.Width, alpha=Petal.Width))+geom_point() +guides(size=FALSE, alpha=FALSE)`

Not every parameter works in every plot.

Here I will show another way to show all five variables in the same plot:

`ggplot(iris) + geom_point(aes(x=Sepal.Length, y=Petal.Length, colour=Sepal.Width, size = Sepal.Width, shape=Species))+guides(colour='legend')`

Histograms

As usual, it will be a great idea to start with the very basic histogram. Here is the distribution of sepal length.

`ggplot(iris, aes(x=Sepal.Length))+geom_histogram()`

Adding some colors and one more variable will be nicer looking and more informative at the same time

`ggplot(iris, aes(x=Sepal.Length, fill=Species))+geom_histogram()`

Three colors show the distribution of three different species.

Here is another cool thing. Instead of showing another variable, it is also possible to use a condition as color. Such as the petal length > 4.

`ggplot(iris, aes(x=Sepal.Length, fill=Petal.Length > 4))+geom_histogram()`

Conditions can be used in scatter plots as well. Please try it if you haven’t already.

There is one problem in this plot. Both plots are stacked on each other. That can be avoided by adding transparency and setting the position as ‘identity’.

`ggplot(iris, aes(x=Sepal.Length, fill=Petal.Length >4))+geom_histogram(position='identity', alpha=0.4 )`

See the hidden part now? Or simply using a side by a sidebar in the distribution can be another solution. For that position need to be set ‘dodge’.

`ggplot(iris, aes(x=Sepal.Length, fill=Petal.Length >4))+geom_histogram(position='dodge', alpha=0.7)`

Which one is the better option? Well, everyone has their own opinion and choices.

Bar Plot

I will use a different dataset that has more categorical variables. Another build in dataset named ‘mpg’.

`data(mpg)`

This dataset is bigger than the previous one. So, it is not possible to show a screenshot of the whole width here. Please run this code to see the first five rows to examine the dataset for yourself:

`head(mpg)`

This dataset shows the manufacturer, class, highway mileage, cylinders, years, model, and a few other variables on cars.

Here is the basic bar plot that shows the number of cars for each manufacturer in the dataset. One more function is introduced here, the ‘themes’. Here I changed the angle of the texts on the x-axis. We will see more use of the ‘themes’ function later.

`ggplot(mpg, aes(x=manufacturer))+geom_bar()+theme(axis.text.x =element_text(angle = 90))`

The names of the manufacturer had to be put in 45 degrees because they become too cluttered. In this plot, only the value of x was enough. Because ggplot could calculate the count behind the scene.

What if you need the percentage instead of count only

The function ‘after_stat’ needs to be used as a y parameter in aesthetics. Here is how:

`ggplot(mpg, aes(x=manufacturer, y = after_stat(100*count / sum(count))))+geom_bar()+theme(axis.text.x =element_text(angle = 90))+ labs(y="Percentage of cars")`

If you notice, we changed the label of the y-axis here, otherwise, it will put that formula of percentage as the y-axis label.

Adding some extra style and information

For this demonstration, let’s make a new plot. This time x-axis will represent class and there will be ‘hwy’ on the y-axis. This plot will also introduce the ‘stat’ parameter. In the ‘geom_bar’ function, we will use ‘stat’ as ‘summary’ and ‘fun’ as ‘mean’ to get the mean as points on the bars. Here is the complete code:

```ggplot(mpg, aes(class, hwy))+geom_bar(stat='summary', fun="mean",
fill='steelblue')+geom_point()```

Those points show the variance in the data as well.

Some people prefer jitter points instead of regular points. Here is an example. I will explain some more after the plot:

```ggplot(mpg, aes(class, hwy))+geom_bar(stat="summary", fill='steelblue',
col='black')+
geom_point(position=position_jitter(0.2), size=2, shape=15)+
labs(
title="hwy for Each Class",
x = NULL,
y = NULL
)+
theme(panel.grid=element_blank(),
panel.background = element_rect(fill = 'White'),
panel.border=element_rect(colour = 'white', fill=NA, size=0.2))```

I like my plots as minimal and clean as possible. Here, the labels of the x and y-axis are taken off. Simply because it felt redundant. The title says “hwy’ for each class”. The names on the x-axis say clearly that they are the classes. What can be on the y-axis then? So, it is already clear. In the ‘labs’ function, x and y were set to NULL to avoid the x and y labels.

Adding one more variable to the bar plot above

The above plot was simply ‘class’ vs ‘hwy’. Adding another categorical variable in the x-axis will make it more informative. I choose ‘cyl’ for this third variable.

```ggplot(mpg, aes(class, hwy, fill=as.factor(cyl)))+
geom_bar(stat="summary", fun="median", col="black",
position="dodge")+
geom_point(position=position_dodge(0.9))```

Here x-axis shows different bars for different ‘class’ and different ‘cyl’ types within each class. In the ‘2seater’ class there are only 8 ‘cyl’ types available in this dataset. So, that one took the whole width. Again, the ‘subcompact’ class has all four ‘cyl’ types. It shows four bars there. As we discussed before position = ‘dodge’ puts the bars side by side instead of stacked on each other.

Boxplot

Another commonly used visualization. It provides great basic statistics about data in one box. If you are totally new to boxplots, here is a tutorial that explains how to retrieve data from a box plot. Though the plot is done in Python. But the explanation of the boxplot is plain English. It should work for anyone. Here is the link.

#### Here is a basic boxplot:

```ggplot(mpg, aes(class, hwy))+
geom_boxplot(fill='steelblue', col="black")```

The code is simple. It takes the dataset, x, and y parameters in the aesthetics function and I decided to put colors in the geom_boxplot function.

Adding some more style to it:

```ggplot(mpg, aes(class, hwy))+geom_point()+
geom_boxplot(fill='steelblue', col="black", , notch=TRUE)```

Added some notch to give a little bit interesting view.

Jitter Plot

The jitter plot is actually a modified version of the scatter plot. It randomly spreads the dots a little bit or by a specified amount so that dots do not lie on each other.

Here is an example. I will put the class on the x-axis and the ‘hwy’ on the y-axis.

`ggplot(mpg, aes(x=class, y=hwy), )+geom_jitter(width=0.2)`

If you do a scatter plot, those dots will be on a vertical straight line. Because it is a jitter plot those dots are a bit scattered. As the width parameter was set to 0.2, the dots were scattered in that range. Please feel free to change that ‘width’ parameter in the geom_jitter() function to see how the plot changes with different widths.

Add a red dot that will represent the mean of those points.

```ggplot(mpg, aes(x=class, y=hwy), )+geom_jitter(width=0.2)+
stat_summary(aes(x=class, y=hwy), fun=mean, geom='point', colour='red', size=3)```

Those red dots makes the plot a little pretty and also more useful.

Facet Grids

Very useful. Because it gives a chance to compare the data in a clearer way. We will explain some more after the first plot. Because a picture tells a thousand words.

I will use a different dataset for this demonstration. Let’s import the dataset here:

```gap = read.csv("https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv")

I named the dataset ‘gap’. Please have a look at the first five rows of the dataset above. It is pretty self-explanatory.

Here is our first facet grid plot:

`ggplot(gap, aes(x=lifeExp, fill=continent))+ geom_histogram(bins=20)+ facet_wrap(~continent)`

So, this is what a facet grid is. Here we made histograms of life expectancy for each continent in the same plot with one line of code. You can put all the histograms in one plot as well using the techniques shown in the previous histogram section.

But this is another way. And it also looks very clean and clear. I want to remove the legends to save some space on the side. That way the picture will have more space. Legends look unnecessary anyway.

```ggplot(gap, aes(x=lifeExp, fill=continent))+ geom_histogram(bins=20)+
facet_wrap(~continent)+
theme_minimal()+
theme(legend.position = 'None')+
xlab("Life Extectancy")+
ylab("count")```

The parameter ‘legend. position’ was set to ‘None’ to remove the legends.

But It can be set to ‘top’, ‘bottom’, ‘left’, or ‘right’ as well

Another important function is added in this plot is the ‘theme_minimal()’. That way it gives a bit different background here.

There are so many different themes available. Here are some of them:

theme_bw(),

theme_classic(),

theme_dark(),

theme_gray(),

theme_light(),

theme_dark()

and many more.

A scatter plot using these continents data where ‘gdpPercap’ goes in the x-axis and ‘lifeExp’ on the y-axis.

```ggplot(gap)+
geom_point(aes(x=gdpPercap, y=lifeExp))+
facet_wrap(~continent)```

Though I think in most cases keeping the same scale helps. Because it is easy to compare the data of different continents when the scale is the same. But at the same time, when the data is so cluttered, it probably looks good to change the scales throughout.

```ggplot(gap)+
geom_point(aes(x=gdpPercap, y=lifeExp))+
facet_wrap(~continent, scale='free')```

Each individual scatter-plot has it’s individual scale, which means different x and y-axis values.

It is also possible to change only x or only y axis values using ‘free_x’ or ‘free_y’ as scale in the facet_wrap function

The ‘facet_wrap’ can take more than one variable. Going back to our ‘mpg’ dataset, let’s do a facet_grid plot using ‘year’ and ‘drv’ for facet_wrap.

```ggplot(mpg)+
geom_point(aes(x=displ, y=hwy))+
facet_wrap(~year+drv)```

Please feel free to experiment with it some more.

The last example using the facet_wrap() that includes a lot of style ideas

This plot will use the ‘mpg’ dataset again. It is a bar plot showing the number of cars per manufacturer for individual years.

```ggplot(mpg)+
geom_bar(aes(y=manufacturer))+
facet_wrap(~year)+
labs(title = "Number of cars per manufacturer",
x = NULL,
y=NULL)+
scale_x_continuous(expand = c(0, NA))+
theme_minimal()+
theme(
text = element_text("Tahoma"),
strip.text = element_text(face='bold',
hjust=0),
panel.grid.major = element_line('white', size=0.6),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank(),
panel.ontop = TRUE
)```

The theme_minimal() was used here. But look, even if you use an inbuilt theme, you can add your own style to it. In this plot x and y label was taken off as the title says it all. Panel grid minor is left blank and the panel grid major is set as ‘white’ with a size of 0.8. That adds a lot of style to the plot. Adding ‘panel. ontop = TRUE’ adds these panel styles on top of the plot. If you delete this line, those thin grids on the bars won’t show.

Including some statistical parameters

In this section, we will add a regression line and confidence band. For this demonstration, ‘displ’ will be placed on the x-axis and ‘hwy’ will be placed on the y-axis from the ‘mpg’ dataset. The color will be changed by ‘cyl’.

```ggplot(mpg, aes(x= displ,
y = hwy))+
geom_point(aes(colour = cyl))+
geom_smooth()+
labs(x = 'displ',
y = 'hwy',
title = "displ vs hwy")+
theme_minimal()```

‘geom_smooth()’ function above adds this regression line and the confidence band that shows the confidence interval in each point.

Scale transformation is a very common practice in different situations in statistics. I am not going into the details in that now because that’s not the scope of this article. But I will show the scale transformation in the ggplot and instead of putting the default regression, use a linear regression line in the plot.

```ggplot(mpg, aes(x= displ,
y = hwy))+
geom_point(aes(colour = cyl))+
geom_smooth(method='lm')+
scale_x_log10()+
labs(x = 'displ',
y = 'hwy',
title = "displ vs hwy")+
theme_minimal()```

In this plot, a log10 transformation is used through the x-axis and a linear regression line was introduced by setting ‘method=’lm’ in the geom_smooth() function. Please feel free to try the log10 transformation in the y-axis as well.

# Conclusion

The ggplot library has a wide range of graphics available. A lot more geometry, a lot more styles. I could show some in this article. If you are a beginner and finished doing all the exercises above, you came a long way. At the same time, there is a long way to go. I will make more tutorials in the future on ggplot. I hope you are able to use some of these ideas in your work or projects.