Chapter 5 Bar Plots
5.1 Introduction
In this chapter, we will visualize categorical data using univariate and bivariate bar plots. More specifically, we will learn to:
- create
- simple bar plot
- stacked bar plot
- grouped bar plot
- modify bar
- direction
- color
- line color
- width
- labels
- modify axis range
- remove axes from the plot
- specify the line type of the X axes
- offset the Y axes
- modify legend
5.2 Bar Plot
A bar plot represents data in rectangular bars. The length of the bars are proportional to the values they represent. Bar plots can be either horizontal or vertical. The X axis of the plot represents the levels or the categories and the Y axis represents the frequency/count of the variable.
A univariate bar plot represents a single categorical variable. The categories in the variable are represented on the X axis and their frequencies on the Y axis. In the below example, the cyl
variable from the mtcars
data set is visualized using a bar plot. The categories or levels are 4
, 6
and 8
which represent the number of cylinders in the automobile and are represented on the X axis. The frequency for each type of cylinder is represented by the Y axis.
In R, bar plots can be created using either the plot()
or barplot()
function. The input to both the functions are different. In case of the plot()
function, we can specify the variable but it must be converted to a factor variable. In case of the barplot()
function, the input must be the count or frequency of the variable. The table()
function can be used to generate the counts/frequency for a variable. Let us use both the functions to create the bar plot:
5.2.2 Using barplot function
If you observe carefully, the same plot is generated by both the functions. Before we explore the bar plots further, let us store the data in a new variable instead of using the table()
function in every example:
##
## 4 6 8
## 11 7 14
5.3 Horizontal or Vertical
Bar plots can be horizontal or veritcal (which is the default). Use the horiz
argument in the barplot()
function to build a horizontal bar plot. As you can see, the axis have been flipped. The Y axis represents the categories and the X axis represents their counts/frequencies.
5.4 Bar Width
In the bar plot, the width of the bars and the space between them are same. A specific category of the variable can be highlighted by increasing/decreasing the width of the bar representing it. In our example, we will increase the width of the bar that represents automobiles with 8 cylinders. The width
argument is used to specify the width of the bars.
The width must be specified for all the bars in the plot. It must be a vector the length of which must be equal to the number of categories of the variable.
5.4.2 Different Widths
In the below example, the width of the third bar is twice the width of the other two bars
In the below example, the width of the second bar is half the width of the other first bar and the third bar is twice the width of the first bar.
The space between the bars can be specified in a similar manner but using the space
argument in the barplot()
function: In the below example, the space between the third bar and the second bar is twice the space between first and second bar.
5.5 Labels
It is important to add appropriate labels to the bars in order to communicate properly. In our example, the bars represent automobiles with different number of cylinders. The labels likewise indicate the number of cylinders represented by the bars. In order to demonstrate how to add labels, we will change the labels from numbers to their corresponding words. The names.arg
argument is used to add labels to the bars in a plot. Below is our example:
It is important to specify labels for all the bars in the plot else R will return an error.
5.6 Color
Let us add some color to the plots. In a bar plot, we can specify different colors for the bars and their borders. Use the col
argument to add color to the bars.
5.6.2 Differnt color for the bars
What happens if we do not specify color for all the bars? The colors you specify are recycled.
5.6.3 Recycling colors
The border
argument specifies the color of the border of the bars. The rules that apply to col
argument apply here also. Below are the examples:
5.6.5 Differnt color for the bars
What happens if we do not specify color for all the bars? The colors you specify are recycled.
5.7 Axes
In this section, we will learn to
- remove axes from the plot
- specify the line type of the X axes
- offset the Y axes
5.7.1 Remove axes
The axes
argument can be used to retain/remove the axes from the plot. It takes logical values as input and the default is TRUE
. Set it to FALSE
to remove the axes from the plot:
If we decide to retain the axes, the line type of the X axes can be specified using the axis.lty
argument. It does not modify the line type of the Y axes and it will not work if the axes
argument is set to FALSE
.
Though we cannot modify the line type of the Y axes, we can offset it using the offset
argument. In the below example, we will offset the Y axes and you can observe that the minimum value of the Y axes is now 5 instead of 0.
You can similarly modify the range of the Y axes using the ylim
argument. Although in case of bar plots, modifying the range of the plot may not be very useful.
5.8 Putting it all together…
Let us quickly revise what we have learnt so far and build a bar plot for visualizing the cyl
variable in the mtcars
data set:
barplot(cyl_freq, col = c('blue', 'red', 'green'),
horiz = TRUE, width = c(1, 1, 2),
names.arg = c('Four', 'Six', 'Eight'),
axis.lty = 2, offset = 2)
5.8.1 Title & Axis Labels
Well the plot looks good but for someone who does not know the underlying data, it will diffficult to understand what is being communicated. Let us add a title and labels for the axes.
5.9 Bivariate Bar Plots
A bivariate bar plot represents the cross table or two way table of categorical variables. They are of two types:
- Stacked Bar Plots
- Grouped Bar Plots
Before we look at bivariate bar plots, let us create a two way table of cyl
(number of cylinders) and gear
(number of gears) using the table()
function:
##
## 3 4 5
## 15 12 5
##
## 3 4 5
## 4 1 8 2
## 6 2 4 1
## 8 12 0 2
The number of gears is represented by the columns in the table and the numbe rof cylinders is represented by the rows.
5.9.1 Stacked Bar Plot
The bars in the plot represent the distribution of cyl
for each level of category of the gear
variable. The first bar represents the distribution of cylinders for automobiles with 3 gears. From the two way table we saw earlier, the columns are the bars. The rows are represented by different sections of the bar. Let us add some colors to the plot as the default colors of the plot are not very intuitive. It will also allow us to clearly examine the distribution of cyl
for the different levels of gear
.
If you carefully observe the table and the plot:
- the blue sections of the bars represent the number of automobiles with 3 gears and 4 cylinders
- the red sections represent the number of automobiles with 4 gears and 6 cylinders
- the green sections represent the number of automobiles with 5 gears and 8 cylinders
We need to convey the above information in some way and will do that using the legend.text
argument. It takes logical values as inputs and the default values is FALSE
. It adds a legend to the plot when it is set to TRUE
. In the next example, we add a legend as well as other relevant information such as title and axis labels.
5.9.2 Grouped Bar Plot
A grouped bar plot represents the same data as the stacked bar plot but instead of being stacked, the bars are now grouped and placed besides each other.
barplot(cyl_gear, col = c('blue', 'red', 'green'),
beside = TRUE, legend.text = TRUE,
main = 'Gears vs Cylinders',
xlab = 'Number of Gears', ylab = 'Frequency')
The beside
argument in barplot()
function is set to TRUE
to build grouped bar plots. It takes logical values as inputs and the default values is FALSE. As you can observe from the plot, the bars are placed besides each other instead of being stacked.