Chapter 7 Histograms
In this chapter, we will learn to:
- create a bare bones histogram
- specify the number of bins/intervals
- represent frequency density on the Y axis
- add colors to the bars and the border
- add labels to the bars
A histogram is a plot that can be used to examine the shape and spread of continuous data. It looks very similar to a bar graph and can be used to detect outliers and skewness in data. The histogram graphically shows the following:
- center (location) of the data
- spread (dispersion) of the data
- presence of multiple modes
To construct a histogram, the data is split into intervals called bins. The intervals may or may not be equal sized. For each bin, the number of data points that fall into it are counted (frequency). The Y axis of the histogram represents the frequency and the X axis represents the variable.
Before we learn how to create histograms, let us see how normal and skewed distributions look when represented by a histogram.
7.2.1 Normal Distribution
7.2.2 Skewed Distributions
Histograms are created using the
hist() function in R. The minimum input required to create a bare bones histogram is a continuous variable. Below is an example:
hist() functions returns details of the histogram which can be accessed by assigning the histogram to a variable. Let us assign the above histogram to a variable
h and use the
$ symbol to access the details stored in the variable.
# display number of breaks h$breaks ##  10 15 20 25 30 35 # frequency of the intervals h$counts ##  6 12 8 2 4 # frequency density h$density ##  0.0375 0.0750 0.0500 0.0125 0.0250 # mid points of the intervals h$mids ##  12.5 17.5 22.5 27.5 32.5 # varible name h$xname ##  "mtcars$mpg" # whether intervals are of equal size h$equidist ##  TRUE
hist() function creates equidistant intervals by default. We can specify the number of bins using the
The below plot displays histograms with different number of bins:
If we want to create histograms with specific intervals, the
breaks argument can be supplied with the intervals.
If you observe the Y axis, it does not represent frequency any more. Instead, it represents the frequency density. What is frequency density?
7.5.1 Frequency Density
Frequency Density = Relative Frequency / Class Width
Relative Frequency = Frequency / Total Observations
## frequency class_width relative_frequency frequency_density ## 1 13 8 0.40625 0.05078125 ## 2 12 6 0.37500 0.06250000 ## 3 3 6 0.09375 0.01562500 ## 4 4 5 0.12500 0.02500000
When multiplied by the class width, the product will always sum upto 1.
##  1
We will learn more about frequency density in a bit. Before we end this section, we need to learn about one more way to specify the intervals of the histogram, algorithms. The
hist() function allows us to specify the following algorithms:
- Sturges (default)
- Freedman-Diaconis (FD)
In the below plot, we examine how th algorithms work:
7.6 Frequency Distribution II
Let us come back to frequency density. If you want the Y axis of the histogram to represent frequency density instead of counts, set the
freq argument to
The same result can be achieved by using the
probability argument as well. It takes only logical values as inputs and the default is
FALSE. If set to
TRUE, the Y axis will represent the frequency density instead of counts.
To add colors to the bars of the histogram, use the
col argument. If the number of colors specified is less than the number of bars, the colors are recycled. Below are a few examples:
7.7.1 Single Color
7.7.2 Different Colors
7.7.3 Recycled Colors
7.8 Border Color
Colors can be specified for the borders of the histogrambars using the
7.8.1 Different Colors
In certain cases, we might want to add the frequency counts on the histogram bars. It is easier for the user to know the frequencies of each bin when they are present on top of the bars. Let us add the frequency counts on top of the bars using the
labels argument. We can either set it to
TRUE or a character vector containing the label values. Let us look at both the methods.
7.9.1 Method 1
7.9.2 Method 2
Specify the label values in a character vector.