Hands-On Exploratory Data Analysis with Python

上QQ阅读APP看书，第一时间看更新

Histogram

Histogram plots are used to depict the distribution of any continuous variable. These types of plots are very popular in statistical analysis.

Consider the following use cases. A survey created in vocational training sessions of developers had 100 participants. They had several years of Python programming experience ranging from 0 to 20.

Let's import the required libraries and create the dataset:

import numpy as np
import matplotlib.pyplot as plt

#Create data set
yearsOfExperience = np.array([10, 16, 14, 5, 10, 11, 16, 14, 3, 14, 13, 19, 2, 5, 7, 3, 20,
       11, 11, 14, 2, 20, 15, 11, 1, 15, 15, 15, 2, 9, 18, 1, 17, 18,
       13, 9, 20, 13, 17, 13, 15, 17, 10, 2, 11, 8, 5, 19, 2, 4, 9,
       17, 16, 13, 18, 5, 7, 18, 15, 20, 2, 7, 0, 4, 14, 1, 14, 18,
        8, 11, 12, 2, 9, 7, 11, 2, 6, 15, 2, 14, 13, 4, 6, 15, 3,
        6, 10, 2, 11, 0, 18, 0, 13, 16, 18, 5, 14, 7, 14, 18])
yearsOfExperience

In order to plot the histogram chart, execute the following steps:

Plot the distribution of group experience:

nbins = 20
n, bins, patches = plt.hist(yearsOfExperience, bins=nbins)

2.dd labels to the axes and a title:

plt.xlabel("Years of experience with Python Programming")
plt.ylabel("Frequency")
plt.title("Distribution of Python programming experience in the vocational training session")

3.raw a green vertical line in the graph at the average experience:

plt.axvline(x=yearsOfExperience.mean(), linewidth=3, color = 'g')

4.isplay the plot:

plt.show()

The preceding code generates the following histogram:

Much better, right? Now, from the graph, we can say that the average experience of the participants is around 10 years. Can we improve the graph for better readability? How about we try to plot the percentage of the sum of all the entries in yearsOfExperience? In addition to that, we can also plot a normal distribution using the mean and standard deviation of this data to see the distribution pattern. If you're not sure what a normal distribution is, we suggest you go through the references in Chapter 1, Exploratory Data Analysis Fundamentals. In a nutshell, the normal distribution is also referred to as the Gaussian distribution. The term indicates a probability distribution that is symmetrical about the mean, illustrating that data near the average (mean) is more frequent than data far from the mean. Enough theory; let's dive into the practice.

To plot the distribution, we can add a density=1 parameter in the plot.hist function. Let's go through the code. Note that there are changes in steps 1, 4, 5, and 6. The rest of the code is the same as the preceding example:

Plot the distribution of group experience:

plt.figure(figsize = (10,6))

nbins = 20
n, bins, patches = plt.hist(yearsOfExperience, bins=nbins, density=1)

2.Add labels to the axes and a title:

plt.xlabel("Years of experience with Python Programming")
plt.ylabel("Frequency")
plt.title("Distribution of Python programming experience in the vocational training session")

3.Draw a green vertical line in the graph at the average experience:

plt.axvline(x=yearsOfExperience.mean(), linewidth=3, color = 'g')

4.Compute the mean and standard deviation of the dataset:

mu = yearsOfExperience.mean()
sigma = yearsOfExperience.std()

5.Add a best-fit line for the normal distribution:

y = ((1 / (np.sqrt(2 * np.pi) * sigma)) * np.exp(-0.5 * (1 / sigma * (bins - mu))**2))

6.Plot the normal distribution:

plt.plot(bins, y, '--')

7.Display the plot:

plt.show()

And the generated histogram with the normal distribution is as follows:

The preceding plot illustrates clearly that it is not following a normal distribution. There are many vertical bars that are above and below the best-fit curve for a normal distribution. Perhaps you are wondering where we got the formula to compute step 6 in the preceding code. Well, there is a little theory involved here. When we mentioned the normal distribution, we can compute the probability density function using the Gaussian distribution function given by ((1 / (np.sqrt(2 * np.pi) * sigma)) * np.exp(-0.5 * (1 / sigma * (bins - mu))**2)).