Learning Spark SQL
上QQ阅读APP看书,第一时间看更新

Computing basic statistics

Computing basic statistics is essential for a good preliminary understanding of our data. First, for convenience, we create a case class and a Dataset containing a subset of fields from our original DataFrame. In the following example, we choose some of the numeric fields and the outcome field, that is, the "term deposit subscribed" field:

Next, we use describe() compute the count, mean, stdev, min, and max values for the numeric columns in our Dataset. The describe() command gives a way to do a quick sense-check on your data. For example, the counts of rows of each of the columns selected matches the total number records in the DataFrame (no null or invalid rows),whether the average and range of values for the age column matching your expectations, and so on. Based on the values of the means and standard deviations, you can get select certain data elements for deeper analysis. For example, assuming normal distribution, the mean and standard deviation values for age suggest most values of age are in the range 30 to 50 years, for other columns the standard deviation values may be indicative of a skew in the data (as the standard deviation is greater than the mean).

Further, we can use the stat package to compute additional statistics such as covariance and Pearson's correlation coefficient. The covariance indicates the joint variability of two random variables. As we are in the EDA phase, these measures can give us indicators of how one variable varies vis-a-vis another one. For example, the sign of the covariance indicates the direction of variability between the two variables. In the following example, the covariance between age and duration of the last contact move in opposite directions, that is as age increases the duration decreases. Correlation gives a magnitude for the strength of this relationship between these two variables.

We can create cross tabulations or crosstabs between two variables to evaluate interrelationships between them. For example, in the following example we create a crosstab between age and marital status represented as a 2x2 contingency table. From the table, we understand, for a given age the breakup of the total number of individuals across the various marital statuses. We can also extract items that occur most frequently in data columns of the DataFrame. Here, we choose the education level as the column and specify a support level of 0.3, that is, we want to find education levels that occur with a frequency greater than 0.3 (observed 30% of the time, at a minimum) in the DataFrame. Lastly, we can also compute approximate quantiles of the numeric columns in the DataFrame. Here, we compute the same for the age column with specified quantile probabilities of 0.25, 0.5 and 0.75, (a value of 0 is the minimum, 1 is the maximum, and 0.5 is the median). 

Next, we use the typed aggregation functions to summarize our data to understand it better. In the following statement, we aggregate the results by whether a term deposit was subscribed along with the total customers contacted, average number of calls made per customer, the average duration of the calls, and the average number of previous calls made to such customers. The results are rounded to two decimal points:

Similarly, executing the following statement gives similar results by customer's age:

After getting a better understanding of our data by computing basic statistics, we shift our focus to identifying outliers in our data.