Learning Spark SQL
上QQ阅读APP看书,第一时间看更新

Executing other miscellaneous processing steps

If required we can choose to execute a few more steps to help cleanse the data further, study more aggregations, or to convert to a typesafe data structure, and so on.

We can drop the time column and aggregate the values in various columns using aggregation functions such as sum and average on the values of each day's readings. Here, we rename the columns with a d prefix to represent daily values.

We display a few sample records from this DataFrame:

scala> finalDayDf1.show(5)

Here, we group the readings by year and month, and then count the number of readings and display them for each of the months. The first month's number of readings is low as the data was captured in half a month.

We can also convert our DataFrame to a Dataset using a case class, as follows:

At this stage, we have completed all the steps for pre-processing the household electric consumption Dataset. We now shift our focus to processing the weather Dataset.