Learning Spark SQL
上QQ阅读APP看书,第一时间看更新

Pre-processing of the household electric consumption Dataset

Create a case class for household electric power consumption called HouseholdEPC:

Read the input Dataset into a RDD and count the number of rows in it.

Next, remove the header and all other rows containing missing values, (represented as ?'s in the input), as shown in the following steps:

In the next step, convert the RDD [String] to a RDD with the case class, we defined earlier, and convert the RDD a DatFrame of HouseholdEPC objects.

Display a few sample records in the DataFrame, and count the number of rows in it to verify that the number of rows in the DataFrame matches the expected number of rows in your input Dataset.