Learning Spark SQL
上QQ阅读APP看书,第一时间看更新

Identifying missing data

Missing data can occur in Datasets due to reasons ranging from negligence to a refusal on the part of respondants to provide a specific data point. However, in all cases, missing data is a common occurrence in real-world Datasets. Missing data can create problems in data analysis and sometimes lead to wrong decisions or conclusions. Hence, it is very important to identify missing data and devise effective strategies to deal with it.

In this section, we analyze the numbers of records with missing data fields in our sample Dataset. In order to simulate missing data, we will edit our sample Dataset by replacing fields containing "unknown" values with empty strings.

First, we created a DataFrame/Dataset from our edited file, as shown:

The following two statements give us a count of rows with certain fields having missing data:

In Chapter 4, Using Spark SQL for Data Munging, we will look at effective ways of dealing with missing data. In the next section, we will compute some basic statistics for our sample Dataset to improve our understanding of the data.