上QQ阅读APP看书，第一时间看更新

Using Spark SQL for basic data analysis

Interactively, processing and visualizing large data is challenging as the queries can take a long time to execute and the visual interface cannot accommodate as many pixels as data points. Spark supports in-memory computations and a high degree of parallelism to achieve interactivity with large distributed data. In addition, Spark is capable of handling petabytes of data and provides a set of versatile programming interfaces and libraries. These include SQL, Scala, Python, Java and R APIs, and libraries for distributed statistics and machine learning.

For data that fits into a single computer, there are many good tools available, such as R, MATLAB, and others. However, if the data does not fit into a single machine, or if it is very complicated to get the data to that machine, or if a single computer cannot easily process the data, then this section will offer some good tools and techniques for data exploration.

In this section, we will go through some basic data exploration exercises to understand a sample Dataset. We will use a Dataset that contains data related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The marketing campaigns were based on phone calls to customers. We'll use the bank-additional-full.csv file that contains 41,188 records and 20 input fields, ordered by date (from May 2008 to November 2010). The Dataset has been contributed by S. Moro, P. Cortez, and P. Rita, and can be downloaded from https://archive.ics.uci.edu/ml/datasets/Bank+Marketing.

As a first step, let's define a schema and read in the CSV file to create a DataFrame. You can use :paste command to paste initial set of statements in your Spark shell session (use Ctrl+D to exit the paste mode), as shown:

After the DataFrame has been created, we first verify the number of records:

We can also define a case class called Call for our input records, and then create a strongly-typed Dataset, as follows:

In the next section, we will begin our data exploration by identifying missing data in our Dataset.