上QQ阅读APP看书，第一时间看更新

Visualizing data with Apache Zeppelin

Typically, we will generate many graphs to verify our hunches about the data. A lot of these quick and dirty graphs used during EDA are, ultimately, discarded. Exploratory data visualization is critical for data analysis and modeling. However, we often skip exploratory visualization with large data because it is hard. For instance, browsers cannot typically cannot handle millions of data points. Hence, we have to summarize, sample, or model our data before we can effectively visualize it.

Traditionally, BI tools provided extensive aggregation and pivoting features to visualize the data. However, these tools typically used nightly jobs to summarize large volumes of data. The summarized data was subsequently downloaded and visualized on the practitioner's workstations. Spark can eliminate many of these batch jobs to support interactive data visualization.

In this section, we will explore some basic data visualization techniques using Apache Zeppelin. Apache Zeppelin is a web-based tool that supports interactive data analysis and visualization. It supports several language interpreters and comes with built-in Spark integration. Hence, it is quick and easy to get started with exploratory data analysis using Apache Zeppelin:

You can download Appache Zeppelin from https://zeppelin.apache.org/. Unzip the package on your hard drive and start Zeppelin using the following command:

      Aurobindos-MacBook-Pro-2:zeppelin-0.6.2-bin-all aurobindosarkar$ 
      bin/zeppelin-daemon.sh start

You should see the following message:

      Zeppelin start                                           [  OK  ]

You should be able to see the Zeppelin home page at: http://localhost:8080/:

Click on the Create new note link and specify a path and name for your notebook, as shown:
In the next step, we paste the same code as in the beginning of this chapter to create a DataFrame for our sample Dataset:

We can execute typical DataFrame operations, as follows:

Next, we create a table from our DataFrame and execute some SQL on it. The results of the SQL statements' execution can be charted by clicking on the appropriate chart-type required. Here, we create bar charts as an illustrative example of summarizing and visualizing data:

We can create a scatter plot, as shown in the following figure:

You can also read the coordinate values of each of the points plotted:

Additionally, we can create a textbox that accepts input values to make the experience interactive. In the following figure, we create a textbox that can accept different values for the age parameter and the bar chart is updated accordingly:

Similarly, we can also create drop-down lists where the user can select the appropriate option:

And, the table of values or chart automatically gets updated:

We will explore more advanced visualizations using Spark SQL and SparkR in Chapter 8, Using Spark SQL with SparkR. In the next section, we will explore the methods used to generate samples from our data.