Learning Spark SQL
上QQ阅读APP看书,第一时间看更新

Processing multiple input data files

In the next few steps, we initialize a set of variables for defining the directory containing the input files, and an empty RDD. We also create a list of filenames from the input HDFS directory. In the following example, we will work with files contained in a single directory; however, the techniques can easily be extended across all 20 newsgroup sub-directories.

Next, we write a function to compute the word counts for each file and collect the results in an ArrayBuffer:

We have included a print statement to display the file names as they are picked up for processing, as follows:

We add the rows into a single RDD using the union operation:

We could have directly executed the union step as each file is processed, as follows:

However, using RDD.union() creates a new step in the lineage graph requiring an extra set of stack frames for each new RDD. This can easily lead to a Stack Overflow condition. Instead, we use SparkContext.union() which executes the union operation all at once without the extra memory overheads.

We can cache and print sample rows from our output RDD as follows:

In the next section, we show you ways of filtering out stop words. For simplicity, we focus only on well-formed words in the text. However, you can easily add conditions to filter out special characters and other anomalies in our data using String functions and regexes (for a detailed example, refer Chapter 9Developing Applications with Spark SQL).