Learning Spark SQL
上QQ阅读APP看书,第一时间看更新

Munging textual data

In this section, we explore data munging techniques for typical text analysis situations. Many text-based analyses tasks require computing word counts, removing stop words, stemming, and so on. In addition, we will also explore how you can process multiple files, one at a time, from HDFS directories.

First, we import all the classes that will be used in this section: