上QQ阅读APP看书，第一时间看更新

Sampling with the RDD API

In this section, we use RDDs for creating stratified samples with and without replacement.

First, we create an RDD from our DataFrame:

We can specify the fractions of each record-type in our sample, as illustrated:

In the following illustration, we use the sampleByKey and sampleByKeyExact methods to create our samples. The former is an approximate sample while the latter is an exact sample. The first parameter specifies whether the sample is generated with or without replacement:

Next, we print out the total number of records in the population and in each of the samples. You will notice that the sampleByKeyExact gives you exact numbers of records as per the specified fractions:

The sample method can be used to create a random sample containing the specified fraction of records in the sample. Next, we create a sample with replacement, containing 10% of the total records:

Other statistical operations, such as hypothesis testing, random data generation, visualizing probability distributions, and so on, will be covered in the later chapters. In the next section, we will explore our data using Spark SQL for creating pivot tables.