takeSample

The takeSample method allows us to fetch a random list of elements from an RDD to the Driver program. One of the variants of the takeSample() method allows two parameters to be passed where the first parameter withReplacement is a Boolean value representing the possibility of each element of the RDD getting picked and being the same when set to True, or else any element of the RDD gets picked only once. The second parameter in the method is the size of the list of the sampled:

//takeSample()
JavaRDD<Integer> intRDD = sparkContext.parallelize(Arrays.asList(1,4,3));
List<Integer> takeSamepTrueList= intRDD.takeSample(true, 4);
for(Integer intVal:takeSamepTrueList) {
System.out.println("The take sample vals for true are : "+intVal);
}

The output of the result will have four elements despite the RDD having only three elements, it is happening because we have set the withReplacement parameter as True and hence for fetching each sample element entire RDD datasets are considered. Now let's try the same example by passing the withReplacement parameter as False:

//TakeSample()
JavaRDD<Integer> intRDD = sparkContext.parallelize(Arrays.asList(1,4,3));
List<Integer> takeSamepFalseList=intRDD.takeSample(false, 4);
for(Integer intVal:takeSamepFalseList){
System.out.println("The take sample vals for false are : "+intVal);
}

In this scenario, there will be only three element in the list despite setting the sample size to be four. It should be apparent now that if we set the withReplacement parameter as False then in such cases the elements for sample size does not get repeated, however, the sample size can be less than that may be required.

Another variant of the takeSample() method adds another parameter called seed, which when kept the same returns the same list of values any number to times the RDD with same dataset is sampled. i.e multiple executions with same seed value with return same result. It is very useful in replicating testing scenarios by keeping the sampled data the same:

//takeSample()
JavaRDD<Integer> intRDD = sparkContext.parallelize(Arrays.asList(1,4,3));
List<Integer> takeSamepTrueSeededList=intRDD.takeSample(true, 4,9);
for(Integer intVal:takeSamepTrueSeededList) {
System.out.println("The take sample vals with seed are : "+intVal);
}
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.217.17