Filters

DataFrame also supports Filters, which can be used to quickly filter the DataFrame rows to generate new DataFrames. The Filters enable very important transformations of the data to narrow down the DataFrame to our use case. For example, if all you want is to analyze the state of California, then using filter API performs the elimination of non-matching rows on every partition of data, thus improving the performance of the operations.

Let's look at the execution plan for the filtering of the DataFrame to only consider the state of California.

scala> statesDF.filter("State == 'California'").explain(true)

== Parsed Logical Plan ==
'Filter ('State = California)
+- Relation[State#0,Year#1,Population#2] csv

== Analyzed Logical Plan ==
State: string, Year: int, Population: int
Filter (State#0 = California)
+- Relation[State#0,Year#1,Population#2] csv

== Optimized Logical Plan ==
Filter (isnotnull(State#0) && (State#0 = California))
+- Relation[State#0,Year#1,Population#2] csv

== Physical Plan ==
*Project [State#0, Year#1, Population#2]
+- *Filter (isnotnull(State#0) && (State#0 = California))
+- *FileScan csv [State#0,Year#1,Population#2] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/salla/states.csv], PartitionFilters: [], PushedFilters: [IsNotNull(State), EqualTo(State,California)], ReadSchema: struct<State:string,Year:int,Population:int>

Now that we can seen the execution plan, let's now execute the filter command, as follows:

scala> statesDF.filter("State == 'California'").show
+----------+----+----------+
| State|Year|Population|
+----------+----+----------+
|California|2010| 37332685|
|California|2011| 37676861|
|California|2012| 38011074|
|California|2013| 38335203|
|California|2014| 38680810|
|California|2015| 38993940|
|California|2016| 39250017|
+----------+----+----------+
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.254.133