Splitting and caching the data

Here, we split the data into a training and testing set, as follows:

>>> df_train, df_test = df.randomSplit([0.7, 0.3], 42)

Here, 70% of samples are used for training and the remaining for testing, with a random seed specified, as always, for reproduction.

Before we perform any heavy lifting (such as model learning) on the training set, df_train, it is good practice to cache the object. In Spark, caching and persistence is an optimization technique that reduces the computation overhead. It saves the intermediate results of RDD or DataFrame operations in memory and/or on disk. Without caching or persistence, whenever an intermediate DataFrame is needed, it will be recalculated again according to how it was created originally. Depending on the storage level, persistence behaves differently:

MEMORY_ONLY: The object is only stored in memory. If it does not fit in memory, the remaining part will be recomputed each time it is needed.
DISK_ONLY: The object is only kept on disk. A persisted object can be extracted directly from storage without being recalculated.

MEMORY_AND_DISK: The object is stored in memory, and might be on disk as well. If the full object does not fit in memory, the remaining partition will be stored on disk, instead of being recalculated every time it is needed. This is the default mode for caching and persistence in Spark. It takes advantage of both fast retrieval of in-memory storage and the high accessibility and capacity of disk storage.

In PySpark, caching is simple. All that is required is a cache method.

Let's cache both the training and testing DataFrame:

>>> df_train.cache()
DataFrame[label: int, C1: string, banner_pos: string, site_id: string, site_domain: string, site_category: string, app_id: string, app_domain: string, app_category: string, device_model: string, device_type: string, device_conn_type: string, C14: string, C15: string, C16: string, C17: string, C18: string, C19: string, C20: string, C21: string]
>>> df_train.count()
28297027
>>> df_test.cache()
DataFrame[label: int, C1: string, banner_pos: string, site_id: string, site_domain: string, site_category: string, app_id: string, app_domain: string, app_category: string, device_model: string, device_type: string, device_conn_type: string, C14: string, C15: string, C16: string, C17: string, C18: string, C19: string, C20: string, C21: string]
>>> df_test.count()
12131940

Now, we have the training and testing data ready for downstream analysis.

Table of Contents for Splitting and caching the data

Create new playlist

Sign In

Sign Up

Table of Contents for
Splitting and caching the data