Loading click logs

To train a model on massive click logs, we first need to load the data in Spark. We do so by taking the following steps:

First, we spin up the PySpark shell by using the following command:

./bin/pyspark --master local[*]  --driver-memory 20G

Here, we specify a large driver memory as we are dealing with a dataset of more than 6 GB.

Start a Spark session with an application named CTR:

>>> spark = SparkSession
...     .builder
...     .appName("CTR")
...     .getOrCreate()

Then, we load the click log data from the train file into a DataFrame object. Note, the data load function spark.read.csv allows custom schema, which guarantees data is loaded as expected, as opposed to inferring by default. So first, we define the schema:

>>> from pyspark.sql.types import StructField, StringType, 
         StructType, IntegerType
>>> schema = StructType([
...     StructField("id", StringType(), True),
...     StructField("click", IntegerType(), True),
...     StructField("hour", IntegerType(), True),
...     StructField("C1", StringType(), True),
...     StructField("banner_pos", StringType(), True),
...     StructField("site_id", StringType(), True),
...     StructField("site_domain", StringType(), True),
...     StructField("site_category", StringType(), True),
...     StructField("app_id", StringType(), True),
...     StructField("app_domain", StringType(), True),
...     StructField("app_category", StringType(), True),
...     StructField("device_id", StringType(), True),
...     StructField("device_ip", StringType(), True),
...     StructField("device_model", StringType(), True),
...     StructField("device_type", StringType(), True),
...     StructField("device_conn_type", StringType(), True),
...     StructField("C14", StringType(), True),
...     StructField("C15", StringType(), True),
...     StructField("C16", StringType(), True),
...     StructField("C17", StringType(), True),
...     StructField("C18", StringType(), True),
...     StructField("C19", StringType(), True),
...     StructField("C20", StringType(), True),
...     StructField("C21", StringType(), True),
... ])

Each field of the schema contains the name of the column (such as id, click, and hour), the data type (such as integer, and string), and whether missing values are allowed (allowed in this case).

With the defined schema, we create a DataFrame object:

>>> df = spark.read.csv("file://path_to_file/train", schema=schema, 
                                                      header=True)

Remember to replace path_to_file with the absolute path of where the train data file is located. The file:// prefix indicates that data is read from a local file. Another prefix, dbfs://, is used for data stored in HDFS.

We will now double-check the schema as follows:

>>> df.printSchema()
root
 |-- id: string (nullable = true)
 |-- click: integer (nullable = true)
 |-- hour: integer (nullable = true)
 |-- C1: string (nullable = true)
 |-- banner_pos: string (nullable = true)
 |-- site_id: string (nullable = true)
 |-- site_domain: string (nullable = true)
 |-- site_category: string (nullable = true)
 |-- app_id: string (nullable = true)
 |-- app_domain: string (nullable = true)
 |-- app_category: string (nullable = true)
 |-- device_id: string (nullable = true)
 |-- device_ip: string (nullable = true)
 |-- device_model: string (nullable = true)
 |-- device_type: string (nullable = true)
 |-- device_conn_type: string (nullable = true)
 |-- C14: string (nullable = true)
 |-- C15: string (nullable = true)
 |-- C16: string (nullable = true)
 |-- C17: string (nullable = true)
 |-- C18: string (nullable = true)
 |-- C19: string (nullable = true)
 |-- C20: string (nullable = true)
 |-- C21: string (nullable = true)

And the data size is checked as follows:

>>> df.count()
40428967

Also, we need to drop several columns that provide little information. We will use the following code to do that:

>>> df = 
    df.drop('id').drop('hour').drop('device_id').drop('device_ip')

We rename the column from click to label, as this will be consumed more often in the downstream operations:

>>> df = df.withColumnRenamed("click", "label")

Let's look at the current columns in the DataFrame object:

>>> df.columns
['label', 'C1', 'banner_pos', 'site_id', 'site_domain', 'site_category', 'app_id', 'app_domain', 'app_category', 'device_model', 'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21']

Table of Contents for Loading click logs

Create new playlist

Sign In

Sign Up

Table of Contents for
Loading click logs