Combining multiple variables – feature interaction

Among all the features of the click log data, some are very weak signals in themselves. For example, gender itself doesn't tell you much regarding whether someone will click an ad, and the device model itself doesn't provide much information either. However, by combining multiple features, we will be able to create a stronger synthesized signal. Feature interaction is introduced for this purpose. For numerical features, it usually generates new features by multiplying multiples of them. We can also define whatever integration rules we want. For example, we generate an additional feature, income/person, from two original features, household income and household size:

For categorical features, feature interaction becomes an AND operation on two or more features. In the following example, we generate an additional feature, gender:site_domain, from two original features, gender and site_domain:

We then use one-hot encoding to transform string values. On top of six one-hot encoded features (two from gender and four from site_domain), feature interaction between gender and site_domain adds eight further features (two by four).

Let's now adopt feature interaction to our click prediction project. We take two features, C14 and C15, as an example of AND interaction:

  1. First, we will import the feature interaction module, RFormula, from PySpark:
>>> from pyspark.ml.feature import RFormula

An RFormula model takes in a formula that describes how features interact. For instance, y ~ a + b means it takes in input features, a and b, and outputs y; y ~ a + b + a:b means it predicts y based on features a, b, and iteration term, a AND b; y ~ a + b + c + a:b means it predicts y based on features a, b, c, and iteration terms, a AND b.

  1. We need to define an interaction formula accordingly:
>>> cat_inter = ['C14', 'C15']
>>> cat_no_inter = [c for c in categorical if c not in cat_inter]
>>> concat = '+'.join(categorical)
>>> interaction = ':'.join(cat_inter)
>>> formula = "label ~ " + concat + '+' + interaction
>>> print(formula)
label ~ C1+banner_pos+site_id+site_domain+site_category+app_id+app_domain+app_category+device_model+device_type+device_conn_type+C14+C15+C16+C17+C18+C19+C20+C21+C14:C15
  1. Now, we can initialize a feature interactor with this formula:
>>> interactor = RFormula(
... formula=formula,
... featuresCol="features",
... labelCol="label").setHandleInvalid("keep")

Again, the setHandleInvalid("keep") handle here makes sure it won't crash if any new categorical value occurs.

  1. Use the defined feature interactor to fit and transform the input DataFrame:
>>> interactor.fit(df_train).transform(df_train).select("features").
show()
+--------------------+
| features|
+--------------------+
|(54930,[5,7,3527,...|
|(54930,[5,7,788,4...|
|(54930,[5,7,788,4...|
|(54930,[5,7,788,4...|
|(54930,[5,7,788,4...|
|(54930,[5,7,788,4...|
|(54930,[5,7,788,4...|
|(54930,[5,7,788,4...|
|(54930,[5,7,788,4...|
|(54930,[5,7,788,4...|
|(54930,[5,7,788,4...|
|(54930,[5,7,788,4...|
|(54930,[5,7,788,4...|
|(54930,[5,7,1271,...|
|(54930,[5,7,1271,...|
|(54930,[5,7,1271,...|
|(54930,[5,7,1271,...|
|(54930,[5,7,1532,...|
|(54930,[5,7,4366,...|
|(54930,[5,7,14,45...|
+--------------------+
only showing top 20 rows

More than 20,000 features are added to the feature space due to the interaction term of C14 and C15.

  1. Again, we chain the feature interactor and classification model together into a pipeline for better organizing the entire workflow:
>>> classifier = LogisticRegression(maxIter=20, regParam=0.000, 
elasticNetParam=0.000)
>>> stages = [interactor, classifier]
>>> pipeline = Pipeline(stages=stages)
>>> model = pipeline.fit(df_train)
>>> predictions = model.transform(df_test)
>>> predictions.cache()
>>> from pyspark.ml.evaluation import BinaryClassificationEvaluator
>>> ev = BinaryClassificationEvaluator(rawPredictionCol =
"rawPrediction", metricName = "areaUnderROC")
>>> print(ev.evaluate(predictions))
0.7490392990518315

An AUC of 74.90%, with additional interaction between features C14 and C15, is a boost from 74.89% without any interaction.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.104.242