Chapter 7. Online Advertising Networks

In this chapter, we’ll examine building an online advertising network that connects advertisers and media websites. Advertisers provide the ads for display, with each ad designed for a particular ad zone. Media sites, on the other hand, provide content pages for display with various regions marked for serving ads. When the media site displays a page, it makes a request to the ad network for one or more ads to display in its ad zones.

As part of the ad serving, the ad network records the number of pageviews of each ad in order to track statistics for the ad, which may then also be used to bill the advertiser.

Solution Overview

This solution is structured as a progressive refinement of the ad network, starting out with the basic data storage requirements and adding more advanced features to the schema to support more advanced ad targeting. The key performance criterion for this solution is the latency between receiving an ad request and returning the (targeted) ad to be displayed.

Design 1: Basic Ad Serving

A basic ad-serving algorithm consists of the following steps:

  1. The network receives a request for an ad, specifying at a minimum the site_id and zone_id to be served.
  2. The network consults its inventory of ads available to display and chooses an ad based on various business rules.
  3. The network returns the actual ad to be displayed, recording the pageview for the ad as well.

This design uses the site_id and zone_id submitted with the ad request, as well as information stored in the ad inventory collection, to make the ad targeting decisions. Later examples will build on this, allowing more advanced ad targeting.

Schema Design

A very basic schema for storing ads available to be served consists of a single collection, ad.zone:

{
  _id: ObjectId(...),
  site_id: 'cnn',
  zone_id: 'banner',
  ads: [
    { campaign_id: 'mercedes:c201204_sclass_4',
      ad_unit_id: 'banner23a',
      ecpm: 250 },
    { campaign_id: 'mercedes:c201204_sclass_4',
      ad_unit_id: 'banner23b',
      ecpm: 250 },
    { campaign_id: 'bmw:c201204_eclass_1',
      ad_unit_id: 'banner12',
      ecpm: 200 },
    ... ]
}

Note that for each site-zone combination you’ll be storing a list of ads, sorted by their eCPM values.

eCPM, CPM, CPC, CTR, etc.

The world of online advertising is full of somewhat cryptic acronyms. Most of the decisions made by the ad network in this chapter will be based on the eCPM, or effective cost per mille. This is a synthetic measure meant to allow comparison between CPM (cost per mille) ads, which are priced based on the number of impressions, and CPC (cost per click) ads, which are priced per click.

The eCPM of a CPM ad is just the CPM. Calculating the eCPM of a CPC ad is fairly straightforward, based on the CTR (click-through rate), which is defined as the number of clicks per ad impression. The formula for eCPM for a CPC ad then is:

eCPM = CPC × CTR × 1000

In this chapter, we’ll assume that the eCPM is already known for each ad, though you’ll obviously need to calculate it in a real ad network.

Operation: Choose an Ad to Serve

The query we’ll use to choose which ad to serve selects a compatible ad and sorts by the advertiser’s ecpm bid in order to maximize the ad network’s profits:

from itertools import groupby
from random import choice

def choose_ad(site_id, zone_id):
    site = db.ad.zone.find_one({
        'site_id': site_id, 'zone_id': zone_id}) 1
    if site is None: return None
    if len(site['ads']) == 0: return None
    ecpm_groups = groupby(site['ads'], key=lambda ad:ad['ecpm']) 2
    ecpm, ad_group = ecpm_groups.next()
    return choice(list(ad_group))3
1

First, we find a compatible site and zone for the ad request.

2

Next, we group the ads based on their eCPM. This step requires that the ads already be sorted by descending eCPM.

3

Finally, we randomly select an ad from the most expensive ad group.

In order to execute the ad choice with the lowest latency possible, we need to maintain a compound index on site_id, zone_id:

>>> db.ad.zone.ensure_index([
...     ('site_id', 1),
...     ('zone_id', 1) ])

Operation: Make an Ad Campaign Inactive

One case we need to deal with is making a campaign inactive. This may happen for a variety of reasons. For instance, the campaign may have reached its end date or exhausted its budget for the current time period. In this case, the logic is fairly straightforward:

def deactivate_campaign(campaign_id):
    db.ad.zone.update(
      { 'ads.campaign_id': campaign_id },
      {' $pull': { 'ads', { 'campaign_id': campaign_id } } },
      multi=True)

This update statement first selects only those ad zones that had available ads from the given campaign_id and then uses the $pull modifier to remove them from rotation.

To execute the multiupdate quickly, we’ll keep an index on the ads.campaign_id field:

>>> db.ad.zone.ensure_index('ads.campaign_id')

Sharding Concerns

In order to scale beyond the capacity of a single replica set, you will need to shard the ad.zone collection. To maintain the lowest possible latency (and the highest possible throughput) in the ad selection operation, the shard key needs to be chosen to allow MongoDB to route the ad.zone query to a single shard. In this case, a good approach is to shard on the site_id, zone_id combination:

>>> db.command('shardcollection', 'dbname.ads.ad.zone', {
...     'key': {'site_id': 1, 'zone_id': 1} })
{ "collectionsharded": "dbname.ads.ad.zone", "ok": 1 }

Design 2: Adding Frequency Capping

One problem with the logic described in Design 1: Basic Ad Serving is that it will tend to display the same ad over and over again until the campaign’s budget is exhausted. To mitigate this, advertisers may wish to limit the frequency with which a given user is presented a particular ad. This process is called frequency capping and is an example of user profile targeting in advertising.

In order to perform frequency capping (or any type of user targeting), the ad network needs to maintain a profile for each visitor, typically implemented as a cookie in the user’s browser. This cookie, effectively a user_id, is then transmitted to the ad network when logging impressions, clicks, conversions, etc., as well as the ad-serving decision. This section focuses on how that profile data impacts the ad-serving decision.

Schema Design

In order to use the user profile data, we need to store it. In this case, it’s stored in a collection ad.user:

{
  _id: 'cookie_value',
  advertisers: {
    mercedes: {
      impressions: [
        { date: ISODateTime(...),
          campaign: 'c201204_sclass_4',
          ad_unit_id: 'banner23a',
          site_id: 'cnn',
          zone_id: 'banner' } },
        ... ],
      clicks: [
        { date: ISODateTime(...),
          campaign: 'c201204_sclass_4',
          ad_unit_id: 'banner23a',
          site_id: 'cnn',
          zone_id: 'banner' } },
      ... ],
    bmw: [ ... ],
    ...
  }
}

There are a few things to note about the user profile:

  • All data is embedded in a single profile document. When you need to query this data (detailed next), you don’t necessarily know which advertiser’s ads you’ll be showing, so it’s a good practice to embed all advertisers in a single document.
  • The event information is grouped by event type within an advertiser, and sorted by timestamp. This allows rapid lookups of a stream of a particular type of event.

Operation: Choose an Ad to Serve

The query we’ll use to choose which ad to serve now needs to iterate through ads in order of profitability and select the “best” ad that also satisfies the advertiser’s targeting rules (in this case, the frequency cap):

from itertools import groupby
from random import shuffle
from datetime import datetime, timedelta

def choose_ad(site_id, zone_id, user_id):
   site = db.ad.zone.find_one({
        'site_id': site_id, 'zone_id': zone_id}) 1
   if site is None or len(site['ads']) == 0: return None
   ads = ad_iterator(site['ads']) 2
   user = db.ad.user.find_one({'user_id': user_id}) 3
   if user is None:
       # any ad is acceptable for an unknown user
       return ads.next()
   for ad in ads: 4
       advertiser_id = ad['campaign_id'].split(':', 1)[0]
       if ad_is_acceptable(ad, user[advertiser_id]):
           return ad
   return None
1

Here we once again find all ads that are targeted to that particular site and ad zone.

2

Next, we have factored out a Python generator that will iterate through all the ads in order of profitability.

3

Now we load the user profile for the given user. If there is no profile, we return the first ad in the iterator.

4

Finally, we iterate through each of the ads and check it using the ad_is_acceptable function.

Here’s our ad_iterator generator:

def ad_iterator(ads):
    '''Find available ads, sorted by ecpm, with random sort for ties'''
    ecpm_groups = groupby(ads, key=lambda ad:ad['ecpm'])
    for ecpm, ad_group in ecpm_groups:
        ad_group = list(ad_group)
        shuffle(ad_group)
        for ad in ad_group: yield ad

This generator yields the ads in an order that both maximizes profitability and randomly shuffles ads of the same eCPM. Finally, here’s our ad filter ad_is_acceptable:

def ad_is_acceptable(ad, profile):
    '''Returns False if the user has seen the ad today'''
    threshold = datetime.utcnow() - timedelta(days=1)
    for event in reversed(profile['impressions']):
        if event['timestamp'] < threshold: break
        if event['detail']['ad_unit_id'] == ad['ad_unit_id']:
            return False
    return True

This function examines all the user’s ad impressions for the current day and rejects an ad that has been displayed to that user.

In order to retrieve the user profile with the lowest latency possible, there needs to be an index on the _id field, which MongoDB supplies by default.

Sharding

When sharding the ad.user collection, choosing the _id field as a shard key allows MongoDB to route queries and updates to the user profile:

>>> db.command('shardcollection', 'dbname.ads.ad.user', {
...     'key': {'_id': 1 } })
{ "collectionsharded": "dbname.ads.ad.user", "ok": 1 }

Design 3: Keyword Targeting

Where frequency capping in the previous section is an example of user profile targeting, you may also wish to perform content targeting so that the user receives relevant ads for the particular page being viewed. The simplest example of this is targeting ads at the result of a search query. In this case, a list of keywords is sent to the choose_ad() call along with the site_id, zone_id, and user_id.

Schema Design

In order to choose relevant ads, we’ll need to expand the ad.zone collection to store relevant keywords for each ad:

{
  _id: ObjectId(...),
  site_id: 'cnn',
  zone_id: 'search',
  ads: [
    { campaign_id: 'mercedes:c201204_sclass_4',
      ad_unit_id: 'search1',
      keywords: [ 'car', 'luxury', 'style' ],
      ecpm: 250 },
    { campaign_id: 'mercedes:c201204_sclass_4',
      ad_unit_id: 'search2',
      keywords: [ 'car', 'luxury', 'style' ],
      ecpm: 250 },
    { campaign_id: 'bmw:c201204_eclass_1',
      ad_unit_id: 'search1',
      keywords: [ 'car', 'performance' ],
      ecpm: 200 },
    ... ]
}

Operation: Choose a Group of Ads to Serve

In the approach described here, we’ll choose a number of ads that match the keywords used in the search, so the following code has been tweaked to return an iterator over ads in descending order of preference. We’ve also modified the ad_iterator to take the list of keywords as a second parameter:

def choose_ads(site_id, zone_id, user_id, keywords):
   site = db.ad.zone.find_one({
        'site_id': site_id, 'zone_id': zone_id})
   if site is None: return []
   ads = ad_iterator(site['ads'], keywords)
   user = db.ad.user.find_one({'user_id': user_id})
   if user is None: return ads
   for ad in ads:
       advertiser_id = ad['campaign_id'].split(':', 1)[0]
       if ad_is_acceptable(ad, user[advertiser_id]):
           yield ad
   return None

Our ad_iterator method has been modified to allow us to score ads based on both their eCPM as well as their relevance:

def ad_iterator(ads, keywords):
  '''Find available ads, sorted by score, with random sort for ties'''
  keywords = set(keywords)
  scored_ads = [
    (ad_score(ad, keywords), ad) for ad in ads ]
  score_groups = groupby(
    sorted(scored_ads), key=lambda score, ad: score)
  for score, ad_group in score_groups:
      ad_group = list(ad_group)
      shuffle(ad_group)
      for ad in ad_group: yield ad

def ad_score(ad, keywords):
  '''Compute a desirability score based on the ad eCPM and keywords'''
  matching = set(ad['keywords']).intersection(keywords) return
  ad['ecpm'] * math.log( 1.1 + len(matching))

def ad_is_acceptible(ad, profile):
  # same as above

The main thing to note in the preceding code is that ads must now be sorted according to some score, which in this case is computed based on a combination of the ecpm of the ad as well as the number of keywords matched. More advanced use cases may boost the importance of various keywords, but this goes beyond the scope of this use case. One thing to keep in mind is that because the ads are now being sorted at display time, there may be performance issues if a large number of ads are competing for the same display slot.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.221.144