In this chapter, we’ll examine building an online advertising network that connects advertisers and media websites. Advertisers provide the ads for display, with each ad designed for a particular ad zone. Media sites, on the other hand, provide content pages for display with various regions marked for serving ads. When the media site displays a page, it makes a request to the ad network for one or more ads to display in its ad zones.
As part of the ad serving, the ad network records the number of pageviews of each ad in order to track statistics for the ad, which may then also be used to bill the advertiser.
This solution is structured as a progressive refinement of the ad network, starting out with the basic data storage requirements and adding more advanced features to the schema to support more advanced ad targeting. The key performance criterion for this solution is the latency between receiving an ad request and returning the (targeted) ad to be displayed.
A basic ad-serving algorithm consists of the following steps:
site_id
and zone_id
to be served.
This design uses the site_id
and zone_id
submitted with the ad
request, as well as information stored in the ad inventory collection,
to make the ad targeting decisions. Later examples will build on this,
allowing more advanced ad targeting.
A very basic schema for storing ads available to be served consists of a
single collection, ad.zone
:
{
_id
:
ObjectId
(...),
site_id
:
'cnn'
,
zone_id
:
'banner'
,
ads
:
[
{
campaign_id
:
'mercedes:c201204_sclass_4'
,
ad_unit_id
:
'banner23a'
,
ecpm
:
250
},
{
campaign_id
:
'mercedes:c201204_sclass_4'
,
ad_unit_id
:
'banner23b'
,
ecpm
:
250
},
{
campaign_id
:
'bmw:c201204_eclass_1'
,
ad_unit_id
:
'banner12'
,
ecpm
:
200
},
...
]
}
Note that for each site-zone combination you’ll be storing a list of ads, sorted by their eCPM values.
The world of online advertising is full of somewhat cryptic acronyms. Most of the decisions made by the ad network in this chapter will be based on the eCPM, or effective cost per mille. This is a synthetic measure meant to allow comparison between CPM (cost per mille) ads, which are priced based on the number of impressions, and CPC (cost per click) ads, which are priced per click.
The eCPM of a CPM ad is just the CPM. Calculating the eCPM of a CPC ad is fairly straightforward, based on the CTR (click-through rate), which is defined as the number of clicks per ad impression. The formula for eCPM for a CPC ad then is:
eCPM = CPC × CTR × 1000
In this chapter, we’ll assume that the eCPM is already known for each ad, though you’ll obviously need to calculate it in a real ad network.
The query we’ll use to choose which ad to serve selects a compatible ad
and sorts by the advertiser’s ecpm
bid in order to maximize the ad
network’s profits:
from
itertools
import
groupby
from
random
import
choice
def
choose_ad
(
site_id
,
zone_id
):
site
=
db
.
ad
.
zone
.
find_one
({
'site_id'
:
site_id
,
'zone_id'
:
zone_id
})
if
site
is
None
:
return
None
if
len
(
site
[
'ads'
])
==
0
:
return
None
ecpm_groups
=
groupby
(
site
[
'ads'
],
key
=
lambda
ad
:
ad
[
'ecpm'
])
ecpm
,
ad_group
=
ecpm_groups
.
next
()
return
choice
(
list
(
ad_group
))
In order to execute the ad choice with the lowest latency possible,
we need to maintain a compound index on site_id
, zone_id
:
>>>
db
.
ad
.
zone
.
ensure_index
([
...
(
'site_id'
,
1
),
...
(
'zone_id'
,
1
)
])
One case we need to deal with is making a campaign inactive. This may happen for a variety of reasons. For instance, the campaign may have reached its end date or exhausted its budget for the current time period. In this case, the logic is fairly straightforward:
def
deactivate_campaign
(
campaign_id
):
db
.
ad
.
zone
.
update
(
{
'ads.campaign_id'
:
campaign_id
},
{
' $pull'
:
{
'ads'
,
{
'campaign_id'
:
campaign_id
}
}
},
multi
=
True
)
This update statement first selects only those ad zones that had
available ads from the given campaign_id
and then uses the $pull
modifier to remove them from rotation.
To execute the multiupdate quickly, we’ll keep an index on the
ads.campaign_id
field:
>>>
db
.
ad
.
zone
.
ensure_index
(
'ads.campaign_id'
)
In order to scale beyond the capacity of a single replica set, you will
need to shard the ad.zone
collection. To maintain the lowest possible
latency (and the highest possible throughput) in the ad selection operation, the
shard key needs to be chosen to allow MongoDB to route the ad.zone
query to a
single shard.
In this case, a good approach is to shard on the site_id
, zone_id
combination:
>>>
db
.
command
(
'shardcollection'
,
'dbname.ads.ad.zone'
,
{
...
'key'
:
{
'site_id'
:
1
,
'zone_id'
:
1
}
})
{ "collectionsharded": "dbname.ads.ad.zone", "ok": 1 }
One problem with the logic described in Design 1: Basic Ad Serving is that it will tend to display the same ad over and over again until the campaign’s budget is exhausted. To mitigate this, advertisers may wish to limit the frequency with which a given user is presented a particular ad. This process is called frequency capping and is an example of user profile targeting in advertising.
In order to perform frequency capping (or any type of user targeting),
the ad network needs to maintain a profile for each visitor, typically
implemented as a cookie in the user’s browser. This cookie, effectively
a user_id
, is then transmitted to the ad network when logging
impressions, clicks, conversions, etc., as well as the ad-serving
decision. This section focuses on how that profile data impacts the ad-serving decision.
In order to use the user profile data, we need to store it. In this
case, it’s stored in a collection ad.user
:
{
_id
:
'cookie_value'
,
advertisers
:
{
mercedes
:
{
impressions
:
[
{
date
:
ISODateTime
(...),
campaign
:
'c201204_sclass_4'
,
ad_unit_id
:
'banner23a'
,
site_id
:
'cnn'
,
zone_id
:
'banner'
}
},
...
],
clicks
:
[
{
date
:
ISODateTime
(...),
campaign
:
'c201204_sclass_4'
,
ad_unit_id
:
'banner23a'
,
site_id
:
'cnn'
,
zone_id
:
'banner'
}
},
...
],
bmw
:
[
...
],
...
}
}
There are a few things to note about the user profile:
The query we’ll use to choose which ad to serve now needs to iterate through ads in order of profitability and select the “best” ad that also satisfies the advertiser’s targeting rules (in this case, the frequency cap):
from
itertools
import
groupby
from
random
import
shuffle
from
datetime
import
datetime
,
timedelta
def
choose_ad
(
site_id
,
zone_id
,
user_id
):
site
=
db
.
ad
.
zone
.
find_one
({
'site_id'
:
site_id
,
'zone_id'
:
zone_id
})
if
site
is
None
or
len
(
site
[
'ads'
])
==
0
:
return
None
ads
=
ad_iterator
(
site
[
'ads'
])
user
=
db
.
ad
.
user
.
find_one
({
'user_id'
:
user_id
})
if
user
is
None
:
# any ad is acceptable for an unknown user
return
ads
.
next
()
for
ad
in
ads
:
advertiser_id
=
ad
[
'campaign_id'
]
.
split
(
':'
,
1
)[
0
]
if
ad_is_acceptable
(
ad
,
user
[
advertiser_id
]):
return
ad
return
None
Here we once again find all ads that are targeted to that particular site and ad zone.
Next, we have factored out a Python generator that will iterate through all the ads in order of profitability.
Now we load the user profile for the given user. If there is no profile, we return the first ad in the iterator.
Finally, we iterate through each of the ads and check it using the
ad_is_acceptable
function.
Here’s our ad_iterator
generator:
def
ad_iterator
(
ads
):
'''Find available ads, sorted by ecpm, with random sort for ties'''
ecpm_groups
=
groupby
(
ads
,
key
=
lambda
ad
:
ad
[
'ecpm'
])
for
ecpm
,
ad_group
in
ecpm_groups
:
ad_group
=
list
(
ad_group
)
shuffle
(
ad_group
)
for
ad
in
ad_group
:
yield
ad
This generator yields the ads in an order that both maximizes profitability and
randomly shuffles ads of the same eCPM. Finally, here’s our ad filter
ad_is_acceptable
:
def
ad_is_acceptable
(
ad
,
profile
):
'''Returns False if the user has seen the ad today'''
threshold
=
datetime
.
utcnow
()
-
timedelta
(
days
=
1
)
for
event
in
reversed
(
profile
[
'impressions'
]):
if
event
[
'timestamp'
]
<
threshold
:
break
if
event
[
'detail'
][
'ad_unit_id'
]
==
ad
[
'ad_unit_id'
]:
return
False
return
True
This function examines all the user’s ad impressions for the current day and rejects an ad that has been displayed to that user.
In order to retrieve the user profile with the lowest latency possible,
there needs to be an index on the _id
field, which MongoDB supplies by
default.
Where frequency capping in the previous section is an example of user profile targeting,
you may also wish to perform content targeting so that the user receives
relevant ads for the particular page being viewed. The simplest example
of this is targeting ads at the result of a search query. In this case,
a list of keywords
is sent to the choose_ad()
call along with the
site_id
, zone_id
, and user_id
.
In order to choose relevant ads, we’ll need to expand the ad.zone
collection to store relevant keywords for each ad:
{
_id
:
ObjectId
(...),
site_id
:
'cnn'
,
zone_id
:
'search'
,
ads
:
[
{
campaign_id
:
'mercedes:c201204_sclass_4'
,
ad_unit_id
:
'search1'
,
keywords
:
[
'car'
,
'luxury'
,
'style'
],
ecpm
:
250
},
{
campaign_id
:
'mercedes:c201204_sclass_4'
,
ad_unit_id
:
'search2'
,
keywords
:
[
'car'
,
'luxury'
,
'style'
],
ecpm
:
250
},
{
campaign_id
:
'bmw:c201204_eclass_1'
,
ad_unit_id
:
'search1'
,
keywords
:
[
'car'
,
'performance'
],
ecpm
:
200
},
...
]
}
In the approach described here, we’ll choose a number of ads that match
the keywords used in the search, so the following code has been tweaked to
return an iterator over ads in descending order of preference. We’ve also
modified the ad_iterator
to take the list of keywords as a second parameter:
def
choose_ads
(
site_id
,
zone_id
,
user_id
,
keywords
):
site
=
db
.
ad
.
zone
.
find_one
({
'site_id'
:
site_id
,
'zone_id'
:
zone_id
})
if
site
is
None
:
return
[]
ads
=
ad_iterator
(
site
[
'ads'
],
keywords
)
user
=
db
.
ad
.
user
.
find_one
({
'user_id'
:
user_id
})
if
user
is
None
:
return
ads
for
ad
in
ads
:
advertiser_id
=
ad
[
'campaign_id'
]
.
split
(
':'
,
1
)[
0
]
if
ad_is_acceptable
(
ad
,
user
[
advertiser_id
]):
yield
ad
return
None
Our ad_iterator
method has been modified to allow us to score ads based on both
their eCPM as well as their relevance:
def
ad_iterator
(
ads
,
keywords
):
'''Find available ads, sorted by score, with random sort for ties'''
keywords
=
set
(
keywords
)
scored_ads
=
[
(
ad_score
(
ad
,
keywords
),
ad
)
for
ad
in
ads
]
score_groups
=
groupby
(
sorted
(
scored_ads
),
key
=
lambda
score
,
ad
:
score
)
for
score
,
ad_group
in
score_groups
:
ad_group
=
list
(
ad_group
)
shuffle
(
ad_group
)
for
ad
in
ad_group
:
yield
ad
def
ad_score
(
ad
,
keywords
):
'''Compute a desirability score based on the ad eCPM and keywords'''
matching
=
set
(
ad
[
'keywords'
])
.
intersection
(
keywords
)
return
ad
[
'ecpm'
]
*
math
.
log
(
1.1
+
len
(
matching
))
def
ad_is_acceptible
(
ad
,
profile
):
# same as above
The main thing to note in the preceding code is that ads must now be sorted
according to some score
, which in this case is computed based on a
combination of the ecpm
of the ad as well as the number of keywords
matched. More advanced use cases may boost the importance of various
keywords, but this goes beyond the scope of this use case.
One thing to keep in mind is that because the ads are now being sorted at
display time, there may be performance issues if a large number
of ads are competing for the same display slot.
3.146.221.144