In this chapter, we’ll explore how you could use MongoDB to store the social graph for a social networking site. We’ll look at storing and grouping followers as well as how to publish events to different followers with different privacy settings.
Our solution assumes a directed social graph where a user can choose whether or not to follow another user. Additionally, the user can designate “circles” of users with which to share updates, in order to facilitate fine-grained control of privacy. The solution presented here is designed in such a way as to minimize the number of documents that must be loaded in order to display any given page, even at the expense of complicating updates.
The particulars of what type of data we want to host on a social network obviously depend on the type of social network we’re designing, and is largely beyond the scope of this use case. In particular, the main variables that you would have to consider in adapting this use case to your particular situation are:
In the solution presented here, we’ll use two main “independent” collections and three “dependent” collections to store user profile data and posts.
The first collection, social.user
, stores the social graph information
for a given user along with the user’s profile data:
{
_id
:
'T4Y...AC'
,
// base64-encoded ObjectId
name
:
'Rick'
,
profile
:
{
...
age
,
location
,
interests
,
etc
.
...
},
followers
:
{
"T4Y...AD"
:
{
name
:
'Jared'
,
circles
:
[
'python'
,
'authors'
]
},
"T4Y...AF"
:
{
name
:
'Bernie'
,
circles
:
[
'python'
]
},
"T4Y...AI"
:
{
name
:
'Meghan'
,
circles
:
[
'python'
,
'speakers'
]
},
...
],
circles
:
{
"10gen"
:
{
"T4Y...AD"
:
{
name
:
'Jared'
},
"T4Y...AE"
:
{
name
:
'Max'
},
"T4Y...AF"
:
{
name
:
'Bernie'
},
"T4Y...AH"
:
{
name
:
'Paul'
},
...
},
...}
4
},
blocked
:
[
'gh1...0d'
]
}
There are a few things to note about this schema:
ObjectId
for the _id
field, we’ll use
a base64-encoded version. Although we can use raw ObjectId
values as keys,
we can’t use them to “reach inside” a document in a query or an update. By
base64-encoding the _id
values, we can use queries or updates that include
the _id
value like circles.10gen.T4Y...AD
.
followers
and
circles
collections. While this is technically redundant, having the
bidirectional connections is useful both for displaying the user’s
followers on the profile page, as well as propagating posts to other
users, as we’ll see later.
profile
subdocument, allowing us to evolve the profile’s schema as necessary
without worrying about name collisions with other parts of the schema that
need to remain fixed for social graph operations.
Of course, to make the network interesting, it’s necessary to add
various types of posts. We’ll put these in the social.post
collection:
{
_id
:
ObjectId
(...),
by
:
{
id
:
"T4Y...AE"
,
name
:
'Max'
},
circles
:
[
'*public*'
],
type
:
'status'
,
ts
:
ISODateTime
(...),
detail
:
{
text
:
'Loving MongoDB'
},
comments
:
[
{
by
:
{
id
:
"T4Y...AG"
,
name
:
'Dwight'
},
ts
:
ISODateTime
(...),
text
:
'Right on!'
},
...
all
comments
listed
...
]
}
Here, the post stores minimal author information (by
), the post
type
, a timestamp ts
, post details detail
(which vary by post
type), and a comments
array. In this case, the schema embeds all
comments on a post as a time-sorted flat array. For a more in-depth
exploration of the other approaches to storing comments, refer back to
Storing Comments.
A couple of points are worthy of further discussion:
by
property to display the author name and a link to the author profile. If
a user wants more detail on a particular author, we can fetch this
information as they request it. Storing minimal information like this
helps keep the document small (and therefore fast.)
circles
property;
any user that is part of one of the listed circles can view the post.
The special values *public*
and *circles*
allow the user to
share a post with the whole world or with any users in any of the
posting user’s circles, respectively.
detail
field. Isolating this polymorphic information into a
subdocument is a good practice, helping to identify which parts
of the document are common to all posts and which can vary. In this
case, we would store different data for a photo post versus a status
update, while still keeping the metadata (_id
, by
, circles
,
type
, ts
, and comments
) the same.
In addition to independent collections, for optimal
performance we’ll need to create a few dependent collections that will
be used to cache information for display. The first of these collections
is the social.wall
collection, and is intended to display a “wall”
containing posts created by or directed to a particular user. The format
of the social.wall
collection follows:
{
_id
:
ObjectId
(...),
user_id
:
"T4Y...AE"
,
month
:
'201204'
,
posts
:
[
{
id
:
ObjectId
(...),
ts
:
ISODateTime
(...),
by
:
{
id
:
"T4Y...AE"
,
name
:
'Max'
},
circles
:
[
'*public*'
],
type
:
'status'
,
detail
:
{
text
:
'Loving MongoDB'
},
comments_shown
:
3
,
comments
:
[
{
by
:
{
id
:
"T4Y...AG"
,
name
:
'Dwight'
,
ts
:
ISODateTime
(...),
text
:
'Right on!'
},
...
only
last
3
comments
listed
...
]
},
{
id
:
ObjectId
(...),
s
ts
:
ISODateTime
(...),
by
:
{
id
:
"T4Y...AE"
,
name
:
'Max'
},
circles
:
[
'*circles*'
],
type
:
'checkin'
,
detail
:
{
text
:
'Great office!'
,
geo
:
[
40.724348
,
-
73.997308
],
name
:
'10gen Office'
,
photo
:
'http://....'
},
comments_shown
:
1
,
comments
:
[
{
by
:
{
id
:
"T4Y...AD"
,
name
:
'Jared'
},
ts
:
ISODateTime
(...),
text
:
'Wrong coast!'
},
...
only
last
1
comment
listed
...
]
},
{
id
:
ObjectId
(...),
ts
:
ISODateTime
(...),
by
:
{
id
:
"T4Y...g9"
,
name
:
'Rick'
},
circles
:
[
'10gen'
],
type
:
'status'
,
detail
:
{
text
:
'So when do you crush Oracle?'
},
comments_shown
:
2
,
comments
:
[
{
by
:
{
id
:
"T4Y...AE"
,
name
:
'Max'
},
ts
:
ISODateTime
(...),
text
:
'Soon... ;-)'
},
...
only
last
2
comments
listed
...
]
},
...
]
}
There are a few things to note about this schema:
social.post
collection for full details.
social.wall
documents for each
social.user
document, one wall document per month. This allows the
system to keep a “page” of recent posts in the initial page view,
fetching older months if requested.
by
properties store only the minimal author
information for display, helping to keep this document small.
$size
query operator does not allow inequality comparisons.
The other dependent collection we’ll use is social.news
, posts from
people the user follows. This schema includes much of the same
information as the social.wall
information, so this document has
been abbreviated for clarity:
{
_id
:
ObjectId
(...),
user_id
:
"T4Y...AE"
,
month
:
'201204'
,
posts
:
[
...
]
}
Since these schemas optimize for read performance at the possible expense of write performance, a production system should provide a queueing system for processing updates that may take longer than the desired web request latency.
The most common operation on a social network is probably the display of a
particular user’s news feed, followed by a user’s wall posts. Because the
social.news
and social.wall
collections are optimized for these
operations, the query is fairly straightforward. Since these two
collections share a schema, viewing the posts for a news feed or a wall
are actually quite similar operations, and can be supported by the same
code:
def
get_posts
(
collection
,
user_id
,
month
=
None
):
spec
=
{
'user_id'
:
viewed_user_id
}
if
month
is
not
None
:
spec
[
'month'
]
=
{
'$lte'
:
month
}
cur
=
collection
.
find
(
spec
)
cur
=
cur
.
sort
(
'month'
,
-
1
)
for
page
in
cur
:
for
post
in
reversed
(
page
[
'posts'
]):
yield
page
[
'month'
],
post
The function get_posts
will retrieve all the posts on a
particular user’s wall or news feed in reverse-chronological order. Some
special handling is required to efficiently achieve the
reverse-chronological ordering:
posts
within a month are actually stored in chronological order,
so the order of these posts must be reversed before displaying.
month
argument,
passing this in as an $lte
expression in the query. This can be substantially
faster than using a .skip()
argument to our .find()
.
month
argument to be
used in any subsequent calls to get_posts
.
There is one other issue that needs to be considered in selecting posts
for display: filtering posts for display. In order to choose posts for display,
we’ll need to use some filter functions on the posts
generated by get_posts
. The first of these filters is used to
determine whether to show a post when the user is viewing his or her own
wall:
def
visible_on_own_wall
(
user
,
post
):
'''if poster is followed by user, post is visible'''
for
circle
,
users
in
user
[
'circles'
]
.
items
():
if
post
[
'by'
][
'id'
]
in
users
:
return
True
return
False
In addition to the user’s wall, our social network provides an “incoming” page that contains all posts directed toward a user regardless of whether that poster is followed by the user. In this case, we need to use the block list to filter posts:
def
visible_on_own_incoming
(
user
,
post
):
'''if poster is not blocked by user, post is visible'''
return
post
[
'by'
][
'id'
]
not
in
user
[
'blocked'
]
When viewing a news feed or another user’s wall, the permission check is
a bit different based on the post’s circles
property:
def
visible_post
(
user
,
post
):
if
post
[
'circles'
]
==
[
'*public*'
]:
# public posts always visible
return
True
circles_user_is_in
=
set
(
user
[
'followers'
]
.
get
(
post
[
'by'
][
'id'
],
[]))
if
not
circles_user_is_in
:
# user is not circled by poster; post is invisible
return
False
if
post
[
'circles'
]
==
[
'*circles*'
]:
# post is public to all followed users; post is visible
return
True
for
circle
in
post
[
'circles'
]:
if
circle
in
circles_user_is_in
:
# User is in a circle receiving this post
return
True
return
False
In order to quickly retrieve the pages in the desired order, we’ll need
an index on user_id
, month
in both the social.news
and
social.wall
collections.
>>>
for
collection
in
(
'db.social.news'
,
'db.social.wall'
):
...
collection
.
ensure_index
([
...
(
'user_id'
,
1
),
...
(
'month'
,
-
1
)])
Other than viewing walls and news feeds, creating new posts is the next
most common action taken on social networks. To create a comment by
user
on a given post
containing the given text
, we’ll need to
execute code similar to the following:
from
datetime
import
datetime
def
comment
(
user
,
post_id
,
text
):
ts
=
datetime
.
utcnow
()
month
=
ts
.
strfime
(
'%Y%m'
)
comment
=
{
'by'
:
{
'id'
:
user
[
'id'
],
'name'
:
user
[
'name'
]
}
'ts'
:
ts
,
'text'
:
text
}
# Update the social.posts collection
db
.
social
.
post
.
update
(
{
'_id'
:
post_id
},
{
'$push'
:
{
'comments'
:
comment
}
}
)
# Update social.wall and social.news collections
db
.
social
.
wall
.
update
(
{
'posts.id'
:
post_id
},
{
'$push'
:
{
'comments'
:
comment
},
'$inc'
:
{
'comments_shown'
:
1
}
},
upsert
=
True
,
multi
=
True
)
db
.
social
.
news
.
update
(
{
'posts.id'
:
_id
},
{
'$push'
:
{
'comments'
:
comment
},
'$inc'
:
{
'comments_shown'
:
1
}
},
upsert
=
True
,
multi
=
True
)
The preceding code can actually result in an unbounded number of comments
being inserted into the social.wall
and social.news
collections. To
compensate for this, we need to periodically run the following update
statement to truncate the number of displayed comments and keep the size
of the news and wall documents manageable:
COMMENTS_SHOWN
=
3
def
truncate_extra_comments
():
db
.
social
.
news
.
update
(
{
'posts.comments_shown'
:
{
'$gt'
:
COMMENTS_SHOWN
}
},
{
'$pop'
:
{
'posts.$.comments'
:
-
1
},
'$inc'
:
{
'posts.$.comments_shown'
:
-
1
}
},
multi
=
True
)
db
.
social
.
wall
.
update
(
{
'posts.comments_shown'
:
{
'$gt'
:
COMMENTS_SHOWN
}
},
{
'$pop'
:
{
'posts.$.comments'
:
-
1
},
'$inc'
:
{
'posts.$.comments_shown'
:
-
1
}
},
multi
=
True
)
In order to efficiently execute the updates to the social.news
and social.wall
collections just shown, we need to be able to quickly
locate both of the following document types:
To quickly execute these updates, then, we need to create the following indexes:
>>>
for
collection
in
(
db
.
social
.
news
,
db
.
social
.
wall
):
...
collection
.
ensure_index
(
'posts.id'
)
...
collection
.
ensure_index
(
'posts.comments_shown'
)
Creating a new post fills out the content-creation activities on a social network:
from
datetime
import
datetime
def
post
(
user
,
dest_user
,
type
,
detail
,
circles
):
ts
=
datetime
.
utcnow
()
month
=
ts
.
strfime
(
'%Y%m'
)
post
=
{
'ts'
:
ts
,
'by'
:
{
id
:
user
[
'id'
],
name
:
user
[
'name'
]
},
'circles'
:
circles
,
'type'
:
type
,
'detail'
:
detail
,
'comments'
:
[]
}
# Update global post collection
db
.
social
.
post
.
insert
(
post
)
# Copy to dest user's wall
if
user
[
'id'
]
not
in
dest_user
[
'blocked'
]:
append_post
(
db
.
social
.
wall
,
[
dest_user
[
'id'
]],
month
,
post
)
# Copy to followers' news feeds
if
circles
==
[
'*public*'
]:
dest_userids
=
set
(
user
[
'followers'
]
.
keys
())
else
:
dest_userids
=
set
()
if
circles
==
[
'*circles*'
]:
circles
=
user
[
'circles'
]
.
keys
()
for
circle
in
circles
:
dest_userids
.
update
(
user
[
'circles'
][
circle
])
append_post
(
db
.
social
.
news
,
dest_userids
,
month
,
post
)
The basic sequence of operations in this code is as follows:
social.post
collection.
Updating a particular wall or group of news feeds is then accomplished
using the append_post
function:
def
append_post
(
collection
,
dest_userids
,
month
,
post
):
collection
.
update
(
{
'user_id'
:
{
'$in'
:
sorted
(
dest_userids
)
},
'month'
:
month
},
{
'$push'
:
{
'posts'
:
post
}
},
multi
=
True
)
In order to quickly update the social.wall
and social.news
collections, we once again need an index on both user_id
and
month
. This time, however, the ideal order on the indexes is
month
, user_id
. This is due to the fact that updates to these
collections will always be for the current month; having month appear
first in the index makes the index right-aligned, requiring
significantly less memory to store the active part of the index.
However, in this case, since we already have an index
user_id
, month
, which must be in that order to enable sorting on
month
, adding a second index is unnecessary, and would end up
actually using more RAM to maintain two indexes. So even though this
particular operation would benefit from having an index on month
,
user_id
, it’s best to leave out any additional indexes here.
In a social network, maintaining the social graph is an infrequent
but essential operation. To add a user other
to the current
user self
’s circles, we’ll need to run the following function:
def
circle_user
(
self
,
other
,
circle
):
circles_path
=
'circles.
%s
.
%s
'
%
(
circle
,
other
[
'_id'
])
db
.
social
.
user
.
update
(
{
'_id'
:
self
[
'_id'
]
},
{
'$set'
:
{
circles_path
:
{
'name'
:
other
[
'name'
]}
}
})
follower_circles
=
'followers.
%s
.circles'
%
self
[
'_id'
]
follower_name
=
'followers.
%s
.name'
%
self
[
'_id'
]
db
.
social
.
user
.
update
(
{
'_id'
:
other
[
'_id'
]
},
{
'$push'
:
{
follower_circles
:
circle
},
'$set'
:
{
follower_name
:
self
[
'name'
]
}
})
Note that in this solution, previous posts of the other
user are not
added to the self
user’s news feed or wall. To actually include these
past posts would be an expensive and complex operation, and goes beyond
the scope of this use case.
Of course, we must also support removing users from circles:
def
uncircle_user
(
self
,
other
,
circle
):
circles_path
=
'circles.
%s
.
%s
'
%
(
circle
,
other
[
'_id'
])
db
.
social
.
user
.
update
(
{
'_id'
:
self
[
'_id'
]
},
{
'$unset'
:
{
circles_path
:
1
}
})
follower_circles
=
'followers.
%s
.circles'
%
self
[
'_id'
]
db
.
social
.
user
.
update
(
{
'_id'
:
other
[
'_id'
]
},
{
'$pull'
:
{
follower_circles
:
circle
}
})
# Special case -- 'other' is completely uncircled
db
.
social
.
user
.
update
(
{
'_id'
:
other
[
'_id'
],
follower_circles
:
{
'$size'
:
0
}
},
{
'$unset'
:
{
'followers.'
+
self
[
'_id'
}
}})
In both the circling and uncircling cases, the _id
is included in the
update queries, so no additional indexes are required.
In order to scale beyond the capacity of a single replica set, we
need to shard each of the collections mentioned previously. Since the
social.user
, social.wall
, and social.news
collections contain
documents that are specific to a given user, the user’s _id
field is
an appropriate shard key:
>>>
db
.
command
(
'shardcollection'
,
'dbname.social.user'
,
{
...
'key'
:
{
'_id'
:
1
}
}
)
{ "collectionsharded": "dbname.social.user", "ok": 1 }
>>>
db
.
command
(
'shardcollection'
,
'dbname.social.wall'
,
{
...
'key'
:
{
'user_id'
:
1
}
}
)
{ "collectionsharded": "dbname.social.wall", "ok": 1 }
>>>
db
.
command
(
'shardcollection'
,
'dbname.social.news'
,
{
...
'key'
:
{
'user_id'
:
1
}
}
)
{ "collectionsharded": "dbname.social.news", "ok": 1 }
It turns out that using the posting user’s _id
is actually not the
best choice for a shard key for social.post
. This is due to the fact
that queries and updates to this table are done using the _id
field,
and sharding on by.id
, while tempting, would require these updates to
be broadcast to all shards. To shard the social.post
collection on
_id
, then, we need to execute the following command:
>>>
db
.
command
(
'shardcollection'
,
'dbame.social.post'
,
{
...
'key'
:
{
'_id'
:
1
}
}
)
{ "collectionsharded": "dbname.social.post", "ok": 1 }
18.118.138.195