Chapter 8. Social Networking

In this chapter, we’ll explore how you could use MongoDB to store the social graph for a social networking site. We’ll look at storing and grouping followers as well as how to publish events to different followers with different privacy settings.

Solution Overview

Our solution assumes a directed social graph where a user can choose whether or not to follow another user. Additionally, the user can designate “circles” of users with which to share updates, in order to facilitate fine-grained control of privacy. The solution presented here is designed in such a way as to minimize the number of documents that must be loaded in order to display any given page, even at the expense of complicating updates.

The particulars of what type of data we want to host on a social network obviously depend on the type of social network we’re designing, and is largely beyond the scope of this use case. In particular, the main variables that you would have to consider in adapting this use case to your particular situation are:

What data should be in a user profile?
This may include gender, age, interests, relationship status, and so on for a “personal” social network, or may include resume-type data for a more “business-oriented” social network.
What type of updates are allowed?
Again, depending on what flavor of social network you are designing, you may wish to allow posts such as status updates, photos, links, check-ins, and polls, or you may wish to restrict your users to links and status updates.

Schema Design

In the solution presented here, we’ll use two main “independent” collections and three “dependent” collections to store user profile data and posts.

Independent Collections

The first collection, social.user, stores the social graph information for a given user along with the user’s profile data:

{
  _id: 'T4Y...AC', // base64-encoded ObjectId
  name: 'Rick',
  profile: { ... age, location, interests, etc. ... },
  followers: {
    "T4Y...AD": { name: 'Jared', circles: [ 'python', 'authors'] },
    "T4Y...AF": { name: 'Bernie', circles: [ 'python' ] },
    "T4Y...AI": { name: 'Meghan', circles: [ 'python', 'speakers' ] },
    ...
  ],
  circles: {
    "10gen": {
      "T4Y...AD": { name: 'Jared' },
      "T4Y...AE": { name: 'Max' },
      "T4Y...AF": { name: 'Bernie' },
      "T4Y...AH": { name: 'Paul' },
      ...  },
    ...}
4  },
  blocked: ['gh1...0d']
}

There are a few things to note about this schema:

  • Rather than using a “raw” ObjectId for the _id field, we’ll use a base64-encoded version. Although we can use raw ObjectId values as keys, we can’t use them to “reach inside” a document in a query or an update. By base64-encoding the _id values, we can use queries or updates that include the _id value like circles.10gen.T4Y...AD.
  • We’re storing the social graph bidirectionally in the followers and circles collections. While this is technically redundant, having the bidirectional connections is useful both for displaying the user’s followers on the profile page, as well as propagating posts to other users, as we’ll see later.
  • In addition to the normal “positive” social graph, this schema above stores a block list that contains an array of user IDs for posters whose posts never appear on the user’s wall or news feed.
  • The particular profile data stored for the user is isolated into the profile subdocument, allowing us to evolve the profile’s schema as necessary without worrying about name collisions with other parts of the schema that need to remain fixed for social graph operations.

Of course, to make the network interesting, it’s necessary to add various types of posts. We’ll put these in the social.post collection:

{
   _id: ObjectId(...),
   by: { id: "T4Y...AE", name: 'Max' },
   circles: [ '*public*' ],
   type: 'status',
   ts: ISODateTime(...),
   detail: {
      text: 'Loving MongoDB' },
   comments: [
      { by: { id:"T4Y...AG", name: 'Dwight' },
        ts: ISODateTime(...),
        text: 'Right on!' },
        ... all comments listed ... ]
}

Here, the post stores minimal author information (by), the post type, a timestamp ts, post details detail (which vary by post type), and a comments array. In this case, the schema embeds all comments on a post as a time-sorted flat array. For a more in-depth exploration of the other approaches to storing comments, refer back to Storing Comments.

A couple of points are worthy of further discussion:

  • Author information is truncated; just enough is stored in each by property to display the author name and a link to the author profile. If a user wants more detail on a particular author, we can fetch this information as they request it. Storing minimal information like this helps keep the document small (and therefore fast.)
  • The visibility of the post is controlled via the circles property; any user that is part of one of the listed circles can view the post. The special values *public* and *circles* allow the user to share a post with the whole world or with any users in any of the posting user’s circles, respectively.
  • Different types of posts may contain different types of data in the detail field. Isolating this polymorphic information into a subdocument is a good practice, helping to identify which parts of the document are common to all posts and which can vary. In this case, we would store different data for a photo post versus a status update, while still keeping the metadata (_id, by, circles, type, ts, and comments) the same.

Dependent Collections

In addition to independent collections, for optimal performance we’ll need to create a few dependent collections that will be used to cache information for display. The first of these collections is the social.wall collection, and is intended to display a “wall” containing posts created by or directed to a particular user. The format of the social.wall collection follows:

{
   _id: ObjectId(...),
   user_id: "T4Y...AE",
   month: '201204',
   posts: [
      { id: ObjectId(...),
        ts: ISODateTime(...),
        by: { id: "T4Y...AE", name: 'Max' },
        circles: [ '*public*' ],
        type: 'status',
        detail: { text: 'Loving MongoDB' },
        comments_shown: 3,
        comments: [
          { by: { id: "T4Y...AG", name: 'Dwight',
            ts: ISODateTime(...),
            text: 'Right on!' },
          ... only last 3 comments listed ...
            ]
      },
      { id: ObjectId(...),s
        ts: ISODateTime(...),
        by: { id: "T4Y...AE", name: 'Max' },
        circles: [ '*circles*' ],
        type: 'checkin',
        detail: {
          text: 'Great office!',
          geo: [ 40.724348,-73.997308 ],
          name: '10gen Office',
          photo: 'http://....' },
        comments_shown: 1,
        comments: [
          { by: { id: "T4Y...AD", name: 'Jared' },
            ts: ISODateTime(...),
            text: 'Wrong coast!' },
          ... only last 1 comment listed ...
            ]
      },
      { id: ObjectId(...),
        ts: ISODateTime(...),
        by: { id: "T4Y...g9", name: 'Rick' },
        circles: [ '10gen' ],
        type: 'status',
        detail: {
          text: 'So when do you crush Oracle?' },
        comments_shown: 2,
        comments: [
          { by: { id: "T4Y...AE", name: 'Max' },
            ts: ISODateTime(...),
            text: 'Soon... ;-)' },
          ... only last 2 comments listed ...
            ]
      },
      ...
   ]
}

There are a few things to note about this schema:

  • Each post is listed with an abbreviated number of comments (three might be typical). This is to keep the size of the document reasonable. If we need to display more comments on a post, we’d perform a secondary query on the social.post collection for full details.
  • There are actually multiple social.wall documents for each social.user document, one wall document per month. This allows the system to keep a “page” of recent posts in the initial page view, fetching older months if requested.
  • Once again, the by properties store only the minimal author information for display, helping to keep this document small.
  • The number of comments on each post is stored to allow later updates to find posts with more than a certain number of comments since the $size query operator does not allow inequality comparisons.

The other dependent collection we’ll use is social.news, posts from people the user follows. This schema includes much of the same information as the social.wall information, so this document has been abbreviated for clarity:

{
   _id: ObjectId(...),
   user_id: "T4Y...AE",
   month: '201204',
   posts: [ ... ]
}

Operations

Since these schemas optimize for read performance at the possible expense of write performance, a production system should provide a queueing system for processing updates that may take longer than the desired web request latency.

Viewing a News Feed or Wall Posts

The most common operation on a social network is probably the display of a particular user’s news feed, followed by a user’s wall posts. Because the social.news and social.wall collections are optimized for these operations, the query is fairly straightforward. Since these two collections share a schema, viewing the posts for a news feed or a wall are actually quite similar operations, and can be supported by the same code:

def get_posts(collection, user_id, month=None):
    spec = { 'user_id': viewed_user_id }
    if month is not None:
        spec['month'] = {'$lte': month}
    cur = collection.find(spec)
    cur = cur.sort('month', -1)
    for page in cur:
        for post in reversed(page['posts']):
            yield page['month'], post

The function get_posts will retrieve all the posts on a particular user’s wall or news feed in reverse-chronological order. Some special handling is required to efficiently achieve the reverse-chronological ordering:

  • The posts within a month are actually stored in chronological order, so the order of these posts must be reversed before displaying.
  • As a user pages through her wall, it’s preferable to avoid fetching the first few months from the server each time. To achieve this, the preceding code specifies the first month to fetch in the month argument, passing this in as an $lte expression in the query. This can be substantially faster than using a .skip() argument to our .find().
  • Rather than only yielding the post itself, the post’s month is also yielded from the generator. This provides the month argument to be used in any subsequent calls to get_posts.

There is one other issue that needs to be considered in selecting posts for display: filtering posts for display. In order to choose posts for display, we’ll need to use some filter functions on the posts generated by get_posts. The first of these filters is used to determine whether to show a post when the user is viewing his or her own wall:

def visible_on_own_wall(user, post):
    '''if poster is followed by user, post is visible'''
    for circle, users in user['circles'].items():
        if post['by']['id'] in users: return True
    return False

In addition to the user’s wall, our social network provides an “incoming” page that contains all posts directed toward a user regardless of whether that poster is followed by the user. In this case, we need to use the block list to filter posts:

def visible_on_own_incoming(user, post):
    '''if poster is not blocked by user, post is visible'''
    return post['by']['id'] not in user['blocked']

When viewing a news feed or another user’s wall, the permission check is a bit different based on the post’s circles property:

def visible_post(user, post):
    if post['circles'] == ['*public*']:
        # public posts always visible
        return True
    circles_user_is_in = set(
        user['followers'].get(post['by']['id'], []))
    if not circles_user_is_in:
        # user is not circled by poster; post is invisible
        return False
    if post['circles'] == ['*circles*']:
        # post is public to all followed users; post is visible
        return True
    for circle in post['circles']:
        if circle in circles_user_is_in:
            # User is in a circle receiving this post
            return True
    return False

In order to quickly retrieve the pages in the desired order, we’ll need an index on user_id, month in both the social.news and social.wall collections.

>>> for collection in ('db.social.news', 'db.social.wall'):
...     collection.ensure_index([
...         ('user_id', 1),
...         ('month', -1)])

Commenting on a Post

Other than viewing walls and news feeds, creating new posts is the next most common action taken on social networks. To create a comment by user on a given post containing the given text, we’ll need to execute code similar to the following:

from datetime import datetime

def comment(user, post_id, text):
    ts = datetime.utcnow()
    month = ts.strfime('%Y%m')
    comment = {
        'by': { 'id': user['id'], 'name': user['name'] }
        'ts': ts,
        'text': text }
    # Update the social.posts collection
    db.social.post.update(
        { '_id': post_id },
        { '$push': { 'comments': comment } } )
    # Update social.wall and social.news collections
    db.social.wall.update(
        { 'posts.id': post_id },
        { '$push': { 'comments': comment },
          '$inc': { 'comments_shown': 1 } },
        upsert=True,
        multi=True)
    db.social.news.update(
        { 'posts.id': _id },
        { '$push': { 'comments': comment },
          '$inc': { 'comments_shown': 1 } },
        upsert=True,
        multi=True)

The preceding code can actually result in an unbounded number of comments being inserted into the social.wall and social.news collections. To compensate for this, we need to periodically run the following update statement to truncate the number of displayed comments and keep the size of the news and wall documents manageable:

COMMENTS_SHOWN = 3

def truncate_extra_comments():
    db.social.news.update(
        { 'posts.comments_shown': { '$gt': COMMENTS_SHOWN } },
        { '$pop': { 'posts.$.comments': -1 },
          '$inc': { 'posts.$.comments_shown': -1 } },
        multi=True)
    db.social.wall.update(
        { 'posts.comments_shown': { '$gt': COMMENTS_SHOWN } },
        { '$pop': { 'posts.$.comments': -1 },
          '$inc': { 'posts.$.comments_shown': -1 } },
        multi=True)

In order to efficiently execute the updates to the social.news and social.wall collections just shown, we need to be able to quickly locate both of the following document types:

  • Documents containing a given post
  • Documents containing posts displaying too many comments

To quickly execute these updates, then, we need to create the following indexes:

>>> for collection in (db.social.news, db.social.wall):
...     collection.ensure_index('posts.id')
...     collection.ensure_index('posts.comments_shown')

Creating a New Post

Creating a new post fills out the content-creation activities on a social network:

from datetime import datetime

def post(user, dest_user, type, detail, circles):
    ts = datetime.utcnow()
    month = ts.strfime('%Y%m')
    post = {
        'ts': ts,
        'by': { id: user['id'], name: user['name'] },
        'circles': circles,
        'type': type,
        'detail': detail,
        'comments': [] }
    # Update global post collection
    db.social.post.insert(post)
    # Copy to dest user's wall
    if user['id'] not in dest_user['blocked']:
        append_post(db.social.wall, [dest_user['id']], month, post)
    # Copy to followers' news feeds
    if circles == ['*public*']:
        dest_userids = set(user['followers'].keys())
    else:
        dest_userids = set()
        if circles == [ '*circles*' ]:
            circles = user['circles'].keys()
        for circle in circles:
            dest_userids.update(user['circles'][circle])
    append_post(db.social.news, dest_userids, month, post)

The basic sequence of operations in this code is as follows:

  1. The post is first saved into the “system of record,” the social.post collection.
  2. The recipient’s wall is updated with the post.
  3. The news feeds of everyone who is circled in the post is updated with the post.

Updating a particular wall or group of news feeds is then accomplished using the append_post function:

def append_post(collection, dest_userids, month, post):
    collection.update(
        { 'user_id': { '$in': sorted(dest_userids) },
          'month': month },
        { '$push': { 'posts': post } },
        multi=True)

In order to quickly update the social.wall and social.news collections, we once again need an index on both user_id and month. This time, however, the ideal order on the indexes is month, user_id. This is due to the fact that updates to these collections will always be for the current month; having month appear first in the index makes the index right-aligned, requiring significantly less memory to store the active part of the index.

However, in this case, since we already have an index user_id, month, which must be in that order to enable sorting on month, adding a second index is unnecessary, and would end up actually using more RAM to maintain two indexes. So even though this particular operation would benefit from having an index on month, user_id, it’s best to leave out any additional indexes here.

Maintaining the Social Graph

In a social network, maintaining the social graph is an infrequent but essential operation. To add a user other to the current user self’s circles, we’ll need to run the following function:

def circle_user(self, other, circle):
    circles_path = 'circles.%s.%s' % (circle, other['_id'])
    db.social.user.update(
        { '_id': self['_id'] },
        { '$set': { circles_path: { 'name': other['name' ]} } })
    follower_circles = 'followers.%s.circles' % self['_id']
    follower_name = 'followers.%s.name' % self['_id']
    db.social.user.update(
        { '_id': other['_id'] },
        { '$push': { follower_circles: circle },
          '$set': { follower_name: self['name'] } })

Note that in this solution, previous posts of the other user are not added to the self user’s news feed or wall. To actually include these past posts would be an expensive and complex operation, and goes beyond the scope of this use case.

Of course, we must also support removing users from circles:

def uncircle_user(self, other, circle):
    circles_path = 'circles.%s.%s' % (circle, other['_id'])
    db.social.user.update(
        { '_id': self['_id'] },
        { '$unset': { circles_path: 1 } })
    follower_circles = 'followers.%s.circles' % self['_id']
    db.social.user.update(
        { '_id': other['_id'] },
        { '$pull': { follower_circles: circle } })
    # Special case -- 'other' is completely uncircled
    db.social.user.update(
        { '_id': other['_id'], follower_circles: {'$size': 0 } },
        { '$unset': { 'followers.' + self['_id' } }})

In both the circling and uncircling cases, the _id is included in the update queries, so no additional indexes are required.

Sharding

In order to scale beyond the capacity of a single replica set, we need to shard each of the collections mentioned previously. Since the social.user, social.wall, and social.news collections contain documents that are specific to a given user, the user’s _id field is an appropriate shard key:

>>> db.command('shardcollection', 'dbname.social.user', {
...     'key': {'_id': 1 } } )
{ "collectionsharded": "dbname.social.user", "ok": 1 }
>>> db.command('shardcollection', 'dbname.social.wall', {
...     'key': {'user_id': 1 } } )
{ "collectionsharded": "dbname.social.wall", "ok": 1 }
>>> db.command('shardcollection', 'dbname.social.news', {
...     'key': {'user_id': 1 } } )
{ "collectionsharded": "dbname.social.news", "ok": 1 }

It turns out that using the posting user’s _id is actually not the best choice for a shard key for social.post. This is due to the fact that queries and updates to this table are done using the _id field, and sharding on by.id, while tempting, would require these updates to be broadcast to all shards. To shard the social.post collection on _id, then, we need to execute the following command:

>>> db.command('shardcollection', 'dbame.social.post', {
... 'key': {'_id':1 } } )
{ "collectionsharded": "dbname.social.post", "ok": 1 }
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.138.195