Inside Microsoft: Managing the Windows Live Messenger Service Infrastructure

Windows Live Messenger is a Win32-based communication product offered by Microsoft within the Windows Live suite of clients and services. Windows Live Messenger is perhaps one of the most well known of all instant messaging services on the Internet, and it remains a dominant player against AOL, Google, and others. For many years, Windows Live Messenger was provided under the MSN brand of services until being updated and relaunched under the Windows Live brand in 2007. At present, Windows Live Messenger is the most used free instant messaging service in the world and is delivered to more than 320 million active users worldwide. The key components of the application vary by version, but the latest client includes online and offline instant messaging to address book contacts, peer-to-peer file sharing, video chat, and SMS messaging features.

Windows Live Messenger is more than just a simple instant messaging client application. The end-to-end client and supporting service is a complex combination of a peer-to-peer and client/server architecture that supports billions of instant messages per day, which represents the daily communications of millions of people worldwide. The client is supported by one of the most scalable, heavily used Internet-based communication platforms in the world. This service infrastructure delivers all the necessary plumbing to enable the various communication scenarios supported by the Messenger client application, including integration with several mobile phone platforms via SMS and instant messaging, as well as peer-to-peer file sharing and interoperability with other services like Yahoo! Messenger.

Engineering Principles

The Windows Live Messenger service infrastructure team has developed a great deal of experience over the years in the delivery of a high-quality and reliable communication platform to a global audience. The team members recognize the complexity of the engineering challenges they face and have spent years perfecting their practices. They adhere to a set of guiding principles that govern how they address scalability and manageability issues in the delivery of their service to market. These principles are outlined below.

Design to scale out

As you might imagine, the Windows Live Messenger service infrastructure components participate in various types of communication scenarios. These scenarios, such as instant messaging or SMS, are represented as separate functions within the Windows Live Messenger application and generally operate independently. This separation of functionality allows the service to be designed to scale by "service role," which means that specific areas of functionality are designed to scale linearly on their own rather than as one application. The team believes that, as applications achieve massive scale, there are no benefits to running the entire service on a single machine and scaling out the number of machines performing all tasks. Separating functionality into "roles" allows those individual roles to scale more granularly if necessary. The key to success, the team indicates, is to balance simplicity with scale. It is important to separate basic functionality of the application for scalability purposes but not to trend toward an approach that is too granular.

Design every aspect of the application to fail gracefully

A critically important philosophy for the Windows Live Messenger team is that each aspect of the application be designed to fail gracefully. Application developers assume that key dependencies could be unavailable or that infrastructure may go offline during critical processing. The team believes in keeping the application interaction model as stateless as possible so that, when failures do occur, clean failover can happen and users may continue uninterrupted. To accomplish this, the team has applied strategies for caching data for certain features to insulate against failures in dependency services. In other cases, the team has moved to more asynchronous calls during application startup, where the team can dynamically choose which features to make available when certain service roles are experiencing problems. These practices allow the end-to-end service to appear completely healthy and available, instead of abruptly showing exceptions to the user when a small feature area is experiencing issues.

Automate key manageability tasks

Managing a service that is the size of Messenger creates challenges for executing simple manageability tasks. Actions such as deploying a service upgrade or monitoring the health of the live site when there are thousands of servers to manage can be extremely challenging and expensive from an operational perspective. To address these challenges, the Windows Live Messenger team incorporated an application,[1] which manages a number of live site management tasks. This application monitors and manages each server in the Windows Live Messenger service. It automates the health monitoring, deployment, and maintenance of the servers so that only a few operations engineers are required to manage the service.

The infrastructure management application works by continuously monitoring each machine’s health until it detects a failure. When the failure is detected, the software runs through a series of diagnostic steps, beginning with a simple reboot, and gradually tries to bring the server back online to receive traffic. In the most severe of circumstances, it will even re-image the machine with a new operating system and application install. If it is unsuccessful at trying to get the Windows Live Messenger application up and running again, it ultimately assumes that a hardware failure has occurred and removes the machine from service. Because the management application is in complete control at all times, application developers are forced to write their code to be responsive to application or complete server failures, thus ensuring a fully automated operational approach to handling service issues.

In addition to health monitoring, this management software also handles the deployment of new versions of the Windows Live Messenger service in a fully automated fashion. Because the software has the necessary control and information about the server topology, it also bears the responsibility of deploying and configuring the application across the server infrastructure. Given the amount of hardware servicing the users of the Windows Live Messenger service, an automated deployment that eliminates both human intervention and error greatly improves both the cost of managing the service as well as the quality and predictability of the deployment.

Continuously evaluate and plan infrastructure capacity

Understanding the usage and resource consumption patterns of your application is extremely beneficial to being able to adequately forecast and plan for hardware needs. The Windows Live Messenger team believes in having a strong and repeatable process for evaluating the capacity and resource consumption of all aspects of its service, so it can plan in advance to acquire and deploy new hardware to meet the user load. The team accomplishes this by first establishing the key resource measurement for each component of the service. This is generally accomplished by mapping specific server performance counters to application components and profiling each component’s resource consumption. Each component is then evaluated on a monthly basis, or after a major service upgrade, to understand growth trends. The end-to-end application infrastructure is then examined holistically on a quarterly basis, and the team makes the determination whether to scale out the service with additional machines. This continuous evaluation and planning cycle allows the team to stay ahead of the growth curve, thus ensuring that the service never reaches a critical threshold that negatively affects users.

Plan to recover from outages

Despite the valiant efforts of the Windows Live Messenger service infrastructure team, service outages can occur from time to time. For all the effort that goes into preventing issues with the health, availability, and reliability of the live site, nearly all services will eventually go down. Ironically, most teams spend countless amounts of effort trying to prevent an outage but spend little effort in planning how to recover from one. The Windows Live Messenger team has experienced circumstances that it believes require both technology- and process-based solutions to recover from an outage effectively. By applying this combined implementation, the team is able to maximize its ability to return 100 percent of its users back to full service. These experiences have shown the team that oftentimes the flood of users returning to the service after a critical outage causes the service to become unresponsive and subsequently go down again. The team recommends having a process and a technical solution that allows for an incremental restoration of usage or traffic back to the service after any significant downtime. This solution will ensure that additional service downtime is avoidable as components of the service are adequately ramped back up to full capacity.

As illustrated above, the Windows Live Messenger service infrastructure team has developed a great deal of experience in the delivery of a high-quality and reliable Internet-based communication platform. The principles highlighted above represent just a few of the best practices acquired by the team over the many years that this service has been available. Even though some of these practices are most effective when the scale of the infrastructure is really large, there are still several that universally apply to services of all sizes. It is important for application developers to learn from the experiences of mega-scale Internet services like Windows Live Messenger and apply the knowledge gained to other applications of varying sizes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.237.79