Chapter 19. Media Servers and Conferencing

In previous chapters (5 and 17), we saw how SIP application servers can be used to build SIP services. The service functionality was achieved through an intelligent manipulation of the SIP dialogs. Such manipulation managed to change the conversation space. A call transfer application is an example of such a service.

Nevertheless, not all the services delivered through SIP imply just manipulation of SIP dialogs. In fact, there are a lot of applications that also imply some processing at the media level. Take, for instance, a voice-mail service that requires the voice mail to play an announcement and record a message. Or think about services that require user input in the form of DTMF.[1] Also, conferencing services imply specific media processing so as to enable the mixing of different streams. All these services have in common the fact that, in addition to the manipulation of the SIP dialogs (i.e., manipulation of the signaling), they also involve some specific processing at the media level. Media-processing functions include, but are not limited to:

  • play media, such as speech or video.

  • record media, such as speech or video.

  • prompt and collect media info from the user.

  • mixing media streams.

  • convert text to speech.

  • speech recognition.

  • transcoding media.

This chapter is devoted to describing different types of media services used in the context of Internet communications, and the architectures defined to support these media services. We will begin the chapter by looking at the basic media services and how to implement them by using SIP.

After the basic services, we will also tackle more-advanced media services such as advanced multimedia conferencing. This is a quite complex topic by itself, whose analysis would deserve a dedicated book. Therefore, we will just touch on the very key principles. References are given so that interested readers can dive more deeply into the subject.

Advanced conferencing often requires the use of fully-featured media server control protocols. In the last section of this chapter, we will look at this currently hot topic in the industry and in the standardization committees.

Basic Media Services

By basic media services, we refer to three different types of functions:

  • Playing announcements or video messages.

  • Prompting and collecting information (user interaction).

  • Mixing media (basic conferencing).

These are basic media services in nature that can be combined to provide interesting applications. Many of today’s existing SIP applications use some or all of these media services.

Announcements are media played to the user. Think, for instance, of the case where Alice calls John, and she gets a user-busy announcement because John is engaged in another call. Another example could be an absence reason service such that whenever Alice calls John outside the working hours, an announcement is played to her indicating that he is not available just now, and giving information about the time of day when he can take calls again.

User interaction basically consists of prompting the user for some information—for example, in an announcement—and then collecting the user’s response. For instance, a call to a company’s number might result in an announcement being played that gives us several options depending on which is the department we want to speak with. The user provides, typically by pressing some keys on his or her IP phone, the desired option, and the application connects the user to the right destination. Also, as part of a user interaction basic service, it is common to have capabilities for recording the media input from the user. For instance, consider a basic voice-mail application that asks the calling user to leave a message. In order to implement this service, the voice mail would need to have the capability to record the media produced by the user. Media recording is also considered a media service.

Multiparty communication (i.e., conferencing) is one of the most complex topics in the general area of communication services. SIP can support many models of multiparty communications. Broadly speaking, we can classify conferences in SIP into three main groups:

  • loosely coupled conferences.

  • fully distributed multiparty conferences.

  • tightly coupled conferences.

Loosely coupled conferences make use of multicast media groups. In this type of conference, there is no signaling relationship between the participants, and there is not a central point of control for the conference. Each participant subscribes to a particular multicast address where they receive the RTP streams from the rest of the participants. Participants also address their RTP streams to the multicast address. SIP may be used just to inform users of the multicast conference address, but also other mechanisms are available for that, such as email, web pages, or the Session Announcement Protocol (SAP). SAP is specified in [RFC 2974].

In fully distributed multiparty conferences, each participant maintains a signaling relationship with the other participants using SIP. There is no central point of control; it is completely distributed among the participants.

Tightly coupled conferences are characterized by the existence of a central point of control. Each participant establishes a signaling relationship to this central point, which provides a number of functions and may also perform media mixing.

In Figure 19.1, the architecture for the three different conferencing models is shown.

Figure 19.1. 

In this book, we will tackle only tightly coupled conferences, which is the most commonly used method. Therefore, from now on, we will use the term “conference” just to refer to “tightly coupled conferences.” In this first section, we consider basic conferencing (that is, basic tightly coupled conferencing) as one of the three key basic media services.

Basic conferencing more or less equals a simple basic-media mixing function. It provides basic functions to create a conference, for a new user to join the conference, and for users to leave the conference. It does not address features such as floor control, gain control, muting, subconferences, and so on. These features are part of an enhanced conferencing service that will be examined in later sections.

Architecture for Basic Media Services

In addition to the manipulation of SIP dialogs according to some service logic (call control), many SIP applications also require media handling. From the functional perspective, we could consider that these applications are made up of two entities:

  • The service logic, which triggers and drives the manipulation of the session dialogs. The service logic resides in the control plane.

  • The media-handling functions (playing announcements, detecting DTMF, mixing streams, transcoding, and so on). The media-handling functions reside in the media plane.

In practice, this split is not just functional but also physical, given that media manipulation is a quite specific task that may require special types of hardware and software resources. In such a physical split, there are two types of servers (application platforms):

  • The application server, which hosts the service logic to manipulate the dialogs.

  • The media server, which is capable of media processing.

The presence of an application server may or may not be needed depending on the complexity of the service. For very simple applications, such as just playing an announcement at call setup, an architecture such as the one depicted in Figure 19.2 might be enough, whereas, for richer applications, there needs to be a separate application server, as in Figure 19.3. We will consider the latter architecture as the reference in order to implement basic (and also advanced) media services.

Figure 19.2. 

Figure 19.3. 

The way this architecture works is quite simple. In order to apply a service, the call (control plane) needs to be routed to the application server (AS). The application sever then executes the service logic and decides to invoke a media service on the media server. In order to invoke the media service (e.g., an announcement), the application server manipulates the Request-URI so that, instead of identifying a user, it identifies a service in the media server, and routes the call toward the media server. The media server receives the call, looks at the Request-URI, and determines what media service needs to be invoked.

The concept of addressing services as if they were users was introduced in [RFC 3087]. The utilization of this concept in order to invoke basic media services is described in [RFC 4240].

[RFC 4240] defines a format for the Request-URI so as to use it as a service indicator at the media server. More specifically, we take advantage of the fact that the standard SIP URI has a user part, but media servers do not have users. Therefore, we can use the user part in the SIP URI as a service indicator. In addition to the user part, it may also be necessary to add some other service-related information in the form of URI parameters.

In the next section, we will see how this is accomplished for the three basic media services described previously.

Implementation

[RFC 4240] defines a way to offer SIP-based basic media services using the simple architecture depicted in the previous section. It defines specific formats in the Request-URI that enable the invocation of the three basic services in media servers:

  • announcements.

  • user interaction.

  • basic conferences.

Announcements

In order to invoke an announcement, the user part in the Request-URI is set to “annc.” In addition to that, there also must be a “play” URI parameter that specifies the resource or announcement sequence to be played. There are also a bunch of other optional parameters that can specify aspects such as number of repetitions, maximum duration, language, and so forth.

The following URI identifies an announcement service (annc) at the media server (mediaserver.ocean.com), and gives the location (//fileserver.ocean.com) of the media file (welcome.wav).

sip:[email protected]; play=file://fileserver.ocean.com/welcome.wav

User Interaction

The user interaction service is identified by the service indicator “dialog” contained in the user part of the Request-URI. In addition, the mandatory “voicexml” URI parameter must be present. It indicates the location of the VoiceXML script that needs to be executed. In practical deployments, the VoiceXML script may reside in the application server.

VoiceXML is an XML language used to describe voice (and now also video) interactions. The present book does not explain VoiceXML in detail. Interested readers are referred to the VoiceXML specification at [W3C_VOICEXML] for more information on the subject.

Nevertheless, in order to let the reader get the idea of what VoiceXML looks like, next we show a simple example VoiceXML script taken from the VoiceXML specification. This is used to ask the user for a choice of drink, and then that choice is submitted to a server script:

<?xml version=“1.0” encoding=“UTF-8”?><vxml xmlns=“http://www.w3.org/2001/vxml” xmlns:xsi=“http://www.w3.org/2001/ XMLSchema-instance” xsi:schemaLocation=“http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd” version=“2.0”> <form>    <field name=“drink”>      <prompt>Would you like coffee, tea, milk, or nothing?</prompt>      <grammar src=“drink.grxml” type=“application/srgs+xml”/>    </field>    <block>      <submit next=“http://www.drink.example.com/drink2.asp”/>  </block> </form></vxml>

An example of Request-URI that causes the invocation of a user interaction service (dialog) at a media server (mediaserver.ocean.com) could be:

sip:[email protected]; play = file://fileserver.ocean.com/dialog.vxml

Basic Conferences

Basic conferencing provides mainly a simple media mixing service. A mixing service receives a number of RTP streams, combines them, and sends back the combination. Figure 19.4 shows the mixing of three incoming streams into the media server. For simplicity, only the media plane is shown.

Figure 19.4. 

The user part in the Request-URI is again used to identify the particular media service: “conf.” However, in this case, in addition to identifying that we want to use a conference, it is also necessary to identify the particular conference instance (i.e., the mixing instance) because many conference instances may exist at the media server. That is achieved by also including a unique identifier for the conference, separated with a “=” sign from the service indicator.

An example of Request-URI that invokes a conferencing service might be:

When the first INVITE request arrives at the media server, if a conference device associated with this URI does not yet exist, a mixing session is created that includes the seizing of a conference device. Subsequent INVITE requests for the same conference (i.e., same unique identifier) cause the media server to join them into the existing conference. If a user wants to abandon the conference, he or she just needs to send a BYE within the session they established with the media server.

When the last participant leaves the conference, the mixing session is destroyed in the media server.

Examples

Figure 19.5 shows the call flow for a simple absence reason service. John calls Alice, but Alice is in a meeting, so she has configured her absence reason[2] service to play the following announcement: “I am in a meeting until 10h00.”

Figure 19.5. 

The call is routed from John to the application server.[3] The application server knows that Alice is in a meeting (because Alice configured such information), so it invokes an announcement service in the media server. The file that contains the announcement to be played is called “meeting.wav.”[4]

Figure 19.6 shows an example of a basic SIP conference. Participants make calls to a generic URI identifying a public conference, and the application server actually selects a particular identifier for the conference and modifies the Request-URI accordingly. John’s INVITE request creates the conference, because his request is the first one. Subsequent INVITE requests to the same generic URI cause the respective participants to be joined to the conference.

Figure 19.6. 

About KPML and the User Interaction Framework

In the previous section, we have seen that the basic media services architecture can be used to implement basic user interaction based on VoiceXML. There is currently work in progress in the IETF [draft-ietf-sipping-app-interaction-framework] in order to define a more generic framework for enabling the interaction of users with applications. The interaction of users with applications is implemented through the “user interface” concept. The framework defines two types of user interfaces: presentation-free UI and presentation-capable UI. The former is a UI that cannot prompt the user with information, whereas the latter can.

VoiceXML, for instance, may be used to enable a presentation-capable UI because it allows for both “collect information” and “prompt and collect for information.” On the other hand, there are cases where only a presentation-free UI is available. An example of such an interface could be a gateway or media server that can just collect DTMF. In order to enable the collection of DTMF through presentation-free UIs, [RFC 4730] defines the Key Press Markup Language (KPML) event package. KPML is an XML-based language that allows us to describe DTMF events. In order to receive notifications of DTMF events, an application server might subscribe the KPML event package to a media server. When the media server detects a DTMF tone in the media stream, it will send a NOTIFY request to the AS, including in the body a KPML document that describes the DTMF event produced.

KPML is a relatively recent standard. It might find acceptance in the remit of TDM/IP media gateways. When it comes to media servers, the current industry trend seems to go in the direction of implementing fully-featured media sever control protocols such as the ones we will describe in the last section of this chapter.

The application interaction framework does not tackle only media-based interaction. Actually, it is generic enough to accommodate any type of interaction, be it through media, web forms, or whatever. A thorough description of the framework is outside the scope of this book.

Enhanced Conferencing

The basic conference service depicted in the previous section offers very limited functionality. [RFC 4245] gives a high-level view of more-advanced requirements for tightly coupled conferences. These requirements will be extended by other forthcoming IETF specifications that will focus on particular areas. The requirements in [RFC 4245] include:

  • conference creation.

  • conference termination.

  • dial-in: participants dial into the conference.

  • dial-out: the “conference” calls the participants.

  • third-party invitation: a user can invite other users to the conference.

  • participant’s removal: the “conference” can remove a participant.

  • conference state dissemination (inform participants about conference info: number of participants, who the chair is, and so on).

  • sidebar conferences: conferences within the conference—that is, a subgroup of participants can talk to each other without being heard by the other conference participants.

[RFC 4245] states that some of these requirements may be fulfilled by using SIP signaling, whereas others might need other means.

In order to meet these advanced requirements, the IETF, within the SIPPING Working Group, has defined a SIP conferencing framework, which is described in [RFC 4353]. This informational RFC proposes a SIP-centric framework and architecture to address the general requirements stated in [RFC 4245]. Most of the conferencing functions are, in this framework, implemented using the SIP protocol. Although it is stated that other functions will need non-SIP mechanisms, these mechanisms are not specified in [RFC 4353].

The SIP conferencing framework in [RFC 4353] represents an important advantage compared with the basic conferencing functionality of [RFC 4240], described in Section 19.1 “Basic Media Services.” However, additional requirements, such as the ones described in [RFC 4376] (requirements for floor control) and [RFC 4597] (conferencing scenarios), and the need to have more-powerful conference management mechanisms, is driving the work, in the IETF XCON Working Group, on a new conferencing framework. The XCON conferencing framework, although still work in progress, is defining a more abstract model that could comply with the broadest set of conferencing requirements and is not necessarily SIP-centric. A lot of functions in the XCON framework can be achieved by using SIP (or other signaling protocols such as H.323, Jabber, ISUP, etc.), but there is also room in this model for other protocols in order to implement conference control or floor control.

In the next sections, we will review both conferencing frameworks.

Framework for Conferencing with SIP

RFC 4353 presents a general SIP-centric architectural model and terminology in order to address tightly coupled conferencing services.

The key element in this architecture is called the focus. The focus is a functional element that represents a SIP UA responsible for maintaining a SIP signaling relationship with each participant in the conference, and making sure, through the use of some mixers under its control, that the media is properly distributed among the participants.

For instance, in dial-in scenarios, the participants direct the session establishment signaling toward the focus, whereas, in dial-out scenarios, it is the focus establishing a SIP dialog with the participants.

There can only be one focus in a conference, but the focus can use more than one mixer. Figure 19.7 shows a functional architecture with just one mixer.

Figure 19.7. 

The focus can also, additionally, incorporate the functions of a conference notification service, accepting subscriptions from the participants and notifying them as soon as the conference state changes. The conference state represents general information about the conference—such as who is actually in the conference, who is the chairperson, what type of media is each participant using, and so on. [RFC 4575] defines a conference event package for this purpose. In order to subscribe to the conference state, a participant would send a SIP SUBSCRIBE message (see Chapter 15) toward the notification server. The notification server would then inform the participant about the conference state (e.g., who is currently in the conference) by sending back a SIP NOTIFY that reflects the current status. As soon as the status changes (e.g., new participants join or leave the conference), the participant would receive new NOTIFY messages.

In addition to the focus, the model also includes a conference policy. The conference policy contains the rules that guide the operation of the conference. For instance, a simple rule might contain the allowed participants to the conference, which the focus should check in order to authorize any attempt to join the conference. There may also be more-complex rules. The conference policy is stored in the conference policy database, and is accessed (read/write) through the conference policy server using non-SIP-specific means.

Figure 19.8 shows all the elements in the architecture and the interfaces between them.

Figure 19.8. 

All the previous architecture diagrams show functional architectures. When grouping the functional entities into physical elements, there are various options. Figure 19.9 shows one such option in which the focus and conference policy elements are implemented in an application server and the mixer is a function of a media server. The interface between application server and media server for advanced conferencing applications will be discussed in subsequent sections.

Figure 19.9. 

In Figure 19.10, another possible mapping is shown where all the functions sit on a conferencing server.

Figure 19.10. 

Next are a couple of examples that highlight some of the advanced conferencing use cases that can be implemented by using this framework. An extensive list of use cases supported by this architecture is contained in “SIP Conferencing for User Agents” [RFC 4579]. This RFC shows how the SIP protocol can be used to implement most of the conferencing features in the framework.

For the purpose of the following examples, and in order to focus the reader’s attention on the signaling between the participants and the focus, the mixer and the focus appear as a single entity in the figures. Also, both examples assume that no participant is subscribed to the conference event package. The NOTIFY messages that appear in the call flow belong to the implicit subscription created by the REFER method, and give information about the status of the referred request (see Chapter 17 for an explanation of the REFER procedure).

Example 1: Dial-out to a New Participant

In this example, John, who is already participating in the conference, requests that the focus add Alice to the conference. All the steps in the scenario are implemented using SIP. The request from John is conveyed in a REFER method that, if accepted by the focus, will cause it to invite (dial-out) Alice to the conference. The focus will instruct the mixer to bring the new media from Alice into the conference and distribute it to the rest of participants. The REFER method would be addressed to a SIP URI that identifies the conference (Conf-ID), and would contain a Refer-To header that includes Alice’s SIP URI. This example is shown in Figure 19.11.

Figure 19.11. 

Example 2: Focus Removes a Participant

In this example, we assume that both John and Alice are connected to the conference. Then John asks the focus to remove Alice from the conference. He sends to the focus a REFER message that, if it is accepted, will cause the focus to send a BYE on the session it has with Alice. Then the focus will instruct the mixer to rearrange the way media is distributed among participants in order to reflect the new situation. The REFER method would be addressed to a SIP URI that identifies the conference (Conf-ID), and would contain a Refer-To header that includes Alice’s SIP URI and an explicit indication of the method to be invoked, which is BYE in this case, so as to cause the termination of the session with Alice. This example is depicted in Figure 19.12.

Figure 19.12. 

XCON Framework

Additional Requirements

As we stated before, the XCON framework for centralized conferencing [draft-ietfxcon-framework] is being defined in order to cope with additional conferencing requirements such as enhanced conference management or floor control. The core requirements of advanced conferences are defined by [RFC 4245], but additional conferencing requirements are provided in [RFC 4376] and [RFC 4597].

Enhanced Conference Management

In previous sections, we saw how a conference could be created using the architecture for basic media services or the SIP conferencing framework. Basically, when the first user establishes the call against the conference server, the conference is created. This simple model allows for the creation of ad hoc and unmanaged conferences. However, there are cases where we need to have more control over the conference. For instance, we may want to create a scheduled conference or a recurring conference. We may also want to specify, at conference creation time, what is the maximum number of participants for the conference, or general information about the conference (subject, and so on) that might be queried by the participants, and so forth.

We also saw in previous sections how new participants might be added to the conference. The offered functionality allows just simple addition or deletion of participants. There are cases where there is a need to have more flexibility to manipulate participants. For instance, we might want to be able to define different roles for the conference participants (e.g., administrator, chairperson, moderator, participant, observer, and so on). We could then, for instance, add a new participant to the conference, specifying his or her role, which implies a certain level of privileges, and so on.

Another interesting application of enhanced conference management is the advanced manipulation of the media associated with the participants. For instance, the chairman of a conference might want to mute some participants whose background noise is very high. Or he or she might want to alter the gain associated with a media stream from one participant. In another example, the moderator of a video conference might want to change the video layout (i.e., the way the video media is combined by the mixer) from single view to dual view, and so on.

Another type of functionality that requires enhanced conference management is the creation and manipulation of sidebars. Sidebars are conferences within a conference. Imagine that John and Alice are participating in a large conference with other people. At one point in time, John and Alice want to exchange views on what is being said, but they do not want the other participants to listen to what they say to each other. John and Alice might create a sidebar with just themselves as participants. While the sidebar is active, they can talk to each other at the same time that they continue receiving the media from the main conference.

Sidebars can be used in many other scenarios. Think, for instance, of a call center application. Frequently, in this type of application, there is a requirement for having a supervisor listen to the conversation between an agent and a customer in order to do an evaluation of the agent. In more-advanced scenarios, the supervisor can also talk to the agent and give him or her instructions while both of them receive the audio from the customer. The customer would not hear what the supervisor says. These “observing and coaching” scenarios can be easily implemented by creating a particular sidebar between supervisor and agent.

All the previous scenarios are just some examples of functionalities that are enabled by enhanced conferencing management. As we will see in the next section, the XCON framework outlines a separate, non-SIP protocol between conference clients and conference systems in order to enable enhanced conference control.

Floor Control

In order to understand what floor control means, let us think of an “analyst briefing” conferencing scenario. The conference call has a panel of speakers who are allowed to talk in the main conference. In addition to the panel speaker, there are also a number of analysts who are not allowed to speak unless they have the floor (these are called floor participants). If they want to speak, they need to make a floor request to the floor chair (that is, the entity that manages the floor). The floor chair will grant or deny the request, and inform the floor participants about their status/position in the floor’s queue.

Floor control represents an advanced conferencing requirement that was not supported either in the architecture for basic media services or in the SIP conferencing framework.

Requirements for floor control are covered in [RFC 4376].

As we will see in the next section, the XCON framework outlines a separate, non-SIP protocol between conference clients and conference systems in order to enable floor control.

Media Services for Enhanced Conferencing

Enhanced conferencing also requires other media-related features. It is common, for instance, that whenever new users joins the conference, they are asked to tell their name, which is then recorded and played to the rest of the participants. Also, when a participant leaves a conference, it is usual that an announcement is played to all the conference participants indicating who left the conference.

Another example could be the “whisper” functionality. This refers to a message targeted to a specific user or users—for example, when only the conference chair receives a warning that there is only five minutes left in the conference.

Therefore, the media services required for enhanced conferencing are:

  • recording and playing participant names to the full conference.

  • playing an announcement to a single user or to a conference mix.

  • collecting DTMF from specific participants.

Architecture

Let us now look at XCON architecture, which supports the previous requirements.

The XCON framework defines some functional elements and outlines the interfaces between them. The framework is not SIP-centric, but SIP can be used as the call-signaling protocol within the framework and also for conference event notification.

The XCON framework is built around the fundamental concept of a conference object. The conference object represents the conference, and encapsulates the conference data throughout the different phases of a conference (creation, reservation, active, completed, and so on). The conference object is accessed through a number of servers:

  • The conference control server.

  • The floor control server.

  • The focus.

  • The notification server.

The conference participants include a specific client for each of these servers. Communication between clients and servers takes place via a number of protocols, as depicted in Figure 19.13.

Though not reflected in the architecture, the model also supports the existence of conference policies that define the set of rights, permissions, and limitations pertaining to operations being performed on a conference object.

Next we describe the main elements in the architecture.

Figure 19.13. 

Conference Control

The conference control protocol provides for data manipulation and state retrieval from the conference object. It allows us to create/delete/modify conferences, add/delete users, add/delete/modify media, put participants on mute, alter the gain media streams, assign roles to participants, create sidebars, and so forth. The XCON framework does not specify a concrete conference control protocol. At the time of writing, there is not yet an IETF standard conference control protocol. An attempt to specify such a protocol was done in [draft-levin-xcon-cccp], which is now expired.

Much of the flexibility in the XCON architecture comes from the existence of a specific protocol for conference control. In the architecture for basic media services and in the SIP conferencing framework, there was no such protocol, and the conference management functions were performed by SIP. For instance, a conference was created when the first user joined; also, a participant was able to request that another user is joined to the conference by sending a REFER request to the focus. These capabilities are very limited in nature. Thanks to the utilization of a separate conference control protocol, much richer features can be offered.

Floor Control

Floor control refers to the capability to manage the access to shared resources—for instance, the determination of who in the audience has the right to talk, and so on. This may be accomplished through a separate, non-SIP protocol that is specified in [RFC 4582]. Again, Figure 19.13 shows specific client and server entities dedicated to floor control.

In the architecture for basic media services and in the SIP conferencing framework, there was no such protocol, and there was no way to offer floor control features.

The basic behavior of floor control is as follows:

  • A participant who wants to get access to the floor sends a floor request to the floor control server.

  • The floor control server checks with the floor chair—that is, with the entity (might be the moderator) that is responsible for granting access to the floor.

  • The floor chair communicates its decision to the floor control server, and this, in turn, communicates the decision to the floor participant.

  • At this point, the floor control server might send a notification to the rest of the floor participants to inform them about their position/status in the floor’s queue.

This procedure is depicted in Figure 19.14.

Figure 19.14. 

Focus

The focus in this framework has the same meaning as in the SIP conferencing framework, except that here it does not include the notification server. Another difference is that the call signaling protocol between the focus and the participant does not necessarily need to be SIP (it might be H.323, ISUP, and so on).

Conference Notification

It is a separate entity in this framework, with similar functions as in the SIP conferencing framework. The SIP event framework with the conference event package might be used for this function.

Mixer

It represents the entity that has the capability to combine different media inputs, and provides the media-handling capabilities to, for instance, mute participants, adjust voice gain, perform different video layouts, sidebars, and so on.

Next we will see some examples of utilization of the conference control protocol. We are not implying, in these examples, the usage of a particular conference control protocol. The intent is to let the reader understand the conference control functionality and its relation with other entities in the XCON architecture.

Example 1: Adding a New Participant to the Conference

In this example, we will see a possible way to add a new participant to the conference by using the conference control protocol. The scenario is shown in Figure 19.15.

Figure 19.15. 

Let us assume that John is in a conference and wants to join Alice to it. He would send to the conference control server a conference control request to join Alice to the conference. In the conference control request, John includes an XML file that contains the requested configuration for the new participant. For instance, he could indicate the role with which Alice should be joined to the conference. The conference control server would authorize the request, and instruct the focus to generate the necessary call signaling to join Alice. Additionally, if there are other participants (e.g., Peter) in the conference who subscribed to the conference state, they would receive a notification from the notification server (which was informed by the focus) indicating that Alice was joined to the conference.

Example 2: Media Manipulation

In this example, we will see how a participant in the conference—John, who is the chairman—might cause another participant (Alice) to be muted because she is contributing a lot to the background noise, and she seems not to be listening to the conference. Additionally, if there are other participants (e.g., Peter) in the conference who subscribed to the conference state, they would receive a notification from the notification server indicating that Alice was put on mute. This example is shown in Figure 19.16.

Figure 19.16. 

Media Server Control

Motivation

As we saw in previous sections, enhanced conferencing applications introduce a quite broad set of new requirements that cannot be met by the simple architectural approach used for basic services. In the previous two sections, we saw two frameworks to cope with these stringent requirements. These frameworks define a number of functional entities and the interfaces between them. There are different ways to group these entities into physical elements. In some cases, all the elements (focus, notification server, mixer, and so on) are implemented in the same box, whereas, in other approaches, all the entities except for the mixer sit at an application server, and the mixer is part of a separate media server. A possible physical instantiation of the XCON architecture following this second approach is shown in Figure 19.17. In this picture, we see an interface between the application server and the media server.

Figure 19.17. 

In the architecture for basic services, we already saw an interface between application server and media server. Actually, this interface was SIP, and the Request-URI was used to signal the type of the requested service. Although this approach is valid for simple conferences, it is not valid for enhanced conferencing. Let us just imagine that at one particular moment during the conference, we want to play an announcement to only some of the participants in the conference. This might be done by following the approach for basic services, which would imply that the focus uses third-party call control and issues re-INVITEs to all the participants that need to listen to the announcement on one hand, and also re-INVITEs to the media server indicating the requested resource (announcement) in the Request-URI. This would actually be a cumbersome and inefficient way to implement this function. Actually, the only thing needed here would be for the application server to tell the media server what is to be done (i.e., play announcement A to participants 1, 2, and 3 in conference Z), without the need to modify the established sessions. There are a lot of examples of features in enhanced conferencing scenarios that do not change the SIP dialogs or the sessions, but that do affect the media flow or the media processing of the server. For instance, imagine that we want to execute gain control on some of the participants in the conference, or that we wanted to create a sidebar conference. In order to implement these functions, we would just need a mechanism for the application server to instruct the media server to perform some action on the media without the need to modify the existing sessions.

So, the conclusion is that we also need to have the means to communicate other types of information in that interface so as to enable enhanced features.

These features would include:

  • in-conference user interaction.

  • creation of submixes.

  • modification of the mix.

  • recording the mix on a leg.

  • play an announcement on a leg.

  • alter the gain for a particular leg.

  • mute a participant.

Now we will see what alternatives exist in order to convey the information between application server and media server needed to implement these functions.

Approaches

Protocols used between application server and media server for the sake of enabling these enhanced features are typically called media server control protocols. There is not, as of today, a unique standard for media server control. Several protocols compete in different deployments worldwide, and different companies push toward slightly different directions. A Working Group has been recently created in the IETF called MEDIACTRL, tasked precisely with the definition of the requirements for the media server control protocols and the protocol extensions needed to fulfill those requirements.

Knowing that this is a very hot and dynamic topic nowadays in the Internet, we will first describe some of the existing approaches for a media server control protocol. Then, in the next section, we will look at some of the possible trends for the future that are being discussed today in the MEDIACTRL Working Group.

First of all there are two very different approaches to media server control. The first one, usually referred to as the “device control” approach, models the media server as an entity providing low-level functions such as mixing media streams, playing media, transcoding, detecting tones, and so forth, along with the capability of connecting the media streams with them. This approach requires the application server to use a quite low-level protocol capable of indicating actions to be played on these resources. For instance, playing an announcement to a particular participant would imply that the application server needs to command several actions on the media server. These would include disconnecting the participant’s stream from the mixer, allocating a media player, connecting the media player to the participant’s stream, instructing the media player to start playing the media, waiting until the playing of the media has completed, releasing the media player resources, and then reconnecting the stream to the mixer.

The second approach is the “server control” approach. In this approach, the media server is modeled as an entity that provides high-level services such as playing announcements, interacting with the user, conferencing services, and so forth. In this case, the underlying media server resources are addressed using high-level application constructs.

The first model is exemplified by protocols such as MEGACO, and has gained quite some interest in the telecom domain, where it has been successfully used for controlling resources in TDM/IP gateways. The use of MEGACO to directly control media servers, on the other hand, has not yet found wide deployment. Telecom bodies such as 3GPP have proposed to use it as well in order to control media servers in the remit of IMS. However, its utilization in that remit has not yet found widespread deployment.

The second model has found wide acceptance in the Internet environment, where there was no legacy for application servers having to implement device control protocols, and where the usually preferred approach in order to foster the rapid development of applications is to use application-level protocols.

The two models have pros and cons. The first model, being so low level, is flexible enough in order to meet any present or future requirements. On the other hand, it requires application servers to speak a new, complex protocol such as MEGACO. Moreover, it requires application developers to have an in-depth understanding of the low-level constructs and a different programming paradigm.

There is rough consensus in the Internet community to pursue the “server model” approach, and to reuse, to some extent, some of the functions existing in SIP (which anyhow needs to be present in the application servers) in order to handle the communication between application server and media server. There is also agreement in the Internet community to use an XML language in order to describe the control data exchanged between application server and media server. Virtually all the media server implementations in the Internet follow this approach.

More specifically, two different XML-based media server control protocols that use SIP as a transport cover all of the market. Neither of them is an Internet standard.[5] These are:

  • MSCML: Media Server Control Markup Language [RFC 4722].

  • MSML: Media Server Markup Language [draf-saleem-msml].

They differ in the XML language itself used to define the control of the media flows in the server. Both of them use SIP messages (INVITE and INFO) to carry the XML content.

Another point of difference between the two approaches is the way SIP is used. There are two ways to do this. The first is to carry the XML content in SIP messages within the same dialogs that established the media sessions with the media server—that is to say, reuse the existing signaling relationship. The other way is to establish a separate SIP dialog, which sets up no media, to carry the control messages.

The approach used by MSCML is to use the dedicated separate SIP dialog to carry commands that affect all the participants in the conference while using the session control connections for sending commands specific to a participant. For instance, closing the conference or playing an announcement to all the participants will be realized by sending commands on the dedicated SIP dialog.

On the other hand, MSML allows for the control messages to be sent on an individual connection even if they affect other participants. Therefore, the targets of MSML actions are not specified implicitly by the SIP dialog within which they are sent, but by specific identifiers carried in the XML data. Additionally, MSML also supports a dedicated control connection, with no media, to carry the XML content on the body of SIP messages.

Figure 19.18 shows an example with MSCML.

Figure 19.18. 

In this example, there is an already-established conference with three participants. Therefore, there exists one SIP signaling relationship between application sever and media server for each participant. In addition to that, there is another SIP dialog between application server and media server for the dedicated control channel. Commands sent on that channel will apply to the three participants, whereas commands sent on just an individual dialog will apply only to the corresponding participant.

In this example, the application server first plays an announcement to all the participants, therefore we can see it is sent on an INFO message pertaining to the dedicated dialog (dialog 4). Later on, the application server mutes Alice, sending the mute command in just the SIP dialog that corresponds to Alice (dialog 2).

The XML content of the first INFO message (playing an announcement) might look like this:

<?xml version=“1.0” encoding=“utf-8”?><MediaServerControl version=“1.0”> <request>   <play>    <prompt>     <audio url=“http://announcements.ocean.com/welcome.wav”/>    </prompt>   </play> </request></MediaServerControl>

The XML content of the second INFO message (mute command) might look like this:

<?xml version=“1.0” encoding=“utf-8”?><MediaServerControl version=“1.0”>  <request>    <configure_leg mixmode=“mute”/>  </request></MediaServerControl>

Future Trends

Out of the hot discussion that exists today around the choice of the right protocol to implement media server control, there seems to be an agreement on some points:

  • The use of SIP provides interesting capabilities for locating media servers, security, and so on.

  • Carrying media server control data in INFO messages is not the best approach. Its main drawback relates to interoperability issues. There are no specific semantics associated with INFO; the semantics are typically defined in the message body. The SIP INFO method does not define any means by which the extensions contained in the message body can be used in an interoperable way.

  • Using an XML language to carry the media server control data is a flexible and convenient approach.

With these considerations in mind, an approach has been proposed to use a dedicated control channel to carry the control messages. Moreover, the control information is now not carried on the SIP messages themselves, but on a transport connection that is established using SIP—very much following the approach defined in [RFC 4145], which we saw in Chapter 10 for TCP-based media transport. Figure 19.19 depicts this approach.

Figure 19.19. 

This approach is called “A Control Framework for the Session Initiation Protocol,” and is still work in progress in the IETF [draft-boulton-sip-control-framework]. As its name implies, it defines just a framework for media server control. The framework defines how the dedicated control channel is established using SIP, and what are the messages exchanged on the transport connection established with SIP. More specifically, the framework defines several types of messages:

  • SYNCH

  • REPORT

  • CONTROL

  • K-ALIVE

The way these messages are used in order to deliver a specific functionality is left for additional extensions, called control packages, that build on top of the framework. Currently, there is ongoing work to define control packages for:

  • basic interactive voice response [draft-boulton-ivr-control-package],

  • advanced interactive voice response [draft-boulton-ivr-vxml-control-package],

  • conference control [draft-boulton-conference-control-package].

Other Media Services

Some enhanced media applications also require functions such as Text-to-Speech or Automatic Speech Recognition. Imagine that John calls a customer-service number. As soon as the call is established, he hears an announcement asking him to say what department he wants to talk to. John says, “Sales,” and the system applies some signal processing in order to translate the voice signal into text so that it can be used by the computer program that determines what to do next. Likewise, the computer program might have to dynamically build other announcements that are to be presented to the user as part of the execution of the interaction flow. Therefore, rather than having a recorded copy of all the possible combinations, it would be better if the server could automatically convert to speech a text string that has been constructed programmatically.

Text-to-speech (TTS) and Automatic Speech Recognition (ASR) are complex media functions that may be offered in media servers. These functions are highly specific, and therefore may be implemented physically separate from the media server itself. A possible protocol for allowing a client to control the ASR and TTS functions, called MRCP (Media Resource Control Protocol), was proposed in an informational Request For Comments [RFC 4463]. There is currently ongoing work to define MCRPv2 in [draft-ietf-speechsc-mrcpv2].

The MRCPv2 client is typically an application server or a media server. In Figure 19.20, we can see how the media server can act as client to a TTS/ASR server.

Figure 19.20. 

In a possible scenario, the media server would record a message from the User Agent and then pass the recorded message (e.g., in PCM-encoded format) to the ASR server, which would then translate it into text and send the text back to the media server.

Summary

We learned in this chapter how media services can be implemented in SIP architectures. We saw that this is an area that is advancing at a rapid pace, so we may see new architectures and ideas being proposed in the near future.

In the next chapter, we change the topic a bit, and tackle the aspects related to SIP identity. We will cover two aspects that are very important in communication scenarios: the capability to convey an authenticated identity of the caller, and the capability to hide the identity of the caller. Both aspects are not successfully covered in the core SIP specifications, so new SIP extensions have been defined to cope with them.



[1] DTMF stands for Dual-Tone Multifrequency.

[2] For instance, this information might be obtained from Microsoft Outlook.

[3] The actual mechanism by which the call is routed to the application server is not shown here. It might be through static configuration in the SIP proxy so that calls originated to a particular user are routed to the application server (by adding the application server URI to the Route header) or other mechanism.

[4] If a flexible announcement is played, then VoiceXML may be used. A flexible announcement may, for example, mention the time when you’re in a meeting.

[5] [RFC 4722] is just an informational RFC (non–Standards Track).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.108.241