Chapter 1. An Introduction to RTP

  • A Brief History of Audio/Video Networking

  • A Snapshot of RTP

  • Related Standards

  • Overview of an RTP Implementation

The Internet is changing: Static content is giving way to streaming video, text is being replaced by music and the spoken word, and interactive audio and video is becoming commonplace. These changes require new applications, and they pose new and unique challenges for application designers.

This book describes how to build these new applications: voice-over-IP, telephony, teleconferencing, streaming video, and webcasting. It looks at the challenges inherent in reliable delivery of audio and video across an IP network, and it explains how to ensure high quality in the face of network problems, as well as how to ensure that the system is secure. The emphasis is on open standards, in particular those devised by the Internet Engineering Task Force (IETF) and the International Telecommunications Union (ITU), rather than on proprietary solutions.

This chapter begins our examination of the Real-time Transport Protocol (RTP) with a brief look at the history of audio/video networking and an overview of RTP and its relation to other standards.

A Brief History of Audio/Video Networking

The idea of using packet networks—such as the Internet—to transport voice and video is not new. Experiments with voice over packet networks stretch back to the early 1970s. The first RFC on this subject—the Network Voice Protocol (NVP)1—dates from 1977. Video came later, but still there is over ten years of experience with audio/video conferencing and streaming on the Internet.

Early Packet Voice and Video Experiments

The initial developers of NVP were researchers transmitting packet voice over the ARPANET, the predecessor to the Internet. The ARPANET provided a reliable-stream service (analogous to TCP/IP), but this introduced too much delay, so an “uncontrolled packet” service was developed, akin to the modern UDP/IP datagrams used with RTP. The NVP was layered directly over this uncontrolled packet service. Later the experiments were extended beyond the ARPANET to interoperate with the Packet Radio Network and the Atlantic Satellite Network (SATNET), running NVP over those networks.

All of these early experiments were limited to one or two voice channels at a time by the low bandwidth of the early networks. In the 1980s, the creation of the 3-Mbps Wideband Satellite Network enabled not only a larger number of voice channels but also the development of packet video. To access the one-hop, reserved-bandwidth, multicast service of the satellite network, a connection-oriented inter-network protocol called the Stream Protocol (ST) was developed. Both a second version of NVP, called NVP-II, and a companion Packet Video Protocol were transported over ST to provide a prototype packet-switched video teleconferencing service.

In 1989–1990, the satellite network was replaced with the Terrestrial Wideband Network and a research network called DARTnet while ST evolved into ST-II. The packet video conferencing system was put into scheduled production to support geographically distributed meetings of network researchers and others at up to five sites simultaneously.

ST and ST-II were operated in parallel with IP at the inter-network layer but achieved only limited deployment on government and research networks. As an alternative, initial deployment of conferencing using IP began on DARTnet, enabling multiparty conferences with NVP-II transported over multicast UDP/IP. At the March 1992 meeting of the IETF, audio was transmitted across the Internet to 20 sites on three continents over multicast “tunnels”—the Mbone (which stands for “multicast backbone”)—extended from DARTnet. At that same meeting, development of RTP was begun.

Audio and Video on the Internet

Following from these early experiments, interest in video conferencing within the Internet community took hold in the early 1990s. At about this time, the processing power and multimedia capabilities of workstations and PCs became sufficient to enable the simultaneous capture, compression, and playback of audio and video streams. In parallel, development of IP multicast allowed the transmission of real-time data to any number of recipients connected to the Internet.

Video conferencing and multimedia streaming were obvious and well-executed multicast applications. Research groups took to developing tools such as vic and vat from the Lawrence Berkeley Laboratory,87 nevot from the University of Massachusetts, the INRIA video conferencing system, nv from Xerox PARC, and rat from University College London.77 These tools followed a new approach to conferencing, based on connectionless protocols, the end-to-end argument, and application-level framing.65,70,76 Conferences were minimally managed, with no admission or floor control, and the transport layer was thin and adaptive. Multicast was used both for wide-area data transmission and as an interprocess communication mechanism between applications on the same machine (to exchange synchronization information between audio and video tools). The resulting collaborative environment consisted of lightly coupled applications and highly distributed participants.

The multicast conferencing (Mbone) tools had a significant impact: They led to widespread understanding of the problems inherent in delivering real-time media over IP networks, the need for scalable solutions, and error and congestion control. They also directly influenced the development of several key protocols and standards.

RTP was developed by the IETF in the period 1992–1996, building on NVP-II and the protocol used in the original vat tool. The multicast conferencing tools used RTP as their sole data transfer and control protocol; accordingly, RTP not only includes facilities for media delivery, but also supports membership management, lip synchronization, and reception quality reporting.

In addition to RTP for transporting real-time media, other protocols had to be developed to coordinate and control the media streams. The Session Announcement Protocol (SAP)35 was developed to advertise the existence of multicast data streams. Announcements of sessions were themselves multicast, and any multicast-capable host could receive SAP announcements and learn what meetings and transmissions were happening. Within announcements, the Session Description Protocol (SDP)15 described the transport addresses, compression, and packetization schemes to be used by senders and receivers in multicast sessions. Lack of multicast deployment, and the rise of the World Wide Web, have largely superseded the concept of a distributed multicast directory, but SDP is still used widely today.

Finally, the Mbone conferencing community led development of the Session Initiation Protocol (SIP).28 SIP was intended as a lightweight means of finding participants and initiating a multicast session with a specific set of participants. In its early incarnation, SIP included little in the way of call control and negotiation support because such aspects were not used with the Mbone conferencing environment. It has since become a more comprehensive protocol, including extensive negotiation and control features.

ITU Standards

In parallel with the early packet voice work was the development of the Integrated Services Digital Network (ISDN)—the digital version of the plain old telephone system—and an associated set of video conferencing standards. These standards, based around ITU recommendation H.320, used circuit-switched links and so are not directly relevant to our discussion of packet audio and video. However, they did pioneer many of the compression algorithms used today (for example, H.261 video).

The growth of the Internet and the widespread deployment of local area networking equipment in the commercial world led the ITU to extend the H.320 series of protocols. Specifically, they sought to make the protocols suitable for “local area networks which provide a non-guaranteed quality of service,” IP being a classic protocol suite fitting the description. The result was the H.323 series of recommendations.

H.323 was first published in 199762 and has undergone several revisions since. It provides a framework consisting of media transport, call signaling, and conference control. The signaling and control functions are defined in ITU recommendations H.225.0 and H.245. Initially the signaling protocols focused principally on interoperating with ISDN conferencing using H.320, and as a result suffered from a cumbersome session setup process that was simplified in later versions of the standard. For media transport, the ITU working group adopted RTP. However, H.323 uses only the media transport functionality of RTP and makes little use of the control and reporting elements.

H.323 met with reasonable success in the marketplace, with several hardware and software products built to support the suite of H.323 technologies. Development experience led to complaints about its complexity, in particular the complex setup procedure of H.323 version 1 and the use of binary message formats for the signaling. Some of these issues were addressed in later versions of H.323, but in the intervening period interest in alternatives grew.

One of those alternatives, which we have already touched on, was SIP. The initial SIP specification was published by the IETF in 1999,28 as the outcome of an academic research project with virtually no commercial interest. It has since come to be seen as a replacement for H.323 in many quarters, and it is being applied to more varied applications, such as text messaging systems and voice-over-IP. In addition, it is under consideration for use in third-generation cellular telephony systems,115 and it has gathered considerable industry backing.

The ITU has more recently produced recommendation H.332, which combines a tightly coupled H.323 conference with a lightweight multicast conference. The result is useful for scenarios such as an online seminar, in which the H.323 part of the conference allows close interaction among a panel of speakers while a passive audience watches via multicast.

Audio/Video Streaming

In parallel with the development of multicast conferencing and H.323, the World Wide Web revolution took place, bringing glossy content and public acceptance to the Internet. Advances in network bandwidth and end-system capacity made possible the inclusion of streaming audio and video along with Web pages, with systems such as RealAudio and QuickTime leading the way. The growing market in such systems fostered a desire to devise a standard control mechanism for streaming content. The result was the Real-Time Streaming Protocol (RTSP),14 providing initiation and VCR-like control of streaming presentations; RTSP was standardized in 1998. RTSP builds on existing standards: It closely resembles HTTP in operation, and it can use SDP for session description and RTP for media transport.

A Snapshot of RTP

The key standard for audio/video transport in IP networks is the Real-time Transport Protocol (RTP), along with its associated profiles and payload formats. RTP aims to provide services useful for the transport of real-time media, such as audio and video, over IP networks. These services include timing recovery, loss detection and correction, payload and source identification, reception quality feedback, media synchronization, and membership management. RTP was originally designed for use in multicast conferences, using the lightweight sessions model. Since that time, it has proven useful for a range of other applications: in H.323 video conferencing, webcasting, and TV distribution; and in both wired and cellular telephony. The protocol has been demonstrated to scale from point-to-point use to multicast sessions with thousands of users, and from low-bandwidth cellular telephony applications to the delivery of uncompressed High-Definition Television (HDTV) signals at gigabit rates.

RTP was developed by the Audio/Video Transport working group of the IETF and has since been adopted by the ITU as part of its H.323 series of recommendations, and by various other standards organizations. The first version of RTP was completed in January 1996.6 RTP needs to be profiled for particular uses before it is complete; an initial profile was defined along with the RTP specification,7 and several more profiles are under development. Profiles are accompanied by several payload format specifications, describing the transport of a particular media format. Development of RTP is ongoing, and a revision is nearing completion at the time of this writing.49,50

A detailed introduction to RTP is provided in Chapter 3, The Real-time Transport Protocol, and most of this book discusses the design of systems that use RTP and its various extensions.

Related Standards

In addition to RTP, a complete system typically requires the use of various other protocols and standards for session announcement, initiation, and control; media compression; and network transport.

Figure 1.1 shows how the negotiation and call control protocols, the media transport layer (provided by RTP), the compression-decompression algorithms (codecs), and the underlying network are related, according to both the IETF and ITU conferencing frameworks. The two parallel sets of call control and media negotiation standards use the same media transport framework. Like-wise, the media codecs are common no matter how the session is negotiated and irrespective of the underlying network transport.

IETF and ITU Protocols for Audio/Video Transport on the Internet

Figure 1.1. IETF and ITU Protocols for Audio/Video Transport on the Internet

The relation between these standards and RTP is outlined further in Chapter 3, The Real-time Transport Protocol. However, the main focus of this book is media transport, rather than signaling and control.

Overview of an RTP Implementation

As Figure 1.1 shows, the core of any system for delivery of real-time audio/video over IP is RTP: It provides the common media transport layer, independent of the signaling protocol and application. Before we look in more detail at RTP and the design of systems using RTP, it will be useful to have an overview of the responsibilities of RTP senders and receivers in a system.

Behavior of an RTP Sender

A sender is responsible for capturing and transforming audiovisual data for transmission, as well as for generating RTP packets. It may also participate in error correction and congestion control by adapting the transmitted media stream in response to receiver feedback. A diagram of the sending process is shown in Figure 1.2.

Block Diagram of an RTP Sender

Figure 1.2. Block Diagram of an RTP Sender

Uncompressed media data—audio or video—is captured into a buffer, from which compressed frames are produced. Frames may be encoded in several ways depending on the compression algorithm used, and encoded frames may depend on both earlier and later data.

Compressed frames are loaded into RTP packets, ready for sending. If frames are large, they may be fragmented into several RTP packets; if they are small, several frames may be bundled into a single RTP packet. Depending on the error correction scheme in use, a channel coder may be used to generate error correction packets or to reorder packets before transmission.

After the RTP packets have been sent, the buffered media data corresponding to those packets is eventually freed. The sender must not discard data that might be needed for error correction or for the encoding process. This requirement may mean that the sender must buffer data for some time after the corresponding packets have been sent, depending on the codec and error correction scheme used.

The sender is responsible for generating periodic status reports for the media streams it is generating, including those required for lip synchronization. It also receives reception quality feedback from other participants and may use that information to adapt its transmission.

Behavior of an RTP Receiver

A receiver is responsible for collecting RTP packets from the network, correcting any losses, recovering the timing, decompressing the media, and presenting the result to the user. It also sends reception quality feedback, allowing the sender to adapt the transmission to the receiver, and it maintains a database of participants in the session. A possible block diagram for the receiving process is shown in Figure 1.3; implementations sometimes perform the operations in a different order depending on their needs.

Block Diagram of an RTP Receiver

Figure 1.3. Block Diagram of an RTP Receiver

The first step of the receive process is to collect packets from the network, validate them for correctness, and insert them into a sender-specific input queue. Packets are collected from the input queue and passed to an optional channel-coding routine to correct for loss. Following the channel coder, packets are inserted into a source-specific playout buffer. The playout buffer is ordered by timestamp, and the process of inserting packets into the buffer corrects any reordering induced during transport. Packets remain in the playout buffer until complete frames have been received, and they are additionally buffered to remove any variation in interpacket timing caused by the network. Calculation of the amount of delay to add is one of the most critical aspects in the design of an RTP implementation. Each packet is tagged with the desired playout time for the corresponding frame.

After their playout time is reached, packets are grouped to form complete frames, and any damaged or missing frames are repaired. Following any necessary repairs, frames are decoded (depending on the codec used, it may be necessary to decode the media before missing frames can be repaired). At this point there may be observable differences in the nominal clock rates of the sender and receiver. Such differences manifest themselves as drift in the value of the RTP media clock relative to the playout clock. The receiver must compensate for this clock skew to avoid gaps in the playout.

Finally, the media data is played out to the user. Depending on the media format and output device, it may be possible to play each stream individually—for example, presenting several video streams, each in its own window. Alternatively, it may be necessary to mix the media from all sources into a single stream for playout—for example, combining several audio sources for playout via a single set of speakers.

As is evident from this brief overview, the operation of an RTP receiver is complex, and it is somewhat more involved than the operation of a sender. This increased complexity is largely due to the variability of IP networks: Much of the complexity comes from the need to compensate for packet loss, and to recover the timing of a stream affected by jitter.

Summary

This chapter has introduced the protocols and standards for real-time delivery of multimedia over IP networks, in particular the Real-time Transport Protocol (RTP). The remainder of this book discusses the features and use of RTP in detail. The aim is to expand on the standards documents, explaining both the rationale behind the standards and possible implementation choices and their trade-offs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.166.76