Chapter 7. Lip Synchronization

  • Sender Behavior

  • Receiver Behavior

  • Synchronization Accuracy

A multimedia session comprises several media streams, and in RTP each is transported via a separate RTP session. Because the delays associated with different encoding formats vary greatly, and because the streams are transported separately across the network, the media will tend to have different playout times. To present multiple media in a synchronized fashion, receivers must realign the streams as shown in Figure 7.1. This chapter describes how RTP provides the information needed to facilitate the synchronization of multiple media streams. The typical use for this technique is to align audio and video streams to provide lip synchronization, although the techniques described may be applied to the synchronization of any set of media streams.

Media Flows and the Need for Synchronization

Figure 7.1. Media Flows and the Need for Synchronization

A common question is why media streams are delivered separately, forcing the receiver to resynchronize them, when they could be delivered bundled together and presynchronized. The reasons include the desire to treat audio and video separately in the network, and the heterogeneity of networks, codecs, and application requirements.

It is often appropriate to treat audio and video differently at the transport level to reflect the preferences of the sender or receiver. In a video conference, for instance, the participants often favor audio over video. In a best-effort network, this preference may be reflected in differing amounts of error correction applied to each stream; in an integrated services network, using RSVP (Resource ReSerVation Protocol),11 this could correspond to reservations with differing quality-of-service (QoS) guarantees for audio and video; and in a Differentiated Services network,23,24 the audio and video could be assigned to different priority classes. If the different media types were bundled together, these options would either cease to exist or become considerably harder to implement. Similarly, if bundled transport were used, all receivers would have to receive all media; it would not be possible for some participants to receive only the audio, while others received both audio and video. This ability becomes an issue for multiparty sessions, especially those using multicast distribution.

However, even if it is appropriate to use identical QoS for all media, and even if all receivers want to receive all media, the properties of codecs and playout algorithms are such that some type of synchronization step is usually required. For example, audio and video decoders take different and varying amounts of time to decompress the media, perform error correction, and render the data for presentation. Also the means by which the playout buffering delay is adjusted varies with the media format, as we learned in Chapter 6, Media Capture, Playout, and Timing. Each of these processes can affect the playout time and can result in loss of synchronization between audio and video.

The result is that some type of synchronization function is needed, even if the media are bundled for delivery. As a result, we may as well deliver media separately, allowing them to be treated differently in the network because doing so does not add significant complexity to the receiver.

With these issues in mind, we now turn to a discussion of the synchronization process. There are two parts to this process: The sender is required to assign a common reference clock to the streams, and the receiver needs to resynchronize the media, undoing the timing disruption caused by the network. First up are discussions of the sender and the receiver in turn, followed by some comments on the synchronization accuracy required for common applications.

Sender Behavior

The sender enables synchronization of media streams at the receiver by running a common reference clock and periodically announcing, through RTCP, the relationship between the reference clock time and the media stream time, as well as the identities of the streams to be synchronized. The reference clock runs at a fixed rate; correspondence points between the reference clock and the media stream allow the receiver to work out the relative timing relationship between the media streams. This process is shown in Figure 7.2.

Mapping Media Time Lines to a Common Clock at the Sender

Figure 7.2. Mapping Media Time Lines to a Common Clock at the Sender

The correspondence between reference clock and media clock is noted when each RTCP packet is generated: A sampling of the reference clock, Treference, is included in the packet along with a calculated RTP timestamp, TRTP = Treference × Raudio + Oaudio. The multiplication must be made modulo 232, to restrict the result to the range of the 32-bit RTP timestamp. The offset is calculated as Oaudio = Taudio – (TavailableDaudio_capture) × Raudio, being the conversion factor between media and reference timelines. Operating system latencies can delay Tavailable and cause variation in the offset, which should be filtered by the application to choose a minimum value. (The obvious changes to the formulae are made in the case of video.)

Each application on the sender that is transmitting RTP streams needs access to the common reference clock, Treference, and must identify its media with reference to a canonical source identifier. The sending applications should be aware of the media capture delay—for example, Daudio_capture—because it can be significant and should be taken into account in the calculation and announcement of the relationship between reference clock times and media clock times.

The common reference clock is the “wall clock” time used by RTCP. It takes the form of an NTP-format timestamp, counting seconds and fractions of a second since midnight UTC (Coordinated Universal Time) on January 1, 1900.5 (Senders that have no knowledge of the wall clock time may use a system-specific clock such as “system uptime” to calculate NTP-format timestamps as an alternative; the choice of a reference clock does not affect synchronization, as long as it is done consistently for all media.) Senders periodically establish a correspondence between the media clock for each stream and the common reference clock; this is communicated to receivers via RTCP sender report packets as described in the section titled RTCP SR: Sender Reports in Chapter 5, RTP Control Protocol.

In typical scenarios, there is no requirement for the sender or receiver to be synchronized to an external clock. In particular, although the wall clock time in RTCP sender report packets uses the format of an NTP timestamp, it is not required to be synchronized to an NTP time source. Sender and receiver clocks never have to be synchronized to each other. Receivers do not care about the absolute value of the NTP format timestamp in RTCP sender report packets, only that the clock is common between media, and of sufficient accuracy and stability to allow synchronization.

Synchronized clocks are required only when media streams generated by different hosts are being synchronized. An example would be multiple cameras giving different viewpoints on a scene, connected to separate hosts with independent network connections. In this instance the sending hosts need to use a time protocol or some other means to align their reference clocks to a common time base. RTP does not mandate any particular method of defining that time base, but the Network Time Protocol5 may be appropriate, depending on the degree of synchronization required. Figure 7.3 shows the requirements for clock synchronization when media streams from different hosts are to be synchronized at playout.

Synchronization of Media Generated by Different Hosts

Figure 7.3. Synchronization of Media Generated by Different Hosts

The other requirement for synchronization is to identify sources that are to be synchronized. RTP does this by giving the related sources a shared name, so a receiver knows which streams it should attempt to synchronize and which are independent. Each RTP packet contains a synchronization source (SSRC) identifier to associate the source with a media time base. The SSRC identifier is chosen randomly and will not be the same for all the media streams to be synchronized (it may also change during a session if identifiers collide, as explained in Chapter 4, RTP Data Transfer Protocol). A mapping from SSRC identifiers to a persistent canonical name (CNAME) is provided by RTCP source description (SDES) packets. A sender should ensure that RTP sessions to be synchronized on playout have a common CNAME so that receivers know to align the media.

The canonical name is chosen algorithmically according to the user name and network address of the source host (see the section RTCP SDES: Source Description in Chapter 5, RTP Control Protocol). If multiple media streams are being generated by a single host, the task of ensuring that they have a common CNAME, and hence can be synchronized, is simple. If the goal is to synchronize media streams generated by several hosts—for example, if one host is capturing and transmitting audio while another transmits video—the choice of CNAME is less obvious because the default method in the RTP standard would require each host to use its own IP address as part of the CNAME. The solution is for the hosts to conspire in choosing a common CNAME for all streams that are to be synchronized, even if this means that some hosts use a CNAME that doesn't match their network address. The mechanism by which this conspiracy happens is not specified by RTP: One solution might be to use the lowest-numbered IP address of the hosts when constructing the CNAME; another might be for the audio to use the CNAME of the video host (or vice versa). This coordination would typically be provided by a session control protocol—for example, SIP or H.323—outside the scope of RTP. A session control protocol could also indicate which streams should be synchronized by a method that does not rely on the CNAME.

Receiver Behavior

A receiver is expected to determine which media streams should be synchronized, and to align their presentation, on the basis of the information conveyed to it in RTCP packets.

The first part of the process—determining which streams are to be synchronized—is straightforward. The receiver synchronizes those streams that the sender has given the same CNAME in their RTCP source description packets, as described in the previous section. Because RTCP packets are sent every few seconds, there may be a delay between receipt of the first data packet and receipt of the RTCP packet that indicates that particular streams are to be synchronized. The receiver can play out the media data during this time, but it is unable to synchronize them because it doesn't have the required information.

More complex is the actual synchronization operation, in which the receiver time-aligns audio and video for presentation. This operation is triggered by the reception of RTCP sender report packets containing the mapping between the media clock and a reference clock common to both media. Once this mapping has been determined for both audio and video streams, the receiver has the information needed to synchronize playout.

The first step of lip synchronization is to determine, for each stream to be synchronized, when the media data corresponding to a particular reference time is to be presented to the user. Because of differences in the network behavior or other reasons, it is likely that data from two streams that was captured at the same instant will not be scheduled for presentation at the same time if the playout times are determined independently according to the methods described in Chapter 6, Media Capture, Playout, and Timing. The playout time for one stream therefore has to be adjusted to match the other. This adjustment translates into an offset to be added to the playout buffering delay for one stream, such that the media are played out in time alignment. Figures 7.4 and 7.5 illustrate the process.

Lip Synchronization at the Receiver

Figure 7.4. Lip Synchronization at the Receiver

Mapping between Timelines to Achieve Lip Synchronization at the Receiver

Figure 7.5. Mapping between Timelines to Achieve Lip Synchronization at the Receiver

The receiver first observes the mapping between the media clock and reference clock as assigned by the sender, for each media stream it is to synchronize. This mapping is conveyed to the receiver in periodic RTCP sender report packets, and because the nominal rate of the media clock is known from the payload format, the receiver can calculate the reference clock capture time for any data packet once it has received an RTCP sender report from that source. When an RTP data packet with media timestamp M is received, the corresponding reference clock capture time, TS (the RTP timestamp mapped to the reference timeline), can be calculated as follows:

Mapping between Timelines to Achieve Lip Synchronization at the Receiver

where Msr is the media (RTP) timestamp in the last RTCP sender report packet, TSsr is the corresponding reference clock (NTP) timestamp in seconds (and fractions of a second) from the sender report packet, and R is the nominal media timestamp clock rate in hertz. (Note that this calculation is invalid if more than 232 ticks of the media clock have elapsed between Msr and M, but that this is not expected to occur in typical use.)

The receiver also calculates the presentation time for any particular packet, TR, according to its local reference clock. This is equal to the RTP timestamp of the packet, mapped to the receiver's reference clock timeline as described earlier, plus the playout buffering delay in seconds and any delay due to the decoding, mixing, and rendering processes. It is important to take into account all aspects of the delay until actual presentation on the display or loudspeaker, if accurate synchronization is to be achieved. In particular, the time taken to decode and render is often significant and should be accounted for as described in Chapter 6, Media Capture, Playout, and Timing.

Once the capture and playout times are known according to the common reference timeline, the receiver can estimate the relative delay between media capture and playout for each stream. If data sampled at time TS according to the sender's reference clock is presented at time TR according to the receiver's reference clock, the difference between them, D = TRTS, is the relative capture-to-playout delay in seconds. Because the reference clocks at the sender and receiver are not synchronized, this delay includes an offset that is unknown but can be ignored because it is common across all streams and we are interested in only the relative delay between streams.

Once the relative capture-to-playout delay has been estimated for both audio and video streams, a synchronization delay, D = DaudioDvideo, is derived. If the synchronization delay is zero, the streams are synchronized. A nonzero value implies that one stream is being played out ahead of the other, and the synchronization delay gives the relative offset in seconds.

For the media stream that is ahead, the synchronization delay (in seconds) is multiplied by the nominal media clock rate, R, to convert it into media timestamp units, and then it is applied as a constant offset to the playout calculation for that media stream, delaying playout to match the other stream. The result is that packets for one stream reside longer in the playout buffer at the receiver, to compensate for either faster processing in other parts of the system or reduced network delay, and are presented at the same time as packets from the other stream.

The receiver can choose to adjust the playout of either audio or video, depending on its priorities, the relative delay of the streams, and the relative playout disturbance caused by an adjustment to each stream. With many common codecs, the video encoding and decoding times are the dominant factors, but audio is more sensitive to playout adjustments. In this case it may be appropriate to make an initial adjustment by delaying the audio to match the approximate video presentation time, followed by small adjustments to the video playout point to fine-tune the presentation. The relative priorities and delays may be different in other scenarios, depending on the codec, capture, and playout devices, and each application should make a choice based on its particular environment and delay budget.

The synchronization delay should be recalculated when the playout delay for any of the streams is adjusted because any change in the playout delay will affect the relative delay of the two streams. The offset should also be recalculated whenever a new mapping between media time and reference time, in the form of an RTCP sender report packet, is received. A robust receiver does not necessarily trust the sender to keep jitter out of the mapping provided in RTCP sender report packets, and it will filter the sequence of mappings to remove any jitter. An appropriate filter might track the minimum offset between media time and reference time, to avoid a common implementation problem in which the sender uses the media time of the previous data packet in the mapping, instead of the actual time when the sender report packet was generated.

A change in the mapping offset causes a change in the playout point and may require either insertion or deletion of media data from the stream. As with any change to the playout point, such changes should be timed with care, to reduce the impact on the media quality. The issues discussed in the section titled Adapting the Playout Point in Chapter 6, Media Capture, Playout, and Timing, are relevant here.

Synchronization Accuracy

When you're implementing synchronization, the question of accuracy arises: What is the acceptable offset between streams that are supposed to be synchronized? There is no simple answer, unfortunately, because human perception of synchronization depends on what is being synchronized and on the task being performed. For example, the requirements for lip synchronization between audio and video are relatively lax and vary with the video quality and frame rate, whereas the requirements for synchronization of multiple related audio streams are strict.

If the goal is to synchronize a single audio track to a single video track, then synchronization accurate to a few tens of milliseconds is typically sufficient. Experiments with video conferencing suggest that synchronization errors on the order of 80 milliseconds to 100 milliseconds are below the limit of human perception (see Kouvelas et al. 1996,84 for example), although this clearly depends on the task being performed and on the picture quality and frame rate. Higher-quality pictures, and video with higher frame rates, make lack of synchronization more noticeable because it is easier to see lip motion. Similarly, if frame rates are low—less than approximately five frames per second—there is little need for lip synchronization because lip motion cannot be perceived as speech, although other visual cues may expose the lack of synchronization.

If the application is synchronizing several audio tracks—for example, the channels in a surround-sound presentation—requirements for synchronization are much stricter. In this scenario, playout must be accurate to a single sample; the slightest synchronization error will be noticeable because of the phase difference between the signals, which disrupts the apparent position of the source.

The information provided by RTP is sufficient for sample-accurate synchronization, if desired, provided that the sender has correctly calculated the mapping between media time and reference time contained in RTCP sender report packets, and provided that the receiver has an appropriate synchronization algorithm. Whether this is achievable in practice is a quality-of-implementation issue.

Summary

Synchronization—like many other features of RTP—is performed by the end systems, which must correct for the variability inherent in a best-effort packet network. This chapter has described how senders signal the time alignment of media, and the process by which receivers can resynchronize media streams. It has also discussed the synchronization accuracy required for lip synchronization between audio and video, and for synchronization among multiple audio streams.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.27.155