6
6.1 The Importance of Synchronization
Unlike analog audio, digital audio has a discrete-time structure, because it is a sampled signal in which the samples may be further grouped into frames and blocks having a certain time duration. If digital audio devices are to communicate with each other, or if digital signals are to be combined in any way, then they need to be synchronized to a common reference in order that the sampling frequencies of the devices are identical and do not drift with relation to each other. It is not enough for two devices to be running at nominally the same sampling frequency (say both at 44.1 kHz). Between the sampling clocks of professional audio equipment it is possible for differences in frequency of up to ±10 parts per million (ppm) to exist and even a very slow drift means that two devices are not truly synchronous. Consumer devices can exhibit an even greater range of sampling frequencies that are nominally the same.
The audible effect resulting from a non-synchronous signal drifting with relation to a sync reference or another signal is usually the occurrence of a glitch or click at the difference frequency between the signal and the reference, typically at an audio level around 50 dB below the signal, due to the repetition or dropping of samples. This will appear when attempting to mix two digital audio signals whose sampling rates differ by a small amount, or when attempting to decode a signal such as an unlocked consumer source by a professional system which is locked to a fixed reference. This said, it is not always easy to detect asynchronous operation by listening, even though sample slippage is occurring, as it depends on the nature of audio signal at the time. Some systems may not operate at all if presented with asynchronous signals.
Furthermore, when digital audio is used with analog or digital video, the sampling rate of the audio needs to be locked to the video reference signal and to any timecode signals which may be used. In single studio operations the problem of ensuring lock to a common clock is not as great as it is in a multi-studio centre, or where digital audio signals arrive from remote locations. In distributed system cases either the remote signals must be synchronized to the local sample clock as they arrive, or the remote studio must somehow be fed with the same reference signal as the local studio. A number of approaches may be used to ensure that this happens and they will be explained in this chapter.
Another topic related to synchronization will also be examined here, and that is the importance of short-term clock stability in digitally interfaced audio systems. It is important to distinguish between clock stability requirements in interfacing and clock stability in convertors, although the two are related to some extent.
6.2 Choice of Sync Reference
6.2.1 AES Recommendations
AES recommendations for the synchronization of digital audio signals are documented in AES11-19971. They state that preferably all machines should be able to lock to a reference signal (DARS or ‘digital audio reference signal’), which should take the form of a standard two-channel interface signal (see section 4.3) whose sampling frequency is stable within a certain tolerance and that all machines should have a separate input for such a synchronizing signal. If this procedure is not adopted then it is possible for a device to lock to the clock embedded in the channel code of the AES-format audio input signal – a technique known as ‘genlock’ synchronization. A third option is for the equipment to be synchronized by a master video reference from which a DARS can be derived.
In the AES11 standard signals are considered synchronous when they have identical sampling frequencies, but phase errors are allowed to exist between the reference clock and received/transmitted digital signals in order to allow for effects such as cable propagation delays, phase-locked loop errors and other electrical effects. Input signal frame edges must lie within ± 25% of the reference signal’s frame edge (taken as the leading edge of the ‘X’ preamble), and output signals within ± 5% (see Figure 6.1), although tighter accuracy than this is preferable because otherwise an unacceptable build-up of delay may arise when equipment is cascaded. A phase error of ±25% of the frame period is actually quite considerable, corresponding to a timing difference of around 5 ¼s at 48 kHz, so the specification should be readily achievable in most circumstances. For example, a cable length of 58 metres typically gives rise to a delay of only one bit (0.32 s @ 48 kHz) in the interface signal. If a number of frames’ delay exists between the audio input and output of a device, this delay should be stated.
The AES11 reference signal may either contain programme or not. If it does not contain programme it may be digital silence or simply the sync preamble with the rest of the frame inactive. Two grades of reference are specified: Grade 1, having a long-term frequency accuracy of ±1 ppm (part per million), and Grade 2, having a long-term accuracy of 10 ppm. The Grade 2 signal conforms to the standard AES sample frequency recommendation for digital audio equipment (AES5-19842), and is intended for use within a single studio which has no immediate technical reason for greater accuracy, whereas Grade 1 is a tighter specification and is intended for the synchronization of complete studio centres (as well as single studios if required). Byte 4, bits 0 and 1 of the channel status data of the reference signal (see section 4.8) indicate the grade of reference in use (00 = default, 01 = Grade 1, 10 = Grade 2, 11 = reserved). It is also specified in this standard that the capture range of oscillators in devices designed to lock to external synchronizing signals should be ± 2 ppm for Grade 1 and ± 50 ppm for Grade 2. (Grade 1 reference equipment is only expected to lock to other Grade 1 references.)
Despite the introduction of this standard there are still many devices that do not use AES3-format DARS inputs, preferring to rely on BNCconnected word clock, video sync reference signals or other proprietary synchronization clocks.
6.2.2 Other Forms of External Sync Reference
Currently, digital audio recorders are provided with a wide range of sync inputs and most systems may be operated in the external or internal sync modes. In the internal sync mode a system is locked to its own crystal oscillator, which in professional equipment should be accurate within ± 10 parts per million (ppm) if it conforms to AES recommended practice, but which in consumer equipment may be much less accurate than this.
In the external sync mode the system should lock to one of its sync inputs, which may either be selectable using a switch, or be selected automatically based on an order of priority depending on the mode of operation (for this the user should refer to the operations manual of the device concerned). Typical sync inputs are word clock (WCLK), which is normally a square-wave TTL-level signal (0–5 V) at the sampling rate, usually available on a BNC-type connector; ‘composite video’, which is a video reference signal consisting of either normal picture information or just ‘black and burst’ (a video signal with a blacked-out picture); or a proprietary sync signal such as the optional Alesis sync connection or the LRCK in the Tascam interface (see section 4.12). WCLK may be ‘daisy-chained’ (looped through) between devices in cases where the AES/EBU interface is not available. Digidesign’s ProTools system also uses a so-called ‘Super Clock’ signal at a multiple of 256 times the sampling rate for slaving devices together with low sampling jitter. This is a TTL level (0–5 V) signal on a BNC connector. In all cases one machine or source must be considered to be the ‘master’, supplying the sync reference to the whole system, and the others as ‘slaves’.
A ‘sync’ light is usually provided on the front panel of a device (or under a cover) to indicate good lock to an external or internal clock, and this may flash or go out if the system cannot lock to the clock concerned, perhaps because it has too much jitter, is at too low a level, conflicts with another clock, or because it is not at a sampling rate which can be accepted by the system.
Video sync (composite sync) is often used when a system is operated within a video environment, and where a digital recorder is to be referenced to the same sync reference as the video machines in a system. This is useful when synchronizing audio and video transports during recording and replay, using either a synchronizer or a video editor, when timecode is only used initially during the pre-roll, whereafter machines are released to their own video sync reference. It also allows for timecode to be recorded synchronously on the audio machine, since the timecode generator used to stripe the tape can also be locked to video syncs. The relationship between video frame rates and audio sampling frequencies is discussed further in section 6.6.1.
6.3 Distribution of Sync References
It is appropriate to consider digital audio as similar to video when approaching the subject of sync distribution, especially in large systems. Consequently, it is advisable to use a central high quality sync signal generator (the equivalent of a video sync pulse generator, or SPG), the output of which is made available widely around the studio centre, using digital distribution amplifiers (DDAs) to supply different outlets in a ‘star’ configuration. In video operations this is usually called ‘house sync’. Each digital device in the system may then be connected to this house sync signal and each should be set to operate in the external sync mode. Because long cable runs or poor quality interfaces can distort digital signals, resulting in timing jitter (see section 6.4.3), it may be advisable to install local reference signal generators slaved to the house sync master generator as a means of providing a ‘clean’ local reference within a studio. Until such time as AES11 reference inputs become available on audio products one might use word clock or a video reference signal instead.
Using the technique of central sync-signal distribution (see Figure 6.2) it becomes possible to treat all devices as slaves to the sync generator, rather than each one locking to the audio output of the previous one. In this case there is no ‘master’ machine and the sync generator acts as the ‘master’. It requires that all machines in the system operate at the same sampling rate, unless a sampling frequency convertor or signal synchronizer is used (see section 6.5).
Alternatively, in a small studio, it may be uneconomical or impractical to use a separate SPG. In such cases one device in the studio must be designated as the master. This device would then effectively act as the SPG, operating in the internal sync mode, with all other devices operating in the external sync mode and slaving to it (see Figure 6.3). If a digital mixer were to be used then it could be used as the SPG, but alternatively it would be possible to use a tape recorder, disk system or other device with a stable clock. In such a configuration it would be necessary either to use AES/EBU interfaces for all interconnection (in which case it would be possible to operate in the ‘genlock’ mode described above, where all devices derive a clock from their digital audio inputs) or to distribute a separate word clock or AES sync signal from the master. In the genlock configuration, the danger exists of timing errors being compounded as delays introduced by one device are passed serially down the signal chain.
In situations where sources are widely spread apart, perhaps even being fed in from remote sites (as might be the case in broadcast operations) the distribution of a single sync reference to all devices becomes very difficult or impossible. In such cases it may be necessary to retime external ‘wild’ feeds to the house sync before they can be connected to inhouse equipment and for this purpose a sampling frequency synchronizer should be used (see section 6.5).
6.4 Clock Accuracy Considerations
As mentioned above, there are a number of different considerations concerning clock accuracy and the importance of stability that apply at different points in the signal chain. There is the accuracy of the audio sample clock used for A/D and D/A conversion, which will have a direct effect on sound quality, and there is the accuracy of the external reference signal. There is also the question of timing stability in digital audio signals that have travelled over interconnects such as those described in this book, which may have suffered distortions of various kinds. Because the audio sample clock used for conversion in externally synchronized systems must be locked in some way either to the digital audio input clock or the sync reference, it is common for instabilities in either of these signals to affect the stability of the audio sample clock (although they need not). This depends on how the clock is extracted from the digital input signal and the nature of the timing error. Furthermore, clock instability resulting from distortion and interference in the digital interface makes the signal more difficult to decode.
6.4.1 Causes and Effects of Jitter on the Interface Signal
Timing irregularities may arise in signals transferred over a digital audio interface due to a number of factors. These may include bandwidth limitations of the interconnect and the effects of induced noise and other signals. Furthermore, the transmitted signal may already have some jitter, either because the source did not properly reject incoming jitter from an input signal, or because its own free-running clock was unstable. AES3 originally specified that data transitions on the interface should occur within 20 ns of an ideal jitter-free clock. In real products it is normal for interface transition jitter to be well below 20 ns, but when devices which pass on input jitter to their digital outputs are cascaded it is possible for specifications to exceed this value after a number of stages. The degree to which a device passes on jitter is known as its jitter transfer function.
AES3-1997 is somewhat more specific in relation to jitter, specifying limits for output jitter, intrinsic jitter, jitter gain and input jitter tolerance. Output jitter is the sum of the intrinsic jitter of the device’s output and that passed through from any timing reference. The intrinsic jitter is specified to be no greater than 0.025 UI (unit interval) when measured using a standard high-pass measurement filter specified in the standard. A UI is specified as the smallest timing unit in the interface and there are 128 UIs per frame (the modulation method can involve transitions in the middle of a bit cell). Sinusoidal jitter gain (the ratio of jitter amplitude at the output of the device to that present at the sync signal input) should be no greater than 2 dB, again measured using a standard filter. Input jitter tolerance (the maximum jitter value below which a device should correctly decode input data) is specified as 0.25 UI peak-to-peak above 8 kHz rising to 10 UI below 200 Hz.
The received signal from a standard two-channel AES interface will have an eye pattern that depends on amplitude and timing irregularities (see Chapter 4). Amplitude errors will close the eye vertically and timing errors will close it horizontally. The limits for correct decoding are laid out in the specification of the interface. Nonetheless, some receivers are better than others at decoding data with a poor eye pattern and this has partly to do with the frequency response of the phase-locked loop in the receiver and its lock-in range (see below). It also depends on the part of the signal from which the decoder extracts its clock, since some transitions are decidedly more unstable than others when the link is poor. Decoders that are very tolerant of poor input signals may at the same time be bad at rejecting jitter. Although they may decode the signal, the resulting sound quality may be poor if the signal is converted within the device without further rejection of jitter and the device may pass on excessive jitter at its output.
Dunn4 and Dunn and Hawksford6 have both carried out simulations of the effects of link bandwidth reduction on standard two-channel interface signals. The important conclusions of their work are as follows. When a link suffers high-frequency loss there will be a reduction in amplitude of the shorter pulses and a slowing in rise and fall times at transitions, the effect of which is to delay the zero-crossing transition after short pulses less than the delay after longer pulses. This variable delay is effectively jitter and is solely a result of high-frequency loss. Figure 6.4 shows the HF loss model which was used in both studies (which, although simplistic, is considered a good starting point for analysis), the time constant of which is RC. Figure 6.5 shows a comparison between simulated bi-phase mark data at time constants of 200 ns and 50 ns, showing clearly that at 200 ns the shorter pulses are more attenuated than the longer (a time constant of 200 ns corresponds to a roll-off of 3 dB at 0.8 MHz, whereas 50 ns corresponds to a similar roll-off at around 3.18 MHz). Dunn’s results show clearly that for links with a bandwidth of less than around 3 MHz the jitter suffered by transitions in the main part of the AES subframe (the audio data time slots) is far greater than that suffered by the penultimate transition of the Y preamble, as shown in Figure 6.6. Dunn and Hawksford also provide convincing evidence that the jitter is highly correlated with the audio signal and is affected by the difference between the number of zeros and ones in the signal. This situation clearly becomes more critical at higher data rates than those originally specified.
6.4.2 Audio Sampling Frequency
It is important to separate the discussion of long-term sample frequency accuracy from stability in the short term. Short-term instability is called ‘jitter’ and long-term inaccuracy would manifest itself as drift (if in one direction only), or wow and flutter (if cyclically modulated) in extreme cases. Jitter will be covered in the next section.
For professional equipment, the nominal sampling frequency should be accurate to within ±10 ppm if it conforms to AES5 recommendations (although the standard only explicitly states this at 48 kHz). This corresponds to an allowable peak drift in the sample period of ±0.21 ns at a sampling frequency of 48 kHz, but implies nothing about the rate at which that modulation takes place. When a device ‘free runs’ it is locked to its own internal oscillator, which for fixed sampling frequencies is normally a crystal oscillator capable of high accuracy. However, in variable-speed modes a crystal oscillator cannot easily be used and some form of voltagecontrolled or other oscillator may take its place, having a less stable frequency. In such cases a device is not expected to meet the stability requirements of AES5 or AES11.
In consumer equipment the sampling frequency is normally less carefully specified and controlled, often making it difficult to interconnect consumer and professional equipment without the use of a sampling frequency convertor or a synchronizer. IEC 60958 specifies three levels of sampling frequency accuracy: Level I (‘high’ accuracy) = ±50 ppm; Level II (normal accuracy) = ±1000 ppm; and Level III (variable pitch shifted clock mode), which is undefined except to say that it can only be received by special equipment and that the frequency range is likely to be ±12.5% of the nominal sampling frequency. Again nothing is said about the rate of sample clock modulation. Consumer sampling frequency and clock accuracy are indicated in bits 24–27 and 28–29 of channel status in the digital audio interface signal (see section 4.8). There is no such indication in professional channel status, except in AES11-type reference signals (see above).
At the professional limit of ± 10 ppm, a nominal sample clock of 48 kHz could range over the limits 47 999.52 Hz to 48 000.48 Hz – a speed tolerance of 0.001% – whereas a normal accuracy consumer device at the same nominal sampling frequency could be anything from 47956Hz to 48 048 Hz – a speed tolerance of ±0.1%.
6.4.3 Sample Clock Jitter and Effects on Sound Quality
Short-term timing irregularities in sample clocks may affect sound quality in devices such as A/D and D/A convertors and sampling frequency convertors. This is due to modulation in the time domain of the sample instant (see section 2.7.4), resulting in low-level signal products within the audio spectrum. The important features of jitter are its peak amplitude and its rate, since the effect on sound quality is dependent on both of these factors taken together. Shelton3, by calculating the rms signal-to-noise ratio resulting from random jitter, showed that timing irregularities as low as 5 ns may be significant for 16-bit digital audio systems over a range of signal frequencies and that the criteria are even more stringent at higher resolutions and at high frequencies. The effects are summarized in Figure 6.7.
When jitter is periodic rather than random, it results in the equivalent of ‘flutter’, and the effect when applied to the sample clock in the conversion of a sinusoidal audio signal is to produce sidebands on either side of the original audio signal due to phase modulation, whose spacing is equal to the jitter frequency. Julian Dunn4 has shown that the level of the jitter sideband (Rj) with relation to the signal is given by:
where J is the peak-to-peak amplitude of the jitter and ωi is the audio signal frequency. Using this formula he shows that for sinusoidal jitter with an amplitude of 500 ps, a maximum level 20 kHz audio signal will produce sidebands at −96.1 dB relative to the amplitude of the tone.
What is important, though, is the audibility of jitter-induced products and Dunn4,5 attempted to calculate this based on an analysis of the resulting spectrum using accepted audibility curves, based on critical band masking theory, assuming that the audio signal is replayed at a high listening level (120 dB SPL). As shown in Figure 6.8, which plots jitter amplitude against jitter frequency (not audio frequency) for just-audible modulation noise on a worst-case audio signal, the jitter amplitude may in fact be very high (> 1 μs) at low jitter frequencies (up to around 250 Hz) because the sidebands will be masked at all audio frequencies, but the amount allowed falls sharply above this jitter frequency although it may still be up to ±10 ns at jitter frequencies up to 400 Hz.
The original version of AES11 specified tolerances for jitter on the sampling frequency clock, but this was dropped in the 1997 revision in favour of a statement to the effect that the clock tolerance requirements for A/D and D/A conversion would have to be more stringent than that for a Grade 1 reference signal in respect of random jitter and jitter modulation. This reinforces the point that sampling clock jitter and interface clock jitter are related but different problems. The effect of jitter on sampling clocks and the audible result thereof has become a large and complex subject and it is not proposed to deal with it further in this book, the primary purpose of which is to cover interfacing issues. A comprehensive study by Chris Dunn and Malcolm Hawksford6 attempted to survey the effects of interface induced jitter on different types of DAC and this paper warrants close study by those whose business it is to design high quality DACs with digital audio interfaces.
The implication of section 6.4.1 is that it is greatly preferable to derive a stable sample clock from one of the reliable preamble transitions than from the audio data slot transitions, although there is evidence that many devices do use the data transitions. One older interface receiver chip adjusted its PLL on every negative-going transition in the interface signal, for example. Since transitions in the audio data part of the subframe are not only more sensitive to line-induced jitter, but also determined by the audio signal, it is even possible for a signal on the B audio channel to modulate the sampling clock such that jitter sidebands appear in the A channel that are tonally related to the signal in the B channel.
The rejection of jitter by the receiver depends principally on the frequency response of its PLL and in general the narrower the response of the PLL and the lower its cut-off frequency the better the rejection of jitter. The problem with this is that such PLLs may not lock up as quickly as wide bandwidth PLLs and may not lock over a particularly wide frequency range. A solution suggested by Dunn and Hawksford is to use a wideband PLL in series with a low bandwidth version which switches in after conditions have stabilized. Alternatively RAM buffering may be used between the interface decoder and the convertor, the data being clocked out of the buffer to the convertor under control of a more stable clock source. In a hi-fi system it is possible that the stable clock source’s frequency could be independent of the incoming data clock, provided that the buffer was of sufficient size to accommodate the maximum timing error between the two over the duration of, say, a CD, but the better solution adopted by some manufacturers in two-box CD players is to generate a synchronizing clock in the convertor which is fed back to reference the speed of the CD transport and thus the rate of data coming over the digital interface (see Figure 6.9).
In a professional digital audio system where all devices in the system are to be locked to a common sampling frequency clock it would normally be necessary to lock any convertors to an external reference. An AES11-style reference signal derived from a central generator might have suffered similar degradations to an audio signal travelling between two devices and thus exhibit transition timing jitter. In such cases, especially in areas where high quality D/A conversion is required, it is advisable either to reclock the reference signal or to use a local high quality reference generator, slaved to the central AES11 generator, with which to clock the convertors.
6.5 Use and Function of Sampling Frequency Synchronizers
A sampling frequency synchronizer may be used to lock a digital audio signal to a reference signal. It could also perform the useful function of reclocking distorted remote signals to remove short-term timing errors (jitter). Three main approaches to sampling frequency synchronization are necessary, depending on the operational requirement. Each will be discussed in turn. These are:
(a) Frame alignment: to deal with signals which are of identical sampling frequency but which are more than 25% of a frame period out of phase with the reference.
(b) Buffering: to deal with signals of nominally the same sampling frequency but which are not locked to the same reference and thus drift slightly with relation to each other.
(c) Sampling frequency conversion: to deal with signals whose sampling requencies differ by a larger amount than implied in (b), such as between 44.1 and 48 kHz, or between consumer and professional systems in which the consumer device’s sampling rate is nominally the same as the professional device’s but within a large tolerance see section 6.4 above).
It may be necessary to correct for timing errors in signals that are synchronous to the master clock but have travelled long distances. These will have been delayed and so be out of phase with the reference. This is more properly referred to as ‘frame alignment’ and is only necessary when a signal is more than 25% of a frame period delayed with reference to the sync reference. Propagation delays are not great, however: for example, an AES/EBU signal must travel some 3.7 km down a typical cable before it is delayed by one sample period; thus it is most unlikely that such a situation will arise in real operational environments unless a large static phase error has been introduced in the sample clock of an incoming signal due to it having been cascaded through a number of devices operating in the genlock mode (see section 6.3).
In order to conform to AES recommendations, frame alignment should rephase the signal to bring it within ±5% of a frame period compared with the sync reference. Input signals less than ±25% adrift are also expected to be brought within this ±5% limit at the output, but this is normally performed within the device itself. Reframing of signals more than ±25% adrift may be performed within the device, but if not an external reframer would be required.
6.5.2 Buffering
For signals of nominally the same frequency but very slightly adrift it is possible to use a simple buffer store synchronizer, such as those described by Gilchrist7 and also by Parker8. In this type of synchronizer, a typical block diagram of which is pictured in Figure 6.10, audio samples are written sequentially into successive addresses of a solid state memory configured in the FIFO manner. These samples are read out of the memory a short time later, clocked by the reference signal, the buffer providing a short-term store to accommodate the variation in input and output rates. If the output rate is slightly faster than the input rate then the buffer will gradually become empty and if it is slower than the input rate the buffer will gradually become full, requiring action at some point to avoid losing data or repeating samples because the buffer cannot be infinitely large. At such a time the read address is reset to the mid point of the buffer, resulting in a slight discontinuity in the audio signal. This discontinuity may be arranged to occur within silent passages of the programme, or alternatively a short crossfade may be introduced at the reset point to ‘hide’ the discontinuity.
Buffer store synchronizers have the advantage that most of the time the audio signal is copied bit for bit between input and output, with discontinuities only occurring once every so many minutes, depending on the size of the buffer and the discrepancy in input and output sampling rates. The larger the buffer the longer the gaps between buffer resets, or the greater the discrepancy between sampling rates which may be accommodated. The price for using a larger buffer is a longer delay between input and output, and this must be chosen with the operational requirement in mind. Using a buffer store capable of holding 480 samples, for example, a delay of around 5 ms would result, and buffer resets would occur every 8.3 minutes if the sampling frequencys were at the extremes of the AES5 tolerance of ±10 ppm.
6.5.3 Sampling Frequency Conversion
For signals whose sampling frequencies differ by too great an amount to be handled by a buffer store synchronizer it will be necessary to employ sampling frequency conversion. This can be used to convert digital interface signals from one rate to another (say from 44.1 to 48 kHz) without passing through the analog domain. Sampling frequency conversion is not truly a transparent process but modern convertors introduce minimal side effects.
The most basic form of sampling frequency conversion involves the translation of samples at one fixed rate to a new fixed rate, related by a simple fractional ratio. Fractional-ratio conversion involves the mathematical interpolation of samples at the new rate based on the values of samples at the old rate. Digital filtering is used to calculate the amplitudes of the new samples such that they are mathematically correct based on the impulse response of original samples, after low-pass filtering with an upper limit of the Nyquist frequency of the original sampling rate. A clock rate common to both sampling frequencies is used to control the interpolation process. Using this method, some output samples will coincide with input samples, but only a limited number of possibilities exist for the interval between input and output samples. Such a process is nominally jitter free.
If the input and output sampling rates have a variable or non-simple relationship, output samples may be required at any interval in between input samples. This requires an interpolator with many more clock phases than for fractional-ratio conversion, the intention being to pick a clock phase which most closely corresponds with the desired output sample instant at which to calculate the necessary coefficient. There will clearly be a timing error, the audible result of which is equivalent to the effect of jitter, and this may be made smaller by increasing the number of possible interpolator phases. If the input sampling rate is continuously varied (as it might be in variable-speed searching or cueing) the position of interpolated samples with relation to original samples must vary also, and this requires real-time calculation of filter phase.
Errors in sampling frequency conversion should be designed so as to result in noise modulation below the noise floor of a 16-bit system and preferably lower. For example, one such convertor is quoted as introducing distortion and noise at −105 dB ref. 1 kHz at 0 dB FS (equivalent to 18-bit noise performance).
6.6 Considerations in Video Environments
Audio and video/film have traditionally required synchronization for the purposes of achieving lip sync. Film and video are both discrete in that they have frames, and when audio was analog synchronizing to sufficient accuracy for lip sync caused no difficulty. Now that audio is also digital, it too is made up of discrete information and more accurate synchronizing with video becomes necessary. In environments where digital audio is used with video signals it is important for the audio sampling rate to be locked to the same master clock as the video reference. The same applies to timecode signals which may be used with digital audio and video equipment. A number of proposals exist for incorporating timecode within the standard two-channel interface, each of which has different merits, and these will be discussed below.
6.6.1 Relationships between Video Frame Rates and Audio Sampling Rates
People using the PAL or SECAM television systems are fortunate in that there is a simple integer relationship between the sampling frequency of 48 kHz used in digital audio systems for TV and the video frame rate of 25 Hz (there are 1920 samples per frame). There is also a simple relationship between the other standard sampling frequencies of 44.1 and 32 kHz and the PAL/SECAM frame rate, as was shown in Table 4.2. Users of NTSC TV systems (such as the USA and Japan) are less fortunate because the TV frame rate is 30/1.001 (roughly 29.97) frames per second, resulting in a non-integer relationship with standard audio sampling frequencies. The sampling frequency of 44.056 kHz was introduced in early digital audio recording systems that used NTSC VTRs, as this resulted in an integer relationship with the frame rate. For a variety of historical reasons it is still quite common to encounter so called ‘pull-down’ sampling frequencies in video environments using the NTSC frame rate, these being 1/1.001 times the standard sampling frequencies, which mainly serves to complicate matters.
The standard two-channel interface’s channel status block structure repeats at 192 sample intervals, and in 48 kHz systems there are exactly ten audio interface frames per PAL/SECAM video frame, simplifying the synchronization of information contained in channel status with junctures in the TV signal and making possible the carrying of EBU timecode signals in channel status as described below.
As described by Shelton9 and others10,11 it is desirable to source a master audio reference signal centrally within a studio operation, just as a video reference is centrally sourced. These two references are normally locked to the same highly stable rubidium reference, which in turn may be locked to a standard reference frequency broadcast by a national transmitter. As shown in Figure 6.11, the overall master clock contained in a central apparatus room will be used to lock the video SPG (distributed as ‘black and burst’ video), the audio reference signal generator (distributed as a standard AES11 reference signal) and the colour subcarrier synthesizer. The video reference is used in turn to lock timecode generators. Appropriate master clock frequencies and division ratios may be devised for the application in question.
The sync point between audio and video reference signals defined in AES11 is the half amplitude point of the leading edge of line sync of line 1 of the video frame in systems where there is an integer number of AES frames per video frame. This is synchronized to the start of the X or Z audio preamble. The situation is complicated with NTSC video since there is not an integer number of audio samples per frame. The desired alignment of audio preamble and video line sync only occurs once every five frames and an indicator is supposed to be included in the video reference to show which frame acts as the sync point.
An alternative approach to video sync is found on some audio equipment, especially digital tape recorders. Here the audio device accepts a video sync reference and derives its sample clock by appropriate multiplication and division of this reference internally. DIP switches may be provided on the audio device to select the appropriate frame rate so that the correct ratio results.
6.6.2 Referencing of VTRs with Digital Audio Tracks
Video tape recorders (VTRs), both analog and digital, are often equipped with digital audio tracks. Digital VTRs are really data recorders with audio and video interfaces. The great majority of the data is video samples and the head rotation is locked to the video field timing. Part of each track is reserved for audio data, which uses the same heads and much of the same circuitry as the video on a time-division basis. As a result the audio sampling rate has to be locked to video timing so that the correct number of audio and video samples can be assembled in order to record a track.
At the moment the only audio sampling rate supported by digital VTRs is 48 kHz. This causes no difficulty in 625/50 systems as there are exactly 960 sample periods in a field and a phase-locked loop can easily multiply the vertical rate to produce a video synchronous 48 kHz clock. Alternatively, line rate can be multiplied by 384/125. However, the 0.1% offset of 525 line systems makes the actual field rate 59.94 Hz. The fields are slightly longer than at 60 Hz and in 60 fields there will be exactly 48 048 sample periods. Unfortunately this number does not divide by 60 without a remainder. The smallest number of fields which contain a whole number of samples is five. This makes the generation of the audio sampling clock more difficult, but it can be obtained by multiplying line rate by 1144/375.
When a DVTR is recording using analog audio inputs, the heads rotate locked to input video, which also determines the audio sampling rate for the ADCs in the machine and there is no difficulty. On replay, the synchronism between video and audio is locked in the recording and the audio sampling rate on the output will be locked to station reference. Any receiving device will need to slave to the replay audio sampling rate. On recording with a digital audio input, it is necessary that the digital audio source is slaved to the input video of the DVTR. This can be achieved either by taking a video-derived audio sampling rate reference from the DVTR to the source, or by using a source which can derive its own sampling rate from a video input. The same video timing is then routed to both the source and the DVTR.
If these steps are not followed, there could be an audio sampling rate mismatch between the source and destination and inevitably periodic corruption will occur. With modern crystal-controlled devices it is surprising how long an unlocked system can run between corruptions. It is easy mistakenly to think a system is locked simply because it works for a short whilst when in fact it is not and will shows signs of distress if monitored for longer.
In some VTRs with digital audio tracks it has been possible for there to arise a phase relationship between the digital audio outputs of a VTR and the video output which is different each time the VTR is turned on, causing difficulties in the digital transfer of audio and video data to another VTR. When the phase relationship is such that the incoming digital audio signal to a VTR lies right on the boundary between two sample slots of its own reference there is often the possibility of sample slips or repeats when timing jitter causes vacillation of sync between two sample periods. This highlights the importance of fixing the audio/video phase relationship.
Manufacturers of video recorders, though, claim that the solution is not as simple as it seems, since in editing systems the VTR may switch between different video sync references depending on the operational mode. Should the audio phase always follow the video reference selection of the VTR? In this case audio reframing would be required at regular points in the signal chain. A likely interim solution will be that at least the phase of the house audio reference signal with relation to the house video sync will be specified, providing a known reference sync point for any reframing which might be required in combined audio/video systems.
6.6.3 Timecode in the Standard Two-Channel Interface
In section 4.8 the possibility of including ‘sample address’ codes in the channel status data of the interface was introduced. Such a code is capable of representing a count of the number of samples elapsed since midnight in a binary form, and when there is a simple relationship between the audio sampling frequency and the video frame rate as there is with PAL TV signals, it is relatively straightforward to convert sample address codes into the equivalent value in hours, minutes, seconds and frames used in SMPTE/EBU timecode.
At a sampling frequency of 48 kHz the sample address is updated once every 4 ms, which is ten times per video frame. At NTSC frame rates the transcoding is less easy, especially since NTSC video requires ‘dropframe’ SMPTE timecode to accommodate the non-integer number of frames per second. A further potential difficulty is that the rate of update and method of transcoding for sample address codes is dependent upon the audio sampling frequency, but since this is normally fixed within one studio centre the issue is not particularly important.
Discussions ran for a number of years on an appropriate method for encoding SMPTE/EBU timecodes in a suitable form within the channel status or user bits of the audio interface. At the time of writing there is no published standard for this purpose, although a number of European broadcasters have adopted the non-standard procedure described below. There is also potential for incorporating timecode messages into the HDLC packet scheme determined for the user bit channel in AES18 (see section 4.7.1). The advantage of this option is that the data transfer rate in AES18 is independent of the audio sampling frequency, within limits, so the timecode could possibly be asynchronous with the audio, video, or both, if necessary. Such a scheme makes it difficult to derive a phase reference between the video frame edge and the timecode frame edge, but additional bits are proposed in this option to indicate the phase offset between the two at regular points in the video frame.
An important proposal was made and adopted by a number of European broadcasters12, which replaced the local sample address code in bytes 14–17 of channel status with four bytes of conventional SMPTE/EBU timecode. These bytes contained the BCD (Binary Coded Decimal) time values for hours, minutes, seconds and frames, as replayed from the device in question, in the following manner:
() Byte 14 = frames (a four-bit BCD value for both tens and units of frames)
() Byte 15 = seconds (ditto for seconds)
() Byte 16 = minutes (ditto for minutes)
() Byte 17 = hours (ditto for hours)
The time-of-day sample address bytes (18–21) are replaced by time-of-day timecode in the same way. At 48 kHz with EBU timecode the timecode value is thus repeated ten times per frame.
6.7 Compatibility Issues in Audio Interfacing
Despite the many standards relating to audio interfaces, or perhaps because of them, it is possible that practical difficulties may arise when attempting to interconnect two or more devices. The problem of ‘getting devices to talk to each other’ is possibly less serious than it was when this book was first written, but there are still areas of difficulty and the number of interface types is large. Communication problems may usually be boiled down to one of a few common sources of incompatibility. Not only must the user know how to work around basic incompatibilities, but also it is necessary to be aware of the possibility for incorrect communication. Devices may be made to ‘talk’ but something may be lost or gained in the translation!
The majority of this section is devoted to the standard two-channel interface. Communications between devices using dissimilar manufacturer-specific interfaces will nearly always require the use of a format convertor such as those described in section 6.12.1. In addition to the extended discussion of practical interfacing contained in the main text, there is a reference ‘troubleshooting guide’ to audio interfacing at the end of this section. If problems are encountered with communication over computer networks one normally needs to deal with them as computer networking problems rather than audio interfacing problems. This may involve correct configuration of routers, IP addresses, network drivers and ports, among other things. This is a topic in its own right, about which much is written, and will not be covered further here.
6.7.1 Incompatibilities between Devices using the Standard Two-Channel Interface
There are only really two reasons why devices using nominally the same interface will not communicate, summed up as either electrical incompatibility or data incompatibility. There is also, of course, the possibility for sampling frequency incompatibility between devices, but this could be considered as a combination of electrical and data incompatibility. If the two devices are using the same electrical interface, such as AES3 for professional purposes or IEC 60958–3 for consumer purposes, there is only a very small chance of electrical mismatch, particularly now that the more recent versions of the standards have limited the options for electrical incompatibility. The more likely cause of any problems is that differences exist between the data transmitted and that expected by the receiver. If direct links are attempted between consumer and professional equipment then there is potential for both electrical and data incompatibility, as discussed below.
6.7.2 Electrical Mismatch in Professional Systems
Between identical interfaces (transmitter and receiver) the most likely electrical problems to arise are (a) loss over long lines; (b) noise and distortion over long lines; (c) impedance mismatch. These can in general be avoided by good system design and by adhering to the recommendations contained in the appropriate standard, but occasionally one encounters a poor electrical installation or needs to make use of existing wiring which may not be ideal for the job of carrying digital audio. When such electrical problems are encountered the most likely symptom is that the receiver will find it difficult or impossible to lock to the incoming data signal, resulting in intermittent operation or indication of ‘loss of lock’ at the digital input. In practice receivers vary widely in their ability to lock to poor quality signals, and thus it may be found that a signal which works satisfactorily with one receiver proves unsatisfactory with another. Using devices such as those discussed in section 6.9 it is possible to determine the ‘health’ of the received data signal, as well as examining problems with data.
A receiver conforming to AES3 should be able to decode a signal with a minimum eye height of 200 mV, and since the transmitter normally produces at least 5 volts the resistive cable attenuation has to be quite large before it will reduce the eye height below this value. More problematical than simple resistive loss is high-frequency roll-off over a long cable. This may need to be corrected by using suitable equalization at the receiver (see section 4.3.3), although equalization should be treated with care since it will only work if the line is relatively noise free. (If the line is noisy then equalization may actually make the problem worse and it has been suggested that if such equalization is to be used it perhaps belongs at the transmitter end rather than the receiver.)
High-frequency loss affects the narrow pulses of the data stream before the wide ones and these narrow pulses may either fail to provide a zero crossing or even disappear altogether in extreme cases. The greater the HF loss the more likelihood there is of intersymbol interference and data edge timing jitter, making it more difficult for the receiver to lock to the received data. Dunn1 suggests that the cable used should not exhibit attenuation at 6 MHz which is more than 6 dB greater than that at 1 MHz over the distance used, otherwise equalization will be required (this is for basic sampling rates). Operating the interface at higher data rates than the original specification will put greater demands on this criterion, extending it to higher frequencies. Although conventional analog audio cable can and has been used in many cases, it is becoming common for new installations to be wired for digital audio with cable having higher specifications and a characteristic impedance which is better controlled and closer to 110 ohms than conventional microphone cable.
Impedance mismatches were more likely under the original AES3 specification than they became under the 1992 revision, due to the 250 ohm termination impedance specified in AES3-1985. (This was to allow for between one and four receivers to be connected in parallel across one signal line.) Such mismatches may result in internal reflections such that transmitted pulses are reflected from the receiving end to interfere with pulses travelling in the opposite direction, and the cable may begin to function as an antenna – both picking up and radiating interference. Now that the termination impedance is specified at 110 ohms and point-to-point interconnects are required, it may be necessary to modify the input impedance of older receivers by fitting a parallel resistor of around 200 ohms so as to make the termination nearer to the correct value. If it is necessary to feed more than one receiver from a single driver it is recommended that suitable digital distribution amplifiers (DDAs) are used, or alternatively one could use passive resistive splitters.
In order to avoid impedance mismatches in between transmitter and receiver it is important that the cable used is consistent along its length and that joins are not made between dissimilar cable types. Problems may also arise if digital audio signals are routed via analog patch bays in which short sections of cabling with mismatched impedances may exist. Any cable ‘stubs’ in such installations produce short-delayed reflections with fairly high amplitude that can interfere with the transmitted data signal.
Users wishing to carry signals over long distances may wish to consider the possibility of adopting the 75 ohm unbalanced interconnect specified in AES3-ID or SMPTE 276M as an alternative to balanced 110 ohm interfacing. Convertors are available which perform the job very easily. It has been suggested by a number of sources that HF interference such as RF sources is better rejected by the effectiveness of cable screening than by the balance of the electrical interface, and that 75 ohm coax cable has better controlled impedance than audio microphone cable.
A final point to bear in mind is that although transformers are not mandatory in most versions of the two-channel interface standard they may be used to ensure good electrical isolation and earth separation where appropriate. The transformer is standard in the EBU version of the interface, since it was regarded as important in broadcast studio centres where earth continuity is normally avoided between operational areas.
6.7.3 Data Mismatch in Professional Systems
Data mismatch between professional devices using the standard two-channel interface has typically been confined to problems with the implementation of channel status (see section 4.8), but there is also the possibility for differences in sampling rate and audio word length, as discussed below. In more recent systems there are numerous options for higher sampling frequencies, operating the interface in single-channel-double-sampling frequency mode or simply increasing the data rate, which can lead to difficulties in communication. There is also the increased likelihood that interfaces may be operated in the ‘non-audio’ or ‘other purposes’ mode for carrying data-reduced audio signals, as described in Chapter 4. User bit incompatibilities might arise if greater use was made of this channel, but few current professional devices take any notice of the state of the user bit. The validity bit is historically another root of trouble and its handling was discussed in section 4.6. As discussed in section 6.12, a number of manufacturers now produce devices specifically designed to analyse and/or correct for data incompatibilities between pieces of equipment and such ‘fix-it’ boxes can be very useful in encouraging communications between systems.
Incompatibilities in channel status implementation can give rise to a variety of symptoms ranging from complete failure of communication to seemingly correct but actually improper communication. The reason that such incompatibilities have arisen is largely that the original AES3 specification was less than specific on how devices should set channel status bits that were not used, as well as how they should respond when receiving data that they were incapable of handling. Consequently all sorts of channel status implementations exist in commercial products, although it should be said that most of the time devices communicate without problems. Because of these potential difficulties, AES3-1992 was more specific about channel status implementation, specifying three levels of implementation depending upon the application (see section 4.8.4). As a result there should be less difficulty between modern devices than between older ones.
The most common problem areas in channel status implementation were historically (a) in the CRCC byte (byte 23); (b) in the signalling of pre-emphasis; (c) in the setting of the consumer/professional flag; and (d) in the indication of sampling rate. Less common problems arose when the left and right channels had different channel status data (making it difficult to decide which was correct) or where the channel status block was the wrong length (one famous example exists of a 191-byte channel status block!).
The problem with the CRCC byte was that not all transmitting devices included it at the end of the channel status block and some early devices implemented it incorrectly, thereby confusing those that expected to see the correct CRCC data. Receivers that check the CRCC will indicate an almost continuous CRC error when decoding an input signal that does not contain CRCC or where it is incorrectly implemented. The reaction of such a receiver will vary from complete refusal to accept the signal to acceptance of the signal whilst flagging a CRC error. It is impossible to state what the ‘correct’ response should be in such a case, as there is no way of telling whether the channel status data is correct or not. Interface signal processors such as the ‘fix-it’ boxes mentioned above often perform the useful function of inserting correct CRCC data into the channel status blocks of signals that lack it. It might reasonably be assumed that a device seeing CRCC bytes repeatedly set to zero would assume that it was not in use and thereafter ignore it, but this is rarely the case at present.
The type of pre-emphasis used should be indicated in bits 2–4 of byte 0 of channel status and it is possible that emphasis may have been applied to a signal without indicating this in channel status. Such incorrectness will not prevent the interface from working but may give rise to a pre-emphasized signal being carried through further stages in the signal chain without being de-emphasized. The only way to tell if a signal is pre-emphasized (when it is not indicated) is really to listen to it, since preemphasized signals will have an exaggerated HF response. The correct pre-emphasis flags may be set using a suitable interface processor, or the signal may be de-emphasized in the digital domain and the flags set to the no emphasis state.
Apart from the consumer/professional flag (discussed below), the sampling frequency indication is the other main area of difficulty in channel status. Not all devices indicate the sampling frequency of the signal in bits 6 and 7 of byte 0, and this can cause a lack of communication when received by a device expecting to see such data. Japanese devices in particular often will not accept data if it does not have the sampling frequency flags set correctly. There is also now the additional indication of sampling frequency in byte 4 of channel status (AES3, Amendment 3 – 1999) to complicate matters, although this is not a requirement for correct functioning of the interface. It is difficult for a receiver to know what to do when presented with a sampling rate flag that contradicts the true audio sampling rate. In such cases it might be suggested that the device should rely on its detection of the true rate rather than the indicated rate, whilst perhaps flagging a mismatch on the front panel. Some confusion also exists over the interpretation of bit 5, byte 0, which indicates ‘source sampling frequency unlocked’. The question is ‘with reference to what is the sampling frequency unlocked?’ – the internal clock, an external reference? When this bit is set to ‘0’ (the default state) nothing can be concluded about the locked state of the sample clock and many systems do not set this bit anyway. When set to ‘1’ all one can say is that there is some problem with the lock of the sample clock and that its frequency may not be relied upon. In systems synchronized to a master reference signal its presence could be used to indicate that there was a free-running clock in a source device earlier in the signal chain.
6.7.4 Electrical Mismatch Between Consumer and Professional Systems
As described in Chapter 4 there is so much similarity between the consumer and professional interfaces that it is tempting to think that consumer devices can be connected directly to professional systems or vice versa. Indeed there is a strong operational motive for this because consumer and professional digital equipment are often used together in studios and programme material is often copied between systems. The problem is that although it is possible in some cases to make the electrical interconnection work, there are other difficulties to contend with such as the almost total dissimilarity in channel status and user bits. There is also the possibility that the consumer device may have a much less stable sample clock and be unable to lock to an external reference, giving rise either to problems in decoding or the danger that sample clock jitter will be passed on to other devices in the professional system. Ideally, therefore, one should not attempt the direct interconnection of consumer and professional equipment, preferring rather to use one of the many interface convertors that exist on the market which will set the necessary channel status bits correctly and convert the signal to the new electrical format. It may also be necessary to resynchronize the signal from a consumer device using a buffer store or sample rate convertor (see section 6.5) by locking it to the reference signal of the professional system, thereby ensuring that its sampling rate is the same as that of the professional system and hopefully removing any excessive clock jitter.
For cases in which the only solution is to attempt direct interconnections some guidelines will be given. Clearly such set-ups should be viewed as temporary and allow for the possible incompatibilities in channel status. Consumer-to-professional electrical connection is often possible because the consumer interface peak output voltage is around 0.5 V and the minimum allowed input voltage to a professional system is 0.2 V. Provided that the wire is not too long the signal may be decoded. As shown in Figure 6.12 the centre core of the consumer coaxial lead may be connected to pin 2 of the professional XLR and the shield to pins 1 and 3. Clearly there will be an impedance mismatch and commercial impedance transformers are available to convert between either 75 and 250 ohms or 75 and 110 ohms. One Japanese manufacturer recommends the circuit shown in Figure 6.13 to balance a consumer output for carrying it over longer distances.
Professional-to-consumer connection may also work, again depending on channel status, since a consumer input is not normally damaged by the higher professional signal voltage. Two circuits are shown in Figure 6.14, depending on whether the professional output is transformer balanced (floating) or driven directly from TTL-level chips (balanced, but not floating). It is also possible to convert signals between consumer electrical and optical formats. Suggested circuits are shown in Figure 6.15.
Unfortunately the original IEC 958 did not exclude the possibility of using the Type 2 electrical interface with professional data, or indeed vice versa (the two subjects were entirely separate in the document). This occasionally led to some unexpected implementations. Although one normally expects to find professional data coming out of XLR connectors there are cases where manufacturers or dealers have simply taken a consumer device and provided it with a so-called ‘professional’ output by feeding the consumer output via an RS-422 driver to an XLR connector. This may be cheap but it is to be discouraged since it leads users to think that the equipment may be connected directly to professional systems, whereas in practice the professional system may refuse to accept it due to the consumer flag being set in the first bit of channel status. If the data is accepted it is possible that further difficulty may arise due to the misinterpretation of channel status data.
6.7.5 Data Mismatch Between Consumer and Professional Systems
Although the audio part of the subframe is to all intents and purposes identical between the two interfaces there are key differences in the channel status data that have already been described in theory in Chapter 4. The success or otherwise of directly interconnecting consumer and professional equipment depends to a large extent on how this data is interpreted.
Some professional systems are provided with both consumer and professional interfaces and clearly this offers the ideal solution, but some older equipment had only one electrical interface with a switch to select ‘consumer’ or ‘professional’ data characteristics. Often the ‘consumer’ implementation simply meant that it would accept data with the first bit of channel status set to zero, ignoring most or all of the following data. The similarity in channel status between consumer and professional runs out fairly quickly after the first couple of bits. One of the most common problems is in the signalling of pre-emphasis. When a professional signal with ‘emphasis not indicated’ is received on a consumer device without any intervention, it will assume that the copy protection flag is being asserted and prevent the signal being recorded (depending on SCMS, as described in section 4.8.7). Similarly, a copy-protected consumer recording being copied directly to a professional system without intervention would force the professional device to a ‘no emphasis’ state, although a noncopy-protected consumer recording would normally be OK and would set the emphasis state correctly.
Past the emphasis bits there is no similarity at all between the interfaces, and thus it is difficult to say exactly what the results of one system interpreting the other’s channel status data would be. Ideally a receiver should check the first bit of channel status to detect the consumer or professional nature of the data, and then switch its implementation automatically to interpret it accordingly.
6.8 Handling Differences in Audio Signal Rate and Resolution
Because of the variety of sampling rates in use in digital audio and the increasing use of audio sample resolutions beyond 16 bits it is important to ensure that the audio signal retains optimum sound quality when it is converted digitally from one rate or resolution to another. The question of sample rate conversion has already been covered to some extent earlier in this chapter, since it is closely related to the topic of synchronization. There is little more to be said here except to state that normally it is impossible to interconnect two devices digitally whose sampling rates differ by more than a tiny amount from one another, requiring that a sample rate convertor be used between the two. The question of differences in sample word length, though, will be covered in more detail.
The standard two-channel interface allows for up to 24 bits of audio per sample, linearly encoded in fixed point form. Until recently only 16 of these were normally used, with the remaining bits set to zero and the MSB of the 16-bit sample in the bit 27 position, but the question now arises as to how to cope with signals of, say, 18- or 20-bit resolution when they are digitally connected to devices of lower resolution. A number of techniques can be used to process, say, a 20-bit signal to reduce its resolution to 16 bits. These range from straightforward truncation, through bit-shifting and redithering at the new resolution, to developments which involve intelligent rounding of the truncation error by noise shaping. In future it may be that professional digital audio devices will incorporate internal intelligent procedures to handle signals of a higher resolution than their internal architecture allows, but at the moment it is normally necessary to employ external processing of some kind at such a juncture.
Truncation is the worst possible solution and involves simply losing the least significant bits of the word. Without redithering the result of truncation is very unpleasant low-level distortion. If a 20-bit source were connected digitally to a 16-bit destination without any intermediate processing the result would normally be the straightforward truncation of the four LSBs.
The addition of dither noise in the digital domain at the point where resolution is reduced is a suitable means of improving the distortion situation and this has been implemented on some digital interface processors and in professional digital mixers. Some editors also have various dithering algorithms for this purpose. The process randomizes the quantizing error that results from word-length reduction by adding a pseudorandom number sequence of controlled amplitude and spectrum to the incoming audio data. In addition, if the full dynamic range of the digital signal has not been used by the programme material (if headroom has been left, for example) it may be possible to bit-shift the 20-bit samples upwards before truncating and redithering. That way more of the MSBs are used and less of the information contained in the LSBs is lost. This is achieved by a simple increase of gain in the digital domain prior to 16-bit transfer.
In the 1992 revision of the AES3 interface standard, provision is made for much more careful definition of the sample word length, such that receiving devices may optimize the transfer of data from a transmitter of different resolution. Standardization work has also gone on within the EBU to determine how analog signal levels should relate to digital signal levels, especially since 20-bit recording can be used on the audio tracks of digital video recorders. The conclusion reached was that one could not rely on correct implementation of byte 2 of channel status in devices using the AES3 interface in all cases, especially in older equipment. Irrespective of the number of bits, the only practical argument was for a fixed relationship to be used between analog and digital levels13.
Originally EBU Recommendation R-64 specified that analog alignment level (corresponding to a meter reading of PPM4 or 0 dBu electrically) should be set to read 12 dB below full scale (i.e. −12 dB FS) on a digital system. This was based on the dynamic range available from typical 16-bit convertors, and assumed that the finished programme’s level would be well controlled. Since then 16-bit convertor technology has improved and because it was necessary to use the same alignment for 20-bit systems the new recommendation now specifies alignment level to be 18 dB below full scale. This allows for an additional 6 dB of operational headroom.
6.9 Analysing the Digital Audio Interface
Because of the wide variety of implementations possible in the standard interfaces and because of the need to test digital interface signals to determine their ‘health’ as electrical signals, there have arisen a number of items of test equipment which may be used to analyse the characteristics of the signal. The following is just a short summary of testing techniques, more detailed coverage of which may be found in Cabot14, Blair et al.15, Mornington West16, and Stone17.
6.9.1 Eye Pattern and Pulse-Width Testing
The eye pattern of a two-channel interface signal is a rough guide to its electrical ‘health’ and gives a clue to the likelihood of it being decoded correctly. In a test set-up described by Cabot, pictured in Figure 6.16, it is possible to vary the attenuation and high-frequency roll-off over a digital link so as to ‘close the eye’ of a random digital audio signal to the limits of the AES3 specification. Having done this one may then verify whether a receiver correctly decodes the data and thus whether it is within the specification. The testing of eye patterns requires a high quality oscilloscope whose trigger input is derived from a stable clock locked to the source clock rate, preferably at a submultiple of it. Alternatively, provided that the clock recovery in the receiver is of high quality and rejects interconnect jitter it may be possible to use this, with the proviso that the eye pattern’s reliability will be affected by any instability in the trigger signal.
An alternative to eye pattern testing on an oscilloscope is the use of a stand-alone interface analyser such as that described by Blair et al., capable of displaying the amplitude and ‘pulse smear’ of the signal on an LCD display with relation to either an internal or external reference signal.
It has been suggested by Kondakor18 of the BBC Designs and Equipment Department that eye patterns may not always indicate the ‘decodability’ of a signal, since he has found examples of seemingly good eye patterns which still give problems at the receiver. Consequently a device has been built that measures the variations in pulse width of received signals – a parameter which becomes greater on poor electrical links – displaying the result on a simple visual display which indicates variations in the three possible bit-cell timings (the ‘1’, ‘0’ and preamble wide pulses) in the form of bar lengths. It is claimed that this gives quick and easy verification of the reliability of received signals, and correlates well with the likelihood that a signal will be decoded.
6.9.2 Security Margin Estimation
Often it is necessary to get a rough idea of how close an interface is to failure and one way of doing this is to introduce a broadband attenuator into the link at the receiving end (with sufficient bandwidth to accommodate the digital audio signal). The attenuation is gradually increased until the receiver fails to decode the signal. The amount of attenuation that can be introduced without the link failing is then a rough guide to the margin of security. An alternative to this method is for test equipment to examine the narrow pulses that follow the extra wide pulses in the subframe preambles of the data. Due to intersymbol interference these are usually the first to be reduced in width and level, or even disappear in cases of poor links and limited link bandwidth. Some interface receiver ICs use this criterion as a measure of the security margin in hand.
6.9.3 Error Checking
Digital interfaces are designed to be used in an error-free environment – in other words, errors are not anticipated over digital links and there is no means of correcting them. This is an achievable situation in well-designed systems and above a certain signal-to-noise ratio one may expect no errors, but below this ratio the error rate will rise quite quickly. The error rate of a digital interface can be checked by a number of means. Common to all the techniques is that a certain bit pattern is generated by the test equipment and the received pattern is then compared against the generated pattern to check for data errors. It is then possible to attempt such things as increasing the noise level on the interface by injecting controlled levels and bandwidths of artificially generated noise to see the effect on error rates. A novel means of checking for errors also described by Cabot is to use a sine-wave test signal and to monitor the digital THD+N (distortion plus noise) reading at the receiver on suitable test equipment. Any error gives rise to a spike in the signal, resulting in a momentary increase in distortion which can be monitored at the output of the notch filter. He points out that errors in LSBs will produce less of an effect on the THD+N reading than errors in MSBs. By correlating errors with the data pattern at the time of the error it is possible to determine whether they are data dependent.
6.9.4 Other Tests
Further checks on the interface may be performed on commercial test equipment, such as the accuracy of the sample clock (compared against either an external or calibrated internal reference), measurements of data cell jitter and common mode signal level. Test equipment can also show the states of all the information in channel status and other auxiliary bits if required, many systems converting these into useful human-readable indication. Examples exist of systems that will analyse the incoming channel status and other data and modify it to suit the established requirements of the receiver in order to set up proper communications in problem situations.
6.10 Interface Transceiver Chips
A number of dedicated ICs are now available for receiving and transmitting standard two-channel interface data, many operating in either consumer or professional modes. Such chips are large-scale integrated circuits (LSIs) which take care of low jitter clock recovery, automatic CRCC checking and generation, automatic sample rate selection and channel status control, among other features. Examples of such devices are the Cirrus Logic (formerly Crystal Semiconductors) series including the CS8427 transceiver chip that will transmit and receive consumer or professional signals at sampling frequencies up to 96 kHz. In such chips the clock recovery is carefully controlled in order to extract a low jitter clock from the interface signal for use by the audio system. The interface between two-channel transceiver chips and the internal signal processing of the equipment is normally in a serial form that can be accepted directly by DSP devices such as the Motorola DSP 56000.
Multichannel interfacing using MADI is based on the FDDI (Fibre Distributed Digital Interface) standard and requires the use of so-called ‘TAXI’ chips designed by AMD (Advanced Micro Devices). The Am7968 and 7969 chips can be used to handle transmission and reception of the high bit rate data from either optical fibre or copper cables, transferring this data normally in parallel form to and from the internal signal processing of the device in question in order to achieve the high transfer rate necessary.
6.11 Routers and Switchers
There are essentially two ways by which standard two-channel interface signals may be routed and switched in systems such as broadcast studio centres where a number of sources and destinations exist. One technique is to use a TDM (Time-Division Multiplexed) switcher, and the other is to use a conventional logic-based crosspoint router. Which is appropriate depends largely on the importance of ‘silent’ switching, such as might be required for ‘hot’ or ‘on-air’ applications, since in a router used simply for assigning signals between sources and destinations (the equivalent to an analog jackfield) it may not matter that discontinuities arise when routing is altered.
In a TDM switcher all the inputs must be synchronous and must be tightly phased inside the router to allow switching at precise timing points within the AES/EBU frame. The inputs are decoded and time-division multiplexed onto a fast parallel bus (as shown in Figure 6.17), allowing output routing to be determined by extracting data in the appropriate channel time slot on the bus and assigning it to a particular output. As suggested by de Jaham19, the TDM switcher is appropriate when a large number of sources and destinations are involved because the physical space occupied ‘per switching point’ is smaller compared with a simple crosspoint router, yet the TDM switcher is complex in design and not usually cost effective for a small to medium number of inputs.
The crosspoint router may be a simple matrix of inputs and outputs, allowing any input to be routed to any output by selecting the appropriate ‘crosspoint’ between the two, using switching logic (see Figure 6.18). Furthermore, the signals may remain in the serial modulated form. One of the great advantages of the crosspoint router is that the inputs do not need to be synchronous, and it could even handle signals of different sampling rates. The problem with such a router is that the precise timing of the crosspoint may be difficult to control, so a glitch may result in the output signal at the time of switching. Evans12 suggests that the typical level of such glitches is around 50 dBu in the analog domain, but this of course depends on the relationship between analog and digital signal levels. Roe20 indicates that clicks caused by such switching are usually masked by the digital filtering used in D/A convertors to a level up to around 50 dB below peak. A reframer may be used to detect and remove the corrupted sample from the data stream where noise-free switching is required. De Jaham also suggests that it is possible to mark the digital frames which have been corrupted at the switching point so that they may be suitably concealed by equipment further down the signal chain, but points out that the process of marking is covered by patents belonging to TDF (Télédiffusion de France).
6.12 Other Useful Products
6.12.1 Interface Format Convertors
When operating in a mixed interface format environment, such as when wishing to copy a recording between a consumer and a professional system or when interfacing equipment of different formats, it may be necessary to employ a suitable format convertor. There are a number of such devices on the market, accepting inputs in one of a selection of standard formats (e.g. ADAT, TDIF, SPDIF, AES/EBU) and converting them to one of a selection of output formats. Some more expensive systems also offer synchronization and sample rate conversion so that the system can transfer signals to systems of different sample rates, as well as different formats.
Often interface processors will take care of more subtle but extremely useful functions such as adding correct CRCC information to channel status and checking for correct implementation of the standard. They may also allow the user to alter some or all of the channel status bits manually. With the advent of recording CD machines (CD-R) there have also arisen a number of systems capable of extracting the track start ID information from the user bits of the consumer digital interface, using this information to increment the track ID information of the destination format correctly, such as when a DAT tape is copied to a CD or vice versa. There is also the possibility that the processor may remove SCMS copy protection (see section 4.8.7) to allow professional users the flexibility that was lost with the introduction of the copy management system. Clearly such a device should be used with due regard to copyright laws.
6.12.2 Digital Headphones
Once signals are interfaced digitally in a studio centre it becomes more difficult to monitor the signal audibly. In analog systems a pair of high impedance headphones can simply be plugged into a jackfield to determine the presence or lack of a signal, but clearly this would not work in a digital system. There exist a small number of digital headphone products that contain an AES/EBU decoder and a built-in D/A convertor, allowing the user to treat digital signals in a similar way to analog signals.
6.13 A Brief Troubleshooting Guide
If a digital interface between two devices appears not to be working it could be due to one or more of the following conditions. The reader should refer to the main text for more detailed explanation of the conditions described.
Asynchronous Sample Rates
The two devices must normally operate at the same sampling frequency, preferably locked to a common reference. Ensure that the receiver is in external sync mode and that a synchronizing signal (common to the transmitter) is present at the receiver’s sync input. If the incoming signal’s transmitter cannot be locked to the reference it must be resynchronized or sample rate converted. Alternatively, set the receiver to genlock to the clock contained in the digital audio input (standard two-channel interfaces only).
The ‘Sync’ or ‘Locked’ indicator flashing in or out on the receiver normally means that no sync reference exists or that it is different from that of the signal at the digital input. Check that sync reference and input are at the correct rate and locked to the same source. Decide on whether to use internal or external sync reference, depending on application.
If problems with ‘good lock’ or drifting offset arise when locking to other machines or when editing, check that any timecode is synchronous with the video and sampling rate. If not, the tape must be restriped with timecode locked to the same reference as the recorder, or a synchronizer used which will lock a digital audio input to the rate dictated by a timecode input.
The transmitter may be operating in the AES3 single-channel-double-sampling-frequency mode in which case successive subframes will carry adjacent samples of a single channel at twice the normal sampling frequency. This might sound like audio pitch-shifted downwards if decoded and converted by a standard receiver incapable of recognizing this mode. Alternatively the devices may be operating at entirely different sampling frequencies and therefore not communicating.
Digital Input
It may be that the receiver is not switched to accept a digital input.
Data Format
Received data is in the wrong format. Both transmitter and receiver must operate to the same format. Conflicts may exist in such areas as channel status, and there may be a consumer–professional conflict. Use a format convertor to set the necessary flags.
Non-Audio or ‘Other Uses’ Set
The data transmitted over the interface may be data-reduced audio, such as AC-3 or DTS format. It can only be decoded by receivers specially designed for the task. The data will sound like noise if it is decoded and converted by a standard linear PCM receiver, but in such receivers it will normally be muted because of the indication in channel status and/or the validity bit.
Cables and Connectors
Cables or connectors may be damaged or incorrectly wired. The cable may be too long, of the wrong impedance, or generally of poor quality. The digital signal may be of poor quality. Check eye height on the scope against specification and check for possible noise and interference sources. Alternatively make use of an interface analyser.
SCMS (Consumer Interface Only)
The copy protect or SCMS flag may be set by the transmitter. For professional purposes, use a format convertor to set the necessary flags or use the professional interface which is not subject to SCMS.
Receiver Mode
The receiver is not in record or input monitor mode. Some recorders must be at least in record–pause before they will give an audible and metered output derived from a digital input.
1. AES, AES11-1997. Synchronization of digital audio equipment in studio operations (1997)
2. AES, AES5-1984. AES recommended practice for professional digital audio applications employing pulse code modulation – preferred sampling frequencies. Journal of the Audio Engineering Society, vol. 32, pp. 781–785 (1984)
3. Shelton, W.T., Synchronization of digital audio. In Proceedings of the AES/EBU Interface Conference, 12–13 September, London, pp. 92–116, Audio Engineering Society British Section (1989)
4. Dunn, N.J., Jitter: specification and assessment in digital audio equipment. Presented at the 93rd AES Convention, San Francisco, 1–4 October, preprint no. 3361 (C-2) (1992)
5. Dunn, N.J., Considerations for interfacing digital audio equipment to the standards AES3, AES5 and AES11. In Proceedings of the AES 10th International Conference, 7–9 September, pp. 115–126 (1991)
6. Dunn, C. and Hawksford, M.O., Is the AES/EBU/SPDIF digital audio interface flawed? Presented at the 93rd AES Convention, San Francisco, 1–4 October, preprint no. 3360 (C-1) (1992)
7. Gilchrist, N.H.C., Sampling-rate synchronization of digital sound signals by variable delay. EBU Technical Review, no. 183, October (1980)
8. Parker, M., Sample frequency conversion, sample slippage, pitch changing and varispeed. In Proceedings of the AES 10th International Conference, 7–9 September, p. T-69 (1991)
9. Shelton, W.T., Timing inter-relations for audio with video. In Proceedings of the AES 9th International Conference, 1–2 February, pp. 31–44 (1991)
10. Bensberg, G., Time for digital audio within television. Presented at the 92nd AES Convention, Vienna, 24–27 March, preprint no. 3258 (1992)
11. Komly, A. and Viallevielle, A., Synchronization and time codes in the user channel. Proposal submitted to AES SC 2-5-1 working party on synchronization, Paris, March (1990)
12. Evans, P., Digital audio in the broadcast centre. EBU Technical Review, no. 241/242, June–August (1990)
13. Møller, L., Signal levels across the EBU/AES digital audio interface. In Proceedings of the 1st NAB Radio Montreux Symposium, Montreux, Switzerland, 10–13 June, pp. 16–28, National Association of Broadcasters (1992)
14. Cabot, R., Measuring AES/EBU digital audio interfaces. Journal of the Audio Engineering Society, vol. 38, no. 6, October, pp. 459–467 (1989)
15. Blair, I. et al., New techniques in analysing the digital audio interface. Presented at the 92nd AES Convention, Vienna, 24–27 March, preprint no. 3230 (1992)
16. Mornington West, A., Signal analysis. In Proceedings of the AES/EBU Interface Conference, 12–13 September, pp. 83–91, Audio Engineering Society British Section (1989)
17. Stone, D., Digital signal analysis. International Broadcast Engineer, November (1992)
18. Kondakor, K., A pulse width analyser for the rapid testing of the AES/EBU serial digital audio signals. Presented at the 93rd AES Convention, San Francisco, 1–4 October (1992)
19. De Jaham, S., Asynchronous routing. In Proceedings of the AES/EBU Interface Conference, 12–13 September, pp. 64–68, Audio Engineering Society British Section (1989)
20. Roe, G., Integrated routing in a hybrid environment. International Broadcast Engineer, July, pp. 44–46 (1991)
98.82.120.188