Chapter 10

Digital audio applications

This chapter concerns some practical digital audio applications, in particular reviewing concepts in editing and mastering as well as the basic principles of interconnection and networking.

Editing software

It is increasingly common for MIDI (see Chapter 14) and digital audio editing to be integrated within one software package, particularly for pop music recording and other multitrack productions where control of electronic sound sources is integrated with recorded natural sounds. Such applications used to be called sequencers but this is less common now that MIDI sequencing is only one of many tasks that are possible. Although most sequencers contain some form of audio editing these days, there are some software applications more specifically targeted at high quality audio editing and production. These have tended to come from a professional audio background rather than a MIDI sequencing background, although it is admitted that the two fields have met in the middle now and it is increasingly hard to distinguish a MIDI sequencer with added audio features from an audio editor with added MIDI features.

Audio applications such as those described here are used in contexts where MIDI is not particularly important and where fine control over editing crossfades, dithering, mixing, mastering and post-production functions are required. Here the editor needs tools for such things as: previewing and trimming edits, such as might be necessary in classical music post-production; PQ editing CD masters; preparing surround sound DVD material for encoding; MLP or AC-3 encoding of audio material; editing of DSD material for SuperAudio CD. The following example, based on the SADiE audio editing system, demonstrates some of the practical concepts.

SADiE workstations run on the PC platform and most utilise an external audio interface. Recent Series 5 systems, however, can be constructed as an integrated rack-mounted unit containing audio interfaces and a Pentium PC. Both PCM and DSD signal processing options are available and the system makes provision for lossless MLP encoding for DVD-Audio, as well as SACD mastering and encoding. A typical user interface for SADiE is shown in Figure 10.1. It is possible to see transport controls, the mixer interface and the playlist display. The main part of the screen is occupied by a horizontal display of recording tracks or ‘streams’, and these are analogous to the tracks of a multitrack tape recorder. A record icon associated with each stream is used to arm it ready for recording. As recording proceeds, the empty streams are filled from left to right across the screen in real time, led by a vertical moving cursor. These streams can be displayed either as solid continuous blocks or as waveforms, the latter being the usual mode when editing is undertaken. After recording, extra streams can be recorded if required simply by disarming the record icons of the streams already used and arming the record icons of empty streams below them, making it possible to build up a large number of ‘virtual’ tracks as required. The maximum number that can be replayed simultaneously depends upon the memory and DSP capacity of the system used. A basic two-input/four-output might allow up to eight streams to be replayed (depending on the amount of DSP being used for other tasks), and a fully equipped system can allow at least 32 simultaneous streams of programme material to be recorded and replayed, i.e.: it is a complete multitrack recording machine.


Figure 10.1   SADiE editor displays, showing mixer, playlist, transport controls and project elements

Replay involves either using the transport control display or clicking the mouse at a desired position on a time-bar towards the top of the screen, this positioning the moving cursor (which is analogous to a tape head) where one wishes replay to begin. Editing is performed by means of a razor-blade icon, which will make the cut where the moving cursor is positioned. Alternatively, an edit icon can be loaded to the mouse’s cursor for positioning anywhere on any individual stream to make a cut.


Figure 10.2   SADiE trim window showing crossfade controls and waveform display

Audio can be arranged in the playlist by the normal processes of placing, dragging, copying and pasting, and there is a range of options for slipping material left or right in the list to accommodate new material (this ensures that all previous edits remain attached in the right way when the list is slipped backwards or forwards in time). Audio to be edited in detail can be viewed in the trim window (Figure 10.2) which shows a detailed waveform display, allowing edits to be previewed either to or from the edit point, or across the edit, using the play controls in the top right-hand corner (this is particularly useful for music editing). The crossfade region is clearly visible, with different colours and shadings used to indicate the ‘live’ audio streams before and after the edit. There are many stages of undo and redo so that nothing need be permanent at this stage. When a satisfactory edit is achieved, it can be written back to the main display where it will be incorporated. Scrub and jog actions for locating edit points are also possible. A useful ‘lock to time’ icon is provided which can be activated to prevent horizontal movement of the streams so that they cannot be accidentally moved out of sync with each other during editing.

The mixer section can be thought of in conventional terms, and indeed some systems offer physical plug-in interfaces with moving fader automation for those who prefer them. As well as mouse control of such things as fader, pan, solo and mute, processing such as eq, filters, aux send and compression can be selected from an effects ‘rack’, and each can be dragged across and dropped in above a fader where it will become incorporated into that channel. Third party ‘plug-in’ software is also available for many systems to enhance the signal processing features, including CEDAR audio restoration software, as described below. The latest software allows for the use of DirectX plug-ins for audio processing. Automation of faders and other processing is also possible. The recorded material itself resides on a (usually) removable hard disk drive, and the edit decision list (the information created during editing which tells the computer how to play the recorded material) resides on the computer’s internal disk drive once the project has been ‘saved’. When a project is complete, the latter can be loaded onto the removable disk so that the whole project is contained therein.

Plug-in architectures

What is a plug-in?

Plug-ins are now one of the fastest-moving areas of audio development, providing audio signal processing and effects that run either on a workstation’s CPU or on dedicated DSP. (The hardware aspects of this were described in Chapter 9.) Audio data can be routed from a sequencer or other audio application, via an API (application programming interface) to another software module called a ‘plug-in’ that does something to the audio and then returns it to the source application. In this sense it is rather like inserting an effect into an audio signal path, but done in software rather than using physical patch cords and rack-mounted effects units. Plug-ins can be written for the host processor in a language such as C++, using the software development toolkits (SDK) provided by the relevant parties. Plug-in processing introduces a delay that depends on the amount of processing and the type of plug-in architecture used. Clearly low latency architectures are highly desirable for most applications.

Many plug-ins are versions of previously external audio devices that have been modelled in DSP, in order to bring favourite EQs or reverbs into the workstation environment. The sound quality of these depends on the quality of the software modelling that has been done. Some host-based (native) plug-ins do not have as good a quality as dedicated DSP plug-ins as they may have been ‘cut to fit’ the processing power available, but as hosts become ever more powerful the quality of native plug-ins increases.

A number of proprietary architectures have been developed for plug-ins, including Microsoft’s DirectX, Steinberg’s VST, Digidesign’s TDM, Mark of the Unicorn’s MAS, TC Works’ PowerCore and EMagic’s host-based plug-in format. Apple’s OS X Audio Units are a feature built into the OS that manages plug-ins without the need for third-party middleware solutions. The popularity of this as a plug-in architecture has yet to be determined at the time of writing, but is likely to be used increasingly as OS X gains popularity. It is usually necessary to specify for which system any software plug-in is intended, as the architectures are not compatible. As OS-based plug-in architectures for audio become more widely used, the need for proprietary approaches may diminish.

Digidesign in fact has four different plug-in approaches that are used variously in its products, as shown in Table 10.1.

DirectX is a suite of multimedia extensions developed by Microsoft for the Windows platform. It includes an element called DirectShow that deals with real-time streaming of media data, together with the insertion of so-called ‘filters’ at different points. DirectX audio plug-ins work under DirectShow and are compatible with a wide range of Windows-based audio software. They operate at 32 bit resolution, using floating-point arithmetic and can run in real time or can render audio files in non-real time. They do not require dedicated signal processing hardware, running on the host CPU, and the number of concurrent plug-ins depends on CPU power and available memory. DirectX plug-ins are also scalable – in other words they can adapt to the processing resource available. They have the advantage of being compatible with the very wide range of DirectX-compatible software in the general computing marketplace but at the time of writing they can only handle two-channel audio.

Table 10.1 Digidesign plug-in alternatives

Plug-in architecture



Uses dedicated DSP cards for signal processing. Does not affect the host CPU load and processing power can be expanded as required


(Host TDM.) Uses the host processor for TDM plug-ins, instead of dedicated DSP


(Real Time Audio Suite.) Uses host processor for plug-ins. Not as versatile as HTDM


Non-real-time processing that uses the host CPU to perform operations such as time-stretching that require the audio file to be rewritten

DXi is a software synthesiser plug-in architecture developed by Cakewalk, running under DirectX.

One example of a proprietary approach used quite widely is VST, Steinberg’s Virtual Studio Technology plug-in architecture. It runs on multiple platforms and works in a similar way to DirectX plug-ins. On Windows machines it operates as a DLL (dynamic link library) resource, whereas on Macs it runs as a raw Code resource. It can also run on BeOS and SGI systems, as a Library function. VST incorporates both virtual effects and virtual instruments such as samplers and synthesisers. There is a cross-platform GUI development tool that enables the appearance of the user interface to be ported between platforms without the need to rewrite it each time.

An example of a plug-in user interface is shown in Fact File 10.1.

Fact file 10.1   Plug-in examples

An example of a plug-in user interface is shown below. A reverberation processor is shown. The quality of such plug-ins is now getting to the point where it is on a par with the sound quality achievable on external devices, depending primarily on the amount of DSP available.


Advanced audio processing software and development tools

High-end audio signal processing workstations, such as the Lake Huron, are designed primarily for research and development purposes. There is also a range of signal processing software for audio research and development that can run on general purpose desktop computers. Although this is not the primary emphasis of this book, brief mention will be made.

Signal processing workstations such as the Huron use large amounts of dedicated DSP hardware to enable the development of advanced real-time algorithms and signal analysis processes. Systems such as this are used for tasks such as acoustical modelling and real-time rendering of complex virtual reality scenes that require many hundreds of millions of computations per second. Such operations are typically beyond the scope of the average desktop PC, requiring some hours of off-line ‘number crunching’. Using high-end workstations such processes may be run off-line in a fraction of the time or may be implemented in real time. A range of applications is available for the Huron workstation, ranging from head-tracked binaural simulation to virtual acoustic reality development tools. Interfaces are available between Huron and MATLAB, the latter being a popular research tool for the analysis, visualisation and manipulation of data.

MSP is a signal processing toolbox and development environment based on the Max MIDI programming environment described below. MSP runs on the Mac or SGI platforms and is designed to enable users to assemble signal processing ‘engines’ with a variety of components (either library or user-defined). They are linked graphically in a similar manner to the MIDI programming objects used in Max, allowing switches, gain, equalisation, delays and other signal processing devices to be inserted in the signal chain. For the user who is not conversant with programming DSPs directly, MSP provides an easy way in to audio signal processing, by pre-defining the building blocks and enabling them to be manipulated and linked graphically. Signal processing can be run on the host CPU, provided it is sufficiently fast. An example of an MSP patch that acts as a variable stereo delay processor is shown in Figure 10.3.


Figure 10.3   Example of a Max MSP patch that describes a variable stereo delay processor

Mastering and restoration


Some software applications are designed specifically for the mastering and restoration markets. These products are designed either to enable ‘fine tuning’ of master recordings prior to commercial release, involving subtle compression, equalisation and gain adjustment (mastering), or to enable the ‘cleaning up’ of old recordings that have hiss, crackle and clicks (restoration).

CEDAR applications or plug-ins are good examples of the restoration group. Sophisticated controls are provided for the adjustment of dehissing and decrackling parameters, which often require considerable skill to master. Recently the company has introduced advanced visualisation tools that enable restoration engineers to ‘touch up’ audio material using an interface not dissimilar to that used for photo editing on computers. Audio anomalies (unwanted content) can be seen in the time and frequency domains, highlighted and interpolated based on information either side of the anomaly. A typical display from its RETOUCH product for the SADiE platform is shown in Figure 10.4.


Figure 10.4(a) CEDAR Retouch display for SADiE, showing frequency (vertical) against time (horizontal) and amplitude (colour/density). Problem areas of the spectrographic display can be highlighted and a new signal synthesised using information from the surrounding region. (a) Harmonics of an interfering signal can be clearly seen. (b) A short-term spike crosses most of the frequency range


Figure 10.4(b)

CEDAR’s restoration algorithms are typically divided into ‘decrackle’, ‘declick’, ‘dethump’ and ‘denoise’, each depending on the nature of the anomaly to be corrected. Some typical user interfaces for controlling these processes are shown in Figure 10.5.


Figure 10.5(a) CEDAR restoration plug-ins for SADiE, showing (a) Declick and (b) Denoise processes


Figure 10.5(b)

Mastering software usually incorporates advanced dynamics control such as the TC Works Master X series, based on its Finalizer products, a user interface of which is pictured in Figure 10.6. Here compressor curves and frequency dependency of dynamics can be adjusted and metered. The display also allows the user to view the number of samples at peak level to watch for digital overloads that might be problematic.

Level control in mastering

Typical audio systems today have a very wide dynamic range that equals or exceeds that of the human hearing system. Distortion and noise inherent in the recording or processing of audio are at exceptionally low levels owing to the use of high resolution A/D convertors, up to 24 bit storage, and wide range floating-point signal processing. Level control, it might be argued, is therefore less crucial than it used to be in the days when a recording engineer struggled to optimise a recording’s dynamic range between the noise floor and the distortion ceiling (see Figure 10.7). However, there are still artistic and technical considerations.

The dynamic range of a typical digital audio system can now be well over 100 dB and there is room for the operator to allow a reasonable degree of ‘headroom’ between the peak audio signal level and the maximum allowable level. Meters are provided to enable the signal level to be observed, and they are usually calibrated in dB, with zero at the top and negative dBs below this. The full dynamic range is not always shown, and there may be a peak bar that can hold the maximum level permanently or temporarily. As explained in Chapter 8, 0 dBFS (full scale) is the point at which all of the bits available to represent the signal have been used. Above this level the signal clips and the effect of this is quite objectionable, except on very short transients where it may not be noticed. It follows that signals should never be allowed to clip.


Figure 10.6   TC Works MasterX mastering dynamics plug-in interface


Figure 10.7   Comparison of analogue and digital dynamic range. (a) Analogue tape has increasing distortion as the recording level increases, with an effective maximum output level at 3% third harmonic distortion. (b) Modern high resolution digital systems have wider dynamic range with a noise floor fixed by dither noise and a maximum recording level at which clipping occurs. The linearity of digital systems does not normally become poorer as signal level increases, until 0 dBFS is reached. This makes level control a somewhat less important issue at the initial recording stage, provided sufficient headroom is allowed for peaks

There is a tendency in modern audio production to want to master everything so that it sounds as loud as possible, and to ensure that the signal peaks as close to 0 dBFS as possible. This level maximising or normalising process can be done automatically in most packages, the software searching the audio track for its highest level sample and then adjusting the overall gain so that this just reaches 0 dBFS. In this way the recording can be made to use all the bits available, which can be useful if it is to be released on a relatively low resolution consumer medium where noise might be more of a problem. (It is important to make sure that correct redithering is used when altering the level and requantising, as explained in Chapter 8.) This does not, of course, take into account any production decisions that might be involved in adjusting the overall levels of individual tracks on an album or other compilation, where relative levels should be adjusted according to the nature of the individual items, their loudness and the producer’s intent.

A little-known but important fact is that even if the signal is maximised in the automatic fashion, so that the highest sample value just does not clip, subsequent analogue electronics in the signal chain may still do so. Some equipment is designed in such a way that the maximum digital signal level is aligned to coincide with the clipping voltage of the analogue electronics in a D/A convertor. In fact, owing to the response of the reconstruction filter in the D/A convertor (which reconstructs an analogue waveform from the PAM pulse train) intersample signal peaks can be created that slightly exceed the analogue level corresponding to 0 dBFS, thereby clipping the analogue side of the convertor. For this reason it is recommended that digital-side signals are maximised so that they peak a few dB below 0 dBFS, in order to avoid the distortion that might otherwise result on the analogue side. Some mastering software provides detailed analysis of the signal showing exactly how many samples occur in sequence at peak level, which can be a useful warning of potential or previous clipping.

Controlling and maintaining sound quality

The sound quality achievable with modern workstations is now exceptionally high. As mentioned earlier, there are now few technical reasons why distortion, noise, frequency response and other performance characteristics should not match the limits of human perception. Of course there will always be those for whom improvements can be made, but technical performance of digital audio systems is no longer really a major issue today.

If one accepts the foregoing argument, the maintenance of sound quality in computer-based production comes down more to understanding the operational areas in which quality can be compromised. These include things like ensuring as few A/D and D/A conversions as possible, maintaining audio resolution at 24 bits or more throughout the signal chain (assuming this is possible), redithering appropriately at points where requantising is done, and avoiding sampling frequency conversions. The rule of thumb should be to use the highest sampling frequency and resolution that one can afford, but no higher than strictly necessary for the purpose, otherwise storage space and signal processing power will be squandered. The scientific merits of exceptionally high sampling frequencies are dubious, for all but a few aficionados, although the marketing value may be considerable.

The point at which quality can be affected in a digital audio system is at A/D and D/A conversion. In fact the quality of an analogue signal is irretrievably fixed at the point of A/D conversion, so this should be done with the best equipment available. There is very little that can be done afterwards to improve the quality of a poorly converted signal. At conversion stages the stability of timing of the sampling clock is crucial, because if it is unstable the audio signal will contain modulation artefacts that give rise to increased distortions and noise of various kinds. This so-called clock jitter is one of the biggest factors affecting sound quality in convertors and high quality external convertors usually have much lower jitter than the internal convertors used on PC sound cards.

The quality of a digital audio signal, provided it stays in the digital domain, is not altered unless the values of the samples are altered. It follows that if a signal is recorded, replayed, transferred or copied without altering sample values then the quality will not have been affected, despite what anyone may say. Sound quality, once in the digital domain, therefore depends entirely on the signal processing algorithms used to modify the programme. There is little a user can do about this except choose high quality plug-ins and other software, written by manufacturers that have a good reputation for DSP that takes care of rounding errors, truncation, phase errors and all the other nasties that can arise in signal processing. This is really no different from the problems of choosing good-sounding analogue equipment. Certainly not all digital equaliser plug-ins sound the same, for example, because this depends on the filter design. Storage of digital data, on the other hand, does not affect sound quality at all, provided that no errors arise and that the signal is stored at full resolution in its raw PCM form (in other words, not MPEG encoded or some other form of lossy coding).

The sound quality the user hears when listening to the output of a workstation is not necessarily what the consumer will hear when the resulting programme is issued on the release medium. One reason for this is that the sound quality depends on the quality of the D/A convertors used for monitoring. The consumer may hear better or worse, depending on the convertors used, assuming the bit stream is delivered without modification. One hopes that the convertors used in professional environments are better than those used by consumers, but this is not always the case. High resolution audio may be mastered at a lower resolution for consumer release (e.g.: 96 kHz, 24 bit recordings reduced to 44.1 kHz, 16 bits for release on CD), and this can affect sound quality. It is very important that any down-conversion of master recordings be done using the best dithering and/or sampling frequency conversion possible, especially when sampling frequency conversion is of a non-integer ratio.

Low bit rate coders (e.g.: MPEG) can reduce quality in the consumer delivery chain, but it is the content provider’s responsibility to optimise quality depending on the intended release format. Where there are multiple release formats it may be necessary to master the programme differently in each case. For example, really low bit rate Internet streaming may require some enhancement (e.g.: compression and equalisation) of the audio to make it sound reasonable under such unfavourable conditions.

When considering the authoring of interactive media such as games or virtual reality audio, there is a greater likelihood that the engineer, author, programmer and producer will have less control over the ultimate sound quality of what the consumer hears. This is because much of the sound material may be represented in the form of encoded ‘objects’ that will be rendered at the replay stage, as shown in Figure 10.8. Here the quality depends more on the quality of the consumer’s rendering engine, which may involve resynthesis of some elements, based on control data. This is a little like the situation when distributing a song as a MIDI sound file, using General MIDI voices. The audible results, unless one uses downloadable sounds (and even then there is some potential for variation), depends on the method of synthesis and the precise nature of the voices available at the consumer end of the chain.

Preparing for and understanding release media

Consumer release formats such as CD, DVD, SACD and MP3 usually require some form of mastering and pre-release preparation. This can range from subtle tweaks to the sound quality and relative levels on tracks to PQ encoding, DVD authoring, data encoding and the addition of graphics, video and text. Some of these have already been mentioned in other places in this book.


PQ encoding for CD mastering can often be done in some of the application packages designed for audio editing, such as SADiE and Pyramix. In this case it may involve little more than marking the starts and ends of the tracks in the playlist and allowing the software to work out the relevant frame advances and Red Book requirements for the assembly of the PQ code that will either be written to a CD-R or included in the DDP file for sending to the pressing plant. The CD only comes at one resolution and sampling frequency (16 bit, 44.1 kHz) making release preparation a relatively straightforward matter.


DVD mastering is considerably more complicated than CD and requires advanced authoring software that can deal with all the different options possible on this multi-faceted release format. A number of different combinations of players and discs are possible, as explained in Fact File 10.2. DVD-Video allows for 48 or 96 kHz sampling frequency and 16, 20 or 24 bit PCM encoding. A two-channel downmix must be available on the disc in linear PCM form (for basic compatibility), but most discs also include Dolby Digital or possibly DTS surround audio. Dolby Digital encoding usually involves the preparation of a file or files containing the compressed data, and a range of settings have to be made during this process, such as the bit rate, dialogue normalisation level, rear channel phase shift and so on. A typical control screen is shown in Figure 10.9. Then of course there are the pictures, but they are not the topic of this book.


Figure 10.8   (a) In conventional audio production and delivery sources are combined and delivered at a fixed quality to the user, who simply has to replay it. The quality is limited by the resolution of the delivery link. (b) In some virtual and synthetic approaches the audio information is coded in the form of described objects that are rendered at the replay stage. Here the quality is strongly dependent on the capabilities of the rendering engine and the accuracy of description

Playing time depends on the way in which producers decide to use the space available on the disc, and this requires the juggling of the available bit budget. DVD-Audio can store at least 74 minutes of stereo audio even at the highest sample rate and resolution (192/24). Other modes are possible, with up to six channels of audio playing for at least 74 minutes, using combinations of sample frequency and resolution, together with MLP. Six-channel audio can only operate at the two lower sample rates of either class (44.1/88.2 or 48/96).

Fact file 10.2   DVD discs and players

There are at least three DVD player types on the market (audio, universal and video), and there are two types of DVD-Audio disc, one containing only audio objects and the other (the DVD-AudioV) capable of holding video objects as well. The video objects on a DVD-AudioV are just the same as DVD-Video objects and therefore can contain video clips, Dolby AC-3 compressed audio and other information. In addition, there is the standard DVD-Video disc.

DVD-AudioV discs should play back in audio players and universal players. Any video objects on an AudioV disc should play back on video-only players. The requirement for video objects on DVD-AudioV discs to contain PCM audio was dropped at the last moment so that such objects could only contain AC-3 audio if desired. This means that an audio disc could contain a multichannel AC-3 audio stream in a video object, enabling it to be played in a video player. This is a good way of ensuring that a multichannel audio disc plays back in as many different types of player as possible, but requires that the content producer makes sure to include the AC-3 video object in addition to MLP or PCM audio objects. The video object can also contain a DTS audio bitstream if desired.


Courtesy of Bike Suzuki (DVD-Audio Forum)

A downmixing technique known as SMART (System Managed Audio Resource Technique) is mandatory in DVD-Audio players but optional for content producers. It enables a stereo downmix of the multichannel material to be made in the player but under content producer control, so this information has to be provided at authoring time. The gains, phases and panning of each audio channel can be controlled in the downmix. A separate two-channel mix (L0/R0) can be included within an MLP bitstream. If a separate stereo mix is provided on the disc then this is automatically used instead of the player downmix.

All modes other than mono or two-channel have the option to split the channels into two groups. Group 1 would normally contain the front channels (at least left and right) of the multichannnel balance, while Group 2 could contain the remaining channels. This is known as scalable audio. The resolution of Group 2 channels can be lower than that of Group 1, enabling less important channels to be coded at appropriate resolutions to manage the overall bit budget. The exact point of the split between the channel groups depends on the mode, and there are in fact 21 possible ways of splitting the channels. It is also possible to ‘bit-shift’ channels that do not use the full dynamic range of the channel. For example, surround channels that might typically under-record compared with the front channels can be bit shifted upwards so as to occupy only the 16 MSBs of the channel. On replay they are restored to their original gains.


Figure 10.9   Screen display of Dolby Digital encoding software options

It is not mandatory to use the centre channel on DVD-Audio. Some content producers may prefer to omit a centre speaker feed and rely on the more conventional stereo virtual centre. The merits or demerits of this continue to be debated.

The use of MLP on DVD-A discs is optional, but is an important tool in the management of bit budget. Using MLP one would be able to store separate two-channel and multichannel mixes on the same disc, avoiding the need to rely on the semi-automatic downmixing features of DVD players. Owing to the so-called Lossless Matrix technology employed, an artistically controlled L0/R0 downmix can be made at the MLP mastering stage, taking up very little extra space on the disc owing to redundancy between the multichannel and two-channel information. MLP is also the key to obtaining high resolution multichannel audio on all channels without scaling.

DVD masters are usually transferred to the pressing plant on DLT tapes, using the Disc Description Protocol (DDP), or on DVD-R(A) discs as a disc image with a special CMF (cutting master format) header in the disc lead-in area containing the DDP data.

Super Audio CD (SACD)

SACD Authoring software enables the text information to be added, as shown in Figure 10.10. SACD masters are normally submitted to the pressing plant on AIT format data tapes.

Sony and Philips have paid considerable attention to copy protection and anti-piracy measures on the disc itself. Comprehensive visible and invisible watermarking are standard features of the SACD. Using a process known as PSP (Pit Signal Processing) the width of the pits cut into the disc surface is modulated in such a fashion as to create a visible image on the surface of the CD layer, if desired by the originator. This provides a visible means of authentication. The invisible watermark is a mandatory feature of the SACD layer and is used to authenticate the disc before it will play on an SACD player. The watermark is needed to decode the data on the disc. Discs without this watermark will simply be rejected by the player. It is apparently not possible to copy this watermark by any known means. Encryption of digital music content is also optional, at the request of software providers.


Figure 10.10   Example of SACD text authoring screen from SADiE


MP3, as already explained elsewhere, is actually MPEG-1, Layer 3 encoded audio, stored in a data file, usually for distribution to consumers either on the Internet or on other release media. Consumer disc players are increasingly capable of replaying MP3 files from CDs, for example. MP3 mastering requires that the two-channel audio signal is MPEG-encoded, using one of the many MP3 encoders available, possibly with the addition of the ID3 tag described in Chapter 6. Some mastering software now includes MP3 encoding as an option.

Some of the choices to be made in this process concern the data rate and audio bandwidth to be encoded, as this affects the sound quality. The lowest bit rates (e.g.: below 64 kbit/s) will tend to sound noticeably poorer than the higher ones, particularly if full audio bandwidth is retained. For this reason some encoders limit the bandwidth or halve the sampling frequency for very low bit rate encoding, because this tends to minimise the unpleasant side-effects of MPEG encoding. It is also possible to select joint stereo coding mode, as this will improve the technical quality somewhat at low bit rates, possibly at the expense of stereo imaging accuracy. As mentioned above, at very low bit rates some audio processing may be required to make sound quality acceptable when squeezed down such a small pipe.

MPEG-4, web and interactive authoring

Commercial tools for interactive authoring and MPEG-4 encoding are only just beginning to appear at the time of writing. Such tools enable audio scenes to be described and data encoded in a scalable fashion so that they can be rendered at the consumer replay end of the chain, according to the processing power available.

Interactive authoring for games is usually carried out using low-level programming and tools for assembling the game assets, there being few universal formats or standards in this business at the present time. It requires detailed understanding of the features of the games console in question and these platforms differ considerably. Making the most of the resources available is a specialised task, and a number of books have been written on the subject (see Recommended further reading at the end of this chapter). Multimedia programmes involving multiple media elements are often assembled using authoring software such as Director, but that will not be covered further here. Preparing audio for web (Internet) delivery is also a highly specialised topic covered very well in other books (see Recommended further reading).

Interconnecting digital audio devices


In the case of analogue interconnection between devices, replayed digital audio is converted to the analogue domain by the replay machine’s D/A convertors, routed to the recording machine via a conventional audio cable and then reconverted to the digital domain by the recording machine’s A/D convertors. The audio is subject to any gain changes that might be introduced by level differences between output and input, or by the record gain control of the recorder and the replay gain control of the player. Analogue domain copying is necessary if any analogue processing of the signal is to happen in between one device and another, such as gain correction, equalisation, or the addition of effects such as reverberation. Most of these operations, though, are now possible in the digital domain.

An analogue domain copy cannot be said to be a perfect copy or a clone of the original master, because the data values will not be exactly the same (owing to slight differences in recording level, differences between convertors, the addition of noise, and so on). For a clone it is necessary to make a true digital copy. Digital interfaces may be used for the interconnection of recording systems and other audio devices such as mixers and effects units. It is now common only to use analogue interfaces at the very beginning and end of the signal chain, with all other interconnections being made digitally.

Professional digital audio systems, and some consumer systems, have digital interfaces conforming to one of the standard protocols and allow for a number of channels of digital audio data to be transferred between devices with no loss of sound quality. Any number of generations of digital copies may be made without affecting the sound quality of the latest generation, provided that errors have been fully corrected. (This assumes that the audio is in a linear PCM format and has not been subject to low bit rate decoding and re-encoding.) The digital outputs of a recording device are taken from a point in the signal chain after error correction, which results in the copy being error corrected. Thus the copy does not suffer from any errors that existed in the master, provided that those errors were correctable. This process takes place in real time, requiring the operator to put the receiving device into record mode such that it simply stores the incoming stream of audio data. Any accompanying metadata may or may not be recorded (often most of it is not).

Making a copy of a recording using any of the digital interface standards involves the connection of appropriate cables between player and recorder, and the switching of the recorder’s input to ‘digital’ as opposed to ‘analogue’, since this sets it to accept a signal from the digital input as opposed to the A/D convertor. It is necessary for both machines to be operating at the same sampling frequency (unless a sampling frequency convertor is used) and may require the recorder to be switched to ‘external sync’ mode, so that it can lock its sampling frequency to that of the player. Alternatively (and preferably) a common reference signal may be used to synchronise all devices that are to be interconnected digitally. A recorder should be capable of at least the same quantising resolution (number of bits per sample) as a player, otherwise audio resolution will be lost. If there is a difference in resolution between the systems it is advisable to use a processor in between the machines that optimally dithers the signal for the new resolution, or alternatively to use redithering options on the source machine to prepare the signal for its new resolution (see Chapter 8).

Increasingly generic computer data interconnects are used for audio purposes as explained in Fact File 10.3.

Dedicated audio interface formats

There are a number of types of digital interface, some of which are international standards and others of which are manufacturer specific. They all carry digital audio for one or more channels with at least 16 bit resolution and will operate at the standard sampling rates of 44.1 and 48 kHz, as well as at 32 kHz if necessary, some having a degree of latitude for varispeed. Some interface standards have been adapted to handle higher sampling frequencies such as 88.2 and 96 kHz. The interfaces vary as to how many physical interconnections are required. Some require one link per channel plus a synchronisation signal, whilst others carry all the audio information plus synchronisation information over one cable.

The most common interfaces are described below in outline. It is common for subtle incompatibilities to arise between devices, even when interconnected with a standard interface, owing to the different ways in which non-audio information is implemented. This can result in anything from minor operational problems to total non-communication and the causes and remedies are unfortunately far too detailed to go into here. The reader is referred to The Digital Interface Handbook by Rumsey and Watkinson, as well as to the standards themselves, if a greater understanding of the intricacies of digital audio interfaces is required.

The AES/EBU interface (AES-3)

The AES-3 interface, described almost identically in AES-3-1992, IEC 60958 and EBU Tech. 3250E among others, allows for two channels of digital audio (A and B) to be transferred serially over one balanced interface, using drivers and receivers similar to those used in the RS422 data transmission standard, with an output voltage of between 2 and 7 volts as shown in Figure 10.11. The interface allows two channels of audio to be transferred over distances up to 100 m, but longer distances may be covered using combinations of appropriate cabling, equalisation and termination. Standard XLR-3 connectors are used, often labelled DI (for digital in) and DO (for digital out).

Fact file 10.3   Computer networks vs digital audio interfaces

Dedicated ‘streaming’ interfaces, as employed in broadcasting, production and post-production environments, are the digital audio equivalent of analogue signal cables, down which signals for one or more channels are carried in real time from one point to another, possibly with some auxiliary information (metadata) attached. An example is the AES-3 interface, described below. Such an audio interface uses a data format dedicated to audio purposes, whereas a computer data network may carry numerous types of information.

Dedicated interfaces are normally unidirectional, point-to-point connections, and should be distinguished from computer data interconnects and networks that are often bidirectional and carry data in a packet format for numerous sources and destinations. With dedicated interfaces sources may be connected to destinations using a routing matrix or by patching individual connections, very much as with analogue signals. Audio data are transmitted in an unbroken stream, there is no handshaking process involved in the data transfer, and erroneous data are not retransmitted because there is no mechanism for requesting its retransmission. The data rate of a dedicated audio interface is usually directly related to the audio sampling frequency, word length and number of channels of the audio data to be transmitted, ensuring that the interface is always capable of serving the specified number of channels. If a channel is unused for some reason its capacity is not normally available for assigning to other purposes (such as higher speed transfer of another channel, for example).

There is an increasing trend towards employing standard computer interconnects and networks to transfer audio information, as opposed to using dedicated audio interfaces. Such computer networks are typically used for a variety of purposes in general data communications and they may need to be adapted for audio applications that require sample-accurate real-time transfer. The increasing ubiquity of computer systems in audio environments makes it inevitable that generic data communication technology will gradually take the place of dedicated interfaces. It also makes sense economically to take advantage of the ‘mass market’ features of the computer industry.

Computer networks are typically general purpose data carriers that may have asynchronous features and may not always have the inherent quality-of-service (QoS) features that are required for ‘streaming’ applications. They also normally use an addressing structure that enables packets of data to be carried from one of a number of sources to one of a number of destinations and such packets will share the connection in a more or less controlled way. Data transport protocols such as TCP/IP are often used as a universal means of managing the transfer of data from place to place, adding overheads in terms of data rate, delay and error handling that may work against the efficient transfer of audio. Such networks may be designed primarily for file transfer applications where the time taken to transfer the file is not a crucial factor – ‘as fast as possible’ will do. This has required some special techniques to be developed for carrying real-time data such as audio information.

Desktop computers and consumer equipment are also increasingly equipped with general purpose serial data interfaces such as USB (universal serial bus) and Firewire (IEEE 1394). These are examples of personal area network (PAN) technology, allowing a number of devices to be interconnected within a limited range around the user. These have a high enough data rate to carry a number of channels of audio data over relatively short distances, either over copper or optical fibre. Audio protocols also exist for these.


Figure 10.11   Recommended electrical circuit for use with the standard two-channel interface

Each audio sample is contained within a ‘subframe’ (see Figure 10.12), and each subframe begins with one of three synchronising patterns to identify the sample as either the A or B channel, or to mark the start of a new channel status block (see Figure 10.13). These synchronising patterns violate the rules of bi-phase mark coding (see below) and are easily identified by a decoder. One frame (containing two audio samples) is normally transmitted in the time period of one audio sample, so the data rate varies with the sampling frequency. (Note, though, that the recently introduced ‘single-channel-double-sampling-frequency’ mode of the interface allows two samples for one channel to be transmitted within a single frame in order to allow the transport of audio at 88.2 or 96 kHz sampling frequency.)

Additional data is carried within the subframe in the form of 4 bits of auxiliary data (which may either be used for additional audio resolution or for other purposes such as low quality speech), a validity bit (V), a user bit (U), a channel status bit (C) and a parity bit (P), making 32 bits per subframe and 64 bits per frame. Channel status bits are aggregated at the receiver to form a 24 byte word every 192 frames, and each bit of this word has a specific function relating to interface operation, an overview of which is shown in Figure 10.14. Examples of bit usage in this word are the signalling of sampling frequency and pre-emphasis, as well as the carrying of a sample address ‘timecode’ and labelling of source and destination. Bit 1 of the first byte signifies whether the interface is operating according to the professional (set to 1) or consumer (set to 0) specification.


Figure 10.12   Format of the standard two-channel interface frame


Figure 10.13   Three different preambles (X, Y and Z) are used to synchronise a receiver at the starts of subframes


Figure 10.14   Overview of the professional channel status block


Figure 10.15   An example of the bi-phase mark channel code

Bi-phase mark coding, the same channel code as used for SMPTE/EBU timecode, is used in order to ensure that the data is self-clocking, of limited bandwidth, DC free, and polarity independent, as shown in Figure 10.15. The interface has to accommodate a wide range of cable types and a nominal 110 ohm characteristic impedance is recommended. Originally (AES-3-1985) up to four receivers with a nominal input impedance of 250 ohms could be connected across a single professional interface cable, but the later modification to the standard recommended the use of a single receiver per transmitter, having a nominal input impedance of 110 ohms.

Standard consumer interface (IEC 60958-3)

The most common consumer interface (historically related to SPDIF – the Sony/Philips digital interface) is very similar to the AES-3 interface, but uses unbalanced electrical interconnection over a coaxial cable having a characteristic impedance of 75 ohms, as shown in Figure 10.16. It can be found on many items of semi-professional or consumer digital audio equipment, such as CD players, DVD players and DAT machines, and is also widely used on computer sound cards because of the small physical size of the connectors. It usually terminates in an RCA phono connector, although some equipment makes use of optical fibre interconnects (TOS-link) carrying the same data. Format convertors are available for converting consumer format signals to the professional format, and vice versa, and for converting between electrical and optical formats.


Figure 10.16   The consumer electrical interface (transformer and capacitor are optional but may improve the electrical characteristics of the interface)

When the IEC standardised the two-channel digital audio interface, two requirements existed: one for ‘consumer use’ and one for ‘broadcasting or similar purposes’. A single IEC standard (IEC 958) resulted with only subtle differences between consumer and professional implementation. Occasionally this caused problems in the interconnection of machines, such as when consumer format data was transmitted over professional electrical interfaces. IEC 958 has now been rewritten as IEC 60958 and many of these uncertainties have been addressed. Both the professional and consumer interfaces are capable of carrying data-reduced audio signals such as MPEG and Dolby Digital as described in Fact File 10.4.

Fact file 10.4   Carrying data-reduced audio

The increased use of data-reduced multichannel audio has resulted in methods by which such data can be carried over standard two-channel interfaces, for either professional or consumer purposes. This makes use of the ‘non-audio’ or ‘other uses’ mode of the interface, indicated in the second bit of channel status, which tells conventional PCM audio decoders that the information is some other form of data that should not be converted directly to analogue audio. Because data-reduced audio has a much lower rate than the PCM audio from which it was derived, a number of audio channels can be carried in a data stream that occupies no more space than two channels of conventional PCM. These applications of the interface are described in SMPTE 337M (concerned with professional applications) and IEC 61937, although the two are not identical. SMPTE 338M and 339M specify data types to be used with this standard. The SMPTE standard packs the compressed audio data into 16, 20 or 24 bits of the audio part of the AES-3 subframe and can use the two subframes independently (e.g. one for PCM audio and the other for data-reduced audio), whereas the IEC standard only uses 16 bits and treats both subframes the same way.

Consumer use of this mode is evident on DVD players, for example, for connecting them to home cinema decoders. Here the Dolby Digital or DTS-encoded surround sound is not decoded in the player but in the attached receiver/decoder. IEC 61937 has parts, either pending or published, dealing with a range of different codecs including ATRAC, Dolby AC-3, DTS and MPEG (various flavours). An ordinary PCM convertor trying to decode such a signal would simply reproduce it as a loud, rather unpleasant noise, which is not advised and does not normally happen if the second bit of channel status is correctly observed. Professional applications of the mode vary, but are likely to be increasingly encountered in conjunction with Dolby E data reduction – a relatively recent development involving mild data reduction for professional multichannel applications in which users wish to continue making use of existing AES-3-compatible equipment (e.g. VTRs, switchers and routers). Dolby E enables 5.1-channel surround audio to be carried over conventional two-channel interfaces and through AES-3-transparent equipment at a typical rate of about 1.92 Mbit/s (depending on how many bits of the audio subframe are employed). It is designed so that it can be switched or edited at video frame boundaries without disturbing the audio.


Figure 10.17   Overview of the consumer channel status block

The data format of subframes is the same as that used in the professional interface, but the channel status implementation is almost completely different, as shown in Figure 10.17. The second byte of channel status in the consumer interface has been set aside for the indication of ‘category codes’, these being set to define the type of consumer usage. Current examples of defined categories are (00000000) for the General category, (10000000) for Compact Disc and (11000000) for a DAT machine. Once the category has been defined, the receiver is expected to interpret certain bits of the channel status word in a particular way, depending on the category. For example, in CD usage, the four control bits from the CD’s ‘Q’ channel subcode are inserted into the first four control bits of the channel status block (bits 1–4). Copy protection can be implemented in consumer-interfaced equipment, according to the Serial Copy Management System (SCMS).

The user bits of the consumer interface are often used to carry information derived from the subcode of recordings, such as track identification and cue point data. This can be used when copying CDs and DAT tapes, for example, to ensure that track start ID markers are copied along with the audio data. This information is not normally carried over AES/EBU interfaces.


Figure 10.18   Format of TDIF data and LRsync signal

Tascam digital interface (TDIF)

Tascam’s interfaces have become popular owing to the widespread use of the company’s DA-88 multitrack recorder and more recent derivatives. The primary TDIF-1 interface uses a 25-pin D-sub connector to carry eight channels of audio information in two directions (in and out of the device), sampling frequency and pre-emphasis information (on separate wires, two for fs and one for emphasis) and a synchronising signal. The interface is unbalanced and uses CMOS voltage levels. Each data connection carries two channels of audio data, odd channel and MSB first, as shown in Figure 10.18. As can be seen, the audio data can be up to 24 bits long, followed by two bits to signal the word length, one bit to signal emphasis and one for parity. There are also four user bits per channel that are not usually used.

Alesis digital interface

The ADAT multichannel optical digital interface, commonly referred to as the ‘light pipe’ interface or simply ‘ADAT Optical’, is a serial, self-clocking, optical interface that carries eight channels of audio information. It is described in US Patent 5,297,181: ‘Method and apparatus for providing a digital audio interface protocol’. The interface is capable of carrying up to 24 bits of digital audio data for each channel and the eight channels of data are combined into one serial frame that is transmitted at the sampling frequency. The data is encoded in NRZI format for transmission, with forced ones inserted every five bits (except during the sync pattern) to provide clock content. This can be used to synchronise the sampling clock of a receiving device if required, although some devices require the use of a separate 9-pin ADAT sync cable for synchronisation. The sampling frequency is normally limited to 48 kHz with varispeed up to 50.4 kHz and TOSLINK optical connectors are typically employed (Toshiba TOCP172 or equivalent). In order to operate at 96 kHz sampling frequency some implementations use a ‘double-speed’ mode in which two channels are used to transmit one channel’s audio data (naturally halving the number of channels handled by one serial interface). Although 5 m lengths of optical fibre are the maximum recommended, longer distances may be covered if all the components of the interface are of good quality and clean. Experimentation is required.


Figure 10.19   Basic format of ADAT data

As shown in Figure 10.19 the frame consists of an 11 bit sync pattern consisting of 10 zeros followed by a forced one. This is followed by four user bits (not normally used and set to zero), the first forced one, then the first audio channel sample (with forced ones every five bits), the second audio channel sample, and so on.

Sony digital interface for DSD (SDIF-3)

Sony has recently introduced a high resolution digital audio format known as ‘Direct Stream Digital’ or DSD (see Chapter 8). This encodes audio using 1 bit sigma–delta conversion at a very high sampling frequency of typically 2.8224 MHz (64 times 44.1 kHz). There are no internationally agreed interfaces for this format of data, but Sony has released some preliminary details of an interface that can be used for the purpose, known as SDIF-3. Some early DSD equipment used a data format known as ‘DSD-raw’ which was simply a stream of DSD samples in non-return-to-zero (NRZ) form, as shown in Figure 10.20(a).

In SDIF-3 data is carried over 75 ohm unbalanced coaxial cables, terminating in BNC connectors. The bit rate is twice the DSD sampling frequency (or 5.6448 Mbit/s at the sampling frequency given above) because phase modulation is used for data transmission as shown in Figure 10.20(b). A separate word clock at 44.1 kHz is used for synchronisation purposes. It is also possible to encounter a DSD clock signal connection at the 64 times 44.1 kHz (2.8224 MHz).

Sony multichannel DSD interface (MAC-DSD)

Sony has also developed a multichannel interface for DSD signals, capable of carrying 24 channels over a single physical link. The transmission method is based on the same technology as used for the Ethernet 100BASE-TX (100 Mbit/s) twisted-pair physical layer (PHY), but it is used in this application to create a point-to-point audio interface. Category 5 cabling is used, as for Ethernet, consisting of eight conductors. Two pairs are used for bi-directional audio data and the other two pairs for clock signals, one in each direction.


Figure 10.20   Direct Stream Digital interface data is either transmitted ‘raw’, as shown at (a) or phase modulated as in the SDIF-3 format shown at (b)

Twenty-four channels of DSD audio require a total bit rate of 67.7 Mbit/s, leaving an appreciable spare capacity for additional data. In the MAC-DSD interface this is used for error correction (parity) data, frame header and auxiliary information. Data is formed into frames that can contain Ethernet MAC headers and optional network addresses for compatibility with network systems. Audio data within the frame is formed into 352 32 bit blocks, 24 bits of each being individual channel samples, six of which are parity bits and two of which are auxiliary bits.

In a recent enhancement of this interface, Sony has introduced ‘SuperMAC’ which is capable of handling either DSD or PCM audio with very low latency (delay), typically less than 50 μs. The number of channels carried depends on the sampling frequency. Twenty-four DSD channels can be handled, or 48 PCM channels at 44.1/48 kHz, reducing proportionately as the sampling frequency increases. In conventional PCM mode the interface is transparent to AES-3 data including user and channel status information.

Data networks and computer interconnects

A network carries data either on wire or optical fibre, and is normally shared between a number of devices and users. The sharing is achieved by containing the data in packets of a limited number of bytes (usually between 64 and 1518), each with an address attached. The packets usually share a common physical link, normally a high-speed serial bus of some kind, being multiplexed in time either using a regular slot structure synchronised to a system clock (isochronous transfer) or in an asynchronous fashion whereby the time interval between packets may be varied or transmission may not be regular, as shown in Figure 10.21. The length of packets may not be constant, depending on the requirements of different protocols sharing the same network. Packets for a particular file transfer between two devices may not be contiguous and may be transferred eratically, depending on what other traffic is sharing the same physical link.

Figure 10.22 shows some common physical layouts for local area networks (LANs). LANs are networks that operate within a limited area, such as an office building or studio centre, within which it is common for every device to ‘see’ the same data, each picking off that which is addressed to it and ignoring the rest. Routers and bridges can be used to break up complex LANs into subnets. WANs (wide area networks) and MANs (metropolitan area networks) are larger entities that link LANs within communities or regions. PANs (personal area networks) are typically limited to a range of a few tens of metres around the user (e.g.: Firewire, USB, Bluetooth). Wireless versions of these network types are increasingly common. Different parts of a network can be interconnected or extended as explained in Fact File 10.5.


Figure 10.21   Packets for different destinations (A, B and C) multiplexed onto a common serial bus. (a) Time division multiplexed into a regular time slot structure. (b) Asynchronous transfer showing variable time gaps and packet lengths between transfers for different destinations


Figure 10.22   Two examples of computer network topologies. (a) Devices connected by spurs to a common hub, and (b) devices connected to a common ‘backbone’. The former is now by far the most common, typically using CAT 5 cabling

Network communication is divided into a number of ‘layers’, each relating to an aspect of the communication protocol and interfacing correctly with the layers either side. The ISO seven-layer model for open systems interconnection (OSI) shows the number of levels at which compatibility between systems needs to exist before seamless interchange of data can be achieved (Figure 10.23). It shows that communication begins when the application is passed down through various stages to the layer most people understand – the physical layer, or the piece of wire over which the information is carried. Layers 3, 4 and 5 can be grouped under the broad heading of ‘protocol’, determining the way in which data packets are formatted and transferred. There is a strong similarity here with the exchange of data on physical media, as discussed earlier, where a range of compatibility layers from the physical to the application determine whether or not one device can read another’s disks.

Fact file 10.5   Extending a network

It is common to need to extend a network to a wider area or to more machines. As the number of devices increases so does the traffic, and there comes a point when it is necessary to divide a network into zones, separated by ‘repeaters’, ‘bridges’ or ‘routers’. Some of these devices allow network traffic to be contained within zones, only communicating between the zones when necessary. This is vital in large interconnected networks because otherwise data placed anywhere on the network would be present at every other point on the network, and overload could quickly occur.

A repeater is a device that links two separate segments of a network so that they can talk to each other, whereas a bridge isolates the two segments in normal use, only transferring data across the bridge when it has a destination address on the other side. A router is very selective in that it examines data packets and decides whether or not to pass them depending on a number of factors. A router can be programmed only to pass certain protocols and only certain source and destination addresses. It therefore acts as something of a network policeman and can be used as a first level of ensuring security of a network from unwanted external access. Routers can also operate between different standards of network, such as between FDDI and Ethernet, and ensure that packets of data are transferred over the most time-/cost-effective route.

One could also use some form of router to link a local network to another that was quite some distance away, forming a wide area network (WAN). Data can be routed either over dialled data links such as ISDN, in which the time is charged according to usage just like a telephone call, or over leased circuits. The choice would depend on the degree of usage and the relative costs. The Internet provides a means by which LANs are easily interconnected, although the data rate available will depend on the route, the service provider and the current traffic.

Audio network requirements

The principal application of computer networks in audio systems is in the transfer of audio data files between workstations, or between workstations and a central ‘server’ which stores shared files. The device requesting the transfer is known as the ‘client’ and the device providing the data is known as the ‘server’. When a file is transferred in this way a byte-for-byte copy is reconstructed on the client machine, with the file name and any other header data intact. There are considerable advantages in being able to perform this operation at speeds in excess of real time for operations in which real-time feeds of audio are not the aim. For example, in a news editing environment a user might wish to upload a news story file from a remote disk drive in order to incorporate it into a report, this being needed as fast as the system is capable of transferring it. Alternatively, the editor might need access to remotely stored files, such as sound files on another person’s system, in order to work on them separately. In audio post-production for films or video there might be a central store of sound effects, accessible by everyone on the network, or it might be desired to pass on a completed portion of a project to the next stage in the post-production process.


Figure 10.23   The ISO model for Open Systems Interconnection is arranged in seven layers, as shown here

Wired Ethernet is fast enough to transfer audio data files faster than real time, depending on network loading and speed. For satisfactory operation it is advisable to use 100 Mbit/s or even 1 Gbit/s Ethernet as opposed to the basic 10 Mbit/s version. Switched Ethernet architectures allow the bandwidth to be more effectively utilised, by creating switched connections between specific source and destination devices. Approaches using FDDI or ATM are appropriate for handling large numbers of sound file transfers simultaneously at high speed. Unlike a real-time audio interface, the speed of transfer of a sound file over a packet-switched network (when using conventional file transfer protocols) depends on how much traffic is currently using it. If there is a lot of traffic then the file may be transferred more slowly than if the network is quiet (very much like motor traffic on roads). The file might be transferred erratically as traffic volume varies, with the file arriving at its destination in ‘spurts’. There therefore arises the need for network communication protocols designed specifically for the transfer of real-time data, which serve the function of reserving a proportion of the network bandwidth for a given period of time. This is known as engineering a certain ‘quality of service’.

Without real-time protocols the computer network may not be relied upon for transferring audio where an unbroken audio output is to be reconstructed at the destination from the data concerned. The faster the network the more likely it is that one would be able to transfer a file fast enough to feed an unbroken audio output, but this should not be taken for granted. Even the highest speed networks can be filled up with traffic! This may seem unnecessarily careful until one considers an application in which a disk drive elsewhere on the network is being used as the source for replay by a local workstation, as illustrated in Figure 10.24. Here it must be possible to ensure guaranteed access to the remote disk at a rate adequate for real-time transfer, otherwise gaps will be heard in the replayed audio.


Figure 10.24   In this example of a networked system a remote disk is accessed over the network to provide data for real-time audio playout from a workstation used for on-air broadcasting. Continuity of data flow to the on-air workstation is of paramount importance here

Protocols for the Internet

The Internet is now established as a universal means for worldwide communication. Although real-time protocols and quality of service do not sit easily with the idea of a free-for-all networking structure, there is growing evidence of applications that allow real-time audio and video information to be streamed with reasonable quality. The RealAudio format, for example, developed by Real Networks, is designed for coding audio in streaming media applications, currently at rates between 12 and 352 kbit/s for stereo audio, achieving respectable quality at the higher rates. People are also increasingly using the Internet for transferring multimedia projects between sites using FTP (file transfer protocol).

The Internet is a collection of interlinked networks with bridges and routers in various locations, which originally developed amongst the academic and research community. The bandwidth (data rate) available on the Internet varies from place to place, and depends on the route over which data is transferred. In this sense there is no easy way to guarantee a certain bandwidth, nor a certain ‘time slot’, and when there is a lot of traffic it simply takes a long time for data transfers to take place. Users access the Internet through a service provider (ISP), using either a telephone line and a modem, ISDN or an ADSL connection. The most intensive users will probably opt for high-speed leased lines giving direct access to the Internet.

The common protocol for communication on the Internet is called TCP/IP (Transmission Control Protocol/Internet Protocol). This provides a connection-oriented approach to data transfer, allowing for verification of packet integrity, packet order and retransmission in the case of packet loss. At a more detailed level, as part of the TCP/IP structure, there are high level protocols for transferring data in different ways. There is a file transfer protocol (FTP) used for downloading files from remote sites, a simple mail transfer protocol (SMTP) and a post office protocol (POP) for transferring email, and a hypertext transfer protocol (HTTP) used for interlinking sites on the world wide web (WWW). The WWW is a collection of file servers connected to the Internet, each with its own unique IP address (the method by which devices connected to the Internet are identified), upon which may be stored text, graphics, sounds and other data.

UDP (user datagram protocol) is a relatively low-level connectionless protocol that is useful for streaming over the Internet. Being connectionless, it does not require any handshaking between transmitter and receiver, so the overheads are very low and packets can simply be streamed from a transmitter without worrying about whether or not the receiver gets them. If packets are missed by the receiver, or received in the wrong order, there is little to be done about it except mute or replay distorted audio, but UDP can be efficient when bandwidth is low and quality of service is not the primary issue.

Various real-time protocols have also been developed for use on the Internet, such as RTP (real-time transport protocol). Here packets are time-stamped and may be reassembled in the correct order and synchronised with a receiver clock. RTP does not guarantee quality of service or reserve bandwidth but this can be handled by a protocol known as RSVP (reservation protocol). RTSP is the real-time streaming protocol that manages more sophisticated functionality for streaming media servers and players, such a stream control (play, stop, fast-forward, etc.) and multicast (streaming to numerous receivers).

Wireless networks

Increasing use is made of wireless networks these days, the primary advantage being the lack of need for a physical connection between devices. There are various IEEE 802 standards for wireless networking, including 802.11 which covers wireless Ethernet or ‘Wi-Fi’. These typically operate on either the 2.4 GHz or 5 GHz radio frequency bands, at relatively low power, and use various interference reduction and avoidance mechanisms to enable networks to coexist with other services. It should, however, be recognised that wireless networks will never be as reliable as wired networks owing to the differing conditions under which they operate, and that any critical applications in which real-time streaming is required would do well to stick to wired networks where the chances of experiencing drop-outs owing to interference or RF fading are almost non-existent. They are, however, extremely convenient for mobile applications and when people move around with computing devices, enabling reasonably high data rates to be achieved with the latest technology.

Bluetooth is one example of a wireless personal area network (WPAN) designed to operate over limited range at data rates of up to 1 Mbit/s. Within this there is the capacity for a number of channels of voice quality audio at data rates of 64 kbit/s and asynchronous channels up to 723 kbit/s. Taking into account the overhead for communication and error protection, the actual data rate achievable for audio communication is usually only sufficient to transfer data-reduced audio for a few channels at a time.

Audio over Firewire (IEEE 1394)

Firewire is an international standard serial data interface specified in IEEE 1394-1995. One of its key applications has been as a replacement for SCSI (Small Computer Systems Interface) for connecting disk drives and other peripherals to computers. It is extremely fast, running at rates of 100, 200 and 400 Mbit/s in its original form, with higher rates appearing all the time up to 3.2 Gbit/s. It is intended for optical fibre or copper interconnection, the copper 100 Mbit/s (S100) version being limited to 4.5 m between hops (a hop is the distance between two adjacent devices). The S100 version has a maximum realistic data capacity of 65 Mbit/s, a maximum of 16 hops between nodes and no more than 63 nodes on up to 1024 separate buses. On the copper version there are three twisted pairs – data, strobe and power – and the interface operates in half duplex mode, which means that communications in two directions are possible, but only one direction at a time. The ‘direction’ is determined by the current transmitter which will have arbitrated for access to the bus. Connections are ‘hot pluggable’ with auto-reconfiguration – in other words one can connect and disconnect devices without turning off the power and the remaining system will reconfigure itself accordingly. It is also relatively cheap to implement.

Firewire combines features of network and point-to-point interfaces, offering both asynchronous and isochronous communication modes, so guaranteed latency and bandwidth are available if needed for time-critical applications. Communications are established between logical addresses, and the end point of an isochronous stream is called a ‘plug’. Logical connections between devices can be specified as either ‘broadcast’ or ‘point-to-point’. In the broadcast case either the transmitting or receiving plug is defined, but not both, and broadcast connections are unprotected in that any device can start and stop it. A primary advantage for audio applications is that point-to-point connections are protected – only the device that initiated a transfer can interfere with that connection, so once established the data rate is guaranteed for as long as the link remains intact. The interface can be used for real-time multichannel audio interconnections, file transfer, MIDI and machine control, carrying digital video, carrying any other computer data and connecting peripherals (e.g.: disk drives).

Originating partly in Yamaha’s ‘m-LAN’ protocol, the 1394 Audio and Music Data Transmission Protocol is now also available as an IEC PAS component of the IEC 61883 standard (a PAS is a publically available specification that is not strictly defined as a standard but is made available for information purposes by organisations operating under given procedures). It offers a versatile means of transporting digital audio and MIDI control data.

Audio over universal serial bus (USB)

The Universal Serial Bus is not the same as IEEE 1394, but it has some similar implications for desktop multimedia systems, including audio peripherals. USB has been jointly supported by a number of manufacturers including Microsoft, Digital, IBM, NEC, Intel and Compaq. Version 1.0 of the copper interface runs at a lower speed than 1394 (typically either 1.5 or 12 Mbit/s) and is designed to act as a low cost connection for multiple input devices to computers such as joysticks, keyboards, scanners and so on. USB 2.0 runs at a higher rate up to 480 Mbit/s and is supposed to be backwards-compatible with 1.0.

USB 1.0 supports up to 127 devices for both isochronous and asynchronous communication and can carry data over distances of up to 5 m per hop (similar to 1394). A hub structure is required for multiple connections to the host connector. Like 1394 it is hot pluggable and reconfigures the addressing structure automatically, so when new devices are connected to a USB setup the host device assigns a unique address. Limited power is available over the interface and some devices are capable of being powered solely using this source – known as ‘bus-powered’ devices – which can be useful for field operation of, say, a simple A/D convertor with a laptop computer.

The way in which audio is handled on USB is well defined and somewhat more clearly explained than the 1394 audio/music protocol. It defines three types of communication: audio control, audio streaming and MIDI streaming. We are concerned primarily with audio streaming applications. Audio data transmissions fall into one of three types. Type 1 transmissions consist of channel-ordered PCM samples in consecutive subframes, whilst Type 2 transmissions typically contain non-PCM audio data that does not preserve a particular channel order in the bitstream, such as certain types of multichannel data-reduced audio stream. Type 3 transmissions are a hybrid of the two such that non-PCM data is packed into pseudo-stereo data words in order that clock recovery can be made easier. This method is in fact very much the same as the way data-reduced audio is packed into audio subframes within the IEC 61937 format described earlier in this chapter, and follows much the same rules.

AES-47: Audio over ATM

AES-47 defines a method by which linear PCM data, either conforming to AES-3 format or not, can be transferred over ATM (Asynchronous Transfer Mode). networks. There are various arguments for doing this, not least being the increasing use of ATM-based networks for data communications within the broadcasting industry and the need to route audio signals over longer distances than possible using standard digital interfaces. There is also a need for low latency, guaranteed bandwidth and switched circuits, all of which are features of ATM. Essentially an ATM connection is established in a similar way to making a telephone call. A SETUP message is sent at the start of a new ‘call’ that describes the nature of the data to be transmitted and defines its vital statistics. The AES-47 standard describes a specific professional audio implementation of this procedure that includes information about the audio signal and the structure of audio frames in the SETUP at the beginning of the call.

