Music is arguably the richest and most carefully-constructed of all acoustic signals – several highly-trained performers can work for hours to get the precise, desired effect in a particular recording. We might conclude that the information carried by the musical waveform is greater than in any other sound – although this immediately gets us into the puzzling territory of trying to define exactly what information it is that music carries, why it exists, and why so many people spend so much time creating and enjoying it.

Leaving aside those philosophical points that are beyond the scope of this chapter, we can easily name a great many objective aspects of a music recording that a listener can extract, with more or less difficulty, such as the beat, melody, lyrics etc. As with other perceptual feats, we can hope to build computer-based systems to mimic these abilities, and it is interesting to see how successful such approaches can be, and to consider the applications in which such automatic systems could be used.

As discussed in Chapter 36, music has been linked to computers since the earliest days of electronic computation, including Max Matthews’ 1967 synthesis of “Daisy Daisy” on an IBM 7094. Computer music synthesis soon led to the idea of computer music analysis, with the first attempt at automatic transcription by Moorer [16]. However, it quickly emerged that, as with other attempts at machine perception, the seemingly effortless analyses performed by human senses were very difficult to duplicate on a machine. Despite more or less continuous research, it is only now that we are on the verge of having the algorithms, the computational power, and the datasets available to produce systems capable of general music transcription and various other musically-relevant judgments. Technological developments have, at the same time, presented urgent challenges in navigating large online and portable music collections, which cry out for a ‘listening machine’ able to listen, remember, and retrieve in listener-relevant terms.

Below, we look at a range of different problems in extracting information from music recordings, starting with the most detailed – such as the individual notes played by the performer – and moving to progressively larger temporal contexts. The unifying theme is that symbolic information is extracted from raw audio waveforms. Thus, we do not include the significant body of work on making high-level musical inferences directly from score representations (e.g., MIDI, a machine-readable note-event description), although this work has been a strong influence on the more recent audio-based work.


FIGURE 37.1 Beginning of the musical score of Handel's Hallelujah chorus.


Figure 37.1 shows a musical score, the conventional documentary representation of a musical piece in the western classical tradition. A musical score has a similar relationship to music audio that actors' scripts have to their speech: A musician (or group of musicians) can use the score to recreate an acoustic rendition of the piece, and the composer of the piece attempts to specify all the relevant information in the score, although different performances will differ to some extent in their “interpretation” in aspects of detailed timing and emphasis variations. Significantly, a trained music student is often able to listen to a piece of music and regenerate the score, or a close approximation (although they may need to listen repeatedly to short sections). This is called transcription, and is an important part of the “ear training” that musicians undergo.


FIGURE 37.2 Spectrogram of a recording of the music of Figure 37.1.

If a person can transcribe a recording of music into a score, we might hope to be able to do the same thing automatically with a computer. This is the analog of speech recognition in the musical domain. However, unlike the majority of speech tasks in which there is only one speaker speaking at a time, the most significant body of music audio consists of multiple, simultaneous musical instruments (ensembles), and a full transcription requires untangling them all. From this perspective, music transcription is even more challenging than speech recognition, and indeed it is not yet comparably successful. Other aspects, however, such as the common characteristics of different renditions of individual notes, and the constraints between sequential and simultaneous tokens, give music signals some favorable properties and make transcription a viable goal.

Music transcription is valuable for applications such as searching for a particular melody within a database of recordings (needed for “query-by-humming”); high-quality transcripts would make possible a range of exciting analysis-resynthesis applications, including analyzing, modifying, and cleaning up famous archival recordings.1 Another motivation for transcription arises from the observation that a score-type description typically consists of just a few hundred bits of information per second of audio – a factor of 1000 or more times smaller than a comparable, high-quality encoding of the audio itself. If music audio recordings could be automatically and accurately abstracted to their score-level descriptions, it could open up a whole new family of compression algorithms [21].


Like pitch in speech (discussed in Chapters 16 and 31), musical pitch arises from local, regular repetition (periodicity) in the sound waveform, which in turn may be analyzed into regularly-spaced sinusoid harmonics at integer multiples of the fundamental frequency (f0) in a Fourier analysis. These are visible as horizontal stripes in the spectrogram shown in Figure 37.2 (not to be confused with the vertical stripes, which indicate the note onsets and syllables). Note transcription could be a relatively simple search for a set of fundamentals to explain the observed harmonics, except that (a) for various reasons including noise and the inherent trade-off between time and frequency resolution, identifying discrete harmonics in Fourier transforms can be noisy and ambiguous, and (b) simultaneous sinusoids of identical or close frequencies are hard to separate (since even if their frequencies match, their relative phase may result in reinforcement or cancellation). Unfortunately, musical harmony is based on simple ratios of fundamental frequencies of notes, presumably because such combinations lead to “interesting” effects when the coincident or close harmonics interfere with one another. Thus, multiple-voice music audio is full of such collisions. None the less, many note transcription systems have been developed based on fitting harmonic models to the signals, and they have steadily increased the detail extracted (from one or two voices through to higher-order polyphony) and the range of acoustic conditions in which they can be applied (from small numbers of specific instruments, to instrument-independent systems).


FIGURE 37.3 Block diagram of the multiple-fundamental-frequency transcription algorithm of Klapuri [12].

Interesting examples include Goto & Hayamizu [11] and Klapuri [12]. At the heart of these approaches is an algorithm for accounting for the spectral peaks in a single frame of the Fourier transform with a small number of fundamental frequencies. Figure 37.3 illustrates the system of Klapuri [12]: First, spectral frames are normalized to reduce the impact of aperiodic noise and interference and enhance the true harmonic peaks. The spectral magnitude is smoothed across frequency, then this smoothed spectrum is subtracted, leaving only the rapid variations in the spectrum such as the sharp spectral peaks. This is essentially high-pass filtering the spectrum along the frequency axis. This enhanced spectrum is then passed to the “predominant f0 estimation” block, which identifies the single fundamental period most strongly present in the spectrum. This is done by correlating it against a template that consists of an ideal set of harmonics of a given fundamental – effectively summing up the enhanced energy at all the expected harmonic locations. An example template appears as the grid of vertical lines in the upper graph in the Predominant f0 Estimation box. By doing this for every value in a dense grid of candidate fundamental frequencies, the approach derives a 'strength’ function as a function of f0, shown in the lower graph. The predominant fundamental is then chosen as the largest peak in this graph, and is reported as one of the fundamentals present in the frame.

Other notes are found by iteratively removing the harmonics of the found notes from the original spectrum, then repeating the search. However, it is not particularly unusual in music to have simultaneous notes where the fundamental of one is an integer multiple of the fundamental of the other. For instance, if one note with fundamental frequency F has harmonics at F, 2F, 3F, ..., then a simultaneous note one octave above it will have harmonics at 2F, 4F, 6F, .... If all the harmonics of the lower note were completely removed, we would never subsequently find the higher note. To avoid this, the system enforces some kind of continuity in the amount of energy that is subtracted. By constraining the energy removed at 4F to be no more than some fixed threshold above the average of the energy present at 3F and 5F (the adjacent harmonics of the lower fundamental, which do not interfere with the higher note), the system is able to leave some energy at the frequencies of harmonics that have experienced constructive interference making their energy much higher than that of their immediate neighbors. This is illustrated in the f0 spectral smoothing block, where the spectrum consists of two notes whose fundamentals are in a ratio of 3:1. The thick line shows the smooth spectrum constraint based on the lower fundamental (i.e., the more closely spaced harmonic peaks). The added prominence of every third harmonic clearly pushes it above this threshold, leaving adequate information in the modified spectrum to identify the higher note in a later iteration.

In contrast to ‘expert system’ approaches based on insight into the nature of musical notes, an ‘Ignorance-based’ approach was explored by Ellis & Poliner [7] who trained general-purpose support-vector machine (SVM) classifiers to recognize spectral slices (from the spectrogram) that contain particular notes, based on labeled training data. They obtained training data from multitrack music recordings (where each instrument is in a separate track), extracting the pitch of the main vocal line, then using those pitch values as labels for training features extracted from the full mixdown. This approach compared well to more traditional techniques, coming third out of ten systems in a formal evaluation of systems for identifying the melody in popular music recordings, conducted as part of the “MIREX-05” formal evaluations of music information retrieval technologies [6]; on average, around 70% of melody notes were correctly transcribed, with a wide variation across different pieces. In many cases, transcribed melodies are clearly recognizable, implying the transcripts are useful, e.g., for retrieval. But a significant number of excerpts have accuracies below 50% and barely recognizable transcripts, although, as with any evaluation, it is hard to interpret the absolute performance without looking in detail at how it was calculated, and what kind of material was used in the test set.


Since note transcription is so challenging, various simplifications of the problem have also been considered, for instance by incorporating more knowledge or constraints into the solution. One notable case is when the score – i.e., the expected sequence of notes, and their canonical timing – is taken as an input to the transcription system. On the face of it, this is solving transcription by starting with the answer, but in practice the process of aligning a known score to a recorded rendition has a number of useful applications, including:

  • automatic accompaniment [4, 25, 5]
  • extracting expressive details of a performance [22]
  • aligning information, e.g., different performances, or audio playback to a related visual display [13, 17].

The majority of work on this topic (e.g., [18, 24, 5, 17]) has drawn on the ideas of Dynamic Time Warping (DTW). DTW was introduced in Chapter 24 as a simple approach to speech recognition via matching time-warped instances to reference templates. DTW also appeared in Chapter 26 in the guise of Viterbi alignment, which is used to find the correspondence between a known word sequence and the audio of the utterance. This is very much the same problem faced in audio-to-score alignment: we know the pitches (frequencies) we expect, and we have an approximation of their timing from interpreting the score with a single, fixed tempo, but the actual recording is likely to have a different, and perhaps varying tempo. Moreover, the precise realizations of each note (i.e., the spectral detail in the harmonic profiles) are unknown.

DTW uses the highly efficient dynamic programming algorithm to search over all allowable temporal alignments – sets of corresponding pairs of time points in the two sequences being aligned – to maximize the sum of a local, framewise similarity measure, and possibly a score associated with with quality of the alignment, such as preferring paths that correspond to approximately equal tempos between recording and reference. The constraint on “allowable” alignments generally requires that the aligning process advances in both dimensions (tref and tobs) at every step, leading to limits on the slope of the path created by plotting tref against tobs. In other variants, a step may move forward in only one dimension, leading to a path that may have horizontal or vertical stretches in which some nonzero stretch of time in one sequence is all matched to the same, single instant in the other. Figure 37.4 shows an example of an alignment path between the score of Let It Be by the Beatles, represented as a MIDI file with rigid tempo, and the recorded performance.

Dynamic programming simply finds the lowest-cost path through a matrix of point-to-point similarities, so the overall success of the alignment is dependent upon how those similarities are calculated – i.e., how to evaluate a particular short segment of audio as being consistent or inconsistent with a particular set of active notes from the musical score. One simple approach is to generate audio from the score, then make the comparison in the audio domain between the actual recording and the synthesized rendition. When the score is encoded as a MIDI file, a format principally intended as the input to music synthesizers, generating audio is simply a matter of employing one of many available MIDI synthesizer programs. The short-time spectra of the two audio versions can then be compared, e.g., by cosine similarity:


where Xr[k] is the magnitude of the kth spectral bin of the frame of real audio, and Xs[k] is the same for the synthetic audio [24]. Cosine similarity is the cosine of the angle between the high-dimensional spectral vectors, and thus is 1 when they are ‘parallel’ (scaled versions of one another), zero when they are orthogonal, and -1 when they are antiparallel. As such, it is invariant to scaling of either spectrum, meaning the similarity will still be high even if the absolute level of the synthesized version is different from the original recording.


FIGURE 37.4 Example of audio-to-score alignment. Left pane: ‘piano roll’ visualization of a score-level description of “Let It Be” by the Beatles. Bottom pane: spectrogram of the audio recording. Main pane: Peak Structure Distance [18] similarity between score and audio, with best path overlaid. Wiggles in path reflect deviations from strict tempo in performance.

However, even though MIDI scores will often include some specification of the type of synthetic voice or instrument to be used for each part of the score, the timbres of the original and synthesized instruments will likely be very different, placing an upper limit on the similarity scores that may open the door to confusions and misalignments. Instead, it is possible to make the comparisons between cepstral vectors – the inverse Fourier transforms of the log-magnitudes of the original spectra, introduced in Chapter 20. By excluding the first few cepstral components, differences in level and in overall spectral shape are factored out, and the comparison is limited to “fine structure” of the spectrum, i.e., the detail of the individual harmonics of each constituent note. To preserve these harmonic details, the cepstrum should be based on the full, original spectrum, not the smoothed and Mel-warped spectrum most often used for speech cepstral features (described in Chapter 22), which, by design, suppresses pitch-related information. For these cepstral features, a Euclidean distance measure is a good match.

It is also possible to design a similarity measure that directly compares a set of score notes to a short-time spectrum, avoiding the need for synthesized audio altogether. Orio & Schwarz [18] defined a ‘peak structure distance’ that uses the notes from the score to identify a set of frequencies that ought to contain the energy of harmonics indicating those notes, for instance by taking the first eight harmonics of each note's pitch. They then create a ‘mask’ consisting of the spectral bins in which those frequencies should appear, then rate the consistency of the observed spectrum with that mask as the proportion of the total energy of the spectral frame that falls under the mask. If the actual audio contains only the pitches indicated in the score, and has negligible energy above the 8th harmonic, the ratio will be close to 1. If, however, the audio contains harmonics other than those predicted, the ratio will fall accordingly. The key property is that the actual energy of the individual harmonics, and their relative strengths, have little or no effect on the measure, thus sidestepping the timbre-matching problems that arise with audio synthesis.


Direct transcription of the pitches of all notes being played is clearly a useful basis for analysis and processing of music audio, but, depending on the application it may be more detail than is needed. Take, for instance, the task of recognizing a particular piece of music – perhaps with the goal of detecting alternative performances or versions of the same underlying composition. Knowing the notes would be useful, but certain performers might change the specific notes as part of their own interpretation. Less likely to change, at least for a large subset of interpretations, is the underlying harmonic structure, i.e., the sequence of chords that make up the piece – since even if a melody note is changed, the accompanying chord is often unaltered. For this task, it would be more appropriate to transcribe not the individual notes, but their combined effect: the musical chord (e.g., C major, Bb minor) they constitute. And since to a large extent there is only a single principal chord present at any moment in the audio, the problems of dealing with overlapping signals do not arise: the chord is a ‘global’ property of the audio at a particular time. Thus chord transcription may be easier (i.e., more accurate) as well as more useful.

Although the chord perceived by a listener is determined mainly by the set of concurrent notes being played, which themselves are carried by the harmonics in the spectrum, an interesting musical effect is that notes that are changed by exactly one octave (i.e., a doubling or halving of the fundamental frequency) do not change the underlying chord – at least for the canonical 3-note root chords at the core of western harmony. Thus, C major can be represented by a combination of C4 (261.6 Hz), E4 (329.6 Hz) and G4 (392.0 Hz), but also by E4, G4, and C5 (2 x 261.6 = 523.2 Hz), or by any other combination of C, E, and G in different octaves. Because of this, in one of the first papers to address recognizing chords in music audio, Fujishima [10] proposed to describe the audio spectrum with a reduced representation he called the “Pitch Class Profile”, but which is now generally known as chroma representation. In a chroma representation, the entire spectrum is ‘folded’ down onto a single octave, for instance into 12 bins representing each of the semitones (C, image B) of the octave. By summing the energy associated with a particular fundamental across all octave shifts of that note, the chroma representation gives a good description of the different notes present in the audio without distinguishing the octave in which they occur. One simple way to calculate a chroma representation is by taking the energy in each bin of a standard spectrum (e.g., the output of a Fast Fourier transform), and assigning it to the chroma bin closest in frequency. Algorithmically, this can be done by going through every bin of the spectrum and adding its energy to the appropriate bin in a chroma vector, e.g.,

for each fft_bin do

chroma_bin = mod(round(12 log2 (freq(fft_bin)/freq(C4))), 12)
chroma_energy[chroma_bin] += fft_energy[fft_bin]

end for

where mod(x, 12) returns the remainder (modulo) of integer x divided by 12, freq(fft_bin) returns the frequency in Hz associated with a particular bin the spectrum, i.e., = k · SR/N for the kth bin of an N-point Fourier transform on a signal sampled at SR Hz, and freq(C4) is the reference frequency for the first chroma bin, i.e., 261.5 Hz. (Several refinements to chroma calculation have been proposed that seek to improve frequency resolution in low-frequency bins, and exclude non-tonal energy [8].)

Figure 37.5 illustrates the chroma representation by comparing it to the spectrogram, for the two inversions of a C major chord described above. The ‘chromagram’ takes chroma representations of each short-time window, then plots them as columns to show the variation of the 12 chroma bins as a function of time (just as the spectrogram shows the variation of spectral content). If you look carefully around 260 Hz in the spectrogram, a comparison of the first chord (e.g., at t = 1 sec) and the second (at t = 2 sec) shows that the second chord is lacking the fundamental for C4 at 261.6 Hz, but in the chromagram both chords look essentially the same. Also note that although the 1st, 2nd, 4th, 8th, etc. harmonics of a given fundamental will all contribute to the same chroma bin, the other harmonics – which often contain significant energy – will contribute to other bins. For instance, the 3rd harmonic has a chroma value 7 semitones (or a musical fifth) above the fundamental. This is principally responsible for the weak chroma activation we see in the figure at D (a fifth above G) and B (a fifth above E). In practice, however, such ‘ghost’ chroma do not impair the usefulness of the representation.

Starting with a chroma representation, chord detection may then be accomplished in very much the same way that phonemes are recognized in automatic speech recognition, as described in Chapters 2226: For each distinct chord class, a statistical model of the feature observations associated with this class is learned from training data. To further improve recognition, sequential constraints such as the likelihood that a chord label will stay the same in successive time frames, and the probability that any particular chord will be followed by any other, can be captured in a state transition matrix. Put together, these constitute a hidden Markov model for chord recognition, and this can even be trained using the same EM procedure used in speech recognition if, for instance, the label data is available only as a chord sequence for a given training audio example, but without detailed time alignments [23]. Much recent work has looked at refinements to this basic chord transcription framework, with corresponding improvements in performance [2, 14, 3].


FIGURE 37.5 The two C major chord inversions described in the text, synthesized with a piano voice. Left: Score notation; Middle: Spectrogram analysis; Right: Equivalent chroma representation (or ‘chronnagram’).


Above the level of individual chords and notes, music has further levels of structure, such as phrases and sections. Such structures comprise coherent sequences of chords and/or notes that often repeat, with or without variations. A listener, particularly one with some musical training, can easily identify phrases and sections, and might segment a pop song into ‘intro’, ‘verse’, ‘chorus’, etc., or a piece of classical music into ‘theme 1’, ‘bridge’, ‘development’, etc. As we might expect, there has also been a substantial body of work that attempts a similar segmentation using automatic analysis.

The problem of detecting musical structure is unusual because it constitutes a combination of local factors (e.g., a change in chord sequence or instrumentation between verse and chorus) and global factors (e.g., one definition of a chorus is a segment that repeats periodically with little or no variation). Thus, to successfully recover structure, we should expect to involve both local and larger-scale comparisons, and need some way to combine these measurements. Another complicating aspect is that, while music listeners will generally agree that it makes sense to divide a piece of music into segments that relate to one another in various ways, it is harder to get listeners to agree on exactly where those boundaries should fall, or on the ‘natural’ segments in a given piece of music. Partly these problems are a question of hierarchy (e.g., does each line of a verse, which may repeat the same melody, constitute a separate section, or only the entire verse?), but in other cases the segmentation is simply ambiguous, such as a bridge section that always occurs between verse and chorus: some might view it as the end of the verse, and others as the start of the chorus, and still others see as distinct from both.

One of the earliest ideas in this area was the 'similarity matrix’ introduced in Foote [9]. In this visualization, a piece of music is represented as a square image, with each cell reflecting the local similarity between frames of the file at the times corresponding to the x and y co-ordinates – i.e., similar to the cost matrix used as the basis of alignment in Figure 37.4, except that one audio recording is being compared to itself, rather than a second piece of audio or a score representation. As with alignment, the comparison can be based on different feature representations (cepstra, chroma, ...), and can use different similarity measures (cosine, Euclidean, ...), and different sizes of local time window, each leading to a different perspective on the internal structure of the music. Figure 37.6 shows one example, the self-similarity of the entire 231 seconds of Let It Be by the Beatles, using cosine similarity of Mel-frequency cepstral coefficients. By comparing the light and dark patches to the hand-marked segment labels, it is clear that the similarity matrix contains much relevant information.

The similarity matrix was first used as a basis for automatic identification of chorus segments in Bartsch & Wakefield [1], who used chroma features calculated on beat-length segments (from an earlier beat-tracking step) to build a similarity matrix. The matrix is then smoothed along diagonals to emphasize off-diagonal stripes, which correspond to regions of successive, synchronous similarity. Then the largest value in the smoothed similarity matrix, indicating the longest and most exactly-repeated section, is chosen as the chorus. Evaluating against a set of 93 popular music tracks where the chorus had been manually marked, their system achieved an average temporal overlap between the automatically-identified chorus and the hand-marked chorus of around 70%, compared to around 30% for a baseline system that chose random segments.

Subsequent work has continued to improve music segmentation and corresponding labels through better measures of structural fitness applied to descriptions that account for the entire recording [19]. Integration of high-level structure with low-level features has been improved by using local clustering of frames to form consistent segments prior to the comparison between different segments to find repeating structure [15]. However, further progress on music segmentation is limited by the intrinsic ambiguity and limited agreement between manually-annotated ground truth [20].


This chapter has looked at some of the information that can be directly extracted from music audio at the lowest level. We have seen that many musically-interesting descriptions can be successfully recovered with a combination of signal processing and pattern recognition. In the interests of space, we have omitted discussion of important work in extracting percussive (unpitched) events, identifying instruments, and tracking tempo and key, as well as much else. All of this ‘literal’ information can be used as a basis for more abstract analyses of the music audio items, for instance as the basis of music classification or retrieval systems as discussed in Chapter 38.


FIGURE 37.6 Similarity matrix for Let It Be by the Beatles (entire piece). The ‘texture’ of the image clearly changes with the different manually-annotated segments of the piece (shown at the side). Off-diagonal stripes indicate repeating segments.


  1. 37.1 Consider a C-major chord consists of three notes with fundamentals of 261.6 Hz (C4), 329.6 Hz (E4), and 392.0 Hz (G4), and with each note consisting of its first 6 harmonics.
    1. (a) What will be the frequencies of all the harmonics present? Assuming a spectral resolution of 20 Hz (corresponding to a temporal resolution on the order of 50 ms), how many of these harmonics will be stationary, and how many will be so close as to interfere by phase cancellation (“beating”)?
    2. (b) If the amplitudes of the harmonics fall as 1/n (where n is the harmonic number), and each note has equal amplitude, and if harmonics are quantized into 12, equally-spaced chroma bins aligned to the chord's tuning, what would be the chroma representation of this chord? How does this differ from what might be expected?
  2. 37.2 The chord progression for the verse and chorus of “Let it Be” by the Beatles can be transcribed, with one observation every half-bar (two beats), as:

    {C, G, a, F, C, G, F, C, a, e, F, C, C, G, F, C}

    where capital letters indicate major chords and lower case indicate minor chords.

    A hidden Markov model is trained using this sequence. Given a set of 8 observations in which the first chord is unambiguously C but the remaining observations match all chord models with equal likelihood, what would be the result of Viterbi decoding of the hidden Markov model for this sequence?

  3. 37.3 Secton 37.6 mentions one evaluation metric for structure detection systems: the proportion of manually-marked “chorus” segments correctly identified. However, a more general structure discovery task is to segment and label all the different time regions in a musical piece. Algorithms may exhibit several common flaws including over-segmentation (dividing the piece up into too many segments), under-segmentation, segment boundary skew (i.e. placing a boundary some distance away from its true location), as well as incorrect clustering of the segments defined by the boundaries.

    Propose a measure for evaluating structure detection algorithms, discussing how it reflects each of these problems. You may also wish to consider the problem of inconsistent ground truth from different manual labelers.


  1. Bartsch, M. A. and Wakefield, G. H., “To catch a chorus: Using chroma-based representations for audio thumbnailing,” in Proc. IEEE Worksh. on Apps. of Sig. Proc. to Acous. and Audio, 2001.
  2. Bello, J. P. and Pickens, J., “A robust mid-level representation for harmonic content in music signals,” in Proc. ISMIR, pages 304-311, London, 2005.
  3. Cho, T., Weiss, R. J., and Bello, J. P., “Exploring common variations in state of the art chord recognition systems,” in Proc. Sound and Music Computing Conference (SMC), pages 1-8, Barcelona, 2010.
  4. Dannenberg, R. B., “An on-line algorithm for real-time accompaniment,” in Proc. Int. Computer Music Conf, pages 193-198, Paris, 1984.
  5. Dannenberg, R. B. and Raphael, C., “Music score alignment and computer accompaniment,” Commun. ACM, 49: 38-43, 2006.
  6. Downie, J., West, K., Ehmann, A., and Vincent, E., “The 2005 Music Information Retrieval Evaluation eXchange (MIREX 2005): Preliminary overview,” in Proc. 6th International Symposium on Music Information Retrieval ISMIR, pages 320-323, London, 2005.
  7. Ellis, D. P. W. and Poliner, G., “Classification-based melody transcription,” Machine Learning Journal, 65: 439-456, 2006.
  8. Ellis, D. P. W. and Poliner, G., “Identifying cover songs with chroma features and dynamic programming beat tracking,” in Proc. ICASSP, pages IV-1429-1432, Hawai’ i, 2007.
  9. Foote, J., “Visualizing music and audio using self-similarity,” in Proc. ACM Multimedia, pages 77-80, 1999.
  10. Fujishima, T., “Realtime chord recognition of musical sound: A system using common lisp music,” in Proc. ICMC, pages 464-467, Beijing, 1999.
  11. Goto, M. and Hayamizu, S., “A real-time music scene description system: Detecting melody and bass lines in audio signals,” in Working Notes of the IJCAI-99 Workshop on Computational Auditory Scene Analysis, pages 31-40, Stockholm, August 1999.
  12. Klapuri, A., “Multiple fundamental frequency estimation by harmonicity and spectral smoothness,” IEEE Trans. Speech and Audio Processing, 11: 804-816, 2003.
  13. Kurth, F. and Muller, M., “Efficient index-based audio matching,” IEEE Trans. Audio, Speech and Language Proc., 16: 382-395, Feb 2008.
  14. Lee, K. and Slaney, M., “Acoustic chord transcription and key extraction from audio using key-dependent HMMs trained on synthesized audio,” IEEE Trans. Audio, Speech and Language Proc., 16: 291-301, Feb 2008.
  15. Levy, M., Sandler, M., and Casey, M., “Extraction of high-level musical structure from audio data and its application to thumbnail generation,” in Proc. ICASSP, pages V-13–V-16, Toulouse, 2006.
  16. Moorer, J., “On the transcription of musical sound by computer,” Computer Music Journal, 1: 32-38, 1977.
  17. Müller, M. and Ewert, S., “Joint structure analysis with applications to music annotation and synchronization,” in Proc. ISMIR, pages 389-394, Philadelphia, 2008.
  18. Orio, N. and Schwarz, D., “Alignment of monophonic and polyphonic music to a score,” in Proc. International Computer Music Conference, pages 155-158, Havana, September 2001.
  19. Paulus, J. and Klapuri, A., “Music structure analysis using a probabilistic fitness measure and an integrated musicological model,” in Proc. ISMIR, pages 369-374, Philadelphia, 2008.
  20. Peiszer, E., Automatic audio segmentation: Algorithm, experiments and evaluation, Masters thesis, Vienna University of Technology, Vienna, Austria, 2007.
  21. Scheirer, E., “Structured audio, Kolmogorov complexity, and generalized audio coding,” IEEE Transactions on Speech and Audio Processing, 9: 914-931, 2001.
  22. Scheirer, E. D., Extracting Expressive Performance Information from Recorded Music, Masters thesis, MIT Media Lab, 1995.
  23. Sheh, A. and Ellis, D. P. W., “Chord segmentation and recognition using EM-trained Hidden Markov Models,” in Proc. Int. Conf on Music Info. Retrieval ISMIR-03, pages 185-191, 2003.
  24. Turetsky, R. J. and Ellis, D. P. W., “Ground-truth transcriptions of real music from force-aligned midi syntheses,” in Proc. Int. Conf on Music Info. Retrieval ISMIR-03, pages 135-141, 2003.
  25. Vercoe, B. and Puckette, M., “Synthetic rehearsal: Training the synthetic performer,” in Proc. Int. Computer Music Conf, pages 275-278, Vancouver, Canada, 1985.
  26. Walker, J. Q., “Music technology at Zenph studios,” Classical Voice of North Carolina, 2005.

1 Analyzing and re-recording old piano recordings by famous performers such as Rachmaninoff is the goal of startup Zenph Studios. Precise performance details are extracted from the original recording, then the piece is re-performed on a robotic piano in a cleaner acoustic environment [26]

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.