Online music sources, such as Apple's iTunes Music Store, provide practically instant access to many millions of music tracks. In the past, choosing music to listen to was a question of flipping through actual media (CDs or LPs), with physical constraints limiting the choice to dozens or at most hundreds. Today, many people carry so much music around with them on portable devices that even just browsing through the cover images could take hours. This growth by several orders of magnitude of the scale of music choices demands entirely new mechanisms for identifying and selecting music – what we are calling the music retrieval problem. This chapter looks at various formulations of this problem, and some current solutions.

Music retrieval shares many aspects with classic text document search (e.g., web searching), but also has some distinctive characteristics owing to the differences between the ultimate purposes of retrieval: listening to music is not so much like reading a document. Music retrieval may be aimed at finding a specific, known item, or alternatively something fitting to the listener's mood or taste. It may be based on a specific query, either in words or by example, or it may be “implicit” (such as activating an internet radio station), in which case the query can be drawn from context such as personal music collections or recent listening history. The actual needs or goals to be satisfied may not be clear, even to the listener, and indeed music preference is still a mysterious phenomenon. All this contributes to the need for a range of different approaches.

In this chapter, we will look at a number of music retrieval scenarios. The first is queryby-example with the goal of finding an exact match, i.e., the audio fingerprinting task, now popularized by applications running on portable devices that can name background music being played in public spaces. A similar scenario, but involving very different solutions, is “query by humming”, in which the user vocalizes some imitation of the desired piece for the machine to identify. A third related variant is cover song matching, where a query example is given, but the task is to find recordings that are recognizably the same music, but performed by different musicians and/or with a different interpretation.

Moving beyond these searches for specific items, a common scenario is for listeners to want to hear music they like, or that matches a particular situation, without having to individually specify each track. This can be based on text queries describing particular bands, genres, or other “tags”, or perhaps derived from specific examples with the goal of finding “more music like this”. An example application of piece-to-piece music similarity task is in playlist generation, where the task is to come up with a sequence of music recordings that can be played in sequence to satisfy or please a listener.

All of these approaches are based on the actual audio content of the music recordings, which is our interest here. It is worth recognizing, however, that there are many other sources of information about music and musicians, and these can have significant influence and utility in practical music retrieval tasks. The most familiar is the recommendations commonly encountered in online music stores based on comparisons between a particular user's music collection (or listening history) and those of other users over a wide pool: Roughly speaking, the idea is to find mores users who resemble the target user, then recommend whatever those other users liked or bought. Such approaches never actually look at the audio content, but merely at the behavior of humans, who are presumably responding to the audio as well as other influences. In the case of music for which a large amount of user interaction data is available, these approaches can be highly effective, although such techniques place new or little-known pieces at a considerable disadvantage since they will rarely be recommended (the “cold start” problem). Several recent approaches have sought to combine audio-based music similarity into a framework otherwise based on user data, e.g., [22, 2].


Music fingerprinting, which aims to identify a particular recording within a large archive, is the least ambiguous task considered in this chapter, and also one of the most widely used. Fingerprinting is used to monitor radio broadcasts (e.g., for royalty collections), to identify music collections that may have been digitized in different ways and which may have unreliable labels, as well as for the now-classic application of identifying a piece of music being played in a public space that you recognize but cannot name. In all these cases, fingerprinting is very successful – which at first may appear surprising, since no human can instantly recognize among millions of tracks. But in fact the task is just the kind of thing that is relatively easier for computers: there is a closed set of existing recordings, and the query audio is precisely the same (exactly the same waveform), at least at its point of origin. Thus, almost any summary feature is sufficient to specify the piece, provided it is not too badly distorted by the transformations and channels the original audio goes through on its way to becoming the query. The other considerations are merely that the feature should retain enough detail to avoid false alarms, along with secondary considerations such as the size of the reference database, the ease with which it can be searched, computational cost, etc.

One very successful algorithm for identifying recordings even amid noisy backgrounds was developed by Avery Wang for Shazam [20]. The matching is based on the identification of time-frequency “landmarks” – prominent energy peaks at a particular time and frequency. The rationale behind using these landmarks is that the frequencies of spectral peaks, the relative times at which they occur, and the fact that they are prominent compared to their immediate neighbors, are unlikely to be changed by the addition of background noise, or distortion caused by channels of various kinds. In particular, it is the timing and spectral location of the landmark that are recorded, not the actual level of the peak, as this is much more vulnerable to modification by these effects. The landmark parameters are combined with a matching scheme that can successfully and unambiguously identify a match even if only a small minority of all the possible landmarks are identified in the query – thus only a relatively tiny part of the audio, sparsely distributed in time and frequency, is actually responsible for each successful match.


FIGURE 38.1 Example of music fingerprint matching using the Shazam landmark-based algorithm. Black circles indicate time-frequency “landmarks”, and lines connecting them indicate the pairs of landmarks stored as descriptors in the hash table. The top pane shows the analysis of a 5.6 s noisy query, and the bottom pane shows the matching excerpt found in the reference database. White lines indicate matching descriptors.

The reference index is based on pairs of landmarks, such that a particular track is represented by a relatively large number of local landmark pairs, described by the frequency of the first landmark peak, the frequency of the second peak, and the time difference between them. These three values are combined via a hashing function to yield a single key. The database then holds, for each key, the list of reference track IDs, and the time within each track at which the combination of landmarks occurs.

Querying the database then consists of forming local landmark pairs from the query audio, retrieving the lists of all the reference tracks that contain each of those hashed keys, then seeing if there is any track that occurs in an unusually large proportion of the retrieved lists. Generally speaking, an individual hash will occur in a random selection of all the reference tracks, so the chance of an unrelated track turning up in many of the lists – i.e., a false alarm – is very low. If there are multiple contenders for the most likely track, however, a second stage of confirmation is to look at the times at which each matching landmark hash occurs in the query compared to the time it occurs in the reference track, as retrieved from the database. Although the query has an unknown offset in its start time relative to the reference item (i.e., it may be plucked from the middle of a track), the time difference between landmarks located in a query and those in a matching reference should be constant, thus confirming the validity of a match.

Figure 38.1 illustrates the operation of this algorithm. The top pane shows an audio query, captured by a hand-held recorder in a noisy background during the playback of the target track over low-quality speakers: Note the almost total lack of detail above 3 kHz. The lower pane shows the corresponding fragment of the reference signal, as identified among a dataset of several hundred tracks. The black circles show, for both the query and reference audio, the detected landmark peaks; the black lines link pairs of landmarks that form the hash keys. Thin white lines highlight the matching landmarks that facilitate the matching of this 5.5 s query: Although both excerpts contain over 50 landmark pairs, only 12 of these are commonly identified and thus matched. However, the chance of these 12 landmark pairs being identified by chance is vanishingly small, thereby minimizing the risk of false alarms. And because the relatively dense identification of landmarks in the reference audio presents many chances for the query to match, the approach is very successful at matching even short, noisy queries – as shown in this example.


When exposed to the idea of using computers to help find music, people often imagine a system that can identify the intended piece of music represented in a hum, whistle, or short sung snippet. This evidently seems the most natural way of formulating a music query, although it is relevant mainly in known-item searches, not for alternative scenarios that involve looking for an unspecified piece appropriate to a particular situation or listener. When compared to fingerprinting, query by humming (QBH) is far more challenging since sung or hummed queries can vary enormously from person to person, depending on factors such as singing skill, accuracy of memory, and environmental conditions. Even the same query produced by the same person but on different occasions may be in a different key, i.e., based on a different starting pitch, leaving aside other variations. And even the most skillful and accurate hummed query cannot come anywhere close, in waveform terms, to a multi-instrument, polyphonic target recording. Thus, QBH is a long-standing challenge and has had a relatively long and steady history of research progress.

QBH can be decomposed into two main tasks: (1) transcribing a hummed or sung query into a suitable, high-level representation, then (2) matching that representation against a large database of known reference items. The first problem is strongly related to the transcription techniques discussed in Chapter 37, whereas the second is more to do with efficient and tractable comparison metrics. There is also the more basic problem of how to obtain and encode the reference items, which has a strong influence on the representations and comparison techniques used. Much work in QBH has assumed the availability of symbolic descriptions of reference melodies, e.g., as fully-transcribed MIDI encodings of the vocal line. However, it is of much more practical use if the matching can be performed against, or at least automatically reconciled with, untranscribed databases of music audio.

To illustrate the difficulty of this task, Figure 38.2 shows an example of a typical sung query: an amateur rendition of the beginning of a song. Despite a moderate level of background noise, a standard pitch tracker is able to recover a useful description of the time-varying fundamental frequency. When we compare this against a canonical version of the intended melody, however, we see large differences. The grey boxes in the lower pane show the target melody after transposing in time and frequency and scaling the tempo to match the query pitch track as closely as possible. As can be seen, sung queries are often quite inaccurate in pitch and timing, and it can be very difficult to identify the breaks between notes in continuous, slurred contours.


FIGURE 38.2 Example of input to a query-by-singing system. Top pane is a spectrogram of a sung query. Lower pane shows fundamental frequency track (from Yin [3]) overlaid on best-aligned metronomic ground truth (gray patches, showing the range of frequencies that will quantize to the same pitch). Query is from Jang [7].

Early approaches to QBH attempted to transcribe the query into a string of discrete note symbols. To deal with inaccuracies in sung pitches and timing, McNab et al. [11] encoded both the query and the reference database of melodies as a simple sequence of pitch differences, quantized as Up, Down, or Repeat. This representation is invariant to variations in note duration, or to errors in pitch provided the pitch moves in the right direction. Matching was then based on string matching techniques, which can further tolerate small variations (inserted or deleted note events, or incorrect contour labels). Unfortunately, at this level of simplification many reference melodies become similar and thus confusable.

Since note segmentation is difficult, a more successful approach is to use the raw pitch contour without segmentation to match against the reference melodies [6]. The entire contour can be moved up or down in pitch to achieve maximum similarity (e.g., smallest squared difference) from the reference contour. This will mean that inaccuracies in sung pitch will degrade the quality of a match, but it provides the advantage that a more accurate query will lead to more accurate results – unlike the crudely-quantized melody representation that threw away much of the query information. To deal with differences in timing, Dynamic Time Warping (introduced in Chapter 24) can be used to find the best alignment between query and reference items, typically with limits on the alignment path slope such as constraining it to lie in the range 0.5 . . . 2. This approach led to great improvement in retrieval accuracy, but was far more costly to compute than simple string comparison. However, simplified alignment schemes that search for only a single, fixed alignment slope (i.e., a constant tempo warp) or a simple piecewise-linear slope actually perform better than a less constrained, full DTW alignment [21].

Since QBH is most valuable for large reference databases, considerable attention has gone into mechanisms for pruning the search space, e.g., by applying a sequence of filtering stages based on simpler comparisons that can safely exclude large portions of the reference dataset prior to the final, expensive alignment comparison. Mainstream database research addresses problems of rapid search in large datasets. For instance, given a metric that obeys the triangle inequality (such as Euclidean distance), techniques such as vantage objects can accelerate searches for nearest neighbors [17]. These approaches work by indexing the reference database in terms of distances to a small number of “vantage point” objects such that items at a particular distance from each point can be quickly retrieved. The distance of the query from these vantage objects is similarly calculated. Since the triangle inequality dictates that the distance between two points cannot be less than the difference between their distances to a vantage point, any reference items that do not have distances to every vantage object that nearly match those found for the query can quickly be excluded, drastically reducing the number of exact comparisons that need to be made.

A similar idea is behind the indexing scheme known as locality sensitive hashing (LSH) [1]. In this case, however, the distances are quantized and the reference items are stored in a hash table indexed by these quantized distances. By a careful combination of intersections and unions of hashes, the number of distances and the size of the quantized bins can be determined systematically to retrieve, with a predefined confidence, all items that lie within a certain radius of the query. Owing to the hash table, the query is very fast and query time is almost independent of reference database size. Ryynanen & Klapuri [14] use this in their QBH scheme, allowing them to include pitch contours based on every possible starting note in their reference database instead of having to assume that queries will start at the beginning of the reference melody. They further use automatic melody transcription (as discussed in Chapter 37) to build a reference database directly from existing music recordings, and report good accuracy in matching real user queries.

A different approach to QBH involves matching sung queries to other queries instead of reference descriptions [12]. Given a database of sung queries that have each been manually associated with the intended target track, this effectively bypasses problems of transcription and extracting the melody from music audio recordings. Matching must still rely on some kind of alignment, but given that the items being matched are of the same nature (amateur sung queries), it can be an easier problem. Also, the most popular excerpts to hum are likely to be the ones best represented in the database. However, it does require the work of manually labeling the initial queries, and can only match popular, previously-seen queries.


Musicians are fond of taking an existing piece of music and putting their own “interpretation” on it. Such alternate versions range from close imitations of the original (e.g., by tribute bands, who strive to reproduce the performances of their idols) to radical reinterpretations (along the lines of John Coltrane's 14 minute, modal version of the Rogers and Hammerstein show tune “My Favorite Things”). In various scenarios it is useful to be able to match these so-called “cover” versions of original pieces, for instance to automatically identify pieces in a live performance with their studio-recorded equivalents. It is also an interesting problem in its own right, since it focuses attention on the underlying music rather than the particular style or instrumentation used in its performance.

Cover song matching differs from fingerprinting because the underlying audio is not exactly the same: The detailed landmarks described in Section 38.2 are able to distinguishing even alternate takes of the same piece by the same musicians, circumstances in which a human listener might not notice any differences. This task is also quite different from QBH, in as much as the query is much richer than a single hummed or sung melody, although in some cases the lead melody line may be the only point of similarity shared with the original. Other points of similarity between cover and original may include the chord progression (i.e., the combined ‘color’ of simultaneous notes) and the lyrics. Factors that frequently vary are the key (i.e., the absolute pitch basis), the rhythm and tempo, the instrumentation, and other aspects conveying music style.

Cover song matching has been the subject of several formal evaluations within the MIREX music information retrieval evaluation activity [4]. Given that sequences of notes (melodies) and their combinations (harmonies and chords) are most often preserved in cover versions, successful approaches have been based on representations that effectively capture these aspects independent of the specific instruments used – namely, the chroma vectors introduced in Section 37.5, which were originally devised for chord recognition. Chroma vectors describe each short time window of audio by sorting the energy of spectral peaks (individual sinusoid harmonics) into 12 bins representing the 12 distinct semitones of the (western) musical scale. The simultaneous notes at approximately equal intensity that constitute a chord have a signature pattern in this representation, as does a single melody note whose intensity dominates the accompaniment. Thus, chroma representations constitute a convenient compromise between melodic and harmonic description that is very appropriate for cover song identification.

Figure 38.3 illustrates chroma representations of the beginnings of two versions of “Let It Be” – the original by the Beatles, and a cover version by Nick Cave with a markedly different instrumentation and vocal line. While the chroma representations are clearly different, they share many features including the signature F-E-D-C falling cadence at the end of the introduction, highlighted by the thick outlines in the Figure. Chroma are calculated on beat-length time windows (from a preceding beat tracking stage), which means that the patterns remain (mostly) registered, even though the performances have tempos about 4% different. The bottom pane shows the result of cross-correlating these two chroma representations over all relative timing skews up to ±80 beats, and over all 12 possible relative transpositions. There is a clear peak at zero transposition and around zero beats relative timing. The 16 beat phrase sequence also shows up as periodic peaks visible in this correlation. The peak values in this kind of whole-track beat-chroma cross-correlation were used for a successful and efficient cover song matching scheme by Ellis & Poliner [5].


FIGURE 38.3 Beat-chroma representations of “Let It Be” by the Beatles (top pane), and as covered by Nick Cave (middle pane). Bottom pane is the cross-correlation of the two matrices, at all 12 relative transpositions, for up to 80 beats timing skew.

Cross-correlation of beat-synchronous chroma matrices offers a quick way to compare tracks and can find matches even if only a relatively small portion of the tracks match, since chance correlations tend to be small. However, in order to capture the full range of possible alignments, some kind of DTW is required. Serra et al. [15] use so-called local alignment in their chroma-based cover song system, which has shown the best performance in several MIREX evaluations. In local alignment, a full grid of costs is calculated just as in conventional DTW, but the scores can be positive (for good matches) and negative (for weak local matches, or for steps indicating non-linear paths). The accumulated score, however, is set to zero everywhere it would otherwise have become negative, and any such points have no traceback, i.e., they become new starting points for local regions of matching. Such a scheme can effectively find multiple, disjoint regions of good match within larger stretches of signal. The Serra et al. system also includes several other optimizations shown to be beneficial in the cover matching task.


So far, we have only discussed retrieving music audio by using audio queries. But one of the most natural ways of formulating a query is to use words. Although, as mentioned above, searching for music may be qualitatively different from searching for documents on the web, there are still plenty of scenarios in which associating music with relevant keywords would be desirable. For instance, sorting music into genre categories, the traditional way of organizing large collections of music in record stores and elsewhere, is one version of labeling music with a particular set of terms. Also, many web sites allow users to ‘tag’ music with descriptive keywords, for instance to help them search for particular subsets. It is an attractive problem to see how accurately a machine can assign such tags.

By viewing each word or tag as a discrete category, music-to-term mapping is simply a conventional classification problem – either a binary task (is this term relevant to this music or not?), or an N-way task (which of these N genre tags best describes this music?). As such, it is a good fit to the statistical pattern classification techniques described in Chapter 9. The main questions are:

  • Granularity: What is the appropriate time scale over which to perform classification – entire tracks, some kind of sub-track division (such as 10 s fragments), or larger collections of tracks such as albums or the entire output of an artist?
  • Features and representation: As with any classification task, the choice of features, and the suitability of those features for the particular classifier being used, have a crucial influence on eventual performance. Music is very different from speech – in particular, speech features such as MFCCs are generally designed to suppress pitch information, but pitch is essential to music. At the same time, MFCCs have proven very successful in many music audio classification tasks including genre and artist classification [8]. More musically-relevant measures, such as chroma and rhythm-related features, have sometimes shown value in more specialized tasks. Classifying sequences of short frames requires some summarization of features calculated from the frames (e.g., the 25 ms frames typical of speech, or possibly beat-length segments): mean and covariance are commonly used, although more sophisticated descriptions based on autoregression of features have also been shown as beneficial [13].
  • Classifiers: Most popular general-purpose classifiers have been applied to music audio classification. Much early work used Gaussian Mixture Models (GMMs) [18], although more recently Support Vector Machines, based for instance on Mahalanobis distance between feature covariance matrix elements, have been more successful [10]. Hidden Markov models, surprisingly, have rarely shown much benefit over simpler classifiers that ignore sequential structure in the audio. It is obvious that music contains very rich information in its temporal structure, so this must presumably be attributed to limitations in the way that this structure has been captured or modeled.
  • Categories and data: There are many different sets of categories (or tags) that could be addressed by these methods. Several have been investigated within MIREX, including genre (and sub-genres), artist identification, and musical mood. For any such task, however, the results are naturally strongly dependent on the type, quantity and quality of the training data used. In fact, genres, which are primarily a marketing construct of record companies, turn out to be poorly distinguished by humans: Meng et al. [13] found their subjects averaged below 60% accuracy assigning 30 s clips into eleven popular music categories. Other tasks, such as music mood, suffer similar problems of quality and repeatability of ground-truth training labels. In general, researcher-defined categories are often suspect, since they may embody prior biases about the nature of the problem or the interests of users. This has led to a shift towards tags directly contributed by users, as described below.


    FIGURE 38.4 User interface of the “MajorMiner” web-based game for collecting music tags. The user listens to a 10 s music clip, then enters tags as free text. Points are scored on the basis of whether other players submit the same tags.

One way to collect ground truth is via web-based games, inspired by the image labeling “ESP” game of von Ahn & Dabbish [19]. In this popular technique, a web site presents any casual web visitor with some complex content (an image in the ESP game, or a music clip for our purposes), then invites them to confirm or enter relevant keywords. Typically, some competitive or game elements are introduced as a kind of incentive; when the game rewards tags that agree with those given by other users, this also encourages accurate and reproducible tagging. Mandel & Ellis [9] describe one such game whose primary user interface is shown in Figure 38.4. Players listen to 10 s clips of music audio and tag them with as many free-form tags as they wish: points are scored only for tags that are confirmed by other players who hear or have heard the same clip. The scoring rewards the first person to enter a particular tag (if it is subsequently validated by other listeners), and gives no points for frequently-entered clips, providing a balance between novelty and consensus. Although users were given no particular instruction concerning the kinds of tags to use, the most popular tags obtained concerned more objective aspects such as instrumentation (e.g., “drums”, “male”), with genre categories (“pop”, “jazz”) occurring less frequently.


FIGURE 38.5 Classifiers trained from the MajorMiner tag data applied to all 10 S segments from an entire track (“Liberation” by Outkast). The top pane shows a Mel-scale spectrogram of the entire track, along with manual annotations showing the principal changes in instrumentation. The lower pane shows the output of a number of relevant classifiers. Several segments from this track were included in the segments heard by listeners, and the most popular manually-assigned tags are outlined with border width indicating their popularity. Classifier outputs agree with user tags, but also generalize to the remainder of the audio.

The outcome of this game is a large collection of spontaneous user-generated labels for relatively short musical snippets. By ensuring that each clip is played to several different users, good notions of reliability and agreement for each tag are obtained. Then, all the clips bearing a given tag can be used, along with a selection of counter-examples, to train a classifier able to automatically tag previously unseen music clips. Figure 38.5 shows the output of a number of these classifiers over an entire track broken into 10 s segments. A random subset of six of these segments were presented to listeners in the game, and their most popular responses are also indicated in the figure, showing good agreement with the automatic classifiers.


The most interesting and demanding version of the music retrieval task is to find music “similar” to a given example or collection – for instance, to recommend new music to listeners wishing to expand their collections, or to build a coherent playlist of songs for a personalized music stream. Many of the techniques from music audio classification are relevant here: For instance, if genre classification is achieved by learning boundaries between MFCC distribution descriptions, music similarity might be simply estimated by the Kullback-Leibler divergence between those distributions [10], although clearly this can only account for a small part of the seemingly rich subjective phenomenon of music similarity.

Adding to the difficulty of these tasks are the problems of ground truth and evaluation. If we wanted to train a classifier to decide which tracks were similar to or different from a particular example, where would we get labels for positive and negative examples? And once we build such systems, how do we decide which system makes the best selections from a given collection? Ultimately, the only meaningful source of judgments is human listeners, and so for the MIREX 2006 Audio Music Similarity evaluation, human judges were recruited to listen to up to 30 candidate matches (drawn from a corpus of 5000 tracks) for each of 60 queries. The candidate matches were the pooled, top-5 results of the six algorithms submitted for evaluation. Judges rated the similarity of each candidate to the original query, and the algorithms were ranked on their average similarity scores. Automatic systems still have some way to go, with fewer than one quarter of candidates being judged “very similar” in 2006, increasing to around half for the best systems in the 2009 evaluation.

While most similarity systems use some kind of direct comparison between audio-derived features, an alternative approach is to apply tag- or concept-related classifiers as described in the previous section, then use the correlation across the set of classifier outputs as a measure of similarity. This approach was used by Turnbull et al. [16] in the 2007 MIREX music similarity evaluation, with results that were statistically equivalent to the other top systems. The attraction of this approach is that it can use user-labeled training data, and it leverages the individual classifiers to distinguish between aspects of the audio data that are relevant to human judgments (for a variety of categories) and those that should be ignored.


Music retrieval is a broad area whose significance has rapidly increased, and it is thus the topic of a great deal of current research. In this chapter we have looked at various incarnations of this problem, with different forms of queries (audio, textual, contextual) and different kinds of desired results (specific items or novel recommendations). While the existence of regular, formal evaluations has had a strong positive influence on the field, many researchers feel that current techniques don't really get at the musical “core” of these audio signals, and look forward to future developments able to reach further into the rich and deep structure of musical sounds.


  1. 38.1 In the example of Figure 38.1, there are about 50 landmark-pairs shown for each 5 second sound excerpt. 12 of them are shared, which is the basis of matching the two fragments.
    1. (a) If the landmark-pairs are represented by a 20 bit hash, and a given reference track can be considered as a uniform collection of these hashes, what is the probability that an unrelated track will match at this level? You can assume the reference track is 200 seconds long, and that hashes occur at an average of 10 per second.
    2. (b)Assuming this level of hash consistency (i.e. that only 12/50ths of reference hashes will appear in the query), what length of query is required to achieve a false-match rate below 1% for a database of 1 million tracks?
  2. 38.2 Consider three melodies:
    • The first line and a half of the US National Anthem:

      “Oh, say can you see by the dawn's early light what so proudly we hailed ...”

    • The first three lines of Yesterday by the Beatles:

      “Yesterday all my troubles seemed so far away, now I need a place to hide away ...”

    • The first two lines of A Little Help From My Friends by the Beatles:

      “What would you say if I sang out of tune, would you stand up and walk out on me...”

    1. (a) Transcribe each melody into a sequence of Up / Down / Repeat melodic contour symbols.
    2. (b) Based on the length of the shortest sequence, calculate the Hamming distance between each pair of melodies (i.e., the number of symbol positions in which the symbols differ). Which melodies have the smallest distance? Does this seem reasonable?
  3. 38.3 In the bottom panel of Figure 38.3, the dark peak at 0 beats time skew and 0 semitones transposition indicates the principle alignment between the two versions of “Let It Be”. But we also see other local maxima in the cross-correlation.
    1. (a) Along the 0 semitone transposition row, we see a series of local maxima, the first two being just inside ±20 beats. To what do these correspond?
    2. (b) We also see peaks in other rows, including ±5 semitones transposition. What do these peaks indicate? Comment on their timing relative to the main peak.
  4. 38.4 Automatic tagging of music audio typically employs statistics of the audio gathered over some window, from a few seconds up to entire tracks and beyond. Discuss the appropriate timescale for different kinds of labels (instruments, moods, genres). How can we choose the appropriate timescale for a particular classification task?


  1. Andoni, A. and Indyk, P., “Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions,” Communications of the ACM, 51: 117-122, 2008.
  2. Celma, O. and Serra, X., “Foafing the music: Bridging the semantic gap in music recommendation,” Web Semantics: Science, Services and Agents on the World Wide Web, 6: 250-256, 2008.
  3. de Cheveigne, A. and Kawahara, H., “YIN, a fundamental frequency estimator for speech and music,” J. Acoust. Soc. Am., 111: 1917-1930, 2002.
  4. Downie, J., “The music information retrieval evaluation exchange (2005-2007): A window into music information retrieval research,” Acoustical Science and Technology, 29: 247-255, 2008.
  5. Ellis, D. and Poliner, G., “Identifying cover songs with chroma features and dynamic programming beat tracking,” in Proc. ICASSP, pages IV-1429-1432, Hawai'i, 2007.
  6. Jang, J. and Lee, H., “Hierarchical filtering method for content-based music retrieval via acoustic input,” in Proceedings of the ninth ACM international conference on Multimedia, pages 401-410. ACM New York, NY, USA, 2001.
  7. Jang, J., “QBSH: A corpus for designing QBSH (query by singing/humming) systems,”, 2006.
  8. Logan, B., “Mel frequency cepstral coefficients for music modeling,” in International Symposium on Music Information Retrieval, 2000.
  9. Mandel, M. and Ellis, D., “A web-based game for collecting music metadata,” Journal of New Music Research, 37: 151-165, 2008.
  10. Mandel, M. and Ellis, D., “Song-level features and support vector machines for music classification,” in Proc. International Conference on Music Information Retrieval ISMIR, pages 594-599, London, Sep 2005.
  11. McNab, R., Smith, L., Witten, I., Henderson, C., and Cunningham, S., “Towards the digital music library: Tune retrieval from acoustic input,” in Proceedings of the first ACM international conference on Digital libraries, page 18. ACM, 1996.
  12. Melodis Corp., “Midomi: Search for music using your voice,”, 2007.
  13. Meng, A., Ahrendt, P., Larsen, J., and Hansen, L. K., “Temporal feature integration for music genre classification,” IEEE Trans. on Audio and Speech and Lang. Proc., 15: 1654-1664, 2007.
  14. Ryynanen, M. and Klapuri, A., “Query by humming of MIDI and audio using Locality Sensitive Hashing,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 2249-2252, 2008.
  15. Serra, J., Gomez, E., Herrera, P., and Serra, X., “Chroma binary similarity and local alignment applied to cover song identification,” IEEE Trans. on Audio, Speech and Language Processing, 16: 1138-1152, 2008.
  16. Turnbull, D., Barrington, L., Torres, D., and Lanckriet, G., “Semantic annotation and retrieval of music and sound effects,” IEEE Trans. on Audio Speech and Lang. Proc., 16: 467-476, 2008.
  17. Typke, R., Giannopoulos, P., Veltkamp, R., Wiering, F., and Van Oostrum, R., “Using transportation distances for measuring melodic similarity,” in Proceedings of the 4th International Conference on Music Information Retrieval (ISMIR 2003), pages 107-114. Citeseer, 2003.
  18. Tzanetakis, G. and Cook, P., “Musical genre classification of audio signals,” IEEE Trans. Speech and Audio Proc., 10: 293-302, 2002.
  19. Von Ahn, L. and Dabbish, L., “Labeling images with a computer game,” in Proceedings of the SIGCHI conference on Human factors in computing systems, pages 319-326. ACM New York, NY, USA, 2004.
  20. Wang, A., “An industrial strength audio search algorithm,” in Proc. Int. Conf on Music Info. Retrieval ISMIR-03, 2003.
  21. Wu, X., Li, M., Liu, J., Yang, J., and Yan, Y., “A top-down approach to melody match in pitch contour for query by humming,” in Proc of International Conference of Chinese Spoken Language Processing. Citeseer, 2006.
  22. Yoshii, K., Goto, M., Komatani, K., Ogata, T., and Okuno, H., “Hybrid collaborative and content-based music recommendation using probabilistic model with latent user preferences,” in Proceedings of the International Conference on Music Information Retrieval ISMIR, 2006.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.