MusicBrainz.org

Instead of continuing to work with the sample data that comes with Solr, we're going to use a large database of music metadata from the MusicBrainz project at http://musicbrainz.org. The data is free and is submitted by a large community of users. One way MusicBrainz offers this data is in the form of a large SQL file for import into a PostgreSQL database. In order to make it easier for you to play with this data, the online code supplement to this book includes the data in formats that can readily be imported into Solr. Alternatively, if you already have your own data, then we recommend starting with that, using this book as a guide.

The MusicBrainz database is highly relational. Therefore, it will serve as an excellent instructional dataset to discuss Solr schema choices. The MusicBrainz database schema is quite complex, and it would be a distraction to go over even half of it. We are going to use a subset of it and express it in a way that has a straightforward mapping to the user interface, which can be seen on the MusicBrainz website. Each of these tables that are depicted in the following diagram can be easily constructed through SQL subqueries or views from the actual MusicBrainz tables:

MusicBrainz.org

To describe the major tables that we mentioned earlier, we'll use some examples from the band, The Smashing Pumpkins:

  • The Smashing Pumpkins is an artist with a type of group (a band). Some artists (groups in particular) have members who are also other artists of type person. So this is a self-referential relationship. The Smashing Pumpkins band has Billy Corgan, Jimmy Chamberlin, and others as members.
  • An artist is attributed as the creator of a release. The most common type of release is an album but there are also singles, EPs, compilations, and others. Furthermore, releases have a status property that is either official, promotional, or bootleg. A popular official album from The Smashing Pumpkins is titled Siamese Dream.
  • A release can be published at various times and places, which MusicBrainz calls an event (a release-event). Each event contains the date, country, music label, and format (CD or tape).
  • A release is composed of one or more tracks. Siamese Dream has 13 tracks starting with Cherub Rock and ending with Luna. Note that a track is a part of just one release and so it is not synonymous with a song. For example, the song Cherub Rock is not only a track on this release but also on the Greatest Hits release, as well as quite a few others in the database. A track has a PUID (PortableUniqueIdentifier), an audio fingerprinting technology quasi-identifier, based on the actual sound on a track. It's not foolproof as there are collisions, but these are rare. Another interesting bit of data MusicBrainz stores is the PUID lookup count, which is how often it has been requested by their servers—a good measure of popularity.

Note that we'll be using the word entity occasionally here in the data modeling sense—it's basically a type of thing represented by the data. Artist, release, event, and track are all entity types with respect to MusicBrainz. In a relational database, most tables correspond to an entity type and the others serve to relate them or to provide for multiple values. In Solr, each document will have a primary entity type and may contain other entities as part of it, too.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.147.124