Sending CSV-formatted data to Solr

If you have data in a CSV format or if it is more convenient for you to get CSV than XML or JSON, then you may prefer the CSV option. Solr's CSV support is fairly flexible. You won't be able to specify an index-time boost but that's an uncommon need.

Tip

CSV is uniquely the only format that Solr supports for round-tripping data. As such, you can query for CSV-formatted data that is suitable to be added right back into Solr (for stored fields only, of course). The XML and JSON query output formats are structured differently than their input formats, so they don't count.

To get CSV data out of a local PostgreSQL database for the MusicBrainz tracks, we ran this command:

psql -U postgres -d musicbrainz_db -c "COPY (
select 'Track:' || t.id as id, 'Track' as type, t.name as t_name, t.length/1000 as t_duration, a.id as t_a_id, a.name as t_a_name, albumjoin.sequence as t_num, r.id as t_r_id, r.name as t_r_name, array_to_string(r.attributes,' ') as t_r_attributes, albummeta.tracks as t_r_tracks 
from (track t inner join albumjoin on t.id = albumjoin.track 
 inner join album r on albumjoin.album = r.id left join albummeta on albumjoin.album = albummeta.id) inner join artist a on t.artist = a.id 
) to '/tmp/mb_tracks.csv' CSV HEADER"

And it generated about 7 million lines of output that looks like this (the first three lines):

id,type,t_name,t_duration,t_a_id,t_a_name,t_num,t_r_id,t_r_name,t_r_attributes,t_r_tracks
Track:183326,Track,In the Arms of Sleep,254,11650,The Smashing Pumpkins,4,22471,Mellon Collie and the Infinite Sadness (disc 2: Twilight to Starlight),0 1 100,14
Track:183328,Track,Tales of a Scorched Earth,228,11650,The Smashing Pumpkins,6,22471,Mellon Collie and the Infinite Sadness (disc 2: Twilight to Starlight),0 1 100,14

This CSV file is provided with the code supplement to the book. To get Solr to import the CSV file, type this at the command line:

curl http://localhost:8983/solr/update/csv -F f.t_r_attributes.split=true -F f.t_r_attributes.separator=' ' -F overwrite=false -F commit=true -F stream.file=/tmp/mb_tracks.csv

The CSV options were specified via form values (-F) here; you can alternatively encode them into the query portion of the URL—it doesn't matter.

Tip

Consider the Unix mkfifo command

When we actually did this, we had PostgreSQL on one machine and Solr on the other. We've used the Unix mkfifo command to create an in-memory data pipe mounted at /tmp/mb_tracks.csv. This way, we didn't have to actually generate a huge CSV file. We could essentially stream it directly from PostgreSQL into Solr. Details on this approach and PostgreSQL are beyond the scope of this book.

Configuration options

The following are the names of each configuration option with an explanation. For the MusicBrainz track CSV file, the defaults were used with the exception of specifying how to parse the multivalued t_r_attributes field and disabling unique key processing for performance.

  • separator: This is the character that separates each value on a line. It defaults to a comma.

    Tip

    If you're using curl and need to specify a tab character or some other character that isn't visible other than a space, then the easiest way to do this is to specify this parameter on the URL as a query parameter instead of with -F. Remember to URL encode it, for example, …/update/csv?separator=%09 –F … and so on.

  • header: Set this to true if the first line lists the field names (the default).
  • fieldnames: If the first line doesn't have the field names, you'll have to use this instead to indicate what they are. They are comma separated. If no name is specified for a column, then its data is skipped.
  • skip: This specifies which fields to not import in the CSV file.
  • skipLines: This is the number of lines to skip in the input file. It defaults to 0.
  • trim: If this is true, it removes the leading and trailing white space as a final step, even if quoting is used to explicitly specify a space. It defaults to false. Solr already does an initial pass trim, but quoting may leave spaces.
  • encapsulator: This character is used to encapsulate (that is surround, quote) values in order to preserve the field separator as a field value instead of mistakenly parsing it as the next field. This character itself is escaped by doubling it. It defaults to the double quote, unless escape is specified. Consider the following example:
    11604, foo, "The ""second"" word is quoted.", bar
  • escape: If this character is found in the input text, then the next character is taken literally in place of this escape character, and it isn't otherwise treated specially by the file's syntax; for example, consider the following code:
    11604, foo, The second, word is followed by a comma., bar
  • keepEmpty: This specifies whether blank (zero length) fields should be indexed as such or omitted. It defaults to false.
  • literal: This adds a fixed field name and value to all the documents. For example, literal.datasource=artists adds the datasource field with the value artists to every document.
  • rowid: This adds a field to every document where the passed in parameter name is the field name and the current line number is the value. This is very helpful when there is no unique ID for each row and also for debugging purposes. It defaults to null and is optional.
  • rowidOffset: This works with the rowid parameter; this integer value will be added to the actual rowid / current line number before adding it to the specified rowid field in the index.
  • overwrite: This indicates whether to enforce the unique key constraint of the schema by overwriting existing documents with the same ID. It defaults to true. Disable this to increase performance, if you are sure you are passing new documents.
  • split: This is a field-specific option used to split what would normally be one value into multiple values. Another set of CSV configuration options (separator, and so on) can be specified for this field to instruct Solr on how to do that. See the previous track's MusicBrainz example on how this is used.
  • map: This is another field-specific option used to replace input values with another. It can be used to remove values too. The value should include a colon, which separates the left side that is replaced with the right side. If we were to use this feature on the tracks of the MusicBrainz data, then it could be used to map the numeric code in t_r_attributes to more meaningful values. Here's an example of such an attempt:
    -F keepEmpty=false -F f.t_r_attributes.map=0:  -F f.t_r_attributes.map=1:Album -F f.t_r_attributes.map=2:Single

    This causes 0 to be removed, because it seems to be useless data, as nearly all tracks have it, and we map 1 to Album and 2 to Single.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.149.168