Time for action – indexing CSV (for example, open data)

There are times when we can use external CSV files. This is not unusual, because it's very simple and easy to work with Excel or similar spreadsheet tools and save the data on a text-separated format (the most widely used are CSV and TSV). We already saw that Solr supports the indexing of CSV via an update handler, so it's not surprising at all that we could handle some of these files with a DataImportHandler processor, as follows:

<dataSource name="fds" type="FileDataSource" encoding="UTF-8" />
<document>
  <entity name="files" 
    dataSource="null" 
    rootEntity="false" 
    processor="FileListEntityProcessor" 
    baseDir="${solr.solr.home}/../../resources" 
    fileName=".*.csv" 
    onError="abort" 
    newerThan="${dataimporter.last_index_time}"    
    transformer="script:console" 
    recursive="true">
  <entity name="csv_file" 
    processor="LineEntityProcessor" 
    url="${files.fileAbsolutePath}" 
    dataSource="fds" 
    transformer="RegexTransformer">
  <field column="rawLine" 
      regex="^(.*);(.*);(.*)$" 
      groupNames="AUTHOR,TITLE,URL" />
    <field column="URL" name="uri"  />
  </entity>
</document>

In this case, we have defined a subentity that does all the work, reading from a local file—even if we would have pointed directly to a URL containing the file. With this approach, we will be able to index every new CSV file in a specific folder. This can be useful for some tasks.

What just happened?

The external entity here uses a FileListEntityProcessor only to obtain a list of files to be processed (in our specific case, we have a single /SolrStarterBook/resources/wgarts/csv/catalog.csv file; but if you want to test the configuration, you can split the file into multiple ones, or simply add a new file). The base folder is relative to the active Solr instance we started, and it will emit a list of filenames ending with the .csv suffix. Here, the last import time is crucial when you have to manage multiple imports.

In this case, all the real processing is handled by a LineEntityProcessor. This particular processor will read the textual content, emitting a line at a time. So, it's up to us to correctly process the field for each line. In order to maintain readability, I only projected three fields in the previous example; but you will find the complete configuration in the /SolrStarterBook/solr-app/chp08/csv directory. A LineEntityProcessor always emits a raw line, which contains all the text read for that line. By using a RegexTransformer, we can extract specific part of the line to produce new fields.

In the example, we want to have only three fields for each row, separated by char; we will eventually assign the correct name to them by the groupNames attribute. There is some similarity with the definition of a common INSERT INTO syntax for databases. The names of the fields where the current values are to be put should be declared in the exact order, and should be separated by a , character. We can read the data in the same way even from the remote sources. (Remember we are using an URL, not a simple file path! So it can either be file:///some_path or http://some_path.) We can assemble different sources and we can continue using custom JavaScript transformers to fix problems with data.

Tip

As a general recipe, I strongly suggest to adopt the UTF-8 encoding explicitly whenever possible. In contexts such as rapid prototyping, automatic remote open data indexing, or some kind of search over share documents, it would be useful to define a list of paths and links and save it on a textual file. In this way, you can change the external entity and emit a list of resources to be indexed directly by a LineEntityProcessor. This can be useful sometimes to move faster, and it's simple to change later to a more appropriate solution.

Once we have seen how to manage an external entity to iterate over a collection of files and emit their list names (or URLs), it's simple to apply the same strategy to different file types, adopting a different processor if necessary.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.227.9