Time for action – expanding the original data with coordinates during the update process

Defining an external and specific dedicated index is not unusual in a real context, when sometimes you would like to index and search for specific kind of entities. Thus, you will be able to recognize them and annotate them in your own data in a way similar to what we already did for highlighting, but in a more controlled way and with more semantics. A good experiment at this time, involving just a bit of scripting, could be expanding the information we have about our city at the time of posting them to be indexed.

  1. Suppose we have already collected information about cities, and we want to use it to expand the information about every new document when it contains a known city. I have defined a new paintings_geo core that extends the usual painting with updating capabilities, as in the following snippet of solrconfig.xml:
    <updateHandler class="solr.DirectUpdateHandler2">
      <updateLog>
        <str name="dir">${solr.ulog.dir:}</str>
      </updateLog>
      <autoCommit>
        <maxTime>100</maxTime>
        <openSearcher>false</openSearcher>
      </autoCommit>
    </updateHandler>
    <requestHandler name="/update" class="solr.UpdateRequestHandler">
      <lst name="defaults">
        <str name="update.chain">script</str>
      </lst>
    </requestHandler>
    <updateRequestProcessorChain name="script">
      <processor class="solr.StatelessScriptUpdateProcessorFactory">
        <str name="script">cities_from_osm.js</str>
      </processor>
      <processor class="solr.LogUpdateProcessorFactory" />
      <processor class="solr.RunUpdateProcessorFactory" />
    </updateRequestProcessorChain>
  2. For the schema.xml file we left all unchanged, but added the field seen before location:
    <fieldType name="location" class="solr.LatLonType" subFieldSuffix="_lat_lon" postingsFormat="SimpleText" />
    <dynamicField name="*_coordinates" type="location" indexed="true" stored="true" />
  3. Now let's start the core with the usual scripts and post some data on that, using a simple update example on CSV, using the following command:
    >> curl -X GET 'http://localhost:8983/solr/paintings_geo/update?stream.body=TEST_01,Mona+Lisa,Leonardo+Da+Vinci,Paris&stream.contentType=application/csv;charset=utf-8&fieldnames=uri,title,artist,city'
    
  4. Here we are posting as an example a small record about the Mona Lisa located at Louvre, Paris.

What just happened?

If you start the usual importing process on this new core with the provided script and look at the running log in the open console, you can easily find that we are adding some extra information to the document. During the update process, we attached an update processor to the update chain, so that every document will be processed by the script cities_from_osm.js. Here we can use every scripting language supported by JVM: Jruby, Groovy, JavaScript, Scala, Clojure, and others. We decided to use JavaScript for two reasons: it has a well-known syntax and it's already used in many HTML front ends. It's good to have somebody in the team who can use it and give us some help if needed.

We can easily explain what happened there with some simplified pseudo code (please refer to the following example for the complete script):

function processAdd(cmd) {

  var doc = cmd.solrDoc;
  doc.setField("museum_entity", museum-value);
  doc.setField("[QUERY]", "/"+other.getName()+"?q="+query.toString());
  doc.setField("[DOCS HITS]", docset.size());

  var other = getCore(cmd, 'cities'),
  var searcher = other.newSearcher("/select");
  var term = new org.apache.lucene.index.Term("city", city);
  var query = new org.apache.lucene.search.TermQuery(term);
  var docset = searcher.getDocSet(query);

  var doc_it = docset.iterator();
  while (doc_it.hasNext()) {
  var id = doc_it.nextDoc();
    var doc_found = searcher.doc(id);
    var lat = doc_found.getField("lat").stringValue();
    var lon = doc_found.getField("lon").stringValue();
    doc.setField("coordinates", normalize(ns, lat)+","+normalize(ns, lon));
  }

  searcher.close();
}

Note that even if we are using JavaScript as a language, the general API (name and parameters of the methods) corresponds to the original Java one to make things easier. Once we retrieve a SolrInputDocument doc from the UpdateRequestProcessor object command, we will be able to add to it new fields at runtime (remember that we have defined dynamic fields; this simplifies things as we are not bound to specific field's name and type here). In particular, we retrieve data from the already populated cities core with data from Open Street Map and perform on it a search for obtaining the coordinates of a matching city name. This approach is not optimal, as we are actually opening a new searcher every time, and can be optimized by saving a reference to a searcher. The simplest way to do that could be, for example, using some global variable, but we don't need to do this now.

Finally we can construct the coordinate field with the right syntax; this time the coordinates are physically stored within the painting resource.

We also decided to put two extra fields ([QUERY] and [DOCS_HITS]) in the results to represent the original query and the number of documents. These represent only other examples of field creation at runtime, but they are useful again to see how to handle a Java object within Javascript. Note that here we are adopting the naming convention of pseudo-fields used with transformers just to let them be recognizable from the original ones; but the fields are, in this case, actually saved in the index, they are not created "on the fly".

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.32.67