For simplicity, we'll use FileDataSource
. With it, we can import data into Solr from XML files using XPathEntityProcessor
to retrieve the data.
Let's go ahead and create a new core named MusicCatalogue-DIH-XPath
in Solr. We can create the configuration files similarly to the ones we previously created for JDBCDataSource
.
In solrconfig.xml
, we'll use the following content:
<requestHandler name="/dataimport" class="solr.DataImportHandler"> <lst name="defaults"> <str name="config">xpath-data-config.xml</str> </lst> </requestHandler>
We'll create a new file called xpath-data-config.xml
, which will contain FileDataSource
and XPathEntityProcessor
:
<dataConfig> <!-- File Data Source --> <dataSource type="FileDataSource" encoding="UTF-8" /> <document> <entity processor="XPathEntityProcessor" name="musicCatalog" pk="songId" url="/path/to/SolrIndexingExamples/Chapter-5/sampleData.xml" forEach="/musicCatalog/albums/album/" transformer="RegexTransformer"> <field column="songId" xpath="/musicCatalog/albums/album/songId"/> <field column="songName" xpath="/musicCatalog/albums/album/songName"/> <field column="artistName" xpath="/musicCatalog/albums/album/artistName"/> <field column="albumArtist" xpath="/musicCatalog/albums/album/albumArtist"/> <field column="albumName" xpath="/musicCatalog/albums/album/albumName"/> <field column="songDuration" xpath="/musicCatalog/albums/album/songDuration"/> <field column="composer" xpath="/musicCatalog/albums/album/composer"/> <field column="rating" xpath="/musicCatalog/albums/album/rating"/> <field column="year" xpath="/musicCatalog/albums/album/year"/> <field column="genre" xpath="/musicCatalog/albums/album/genre"/> </entity> </document> </dataConfig>
In the preceding <dataConfig>
element, we're just using a single XML file; we need to get this file indexed for our example. We can also use the following configuration to index a list of XML files:
<dataConfig> <dataSource type="FileDataSource" encoding="UTF-8"/> <document> <entity name="document" processor="FileListEntityProcessor" baseDir="/path/to/xml-files" fileName=".*.xml$" recursive="false" rootEntity="false" dataSource="null"> <entity processor="XPathEntityProcessor" name="musicCatalog" pk="songId" url="${document.fileAbsolutePath}" forEach="/musicCatalog/albums/album/" transformer="RegexTransformer"> <!-- Definition of Fields as per the previous example --> </entity> </entity> </document> </dataConfig>
A sample XML file that contains the sample album data has been provided in the code that is available with this book.
The contents of the sample XML file look like the following:
<musicCatalog> <albums> <album> <songId>100000010</songId> <songName>(Oh No) What You Got</songName> <artistName>Justin Timberlake</artistName> <albumArtist>Various</albumArtist> <albumName>Justified</albumName> <songDuration>4.31</songDuration> <composer/> <rating>3.5</rating> <year>2002</year> <genre>Pop, Electronic, Dance, Adult Contemporary, Teen Pop</genre> </album> </albums> </musicCatalog>
As we can see from xpath-data-config.xml
, we are using FileDataSource
to read the contents of the file. Then, using XPathEntityProcessor
, we fetch the values of the field. For example, we retrieve artistName
using the following code:
<field column="artistName" xpath="/musicCatalog/albums/album/artistName"/>
The xpath
attribute is used to pass an XPath expression to the field element, which is used by XPathEntityProcessor
to retrieve the artistName
value from the XML document and is then fed into Solr for indexing.
Let's test our newly created core in Solr. To do this, we'll start our Solr instance and navigate to the Solr Admin UI (http://localhost:8983/solr/#/musicCatalog-DIH-XPath/
).
Let's import the XML data using the DataImport tab. To do this, click on the Dataimport tab, select the full-import option, and click on Execute, as shown in this screenshot:
As we can see from the preceding screenshot, after we click on the Execute button, the data import handler indexes the data from the XML file into Solr. Solr gives the following output, which tells the user how many documents were added/updated or deleted:
After running the import, we can query the Solr index to retrieve our indexed document. To do this, we can use the query browser in the Solr Admin UI, or we can directly go to this URL:
http://localhost:8983/solr/musicCatalog-DIH-XPath/select?q=*%3A*&wt=json&indent=true
The following result is expected if the data import is successful:
{ "responseHeader":{ "status":0, "QTime":0 }, "response":{ "numFound":1, "start":0, "docs":[ { "genre":"Pop, Electronic, Dance, Adult Contemporary, Teen Pop", "composer":"", "albumArtist":"Various", "tmpField":[ "Various", "100000010", "Justin Timberlake" ], "albumName":"Justified", "songDuration":4.31, "year":2002, "songName":"(Oh No) What You Got", "rating":3.5, "songId":"100000010", "artistName":"Justin Timberlake" } ] } }
The preceding result shows us how we can use the data import handler to index XML documents into Solr.
18.222.164.246