Collecting the paintings data from DBpedia

It's now time to collect metadata from DBpedia. This part is not strictly related to Solr in itself, but they are useful for creating more realistic examples. So, I have prepared, in the repository, both the scripts to download the files and some Solr document already created for you from the downloaded files. If you are not interested in retrieving the files by yourself, you can directly skip to the Indexing example data section.

In the /SolrStarterBook/test/chp03/ directory, you will also find the INSTRUCTIONS.txt file that describes the full process step-by-step from the beginning to a simple query.

Downloading data using the DBpedia SPARQL endpoint

DBpedia is a project aiming to construct a structured and semantic version of Wikipedia data represented in RDF (Resource Description Framework). Most of the data in which we are interested is described using the Yago ontology, which is one well-known knowledge base and can be queried by a specific language called SPARQL.

Note

The Resource Description Framework is widely used for conceptual data modeling and conceptual description. It is used in many contexts on the Web and can be used to describe the "semantics" of data. If you want to know more, the best way is to start from the Wikipedia page, http://en.wikipedia.org/wiki/Resource_Description_Framework, which also contains links the most recent RDF specifications by W3C.

Just to give you an idea for obtaining a list of pages, we could, use SPARQL queries similar to the following (I omit the details here) against the DBpedia endpoint found at http://dbpedia.org/sparql:

SELECT DISTINCT ?uri
WHERE { 
  ?uri rdf:type ?type.
  {?type rdf:type <http://dbpedia.org/class/yago/Painting...>}
}

When the list is complete (we can ask the results directly as a CSV list of uris), we can then download every item from it in order to extract the metadata we are interested in.

Note

The Scala language is a very good option for writing scripts and combines the capabilities of the standard Java libraries with a powerful, synthetic, and (in most cases) easy-to-read syntax, so the scripts are written using this language. You can download and install Scala following the official instructions from http://www.scala-lang.org/download/, and add the Scala interpreter and compiler to the PATH environment variable as we have done with Java.

We can directly execute the SCALA sources in /SolrStarterBook/test/chp03/paintings/. (Let's say you want to customize the script.) For example we can start the download process calling the downloadFromDBPedia script (if you are on a Windows platform, simply use the bat version instead):

>> ./downloadFromDBPedia.sh

If you don't want to install Scala, you can simply run the already compiled downloadFromDBPedia.jar file with Java, including the Scala library, with the alternative script:

>> ./downloadFromDBPedia_java.sh

Note that these two methods are equivalent as when the first is running, it creates the executable jar if the jar does not exist.

Tip

When playing with the Wikipedia api , it is simple to obtain a single page. If you look for examples at http://en.wikipedia.org/w/api.php?action=parse&format=xml&page=Mona_Lisa, you will see what we are able to retrieve directly from the Wikipedia API—the resource page describing the well know painting La Giaconda. If you are interested in using the existing webcrawlers' libraries to automate these processes without have to write an ad hoc code every time, you should probably take a look at Apache Nutch at http://nutch.apache.org/ and how to integrate it with Solr.

Once the download terminates (it could take a while depending on your system and network speed), we will finally have collected several RDF/XML files in /SolrStarterBook/resources/dbpedia_paintings/downloaded/. These files are our first source of information, but we need to create Solr XML documents to post them to our index.

Creating Solr documents for example data

The XML format used for posting is the one that is usually seen by default. Due to the number of resources that have to be added, I prefer to create a single XML file for posting every resource. In a real system, this process could have been handled differently but in our case, this permits us to easily skip problematic documents.

To create the Solr XML document, we have two options. Again, it's up to you to decide if you want to use the Scala script directly or call the compiled jar using one of the following two command statements:

>> ./createSolrDocs.sh
>> ./createSolrDocs_java.sh

This process internally uses an XSLT transformation (dbpediaToPost.xslt) to create a Solr document with the fields in which we are interested for every resource. You may notice some errors on the console since some of the resources can have issues regarding data format, encoding, and others. These will not be a problem for us, and also could be used as a realistic example to see how to manage character normalization or data manipulation in general.

Indexing example data

The first thing we need to index data is to have a running Solr instance for our current configuration. For your convenience you can use the script provided in /SolrStarterBook/test:

>> ./start.sh chp03

In the directory /SolrStarterBook/resources/dbpedia_paintings/solr_docs/, we have the Solr documents (in XML format) that can be posted to Solr in order to construct the index. To simplify this task, we will use a copy of the post.jar tool that you'll find on every Solr standard installation:

>> java -Dcommit=yes -Durl=http://localhost:8983/solr/paintings/update -jar post.jar ../../../resources/dbpedia_paintings/solr_docs/*.xml

Note how the path used is relative to the current /SolrStarterBook/test/chp03/paintings/ folder.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.181.36