In this section, we will start analyzing our data, and define the basic fields for a logical entity, which will be used for writing the documents to be indexed.
Looking at the structures of the RDF/XML downloaded files (they are represented using an XML serialization for RDF descriptions), we don't need to think too much about the RDF in itself for our purpose. On opening them with a text editor, you will find that they contain the same metadata for every resource, so you can easily find its corresponding DBpedia page. As seen before, most of them are based on best practices and standard vocabularies, such as dublin core, which are designed to share the representation of resources, and can be indexed almost directly. Starting from that, we can decide how to describe our paintings for our searches, and then what are the basic elements we need to select to construct our basic example core.
You can look at the sketch schema shown in the following diagram. It's simple to start thinking about a painting
entity, which you can think of like a box for some fields:
The elements cited are inspired by some of the usual metadata we intuitively expect and are able to find in most cases in the downloaded files.
I strongly suggest you to make a schema like the previous image when you are about to start writing your own configuration. This makes things more clearer than when you start coding directly, and also helps us to speak with each other, sharing, and understanding an emergent design.
In this collection of ideas for important elements, we can then start isolating some essential fields (let's see things directly from the Solr perspective), and when the new Solr core first runs, we could then add new specific fields and configurations.
To represent our Painting
entity, define a simple Solr document with the following fields:
Field |
Example |
---|---|
uri | |
title |
Mona Lisa |
artist |
Leonardo Da Vinci |
museum |
Louvre |
city |
Paris |
year |
~1500 |
wikipedia_link |
We have adopted only a few fields, but there could be several; in this particular case we have selected those which seem to be the most easy and recognizable for us to explore.
18.224.67.235