ManifoldCF – a connector framework

Apache ManifoldCF (CF meaning Connector Framework) provides a framework for extracting content from multiple repositories, enriching it with document-level security information, and outputting the resulting document into Solr based on the security model found in Microsoft's Active Directory platform. Working with ManifoldCF requires an understanding of the interaction between extracting content from repositories via a Repository Connector, outputting the documents and security tokens via an Output Connector into Solr, listing a specific user's access tokens from an Authority Connector, and finally performing a search that filters the document results based on the list of tokens. ManifoldCF takes care of ensuring that, as content and security classifications for content are updated in the underlying repositories, it is synched to Solr, either on a scheduled basis or a constantly monitoring basis. Finally, it has a convenient web UI to manage the connector states.

Connectors

ManifoldCF provides connectors that index into Solr content from a number of enterprise content repositories, including SharePoint, Documentum, Meridio, LiveLink, and FileNet. Competing with DataImportHandler and Nutch, ManifoldCF also crawls web pages, RSS feeds, JDBC databases, and remote Windows shares and local filesystems, while adding the document-level security tokens, where applicable. Also of note is its MediaWiki connector. The most compelling use case for ManifoldCF is leveraging ActiveDirectory to provide access tokens for content indexed in Microsoft SharePoint repositories, followed by just gaining access to content in the other enterprise-content repositories.

Putting ManifoldCF to use

While the sweet spot for using ManifoldCF is with an authority like ActiveDirectory, we're going to reuse our MusicBrainz.org data and come up with a simple scenario for playing with ManifoldCF and Solr. We will use our own MusicBrainzConnector class to read in data from a simple CSV file that contains a MusicBrainz ID for an artist, the artist's name, and a list of music genre tags for the artist:

4967c0a1-b9f3-465e-8440-4598fd9fc33c,Enya,folk,pop,irish

The data will be streamed through Manifold and out to our /manifoldcf Solr core with the list of genres used as the access tokens. To simulate an Authority service that translates a username to a list of access tokens, we will use our own GenreAuthority. It will take the first character of the supplied username, and return a list of genres that start with the same character. So a call to ManifoldCF for the username [email protected] would return the access tokens pop and punk. A search for "Chris" would match on "Chris Isaak" since he is tagged with pop, but "Chris Cagle" would be filtered out since he plays only American and country music.

Browse the source for both MusicBrainzConnector and GenreAuthority in ./examples/9/manifoldcf/connectors/ to get a better sense of how specific connectors work with the greater ManifoldCF framework.

To get started, we need to add some new dynamic fields to our schema in cores/manifoldcf/conf/schema.xml:

<dynamicField name="allow_token_*" type="string" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="deny_token_*" type="string" indexed="true" stored="true" multiValued="true"/>

These rules will allow the Solr output connector to store access tokens in the fields such as allow_token_document and deny_token_document.

Now we can start up ManifoldCF. The version distributed with this book is a stripped-down version, with just the specific connectors required for this demo! In a separate window from ./examples/9/manifoldcf/example run the following code:

>>java -jar start.jar

ManifoldCF ships with Jetty as a servlet container, hence the very similar start command to the one Solr uses!

Browse to http://localhost:8345/mcf-crawler-ui/ to access the ManifoldCF user interface which exposes the following main functions:

  • List Output Connections: This provides a list of all the recipients of the extracted content. It is configured to store content in the manifoldcf Solr core.
  • List Authority Connections: This translates user credentials to a list of security tokens. You can test that our GenreAuthority is functioning by calling the API at http://localhost:8345/mcf-authority-service/[email protected] and verifying you receive a list of genre access tokens starting with the letter p.
  • List Repository Connections: This is the only repository of content we have is the CSV file of author/genre information. The other repositories, such as RSS feeds or SharePoint sites would be listed here. When you create a repository, you associate a connector and the Authority you are going to use, in our case, GenreAuthority.
  • List All Jobs: This lists all the combinations of input repository and output Solrs.
  • Status and Job Management: This very useful screen allows you to stop, start, abort, and pause the jobs you have scheduled, and provide a basic summary of the number of documents that have been found in the repository as well as those processed in Solr.

Go ahead and choose the Status and Job Management screen and trigger the indexing job. Click on Refresh a couple of times, and you will see the artist's content being indexed into Solr. To see the various genres being used as access tokens, browse to:

http://localhost:8983/solr/manifoldcf/select?q=*:*&facet=true&facet.field=allow_token_document&rows=0.

At the time of writing, neither ManifoldCF nor Solr have a component that hooked ManifoldCF-based permissions directly into Solr. However, based on the code from the ManifoldCF in Action manuscript, available at http://code.google.com/p/manifoldcfinaction/, you can easily add a Search Component to your request handler. Add the following code to solrconfig.xml:

<requestHandler name="standard" class="solr.SearchHandler" default="true">
  <arr name="components">
    <str>manifoldcf</str>
  </arr>
</requestHandler>

<searchComponent name="manifoldcf" class="org.apache.manifoldcf.examples.ManifoldCFSecurityFilter">
  <str name="AUTHENTICATED_USER_NAME">username</str>
</searchComponent>

You are now ready to perform your first query! Do a search for Chris, specifying your username as [email protected] and you should see only pop and punk music artists being returned!

http://localhost:8983/solr/manifoldcf/select?q=Chris&[email protected]

Change the username parameter to [email protected] and Chris Cagle, country singer should be returned! As documents are added/removed from the CSV file, ManifoldCF will notice the changes and reindex the updated content.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.163.58