Chapter 7. Apache Nutch

In the previous chapter, we saw how we can index documents using Apache Tika into Solr. In this chapter, we'll see how we can use Apache Nutch to index web content into Solr and index them in Solr. This chapter will cover the following topics:

  • Introducing to Apache Nutch
  • Installing Apache Nutch
  • Configuring Solr with Nutch

Introducing Apache Nutch

Apache Nutch is an open source web crawler that can be used to retrieve data from websites and get data from it. It is an extensible and scalable crawler that gives us the freedom to use it as we like by using plugins. Apache Nutch is written in Java, just like Apache Solr, and both tools make a perfect combination for creating a search engine of our own if they are combined.

Apache Nutch can be used on a single node or can be run in a distributed way with multiple nodes. Let's see how we can combine Apache Solr and Apache Nutch to crawl a web page and index it. To do this, let's start by installing Apache Nutch.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.254.80