Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7. Apache Nutch

In the previous chapter, we saw how we can index documents using Apache Tika into Solr. In this chapter, we'll see how we can use Apache Nutch to index web content into Solr and index them in Solr. This chapter will cover the following topics:

Introducing to Apache Nutch
Installing Apache Nutch
Configuring Solr with Nutch

Introducing Apache Nutch

Apache Nutch is an open source web crawler that can be used to retrieve data from websites and get data from it. It is an extensible and scalable crawler that gives us the freedom to use it as we like by using plugins. Apache Nutch is written in Java, just like Apache Solr, and both tools make a perfect combination for creating a search engine of our own if they are combined.

Apache Nutch can be used on a single node or can be run in a distributed way with multiple nodes. Let's see how we can combine Apache Solr and Apache Nutch to crawl a web page and index it. To do this, let's start by installing Apache Nutch.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7. Apache Nutch

Create new playlist

Sign In

Sign Up

Chapter 7. Apache Nutch

Introducing Apache Nutch

Table of Contents for
7. Apache Nutch