Nutch for crawling web pages

A very common source of data to search is content in web pages, either from the Internet or inside the firewall. The long-time popular solution for crawling and indexing web pages, especially for millions of them, is Nutch, a former Lucene subproject. If you need to scale to millions of pages up, then consider Nutch or Heritrix. For smaller scales, there are many options (that are also simpler), including ManifoldCF, which is discussed later.

Tip

What about Heritrix?

In the previous editions of the book, we highlighted Heritrix—a crawler sponsored by the Internet Archive that was arguably a more scalable crawler than Nutch. The output files from the crawler are used in the SolrJ example, and there is an example in /examples/9/heritrix-2.0.2/. However, Nutch has shown more development activity than Heritrix in the past couple of years, and thus, we are focusing only on it in this edition.

Nutch is an Internet scale web crawler similar to Google with components such as the web crawler itself, a link graphing database, and parsers for HTML and other common formats found on the Internet. Nutch is designed to scale horizontally over multiple machines during crawling using the big data platform, Hadoop, to manage the work.

Note

The Nutch project is the progenitor of the BigData/Search world! Nutch was developed by Doug Cutting and Mike Cafarella in 2002, a few years after Doug developed Lucene (the underpinnings of Solr). To scale it, they built the Nutch Distributed File System (NDFS), which became HDFS. To parse data, they used MapReduce, which spun off to become Hadoop!

Nutch has gone through varying levels of activity and community involvement and has two lines of development—1.x which is very stable and mature, and 2.x which is less but has a more flexible architecture. Previously, Nutch used its own custom search interface based on Lucene, but now, it leverages Solr for search in the 1.9 codebase. This allows Nutch to focus on web crawling, while Solr provides a powerful dedicated search platform with features such as query spellcheck and faceting that Nutch previously didn't have. Nutch natively understands web relevancy concepts such as the value of links towards calculating a page rank score, and how to factor in what an HTML <title/> tag is, when building the scoring model to return results. In the 2.0 version, it leverages more standard open source components, such as HBase for the link database instead of its own internal technology. While a better approach, it has more dependencies, so the demo uses the 1.9 codebase.

Nutch uses a seed list of URLs that tells it where to start finding web pages to crawl. The directory at ./examples/9/nutch/ contains a configured copy of Nutch for crawling through a list of Wikipedia pages for the 300 most popular artists according to MusicBrainz's count of track lookups. Look at the seed_urls.rb script to see the logic used for building the URL seed list wikipedia_seed_urls.txt. To crawl the Internet for a handful of documents, starting from the seed list and index into Solr run from ./examples/9/nutch/ directory:

>> ./apache-nutch-1.8/bin/crawl wikipedia_seed_urls.txt testCrawl/ http://127.0.0.1:8983/solr/nutch/ 1

Browse to http://localhost:8983/solr/nutch/select?q=*:*&fl=url,title and you will see some wiki pages crawled and indexed by Nutch into Solr.

The sizeFetchlist=10 parameter in the ./apache-nutch-1.8/bin/crawl bash script tells Nutch how many documents to crawl. We have hardcoded it to 10 to make sure the example crawl doesn't consume all your resources. Once you are satisfied that Nutch is working the way you want, uncomment the line sizeFetchlist=`expr $numSlaves * 50000`, and trigger the crawl again to index each of the wiki pages listed in the wikipedia_seed_urls.txt file.

The schema file (at ./cores/nutch/conf/schema.xml) that Nutch uses is very selfexplanatory. The biggest change you might make is to set stored="false" on the content field to reduce the index size if you are doing really big crawls and need to save space.

For more information about the plugins that extend Nutch, and how to configure Nutch for more sophisticated crawling patterns, look at the documentation at http://nutch.apache.org.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.30.162