List of Figures

Chapter 1. Understanding collective intelligence

Figure 1.1. A user may be influenced by other users either directly or through intelligence derived from the application by mining the data.

Figure 1.2. Three components to harnessing collective intelligence. 1: Allow users to interact. 2: Learn about your users in aggregate. 3: Personalize content using user interaction data and aggregate data.

Figure 1.3. Four pillars for user-centric applications

Figure 1.4. An example of a user-centric application—LinkedIn (

Figure 1.5. Classifying user-generated information

Figure 1.6. This tag cloud from shows popular tags at the site.

Figure 1.7. Screen shot from showing news items with the number of diggs for each

Figure 1.8. Screenshot from Yahoo! Music recommending songs of interest

Chapter 2. Learning from user interactions

Figure 2.1. Synchronous and asynchronous learning services

Figure 2.2. Architecture for embedding and deriving intelligence in an event-driven system

Figure 2.3. Architecture for embedding intelligence in a non-event-driven system

Figure 2.4. A user interacts with items, which have associated metadata.

Figure 2.5. The three sources for generating metadata about an item

Figure 2.6. Attribute hierarchy of a user profile

Figure 2.7. Term vector representation of text

Figure 2.8. Typical steps involved in analyzing text

Figure 2.9. Two dimensional vectors, v1 and v2

Figure 2.10. Screenshot from YouTube showing related videos for a video

Figure 2.11. Persistence of ratings in a table that stores each user’s ratings in a separate table

Figure 2.12. At, users are allowed to vote on how they like an article—“digg it” is a positive vote, while “Bury” is a negative vote.

Figure 2.13. Screenshot from The Wall Street Journal ( that shows how a user can forward/email an article to another user

Figure 2.14. Saving an item to a list (NY

Figure 2.15. Composite pattern for organizing bookmarks together

Figure 2.16. A normal distribution with a mean of 0 and standard deviation of 1

Figure 2.17. allows users to place a positive or negative vote of confidence in a reviewer.

Figure 2.18. The association between a reviewer, an item, and the review of an item

Figure 2.19. Schema design for persisting reviews

Figure 2.20. A user’s metadata vector is created using the metadata vector associated with all the items visited.

Chapter 3. Extracting intelligence from tags

Figure 3.1. Three ways to generate tags

Figure 3.2. Screenshot of how a user creates a tag at

Figure 3.3. Amazon allows users to tag a product and see how others have tagged the same product.

Figure 3.4. Tag cloud from

Figure 3.5. Tag cloud of all-time most popular tags at Flickr

Figure 3.6. Combining term vectors from a number of documents to form a tag cloud

Figure 3.7. Using a tag, the context that it appears in, and user metadata to get relevant results from a search engine

Figure 3.8. The tags and tagging_source database tables

Figure 3.9. The MySQLicious schema with sample data

Figure 3.10. Scuttle representation with sample data

Figure 3.11. The normalized Toxi solution with sample data

Figure 3.12. The recommended persistence schema designed for scalability and performance

Figure 3.13. Nesting queries to get the set of tags used

Figure 3.14. Table to store the metadata associated with an item via tags

Figure 3.15. The addition of summary and days tables

Figure 3.16. Class design for implementing a tag cloud

Figure 3.17. The class diagram for FontSizeComputationStrategy

Figure 3.18. Using the Decorator pattern to generate HTML to represent the tag cloud

Figure 3.19. The tag cloud for our example

Chapter 4. Extracting intelligence from content

Figure 4.1. Architecture for integrating internally hosted separate instances server

Figure 4.2. Class model for representing a blog for a user

Figure 4.3. Persistence schema for blogs

Figure 4.4. Relationship between a page, a category, and a user in a wiki

Figure 4.5. Persistence model for a wiki

Figure 4.6. Modeling a message board or a group

Figure 4.7. The schema for the elements of a message board

Figure 4.8. Typical steps involved in analyzing text

Figure 4.9. The hierarchy of analyzers used to create metadata from text

Figure 4.10. The tag cloud for the title consists of four terms.

Figure 4.11. The tag cloud for the body of the text

Figure 4.12. The resulting tag cloud obtained by combining the title and the body

Figure 4.13. The tag cloud after removing the stop words

Figure 4.14. The tag cloud after normalizing the terms

Figure 4.15. Tag cloud for the title after using the bi-term analyzer

Figure 4.16. Tag cloud for the blog after using a bi-term analyzer

Figure 4.17. Adding item_type to the item table

Figure 4.18. Classifying content types

Chapter 5. Searching the blogosphere

Figure 5.1. Four steps in searching the blogosphere

Figure 5.2. The generic architecture for the blog searcher

Figure 5.3. The BlogQueryResult object

Figure 5.4. BlogSearchResponseHandler and XMLToken

Figure 5.5. Two implementations for BlogQueryResult

Figure 5.6. Base implementation for BlogSearcher

Figure 5.7. The base class for SAX parsing handlers

Figure 5.8. The interfaces and their implementing classes

Figure 5.9. The classes extending BlogQueryParameterImpl

Chapter 6. Intelligent web crawling

Figure 6.1. The basic process of web crawling

Figure 6.2. Submitting your site’s sitemap using Google Webmaster tools

Figure 6.3. Number of relevant URLs retrieved as a function of number of URLs visited

Figure 6.4. The Cygwin window after the crawl command

Figure 6.5. The directory structure after the crawl

Figure 6.6. The stats associated with the crawldb

Figure 6.7. The search screen for the Nutch application

Figure 6.8. Searching for collective intelligence using the Nutch search application

Figure 6.9. MapReduce example for counting term frequencies

Chapter 7. Data mining: process, toolkits, and standards

Figure 7.1. A predictive model makes a prediction based on the values for the input attributes.

Figure 7.2. Two clusters in a two-dimensional attribute space found by analyzing the proximity of the data points

Figure 7.3. An example decision tree showing two attributes

Figure 7.4. A multi-layer perceptron where the input from one layer feeds into the next layer

Figure 7.5. The directory structure and some of the files for WEKA

Figure 7.6. WEKA documentation that’s available in the install

Figure 7.7. WEKA GUI with options to start one of four applications

Figure 7.8. The five attributes of the iris.arff dataset, with details about the sepallength attribute

Figure 7.9. Converting a continuous variable into a discrete variable using filters in WEKA

Figure 7.10. A dataset in WEKA is represented by instances.

Figure 7.11. Classifer uses instances to build the model and classifies an instance.

Figure 7.12. Classifer uses instances to build the model and classifies an instance.

Figure 7.13. Clusterer uses instances to build the model and associate an instance with the appropriate cluster.

Figure 7.14. Association-learning algorithms available in WEKA

Figure 7.15. Key JDM objects

Figure 7.16. Key JDM interfaces to describe the physical and logical aspects of the data

Figure 7.17. The model representation in JDM

Figure 7.18. The settings associated with the different kinds of algorithms

Figure 7.19. The interfaces associated with the various tasks supported by JDM

Figure 7.20. The interfaces associated with creating a Connection to the data-mining service

Chapter 8. Building a text analysis toolkit

Figure 8.1. Typical steps involved in analyzing text

Figure 8.2. Example of how the tools developed in this chapter can be leveraged in your application

Figure 8.3. Key classes in the Lucene analysis package

Figure 8.4. Some of the concrete implementations for Tokenizer and TokenFilter

Figure 8.5. The Analyzer class with some of its concrete implementations

Figure 8.6. The Analyzer class with some of its concrete implementations

Figure 8.7. The implementations for the PhrasesCache and SynonymsCache

Figure 8.8. The infrastructure for text analysis

Figure 8.9. Tag infrastructure–related classes

Figure 8.10. Term vector–related infrastructure

Figure 8.11. The TextAnalyzer and the InverseDocFreqEstimator

Figure 8.12. The tag cloud for the title, consisting of five tags

Figure 8.13. The tag cloud for the body, consisting of 15 tags

Figure 8.14. The tag cloud for the combined title and body, consisting of 15 tags

Figure 8.15. An example of automatically detecting relevant terms by analyzing text

Chapter 9. Discovering patterns with clustering

Figure 9.1. The various steps in our example of clustering blog entries

Figure 9.2. The interfaces associated with clustering text

Figure 9.3. The classes for implementing the hierarchical agglomerative clustering algorithm

Figure 9.4. The classes for implementing the hierarchical agglomerative clustering algorithm

Figure 9.5. A ClusteringModel consists of a set of clusters obtained by analyzing the data.

Figure 9.6. Some of the classes associated with clustering algorithm settings and clustering settings

Chapter 10. Making predictions

Figure 10.1. The first node in our decision tree

Figure 10.2. The second split in our decision tree

Figure 10.3. The final decision tree for our example

Figure 10.4. Splitting the dataset based on the value of the output attributes to compute the conditional probabilities

Figure 10.5. Belief network representation for our example

Figure 10.6. The simplified belief network when only A is known

Figure 10.7. The classes that we develop in this chapter

Figure 10.8. A multi-layer perceptron with one hidden layer. The weight Wxy is the weight from node x to node y. For example, W25 is the weight from node 2 to node 5.

Figure 10.9. A typical radial basis function

Figure 10.10. The model interfaces corresponding to supervised learning

Figure 10.11. Setting interfaces related to supervised learning

Figure 10.12. Algorithm-specific settings related to supervised learning algorithms

Chapter 11. Intelligent search

Figure 11.1. The entities involved with adding search to your application

Figure 11.2. The key Lucene classes for creating and searching an index

Figure 11.3. Non-compound and compound index files

Figure 11.4. A simple deployment architecture where each search instance has its own copy of a read-only index. An external service creates and updates the index, pushing the changes periodically to the servers.

Figure 11.5. Multiple search instances sharing the same index

Figure 11.6. The default implementation for the Similarity class

Figure 11.7. Query classes available in Lucene

Figure 11.8. Filters available in Lucene

Figure 11.9. HitCollector-related classes

Figure 11.10. Screenshot of Luke in the Documents tab

Figure 11.11. Screenshot of the Solr admin page

Figure 11.12. Screenshot of the home page for collective intelligence at Kosmix

Figure 11.13. Clustering search results using Carrot2 clustering

Figure 11.14. Screenshot of a personalized search engine on collective intelligence developed using Google Custom Search

Figure 11.15. Screenshot from NextBio showing the Gene TP53, along with inferences from analyzing the data

Chapter 12. Building a recommendation engine

Figure 12.1. An example of the output of a recommendation engine at

Figure 12.2. The inputs and outputs of a recommendation engine

Figure 12.3. Item-based analysis: similar items are recommended

Figure 12.4. User-based analysis: items liked by similar users are recommended

Figure 12.5. WEKA classes related to instance-based learning and nearest-neighbor search

Figure 12.6. Illustration of the dimensionality reduction

Figure 12.7. Screenshot of recommendations to a user at

Figure 12.8. To help the recommendation engine at Amazon, a user can rate an item and/or remove items from consideration.

Figure 12.9. Recommendations based on browsing history

Figure 12.10. Google News with recommended stories using the user’s web history

Figure 12.11. Personalized news stories for a logged-in user

Figure 12.12. Movies related to a movie being recommended at Netflix

Figure 12.13. Home page for a user at Netflix showing the user’s recommended movies

Figure 12.14. A screenshot of the Netflix leaderboard as of early 2008 (

