4.3. Designing the Database

Five database tables are used for the project. The first two store the search engine's configuration options.

+--------------+--------------+------+
| Field        | Type         | Null |
+--------------+--------------+------+
| DOCUMENT_URL | varchar(255) | NO   |
+--------------+--------------+------+

+--------------+--------------+------+
| Field        | Type         | Null |
+--------------+--------------+------+
| TERM_VALUE   | varchar(255) | NO   |
+--------------+--------------+------+

The WROX_SEARCH_CRAWL table collects the addresses for the indexer to retrieve and include in the index. Essentially, this is a list of all files a user may search through using the search engine. WROX_SEARCH_STOP_WORD stores a list of stop words that should not be included in the index. Typically these are articles, pronouns, prepositions, and other words that have little value as search terms.

It would be nice if MySQL exposed its internal list of stop words using a special SHOW or SELECT query because then an INSERT-SELECT statement could pre-populate the table with data. You could then use this list as a starting point to add or filter various words as needed. However, this isn't the case so you must develop your own list of stop words.

The next three tables construct the inverted index to be fed by the indexer and searched by the front end of the search engine. An inverted index is a common data structure used in search algorithms to optimize the speed of the query. The trade-off between this and other indexing schemes is that populating and maintaining the index is slower but searching is faster. The user doesn't have to wait forever for the results of a simple query.

+----------------+------------------+------+-----+---------+----------------+
| Field          | Type             | Null | Key | Default | Extra          |
+----------------+------------------+------+-----+---------+----------------+
| DOCUMENT_ID    | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| DOCUMENT_URL   | varchar(255)     | NO   | UNI |         |                |
| DOCUMENT_TITLE | varchar(255)     | YES  |     | NULL    |                |
| DESCRIPTION    | varchar(255)     | YES  |     | NULL    |                |
+----------------+------------------+------+-----+---------+----------------+

+------------+------------------+------+-----+---------+----------------+
| Field      | Type             | Null | Key | Default | Extra          |
+------------+------------------+------+-----+---------+----------------+
| TERM_ID    | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| TERM_VALUE | varchar(255)     | NO   | UNI |         |                |
+------------+------------------+------+-----+---------+----------------+

+-------------+------------------+------+-----+
| Field       | Type             | Null | Key |
+-------------+------------------+------+-----+
| TERM_ID     | int(10) unsigned | NO   | MUL |
| DOCUMENT_ID | int(10) unsigned | NO   | PRI |
| OFFSET      | int(10) unsigned | NO   | PRI |
+-------------+------------------+------+-----+

The WROX_SEARCH_DOCUMENT table stores the address of each document indexed in the search engine and associates numeric keys to them. Also stored are the document's title and description, which will be displayed in the search results returned to the user. The words in each document are collected into WROX_SEARCH_TERM and are also assigned numeric keys. The terms are linked to the documents by the WROX_SEARCH_INDEX table, which references the keys to track which term is found where in each document.

Here is the complete SQL code for the five tables:

CREATE TABLE WROX_SEARCH_CRAWL (
    DOCUMENT_URL  VARCHAR(255)  NOT NULL
)
ENGINE=InnoDB DEFAULT CHARACTER SET latin1
    COLLATE latin1_general_cs;

CREATE TABLE WROX_SEARCH_STOP_WORD (
    TERM_VALUE  VARCHAR(255)  NOT NULL
)
ENGINE=InnoDB DEFAULT CHARACTER SET latin1
    COLLATE latin1_general_cs;

CREATE TABLE WROX_SEARCH_DOCUMENT (

DOCUMENT_ID     INTEGER UNSIGNED  NOT NULL  AUTO_INCREMENT,
    DOCUMENT_URL    VARCHAR(255)      NOT NULL,
    DOCUMENT_TITLE  VARCHAR(255),
    DESCRIPTION     VARCHAR(255),

    PRIMARY KEY (DOCUMENT_ID),

    CONSTRAINT UNIQUE (DOCUMENT_URL)
)
ENGINE=InnoDB DEFAULT CHARACTER SET latin1
    COLLATE latin1_general_cs AUTO_INCREMENT=0;

CREATE TABLE WROX_SEARCH_TERM (
    TERM_ID    INTEGER UNSIGNED  NOT NULL  AUTO_INCREMENT,
    TERM_VALUE VARCHAR(255)      NOT NULL,

    PRIMARY KEY (TERM_ID),

    CONSTRAINT UNIQUE (TERM_VALUE)
)
ENGINE=InnoDB DEFAULT CHARACTER SET latin1
    COLLATE latin1_general_cs AUTO_INCREMENT=0;

CREATE TABLE WROX_SEARCH_INDEX (
    TERM_ID       INTEGER UNSIGNED  NOT NULL,
    DOCUMENT_ID   INTEGER UNSIGNED  NOT NULL,
    OFFSET        INTEGER UNSIGNED  NOT NULL,

    PRIMARY KEY (DOCUMENT_ID, OFFSET),

    FOREIGN KEY (TERM_ID)
        REFERENCES WROX_SEARCH_TERM(TERM_ID),

    FOREIGN KEY (DOCUMENT_ID)
        REFERENCES WROX_SEARCH_DOCUMENT(DOCUMENT_ID)
)
ENGINE=InnoDB DEFAULT CHARACTER SET latin1
    COLLATE latin1_general_cs;

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.144.69