Text Extender

DB2 Text Extender was designed to perform advanced text search faster and more efficiently than regular SQL. Databases are more often storing large text documents. Traditional data types can be difficult to search when exact information is not known. SQL is unaware of linguistic properties in determining a search match. DB2 Text Extender can interpret text fields and formatted documents. DB2 Text Extender can identify language, sentence, and paragraph structure. It is a useful addition to DB2 for providing assistance in searching text data.

Traditional Text Searching

Traditional data types store a single property of an entity into a single database field. Tables are used to group these properties together. For example, a table may be created to store information about a product. The product code and name may be stored using the CHAR or VARCHAR data type. The price may be stored using the DECIMAL data type. When you know the product code or the exact name of the product, you can easily look up or change the price of the item using the equal operator, “=”. The equal operator is case sensitive and only finds exact matches. If the exact product name is not known, depending on how many products are sold, it can be difficult to find the price or other information about the product.

The product code or exact name might not always be known. For example, an application could be created to look up the product information based on a search term. This task is much more difficult than it appears. The user might misspell the product name, only type a portion of the full product name or use a different word to describe the product than what appears in the product name. Users probably won't find what they are looking for using traditional SQL data types.

Another method of searching text is to use a LIKE predicate in an SQL statement. The LIKE predicate uses pattern-matching techniques to eliminate unwanted rows from the search result.

There are several limitations to using the LIKE predicate. For example, the pattern must match exactly. If the search pattern is too general, it might return unwanted results. It does not handle letter case differences, spelling mistakes, or linguistic substitutions. The LIKE predicate is also relatively slow because the text pattern matching is started only after the query is executed.

The performance problem becomes more apparent when searching large documents. DB2 will store large documents in a single field using the LOB data type. Each database field is an indivisible unit. Applications wanting to search or manipulate this data would often have to retrieve the entire document to examine just a small portion of the data. The LIKE predicate performance degrades quickly as the amount of text data required to be searched during query execution increases. This is very inefficient.

Document formatting can interfere with using the LIKE predicate. Many popular document formats such as word processing documents, use binary formatting codes. These documents can only be stored in a database using the BLOB data type. The LIKE predicate cannot be used on BLOB data. Other text-based formatted documents, such as HTML or XML, may also create problems for the LIKE predicate. The LIKE predicate is unable to distinguish between the document data and the formatting tags. A LIKE pattern containing groups of words may not be found if a markup tag occurs between them. DB2 Extenders provide solutions to many of these problems.

Text Extender Searching

DB2 Text Extender addresses the search limitations and performance issues faced by traditional text searching methods. It enables advanced filtering techniques such as linguistic processing and fuzzy logic. It improves performance by preprocessing text documents and storing the important document information in indexes. This reduces the cost of the search operation during query execution. DB2 Text Extender is familiar with several popular document formats and can be taught how to recognize other formats. This allows it to separate the document data from the document formatting. It is not required to store the document inside the database. The text documents can either be stored in a database or on a local file system. Documents stored on file systems can be referenced using the full path name or DB2 Datalinks.

Text Extender indexes store significant terms from text data and documents. Information about the documents is collected before the execution of the query. Unlike DB2 indexes, these indexes are not stored inside the database. Instead, they are stored as files on a local file system. An inverted index technique is used to list documents where these significant terms are found. Indexing can be done asynchronously in the background while performing other tasks. When new data is added, the indexes can be updated incrementally on demand or by using automated scheduling and changed document count thresholds.

There are four principle steps in index creation. They include document analysis, lexical analysis, linguistic processing, and index creation. During document analysis, Text Extender performs text structure recognition, deformatting, and image suppression. Lexical analysis includes tokenization and stopword filtering. Linguistic processing is an optional stage where Text Extender uses techniques such as word stemming, morphology, normalization, and term extraction. The final stage is index creation. In this stage, term and document statistics are collected and used to generate index files. The data types that can be indexed by Text Extender include CHAR, VARCHAR, LONG VARCHAR, CLOB, BLOB, and DBCLOB.

A similar process is used when performing a search. First the statement syntax and parameters of the UDF are verified. If everything is correct, the same processing and filtering techniques used on the documents during index creation is performed on the search criteria. A list of significant terms are generated and expanded to include other words with similar meaning. Each term in the list is compared to the list of terms stored in the Text Extender Index. Any match in the index will reference a list of documents where that term was found. The rows containing these documents are returned in the SQL statement result set.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.116.69