CHAPTER 6

image

Full-Text Indexing

SQL Server supports mechanisms that allow you to store large amounts of unstructured text information. In older versions of SQL Server, you had the TEXT data type. With the more modern versions of SQL Server, you can use MAX values with the variable-length character data types VARCHAR and NVARCHAR. This means you can store up to 2GB worth of textual information within a single column. Since a standard index can hold only up to 900 bytes of information, there has to be another mechanism for searching within these large data types. That’s where full-text indexing comes into play.

Full-Text Indexing

Full-text search indexing is another indexing feature in SQL Server, outside the normal indexing methods and objects. This chapter will briefly describe the full-text search architecture, storage, and indexing for optimal performance.

Full-text search (FTS) allows you to store large amounts of text-based content. This content can include a number of document types including formats such as Word document (.doc) files. This storage is then in BLOB columns instead of plain-text data. The ability to search and store content of an unstructured nature provides a number of opportunities in a database management system.

Document retention is one such opportunity; it allows you to store documents for vast lengths of time at a much lower cost. The search abilities allow for querying this content for all types of needs. Imagine a shipping company that creates thousands of shipping documents from a template created in plain text. Those documents create a massive initiative for retention purposes to ensure shipments can be tracked for later needs. Storage warehouse rooms cost money to maintain. When the task of researching a specific shipment arises, the hours taken for that task are significant.

Now imagine this shipping company is using the FTS feature and an indexing structure. The documents are scanned with systems that read the text into a system that later inserts this data into a SQL Server database. This allows for a full-text search of specific account numbers, shipping invoices, and any distinct text in the documents needed for later review. An index just like a book index can be created, making it even quicker to find specific documents. Going further, FTS lets you search for specific content in the documents themselves. If a request comes in to find all shipping documents that were sent by a specific freight company on a specific trailer, the FTS capabilities allow the information to be retrieved in a fraction of the time as compared to a manual process.

Creating a Full-Text Example

Now that you understand the concept of FTS, let’s look at the indexing strategy. Full-text indexes are essentially the backbone of searching and querying the data. This data can be a number of data types including char, varchar, nchar, nvarchar, text, ntext, image, xml, varbinary, and varbinary(max). Since some of these data types are on the deprecation list, it’s probably best to avoid their use. These include text, ntext, and image. They’re all easily replaced by the other data types. In fact, utilizing varchar(max) on most 64-bit systems has outperformed other data types in the latest tests by the CAT team, as described in “Best Practices for Integrated Full Text Search (iFTS) in SQL 2008” (http://bit.ly/1L9fBn2). This is, in part, because of data types such as char using a direct parsing mechanism and the others using specialized processing. 64-bit systems and FTS tend to outperform 32-bit systems because of the memory-intensive use of the FTS process. When 32-bit systems are in use, the memory consumption should be configured and monitored carefully. This is partly why FTS on larger installations is utilized on designated hardware and databases. This allows the database administrator to maintain a system for FTS specifically.

For the remainder of this section on full-text search indexing, the contents of the white paper “Optimizing Your Query Plans with the SQL Server 2014 Cardinality Estimator” by Joseph Sack have been inserted into the sample table using the script in Listing 6-1. I’ll use the document to demonstrate full-text indexing. You can find the document in Books Online or download it from http://bit.ly/1II5UfU. You need to make sure that you have the Full Text Catalog feature installed in order to successfully run all the commands in this chapter. Full-text is not a feature that is installed by default, so you may need to address that first.

Using AdventureWorks, a table can be prepared that will be used for full-text searching. Using the varbinary(max) data type allows the import of most document types and images. In Listing 6-1, the CREATE TABLE and INSERT statements prepare the objects needed to create a full-text search index.

Creating a Full-Text Catalog

When creating an FTS index, a full-text catalog must first be created. Prior to SQL Server 2008, this catalog was a physical object and was optimized by designating a specific file group for the catalog. In versions of SQL Server after 2008, the catalog is now contained in the database as a definition. The catalog itself is now a virtual object and greatly enhances the performance by eliminating I/O bottlenecks. A catalog contains all the properties that are searchable.

The catalog is the link to the full-text index. To create a new full-text catalog, use the CREATE FULLTEXT CATALOG syntax shown in Listing 6-2.

The first option that should be considered in the creation of a catalog is the AS DEFAULT setting. Commonly, full-text indexes are created without thought of the catalog to which they should be applied. If the catalog is omitted in the index creation, the catalog that has been set as the default will be used.

Authorization and accent sensitivity are specific in the CREATE command. When omitting the authorization option, ownership will fall under dbo. This is the same for most objects in SQL Server when ownership is not declared. It is recommended that you assign ownership for managing security and grouping objects under the proper areas. When specifying a user for ownership, you must specify a user name matching one of the following:

  • The name of the user running the statement
  • The name of a user that the user executing the command has impersonate permissions for
  • The database owner or system administrator

Accent sensitivity dictates whether the catalog will be accent sensitive or insensitive. Be sure to research whether accent sensitivity should be on or off prior to the creation of the catalog. If this option is changed, the full-text indexes on the catalog must be rebuilt.

Execute the statement in Listing 6-3 to create a catalog as the default to be used with the white paper inserted into the table.

Creating a Full-Text Index

With the catalog created and the decision made for how you want to handle catalogs, accent sensitivity, and ownership, now you need to make some decisions and apply some restrictions to the creation of the full-text index. The most critical of these decisions is the requirement of a key index.

Syntax

Create the full-text index using the syntax in Listing 6-4. Table 6-1 describes the different options available.

Table 6-1. Full-Text Index Options

Option Name

Description

TYPE COLUMN

Specifies the name of the column that holds the document type for documents loaded in BLOB types, such as .doc, .pdf, and .xls. This option is used only for varbinary, varbinary(max), and image data types. If this option is specified on any other data type, the CREATE FULLTEXT INDEX statement will throw an error.

LANGUAGE

Alters the default language that is used for the index with the following variations and options:

  • The language can be specified as string, integer, or hexadecimal.
  • If a language is specified, the language is used when a query is run using the index.
  • When a language is specified as a string value, the syslanguages system table must correspond to the language.
  • If a double-byte value is used, it is converted to hexadecimal at creation time.
  • Word breakers and stemmers for the specific language must be enabled or a SQL Server error will be generated.
  • Non-BLOB and non-XML columns containing multiple languages should follow the 0x0 neutral language setting.
  • For BLOB and XML types, language types in the documents themselves will be used. For example, a Word document with a language type of Russian or LCID 1049 will force the same setting in the index. Use sys.fulltext_languages to review all the language types and LCID codings available.

KEY INDEX

Every full-text index requires an adjoining unique, single-key, non-null column to be designated. Specify the column in the same table using this option.

FULLTEXT_CATALOG_NAME

If the full-text index is not to be created using the default catalog, specify the catalog name using this option.

CHANGE_TRACKING

Determines how and when an index is populated. Options are MANUAL, AUTO, and OFF [NO_POPULATION]. The MANUAL setting requires ALTER FULLTEXT INDEX ... START UPDATE POPULATION to be executed before the index is populated. The AUTO setting populates the index at creation time and automatically updates based on changes that are made ongoing. This is the default setting if CHANGE_TRACKING is omitted in the CREATE statement. The OFF [NO_POPULATION] setting is used to completely turn population off for the index, and SQL Server will not retain a list of changes. The index is populated upon creation one time unless the NO_POPULATION is specified.

STOPLIST

Specifies a StopList that will essentially stop certain words from being indexed. OFF, SYSTEM, and a custom stop list are available options. The OFF setting will not use a stop list and will have more overhead on performance of population of the index. The SYSTEM is the default stop list created already. A user-created StopList is a StopList that was created that can be used in association with an index.

In most other CREATE INDEX statements, the basic syntax and options are alike with slight modifications. With FTS index creation, you can see there is a completely different set of options and considerations. The initial CREATE FULLTEXT INDEX is the same as any CREATE INDEX, with the given table required and then column to index. After this, the other options are not typical to normal index creations.

Key Indexes

Choosing the key index can be a straightforward choice given the restrictions of the key index being a unique, single-key, and non-nullable column. A primary key will commonly work well for this, like the primary key shown in Listing 6-1 on the SQLServerDocuments table. However, thought should be given to the size of the key. Ideally, a 6-byte key is recommended and documented as optimal to reduce overhead on I/O and CPU resource consumption. Recall that one of the restrictions of the unique key is that it cannot exceed 900 bytes. If this maximum restriction is met, the population will fail. Resolving the problem could force a new index and alteration of the table itself to occur. This could create costly downtime for tables that are in high-use situations.

Population

Change tracking in full-text indexing should be weighed heavily when creating full-text indexes. The default setting of AUTO may have overhead that can affect the performance negatively if the contents of the column being indexed change frequently. For example, a system that is storing shipping invoices that never change and are inserted only once a month would not likely benefit from AUTO being set. A MANUAL population would most likely be better run at a given time by using the SQL Server Agent based on the loading of the contents in the table. Although not common, some systems are static and loaded only one time. This would be an ideal situation for using the OFF setting, with the initial population being performed only at that time.

The last option for population is incremental population. It is an alternative to manual population. Incremental population is the same concept as an incremental update to data. As you run through the data and changes are made, they are tracked. Think of merge replication as a comparison. Merge replication retains changes by the use of triggers and insert/update/delete tracking rows into merge system tables. At a given point in time, a DBA can set a synchronization schedule to process those changes and replicate them to the subscribers. This is the same way incremental population functions. By using a timestamp column in the table, the changes are tracked. Only those that are found needing a change are processed. This does mean the requirement for a timestamp column on the table must be met in order to perform incremental populations. For data that has an extreme amount of change, this may not be ideal. However, for data that changes randomly and seldomly, incremental population may be suited for the installation.

StopLists

StopLists are extremely useful in managing what not to populate. This can improve the population performance by bypassing what are known as noise words. As an example, consider the sentence “A dog chewed through the fiber going to the SAN causing the disaster and recovery plans to be used for the SQL Server instance.” In this sentence, you would most likely want fiber, SAN, disaster, recovery, and SQL, or Server indexed. The A, the, to and be words would not be ideal. These are considered noise words and are not part of the population process. As you can imagine, the use of StopList can be extremely helpful in the overall population performance and parsing of the content. Use of the StopList can be specific to languages as well. For example, la in French would be specified over the in English.

To create a custom StopList, use the CREATE FULLTEXT STOPLIST statement as shown in Listing 6-5. The system default StopList can be used to pregenerate all the noise words already identified as such. For the white paper example, the name of the StopList would be WhitePaperStopList.

To view the StopList, use the system views sys.fulltext_stoplists and sys.fulltext_stopwords. The sys.fulltext_stoplists view will hold metadata related to the stoplists that are created on the SQL Server instance. Determine the stoplist_id to join to sys.fulltext_stopwords to show a complete listing of the words. Alone, this StopList is no better than the system default StopList. To add words to the StopList, use the ALTER FULLTEXT STOPLIST statement in Listing 6-6’s example. That example removes Downtime as a word to be excluded.

To review the StopList words, run the query shown in Listing 6-7.

You can see the query results in Figure 6-1; the word Downtime has been successfully added.

9781484211199_Fig06-01.jpg

Figure 6-1. Query results of a StopList

With the catalog, StopList, and key index availability within the primary key SQLServerDocumentsID in the table created in Listing 6-1, you can create a full-text index on the DOC column in the same table from Listing 6-1. To create a full-text index, use the CREATE FULLTEXT INDEX statement (see Listing 6-8).

Once the index is created, population will begin since there was no option added for CHANGE_TRACKING. Later in the chapter I show how to monitor the catalog and see the status. It might take a while to load depending on the size of your document. The default AUTO setting takes effect. To query the content of the SQLServerDocuments table and Doc column, you can run a CONTAINS statement to return a specific word. Listing 6-9 shows an example of such a statement.

Figure 6-2 shows the execution plan from the query.

9781484211199_Fig06-02.jpg

Figure 6-2. Execution plan of CONTAINS and FTS Index usage

By searching by means of CONTAINS(Contents,'replication'), the execution plan in Figure 6-2 shows the operation on FulltextMatch. It also returns the white paper with a document type of .docx as a match for this word search.

Full-Text Search Index Catalog Views and Properties

SQL Server provides a wealth of information about indexes in general. Performance, configurations, usage, and storage are just a few. As with normal index objects, full-text indexes require the same attention and detail to maintenance and options set to ensure they consistently benefit the overall performance rather than hinder it.

Table 6-2 describes the catalog views available to full-text search.

Table 6-2. Full-Text Catalog Views

Catalog View Name

Description

sys.fulltext_catalogs

Lists all full-text catalogs and high-level properties.

sys.fulltext_document_types

Returns a complete list of document types that are available for indexing. Each of these document types will be registered on the instance of SQL Server.

sys.fulltext_index_columns

Lists all columns that are indexed.

sys.fulltext_index_fragments

Lists all details of the full-text index fragments (storage of the inverted index data).

sys.fulltext_indexes

Lists every full-text index and properties set on the indexes.

sys.fulltext_languages

Lists all the available languages on the instance to full-text indexing.

sys.fulltext_stoplists

Lists every StopList created.

sys.fulltext_stopwords

Lists all StopWords in the database.

sys.fulltext_system_stopwords

Lists the preloaded system StopWords.

For informational purposes, while reviewing catalogs, properties, and status results for population, invoke the FULLTEXTCATALOGPROPERTY function, as shown in Listing 6-10.

The returned information will provide a wealth of detail on the state of the catalog including population status. The catalog_name parameter will take any catalog created, and then a listing of properties can be utilized to return the specific information required. Table 6-3 lists the properties you can pass.

Table 6-3. Full-Text Catalog Properties

Property Name

Description

AccentSensitivity

Catalog’s current accent sensitivity setting. This returns 0, which means insensitive, and 1, which means sensitive.

IndexSize

Logical size in megabytes of the catalog.

ItemCount

The total items that have been indexed in the catalog.

LogSize

Backward capability property. This returns 0.

MergeStatus

Returns 0 if no master merge is in progress and 1 if a master merge is in progress.

PopulateCompletionAge

The elapsed time since index population, in seconds, measured since 01/01/1990 00:00:00. This will always return 0 if the population has not run yet.

PopulateStatus

PopulateStatus can return nine different values.

0 = Idle.

1 = Full population in progress.

2 = Paused.

3 = The population has been throttled.

4 = The population is recovering.

5 = The status is shut down.

6 = Incremental population is currently in progress.

7 = The status is currently building an index.

8 = The disk is full.

9 = Change tracking.

UniqueKeyCount

Number of individual full-text index keys in the catalog.

ImportStatus

Returns 0 when the full-text catalog is not being imported and 1 when it is being imported.

For example, to show the population status of the WhitePaperCatalog catalog used earlier, you can use the statement in Listing 6-11.

Catalogs and the referencing indexes can be reviewed by executing sys.fulltext_index_catalog_usages. This catalog view returns all the indexes that have been referenced from it, as shown in Listing 6-12.

For detailed information on all catalogs and settings currently applied to them, query sys.fulltext_catalogs. This catalog view is helpful in determining the default catalog and property status indicators, such as is_importing that shows whether the catalog is in the process of being imported.

For a detailed review of the full-text indexes in the database, you can use sys.fulltext_indexes along with joining catalog views to create a more meaningful resultset. Important information from this catalog view consists of the full-text catalog name and properties; change tracking property, crawl type, and state; and the StopList set to be used.

The query in Listing 6-13 returns an information resultset of all indexes including catalog and StopList information for the index.

Summary

This chapter outlined how to create and query from a full-text index. The need to be able to filter and query on large documents and free-form text is just as important as being able to use standard structured indexes. With full-text indexing, you can use examine not only the contents of a column but also the contents of a file within a column, allowing applications to much better identify documents and other artifacts that match contextually with the requests being submitted.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.93.222