A PostgreSQL full text search

PostgreSQL provides a full text search capability, which is used to overcome SQL pattern matching operators, including LIKE and ILIKE, boosting the performance of the text search. For example, even though an index on text using the text_pattern_op class is supported, this index cannot be used to match a nonanchored text search. To explain this limitation, let's create the following table:

CREATE TABLE document(
  document_id serial primary key,
  document_body text
);

CREATE INDEX on document (document_body text_pattern_ops);

INSERT INTO document VALUES (default, 'Can not use text_pattern_op class to search for non-anchored text');

To test the index with anchored and nonanchored text search, let's disable sequential scan and generate execution plans as shown in the following example:

car_portal=# EXPLAIN SELECT * FROM document WHERE document_body like 'Can%text_pattern';
                                         QUERY PLAN
--------------------------------------------------------------------------------------------
 Bitmap Heap Scan on document  (cost=4.21..13.68 rows=1 width=36)
   Filter: (document_body ~~ 'Can%text_pattern'::text)
   ->  Bitmap Index Scan on document_document_body_idx  (cost=0.00..4.21 rows=6 width=0)
         Index Cond: ((document_body ~>=~ 'Can'::text) AND (document_body ~<~ 'Cao'::text))
(4 rows)

car_portal=# EXPLAIN SELECT * FROM document WHERE document_body like '%text_pattern';
                                 QUERY PLAN
-----------------------------------------------------------------------------
 Seq Scan on document  (cost=10000000000.00..10000000025.38 rows=1 width=36)
   Filter: (document_body ~~ '%text_pattern'::text)
(2 rows)

Note that for nonanchored text, as in %text_pattern, the index is not used.

Another issue with the traditional LIKE and ILIKE operators is the ranking based on similarity and natural language linguistic support. The LIKE and ILIKE operators always evaluate a Boolean value: either as TRUE or as FALSE.

In addition to ranking nonanchored text search support, PostgreSQL full text search provides many other features. Full text search supports dictionaries, so it can supports language, such as synonyms.

The tsquery and tsvector data types

Full text search is based on the tsvector and tsquery data types; here, tsvector represents a document in a normalized state.

The tsvector data type

The tsvector data type is a sorted list of distinct lexemes. Duplicate elimination and sorting is done during input, as follows:

SELECT 'A wise man always has something to say, whereas a fool always needs to say something'::tsvector;
                                          tsvector
--------------------------------------------------------------------------------------------
 'A' 'a' 'always' 'fool' 'has' 'man' 'needs' 'say' 'say,' 'something' 'to' 'whereas' 'wise'
(1 row)

Casting a text to tsvector does not normalize the document completely due to the lack of linguistic rules. To normalize the preceding example, one can use the to_tsvector() function to normalize the text properly, as follows:

SELECT to_tsvector('english', 'A wise man always has something to say, whereas a fool always needs to say something');
                                      to_tsvector
---------------------------------------------------------------------------------------
 'alway':4,12 'fool':11 'man':3 'need':13 'say':8,15 'someth':6,16 'wherea':9 'wise':2
(1 row)

As shown in the preceding example, the to_tsvector function stripped some letters, such as s from alway, and also generated the integer position of lexemes, which can be used for proximity ranking.

The tsquery data type

The tsquery data type is used to search for certain lexemes. Lexemes can be combined with the & (AND), | (OR), and ! (NOT) operators. Note that the NOT operator has the highest precedence, followed by AND and then OR. Also, parentheses can be used to group lexemes and operators to force a certain order.

The following example shows how we can search for certain lexemes using tsquery, tsvector, and the match operator (@@):

car_portal=# SELECT 'A wise man always has something to say, whereas a fool always needs to say something'::tsvector @@ 'wise'::tsquery;
 ?column?
----------
 t
(1 row)

The tsquery also has the to_tsquery function to convert text to lexemes, as shown here:

car_portal=# SELECT to_tsquery('english', 'wise & man');
   to_tsquery
----------------
 'wise' & 'man'
(1 row)

Pattern matching

There are several factors affecting the result of pattern matching, including:

  • Text normalization
  • Dictionary
  • Ranking

If the text is not normalized, text search might not return the expected result. The following examples show how pattern matching can fail with unnormalized text:

car_portal=# SELECT 'elephants'::tsvector @@ 'elephant';
 ?column?
----------
 f
(1 row)

In the preceding query, casting elephants to tsvector and the implicit casting of elephant to the query does not generate normalized lexemes due to missing information about the dictionary.

To add dictionary information, to_tsvector and to_tsquery can be used as follows:

car_portal=# SELECT to_tsvector('english', 'elephants') @@ to_tsquery('english', 'elephant');
 ?column?
----------
 t
(1 row)

car_portal=# SELECT to_tsvector('simple', 'elephants') @@ to_tsquery('simple', 'elephant');
 ?column?
----------
 f
(1 row)

Full text search supports pattern matching based on ranks. The tsvector lexemes can be marked with the labels, A, B, C, and D; where D is the default and A has the highest rank. The setweight function can be used to assign weight to tsvector explicitly, as follows:

SELECT setweight(to_tsvector('english', 'elephants'),'A') || setweight(to_tsvector('english', 'dolphin'),'b');
        ?column?
-------------------------
 'dolphin':2B 'eleph':1A
(1 row)

For ranking, there are two functions: ts_rank and ts_rank_cd. The ts_rank function is used for standard ranking, while ts_rank_cd is used for the cover density ranking technique. The following example shows the result of ts_rank_cd when used to search eleph and dolphin, respectively:

car_portal=# SELECT ts_rank_cd (setweight(to_tsvector('english', 'elephants'),'A') || setweight(to_tsvector('english', 'dolphin'),'B'), 'eleph' );
 ts_rank_cd
------------
          1
(1 row)

car_portal=# SELECT ts_rank_cd (setweight(to_tsvector('english', 'elephants'),'A') || setweight(to_tsvector('english', 'dolphin'),'B'), 'dolphin' );
 ts_rank_cd
------------
        0.4
(1 row)

Ranking is often used to enhance, filter out, and order the result of pattern matching. In real-life scenarios, different document sections can have different weights; for example, when searching for a movie, the highest weight could be given to the movie title and main character, and less weight could be given to the summary of the movie plot.

Full text search indexing

GiST can be used to index tsvector and tsquery, while GIN can be used to index tsvector only.

The GiST index is lossy and can return false matches, so PostgreSQL automatically rechecks the returned result and filtered out false matches. False matches can reduce performance due to the records' random access cost. The GIN index stores only the lexemes of tsvector and not the weight labels; due to this, the GIN index could be considered also lossy if weights are involved in the query.

The performance of the GIN and GiST indexes depends on the number of unique words, so it is recommended to use dictionaries to reduce the total number of unique words. The GIN index is faster to search and slower to build and requires more space than GiST. The maintenance_work_mem function can improve the GIN index's build time, but this does not work for the GiST index.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.103.127