PostgreSQL provides a full text search capability, which is used to overcome SQL pattern matching operators, including LIKE
and ILIKE
, boosting the performance of the text search. For example, even though an index on text using the text_pattern_op
class is supported, this index cannot be used to match a nonanchored text search. To explain this limitation, let's create the following table:
CREATE TABLE document( document_id serial primary key, document_body text ); CREATE INDEX on document (document_body text_pattern_ops); INSERT INTO document VALUES (default, 'Can not use text_pattern_op class to search for non-anchored text');
To test the index with anchored and nonanchored text search, let's disable sequential scan and generate execution plans as shown in the following example:
car_portal=# EXPLAIN SELECT * FROM document WHERE document_body like 'Can%text_pattern'; QUERY PLAN -------------------------------------------------------------------------------------------- Bitmap Heap Scan on document (cost=4.21..13.68 rows=1 width=36) Filter: (document_body ~~ 'Can%text_pattern'::text) -> Bitmap Index Scan on document_document_body_idx (cost=0.00..4.21 rows=6 width=0) Index Cond: ((document_body ~>=~ 'Can'::text) AND (document_body ~<~ 'Cao'::text)) (4 rows) car_portal=# EXPLAIN SELECT * FROM document WHERE document_body like '%text_pattern'; QUERY PLAN ----------------------------------------------------------------------------- Seq Scan on document (cost=10000000000.00..10000000025.38 rows=1 width=36) Filter: (document_body ~~ '%text_pattern'::text) (2 rows)
Note that for nonanchored text, as in %text_pattern
, the index is not used.
Another issue with the traditional LIKE
and ILIKE
operators is the ranking based on similarity and natural language linguistic support. The LIKE
and ILIKE
operators always evaluate a Boolean value: either as TRUE
or as FALSE
.
In addition to ranking nonanchored text search support, PostgreSQL full text search provides many other features. Full text search supports dictionaries, so it can supports language, such as synonyms.
Full text search is based on the tsvector
and tsquery
data types; here, tsvector
represents a document in a normalized state.
The tsvector
data type is a sorted list of distinct lexemes. Duplicate elimination and sorting is done during input, as follows:
SELECT 'A wise man always has something to say, whereas a fool always needs to say something'::tsvector; tsvector -------------------------------------------------------------------------------------------- 'A' 'a' 'always' 'fool' 'has' 'man' 'needs' 'say' 'say,' 'something' 'to' 'whereas' 'wise' (1 row)
Casting a text to tsvector
does not normalize the document completely due to the lack of linguistic rules. To normalize the preceding example, one can use the to_tsvector()
function to normalize the text properly, as follows:
SELECT to_tsvector('english', 'A wise man always has something to say, whereas a fool always needs to say something'); to_tsvector --------------------------------------------------------------------------------------- 'alway':4,12 'fool':11 'man':3 'need':13 'say':8,15 'someth':6,16 'wherea':9 'wise':2 (1 row)
As shown in the preceding example, the to_tsvector
function stripped some letters, such as s
from alway
, and also generated the integer position of lexemes, which can be used for proximity ranking.
The tsquery
data type is used to search for certain lexemes. Lexemes can be combined with the &
(AND), |
(OR), and !
(NOT) operators. Note that the NOT operator has the highest precedence, followed by AND and then OR. Also, parentheses can be used to group lexemes and operators to force a certain order.
The following example shows how we can search for certain lexemes using tsquery
, tsvector
, and the match operator (@@
):
car_portal=# SELECT 'A wise man always has something to say, whereas a fool always needs to say something'::tsvector @@ 'wise'::tsquery; ?column? ---------- t (1 row)
The tsquery
also has the to_tsquery
function to convert text to lexemes, as shown here:
car_portal=# SELECT to_tsquery('english', 'wise & man'); to_tsquery ---------------- 'wise' & 'man' (1 row)
There are several factors affecting the result of pattern matching, including:
If the text is not normalized, text search might not return the expected result. The following examples show how pattern matching can fail with unnormalized text:
car_portal=# SELECT 'elephants'::tsvector @@ 'elephant'; ?column? ---------- f (1 row)
In the preceding query, casting elephants
to tsvector
and the implicit casting of elephant
to the query does not generate normalized lexemes due to missing information about the dictionary.
To add dictionary information, to_tsvector
and to_tsquery
can be used as follows:
car_portal=# SELECT to_tsvector('english', 'elephants') @@ to_tsquery('english', 'elephant'); ?column? ---------- t (1 row) car_portal=# SELECT to_tsvector('simple', 'elephants') @@ to_tsquery('simple', 'elephant'); ?column? ---------- f (1 row)
Full text search supports pattern matching based on ranks. The tsvector
lexemes can be marked with the labels, A
, B
, C
, and D
; where D
is the default and A
has the highest rank. The setweight
function can be used to assign weight to tsvector
explicitly, as follows:
SELECT setweight(to_tsvector('english', 'elephants'),'A') || setweight(to_tsvector('english', 'dolphin'),'b'); ?column? ------------------------- 'dolphin':2B 'eleph':1A (1 row)
For ranking, there are two functions: ts_rank
and ts_rank_cd
. The ts_rank
function is used for standard ranking, while ts_rank_cd
is used for the cover density ranking technique. The following example shows the result of ts_rank_cd
when used to search eleph
and dolphin
, respectively:
car_portal=# SELECT ts_rank_cd (setweight(to_tsvector('english', 'elephants'),'A') || setweight(to_tsvector('english', 'dolphin'),'B'), 'eleph' ); ts_rank_cd ------------ 1 (1 row) car_portal=# SELECT ts_rank_cd (setweight(to_tsvector('english', 'elephants'),'A') || setweight(to_tsvector('english', 'dolphin'),'B'), 'dolphin' ); ts_rank_cd ------------ 0.4 (1 row)
Ranking is often used to enhance, filter out, and order the result of pattern matching. In real-life scenarios, different document sections can have different weights; for example, when searching for a movie, the highest weight could be given to the movie title and main character, and less weight could be given to the summary of the movie plot.
GiST can be used to index tsvector
and tsquery
, while GIN can be used to index tsvector
only.
The GiST index is lossy and can return false matches, so PostgreSQL automatically rechecks the returned result and filtered out false matches. False matches can reduce performance due to the records' random access cost. The GIN index stores only the lexemes of tsvector
and not the weight labels; due to this, the GIN index could be considered also lossy if weights are involved in the query.
The performance of the GIN and GiST indexes depends on the number of unique words, so it is recommended to use dictionaries to reduce the total number of unique words. The GIN index is faster to search and slower to build and requires more space than GiST. The
maintenance_work_mem
function can improve the GIN index's build time, but this does not work for the GiST index.
3.145.103.127