Handling LIKE queries

LIKE is a widely used component of the SQL language that allows the user to perform fuzzy searches. Sometimes, you have an idea about what you are looking for but you are not exactly sure what it is. In these situations, a wildcard search can come in handy and improve user experience dramatically. Just think of the following example: you are reading a fax that has just dropped in. On a poor-quality fax, can you really distinguish a B from an 8 all the time? I guess not. LIKE will fix that by allowing you to put placeholders in a query.

To demonstrate the power of LIKE and the problems arising along with it, I have compiled a small example. The goal is to look for names of cities and villages. Here is the required data:

test=# CREATE TABLE t_location (name text);
CREATE TABLE
test=# COPY t_location FROM PROGRAM 
    'curl www.cybertec.at/secret/orte.txt';
COPY 2354

The CREATE TABLE part is pretty straightforward and easy to understand. To feed data to the system, you can rely on COPY. In this example, sample data can be found on the author's website. The data is read efficiently using a pipe (COPY PROGRAM) fed by a command-line program called curl.

Note

The curl is a command-line tool used to transfer data from or to a server using one of the supported protocols (HTTP, HTTPS, FTP, FTPS, SCP, SFTP, TFTP, DICT, TELNET, LDAP or FILE). The command is designed to work without user interaction.

The file on the server has only one column, so this works out pretty neatly. If COPY is successful, 2354 rows should be in your table.

Simple LIKE queries

Let's focus our attention on a simple query now. All rows starting with Wiener should be retrieved:

test=# SELECT * FROM t_location 
    WHERE name LIKE 'Wiener%';
      name       
-----------------
 Wiener Neustadt
 Wiener Neudorf
 Wienerwald
(3 rows)

The query works like a charm. However, the execution plan of the query uncovers a real performance problem:

test=# explain SELECT * FROM t_location 
    WHERE name LIKE 'Wiener%';
                  QUERY PLAN                         
---------------------------------------------------
 Seq Scan on t_location  (cost=0.00..43.42 
  rows=1 width=13)
   Filter: (name ~~ 'Wiener%'::text)
 Planning time: 0.078 ms
(3 rows)

The table will be read sequentially. To fix this problem, a special index is needed:

test=# CREATE INDEX idx_name 
    ON t_location (name text_pattern_ops);
CREATE INDEX

You cannot just deploy a normal index on the columns. To make LIKE work, a special operator class is needed in most cases. For text, this operator class is called text_pattern_ops (for varchar, it would be varchar_pattern_ops).

What is an operator class? It is basically a strategy used by the index. Just imagine a simple B-tree. Small values will go to the left edge, and large values to the right edge of the tree. In simple terms, an operator class will know what is considered to be small and what is considered to be a large value. Operator classes are a way to teach your index how to behave.

In this example, the special operator class for the index is necessary in order to enable an index scan in the first place. Here is how it works:

test=# explain SELECT * FROM t_location 
    WHERE name LIKE 'Wiener%';
                     QUERY PLAN                                    
-------------------------------------------------------
 Index Only Scan using idx_name on t_location  
  (cost=0.28..8.30 rows=1 width=13)
   Index Cond: ((name ~>=~ 'Wiener'::text) 
    AND (name ~<~ 'Wienes'::text))
   Filter: (name ~~ 'Wiener%'::text)
 Planning time: 0.197 ms
(4 rows)

Just relax for a second and take a look at the way PostgreSQL uses the index. Basically, it says that wiener% means larger than wiener and smaller than wienes. In other words, PostgreSQL uses a clever rewrite to solve the problem.

Note that the percent sign has to be at the end of the query string. Given the indexing strategy, this sign is not allowed somewhere in the middle or at the beginning.

Keep in mind that those small numbers can vary. In the case of small tables, there is quite some noise in your performance data. So if one of your queries takes a millisecond more or less, don't panic! This is normal.

More advanced LIKE queries

In many cases, it is really necessary to have a percent sign on both sides of the search string:

test=# explain SELECT * FROM t_location 
    WHERE name LIKE '%Wiener%';
                   QUERY PLAN                         
-------------------------------------------------------
 Seq Scan on t_location  (cost=0.00..43.42 rows=1 width=13)
   Filter: (name ~~ '%Wiener%'::text)
 Planning time: 0.120 ms
(3 rows)

As you can see, we end up with a sequential scan again. Clearly, a sequential scan is not an option on large tables.

To fix this problem, you can turn to PostgreSQL extensions. Extensions are a nice feature of PostgreSQL that allow users to easily integrate additional functionality with the server. To solve the problem of indexing LIKE, an extension called pg_trgm is needed. This special extension provides access to so-called Trigrams, a concept especially useful when it comes to fuzzy searching:

test=# CREATE EXTENSION pg_trgm;
CREATE EXTENSION

CREATE EXTENSION enables the extension so that an index capable of handling trigrams can be created:

test=# CREATE INDEX idx_special ON t_location 
    USING gist(name gist_trgm_ops);
CREATE INDEX

This time, a GiST index has to be created. The gist_trgm_ops can serve as the operator class to handle our little problem.

Don't worry about the precise inner working of this index type. It will nicely solve the problem of fuzzy searching, as shown in this example:

test=# explain SELECT * FROM t_location 
    WHERE name LIKE '%Wiener%';
                  QUERY PLAN                                   
-------------------------------------------------------
 Index Scan using idx_special on t_location  
  (cost=0.14..8.16 rows=1 width=13)
   Index Cond: (name ~~ '%Wiener%'::text)
 Planning time: 0.232 ms
(3 rows)

The index will be able serve our request.

Note

Keep in mind that trigram-based indexes are a highly important feature, as many applications tend to suffer from indexing problems related to fuzzy searches. GiST and trigrams can really help to stop these problems from causing major performance bottlenecks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.170.63