Cross column correlation can cause a wrong estimation of the number of rows as PostgreSQL assumes that each column is independent of other columns. In reality, there are a lot of examples where this is not true. For example, one could find patterns where the first and last names in certain cultures are correlated. Another example is the country
and language
preference of the users. To understand cross column correlation, let's create a table called users
, as follows:
CREATE TABLE users ( id serial primary key, name text, country text, language text ); INSERT INTO users(name, country, language) SELECT generate_random_text(8), 'Germany', 'German' FROM generate_series(1, 10); INSERT INTO users(name, country, language) SELECT generate_random_text(8), 'USA', 'English' FROM generate_series(1, 10); VACUUM ANALYZE users;
If one wants to get users whose language is German
and country is Germany
, he/she will end up with a wrong estimation of rows as both columns are correlated, as follows:
EXPLAIN SELECT * FROM users WHERE country = 'Germany' and language ='German'; Seq Scan on users (cost=0.00..1.30 rows=5 width=26) Filter: ((country = 'Germany'::text) AND (language = 'German'::text))
Note that the number of estimated rows is five, which is calculated as follows:
Estimated number of rows = Total_number_of_rows * selectivity of country * selectivity of language
Estimated number of rows = 20 *.5 * .5 = 5
A simple solution to correct the number of rows is to change the physical design of the table and combine both fields in one, as follows:
CREATE TABLE user_2 ( id serial, name text, user_info jsonb ); INSERT INTO user_2(name, user_info) SELECT generate_random_text(8), '{"country":"Germany", "language":"German"}' FROM generate_series(1, 10); INSERT INTO user_2(name, user_info) SELECT generate_random_text(8), '{"country":"USA", "language":"English"}' FROM generate_series(1, 10); VACUUM ANALYZE user_2;
In the preceding example, jsonb
is used to wrap the country and language; the explain plan gives the correct estimation of the number of rows, as follows:
EXPLAIN SELECT * FROM user_2 WHERE user_info = '{"country":"USA", "language":"English"}'; Seq Scan on user_2 (cost=0.00..1.25 rows=10 width=60) Filter: (user_info = '{"country": "USA", "language": "English"}'::jsonb)
3.137.215.202