Let’s take a closer look at the xBestIndex()
and xFilter()
functions. Both of our example modules were
fairly simple and didn’t use them, but proper use of these functions is
critical for internal modules that implement some types of high-performance
indexing system.
By default, the only way to get data out of a table—virtual or otherwise—is to do a full table scan. This can be quite expensive, especially if the table is large and the query is trying to extract a small number of rows.
Standard tables have ways of boosting retrieval speeds, such as using indexes. The query optimizer can use other hints found in a standard table definition, such as knowing which columns are unique or have other constraints on them.
Virtual tables lack these features. You cannot
create an index on a virtual table, and the query optimizer has no
knowledge of the structure or format of a virtual table, other than
the column names. The only known constraint on a virtual table is that
each virtual row must have a unique, integer ROWID
.
Without any additional information, it is very difficult to optimize a query that involves a virtual table. This is true for both the query planner and the virtual table itself. For the best performance, the query optimizer needs to understand what types of lookups the virtual table is best suited to doing. Conversely, the virtual table module needs to understand the nature of the user query, including any constraints, so that it can use any internal indexes or lookup optimizations to the best of its ability.
The purpose of the xBestIndex()
and xFilter()
functions is to bridge this gap. When an
SQL statement is prepared, the query optimizer may call xBestIndex()
several times, presenting
several different query possibilities. This allows the module to
formulate its own query plan and pass back an approximate cost metric
to the query optimizer. The query optimizer will use this information
to pick a specific query plan.
When the query statement is executed, the SQLite
library uses xFilter()
to
communicate back to the module which query plan was actually chosen.
The module can use this information to optimize its internal data
lookups, as well as skip over any rows that are not relevant to the
query at hand. This allows a virtual table to implement more targeted
data lookups and retrievals, not unlike an index on a traditional
table.
If you’ll recall, the xBestIndex()
function is a table-level function that looks like this:
int xBestIndex( sqlite3_vtab *vtab, sqlite3_index_info *idxinfo );
The whole key to this function is the sqlite3_index_info
structure. This
structure is divided into two sections. The first section provides a
series of inputs to your function, allowing SQLite to propose a query
plan to the module. The input section should be treated as
read-only.
The second section is the output section. A module uses this second section to communicate back to the query optimizer information about which constraints the virtual table is prepared to enforce, and how expensive the proposed query might be. The module is also given a chance to associate an internal query plan or other data to this particular proposal. The query optimizer will then use this data to select a specific query plan.
The input section consists of two size values and
two arrays. The nConstraint
integer
indicates how many elements are in
the aConstraint[]
array.
Similarly, the nOrder
By[]
integer indicates how many
elements are in the aOrderBy[]
array:
struct sqlite3_index_info { /**** Inputs ****/ int nConstraint; /* Number of entries in aConstraint */ struct sqlite3_index_constraint { int iColumn; /* Column on lefthand side of constraint */ unsigned char op; /* Constraint operator */ unsigned char usable; /* True if this constraint is usable */ int iTermOffset; /* Used internal - xBestIndex should ignore */ } *aConstraint; /* Table of WHERE clause constraints */ int nOrderBy; /* Number of terms in the ORDER BY clause */ struct sqlite3_index_orderby { int iColumn; /* Column number */ unsigned char desc; /* True for DESC. False for ASC. */ } *aOrderBy; /* The ORDER BY clause */
The aConstraint[]
array communicates a series of simple
constraints that a query may put on the virtual table. Each array
element defines one query constraint by passing values for a column
index (aConstraint[i].iColumn
) and
a constraint operator (aConstraint[i].op
). The column index refers to the
columns of the virtual table, with a zero signifying the first column.
An index of ‒1 indicates the constraint is being applied to the
virtual ROWID
column.
The specific constraint operator is indicated with one of these constants. The referenced column (the column index) is always assumed to be on the lefthand side. These are the only operators that can be optimized by a virtual table:
SQLITE_INDEX_CONSTRAINT_EQ /* COL = Expression */ SQLITE_INDEX_CONSTRAINT_GT /* COL > Expression */ SQLITE_INDEX_CONSTRAINT_LE /* COL <= Expression */ SQLITE_INDEX_CONSTRAINT_LT /* COL < Expression */ SQLITE_INDEX_CONSTRAINT_GE /* COL >= Expression */ SQLITE_INDEX_CONSTRAINT_MATCH /* COL MATCH Expression */
For example, if one of the aConstraint
elements had the
values:
aConstraint[i].iColumn = -1; aConstraint[i].op = SQLITE_INDEX_CONSTRAINT_LE;
That would roughly translate to a WHERE
clause of:
...WHERE ROWID <= ?
The parameter on the right side of the expression may change from query to query, but will remain constant for the any given table scan, just as if it were a statement parameter with a bound value.
Each aConstraint[]
element also contains a usable
element. Some constraints may
not be usable by the optimizer due to joins or other external
conditions put on the query. Your code should only pay attention to
those constraints where the usable field is nonzero.
The second array of the input section, aOrderBy[]
, communicates a set of
requested ORDER BY
sortings (it may
also be generated by columns in a GROUP
BY
clause). Each ordering element is defined by a
column index and a direction (ascending or descending). The column
indexes work the same way, with defined columns starting at 0 and ‒1
referring to the ROWID
. The
ordering elements should be treated as a series of ORDER BY
arguments, with the whole
data set being sorted by the first ordering, then subsets of equal
values being sorted by the second ordering, and so on.
The output section contains the data that is
passed back to the SQLite optimizer. It consists of a constraint array
and a set of values. The aConstraintUsage[]
array will always be the same size
as the aConstraint[]
array (that
is, will always have nConstraint
elements). SQLite will always zero out the memory used by the output
section. This is why it is safe to ignore
the structure in simplified implementations of xBest
Index()
—the
structure is basically preset to an answer of, “this module cannot
optimize anything.” In that case, every virtual table query will
require a full table scan:
/**** Outputs ****/ struct sqlite3_index_constraint_usage { int argvIndex; /* If >0,constraint is part of argv to xFilter */ unsigned char omit; /* Do not code a test for this constraint */ } *aConstraintUsage; int idxNum; /* Number used to identify the index */ char *idxStr; /* Application-defined string */ int needToFreeIdxStr; /* Free idxStr using sqlite3_free() if true */ int orderByConsumed; /* True if output is already ordered */ double estimatedCost; /* Estimated cost of using this index */ };
If a module is able to optimize some part of the
query, this is indicated to the query optimizer by modifying the
output section of the sqlite3_index_info
structure to indicate what query
optimizations the module is willing and capable of performing.
Each element of the aConstraintUsage[]
array corresponds to the same
ordered element of the aConstraint[]
array. For each constraint described in
an aConstraint[]
element, the
corresponding aConstraintUsage[]
element is used to describe how the module wants the constraint
applied.
The argvIndex
value is used to indicate to SQLite that you want the expression value
of this constraint (that is, the value on the righthand side of the
constraint) to be passed to the xFilter()
function as one of the argv
parameters. The argvIndex
values are used to determine
the sequence of each expression value. Any aConstraintUsage[]
element can be assigned any index
value, just as long as the set of assigned index values starts with
one and has no gaps. No aConstraintUsage[]
elements should share the same
nonzero argvIndex
value. If the
default argvIndex
value of zero is
returned, the expression value is not made available to the xFilter()
function. Exactly how this
is used will make more sense when we look more closely at xFilter()
.
The omit
field
of an aConstraintUsage[]
element is
used to indicate to the SQLite library that the virtual table module
will take the responsibility to enforce this constraint and that
SQLite can omit the verification process for this constraint. By
default, SQLite verifies the constraints of every row returned by a
virtual table (e.g., every row xNext()
stops at). Setting the omit
field will cause SQLite to skip
the verification process for this constraint.
Following the constraint array is a series of
fields. The first three fields are used to communicate to the xFilter()
function. The fields
idxNum
and idxStr
can be used by the module
however it wishes. The SQLite engine makes no use of these fields,
other than to pass them back to xFilter()
. The third field, needToFreeIdxStr
, is a flag that indicates to the
SQLite library that the memory pointed to by idxStr
has been dynamically allocated by sqlite3_malloc()
, and the SQLite
library should free that memory with sqlite3_free()
if the library decides it is no longer
required.
This flag is needed to prevent memory leaks.
Remember that xBestIndex()
may be
called several times as part of the prepare process for an SQL
statement. The module will usually pass back a unique idxStr
value for each proposed query
plan. Only one of these idxStr
values will be passed to xFilter()
,
however, and the rest must be discarded. That means that any string
(or other memory block) you provide to idxStr
needs to either be static memory, or the
memory needs to be allocated with sqlite3_malloc()
and the needToFreeIdxStr
flag needs to be set. This allows
the SQLite library to properly clean up any unused idxStr
allocations.
The orderByConsumed
field is used to indicate that the
module is able to return the data presorted in the order defined by
the aOrderBy
array. This is an
all-or-nothing flag. If three aOrderBy
elements are given, but the module can only
sort the output by the first column, it must return a false
value.
Finally, the estimatedCost
field is used to communicate a cost
value back to the SQLite library. If this is an external module, this
number should approximate the total number of disk accesses required
to return all rows that meet the specified constraints. If this is an
internal module, it can be an approximation of the number of sqlite3_step()
and sqlite3_column_xxx()
calls. In
situations where a full table scan is required, it can estimate the
number of rows in the virtual table. The exact measurement is not
extremely meaningful, other than the relative values between different
calls to xBestIndex()
.
The xFilter()
function provides a way for the SQLite library to notify the
module, within the context of a specific table cursor, exactly what
constraints and ordering should be applied to the next table scan.
Recall that the xFilter()
prototype
looks like this:
int xFilter( sqlite3_vtab_cursor *cursor, int idxNum, const char *idxStr, int argc, sqlite3_value **argv )
The first argument is the table cursor that
requires these constraints. The idxNum
and idxStr
values are the same values that were passed back by the module in a
prior call to xBestIndex()
. These
mean whatever the module wants, just as long as the code in xBestIndex()
and xFilter()
agrees on what they are and
what the values represent.
Finally, the last two arguments are derived from
the
aConstraintUsage[].argvIndex
values passed back by the
module. The argv
parameter is an
array of sqlite3_value
structures,
while the argc
parameter indicates
how many elements are in the argv
array.
Going back to our prior example, consider an
sqlite3_index_info
structure
with an aConstraint[i]
element,
where iColumn=-1
and op=SQLITE_INDEX_CONSTRAINT_LE
(indicating a constraint of ROWID >=
?
). If the module’s xBestIndex()
function set aConstraintUsage[i].argIndex
to a value of 1, the
argv[0]
value passed into
xFilter()
will have the value found on the righthand
side of the expression.
Notice that the argument indexes between xBestIndex()
and xFilter()
are off by one. Because
sqlite3_index_info
considers
an aConstraintUsage[].argvIndex
value of 0 to indicate an invalid index, the argvIndex
values start at 1. The actual argv
indexes will all be one less,
however, as they start at 0.
Using the idxNum
, idxStr
,
and argv
values, it is the
responsibility of xFilter()
to
configure this table
cursor to provide the correct constraints and ordering that were
promised by the
corresponding sqlite3_index_info
block.
The design of xBestIndex()
and xFilter()
functions is strongly focused on optimizing
internal style modules. These are modules that are going to use one or
more SQL statements that operate over a set of internal tables to
produce the virtual table data. This is similar to how the dblist
module works, but normally
involves more complex SQL commands.
A module is free to do whatever it wants with the
idxNum
and idxStr
values, but most internal
modules use them to pass off pre-built SQL command strings. Each time
xBestIndex()
is called, the
module tries to figure out how it would service the query constraints
and ordering constraints, by adding conditions, constraints, and
ORDER BY
parameters to the
internal SQL statements used to generate the virtual table data. The
xBestIndex()
function marks
the constraints it can use and builds the required SQL command
strings, complete with statement parameters. These SQL commands are
passed back with the idxStr
value.
The idxNum
can be used to pass back
a string length, or some other index value or bit flags or whatever
the module wants. The argvIndex
values of the aConstraintUsage
elements are set to the corresponding statement parameter index
values. In essence, the xBestFilter()
function will build the SQL command
strings that query the virtual table data in such a way that the
required constraints and ordering are already “baked in” to the
behind-the-scenes queries.
When xFilter()
is called, the idxStr
value will
have the relevant SQL command strings for that query configuration.
The SQL command strings can then be prepared, and the constraint
expressions pass in via the argv
array, and can be bound to any statement parameters. The xFilter()
function starts to step
through the prepared statements, generating the first row. Like the
dblist internal module, subsequent calls to xNext()
continue to step through any internal
statements, returning additional rows.
As long as xBestIndex()
can derive a reasonable set of SQL
command strings that are capable of expressing the required internal
query (or queries), this is all reasonably straightforward. If
necessary, multiple SQL command strings can be passed into xFilter()
by
defining them one after another in a large string, and using the tail
parameter of sqlite3_prepare_xxx()
to prepare multiple statements, one after another.
Things can be difficult when dealing with external
modules. Very often external modules can’t define complex query
conditions or sort ordering with a simple string. Although the idxStr
pointer can be used to pass in
some type of data structure, it can be difficult to encode all the
constraint information. This is one of the reasons why many modules,
and especially external modules, forego the use of xBestIndex()
and xFilter()
, and
just depend on full table scans for all operations. Full table scans
might be slower, but they still work.
That might sound bad, but remember that even on a standard table with a standard index, you typically don’t start to see really good returns on using the index unless a constraint and appropriate index are able to eliminate 80% or better of the rows. Spending a lot of time to build a constraint handler that only filters out a small percentage of rows is normally a losing proposition. While that can be the whole point of internal modules, the primary goal of most external modules is to simply provide data connectivity. If you’re working on an external module, get the basic data translation working first, and then worry about possibly implementing more efficient lookup patterns.
3.16.54.63