Now that we’ve seen a very simple example of a virtual table module, you should have some idea of how they work. Although our dblist module was a good introduction to how virtual tables operate, it isn’t a very representative example.
To provide a more advanced and realistic example, we’re going to look at a second example module. This module is known as weblog, and is designed to parse Apache httpd server logs and present them to the database engine as a virtual table. It will parse the default Apache combine or common logfile formats, or any other logfile that matches this format. Apache logfiles are cross-platform and reasonably common. Many people have access to logfiles with a decent amount of interesting data, allowing this example to be a bit more hands-on.
Be warned that some of the code explanations will be a bit brief. Although the functions are larger, much of the code involves rather basic housekeeping-type tasks, such as string scanning. Rather than focus on these parts, most of the descriptions will focus on how the code interacts with the SQLite library. Many of the housekeeping details will be, as they say, left as an exercise for the reader.
The weblog module is designed as an external read-only module. The module gets all of its data directly from a web server logfile, making it dependent on external resources to provide data. The module does not let you modify those data sources, however.
A weblog virtual table would be created with an SQL command like this:
CREATE VIRTUAL TABLE current USING weblog( /var/log/httpd/access.log );
Notice that the filename has no single quotes and is not a string literal. Table parameters include everything between the commas (of which we have none, since there is only one argument), so if you need to reference a file with spaces, you can do something like this:
CREATE VIRTUAL TABLE log USING weblog( /var/log/httpd/access log file.txt );
The first example will create a table instance current
, and bind it to the data found in
the /var/log/httpd/access.log
file. The
second example will bind the SQL table log
to the file access log
file.txt
in the same directory.
Briefly, the Apache common log
format contains seven fields. The first field is the IP address of the
client. In rare situations this might be a hostname, but most servers are
configured to simply record the IP address in dot-decimal format. The second
field is a legacy ident field. Most web servers do not support this, and
record only a single dash. The third field records the username, if given.
If not given, this field is also recorded as a single dash. The fourth field
is a timestamp, surrounded by square brackets ( [
]
). The fifth is the first line of the HTTP request, in
double quotes. This contains the HTTP operation (such as GET
or POST
) as well as the URL. In the sixth column is the HTTP
result code (e.g., 200 for OK, 404 for missing resource), with the number of
payload bytes returned in the seventh field.
The combine file format adds two more fields. The eighth field is the referrer header, which contains a URL. The ninth field is the user-agent header, also in double quotes.
Count | Logfile field | Meaning |
1 | Client Address | IP or hostname of HTTP client |
2 | Ident | Legacy field, not used |
3 | Username | Client-provided username |
4 | Timestamp | Time of transaction |
5 | HTTP Request | HTTP operation and URL |
6 | Result Code | Result status of HTTP request |
7 | Bytes | Payload bytes |
8 | Referrer | URL of referrer page |
9 | User Agent | Client software identifier |
The weblog module is designed to read the combine file format. However, if given a common logfile that lacks the last two fields, these extra fields will simply be NULL.
Although the logfile has seven or nine columns, the weblog virtual table will have more than nine columns. The virtual table adds a number of additional columns that present the same data in different ways.
For example, the IP address will be returned in one column as a text value that holds the traditional dotted notation. Another column will provide a raw integer representation. The text column is easier for humans to understand, but the integer column allows for faster searches, especially over ranges. The underlying data is the same: the two columns just return the data in different formats. Similarly, the timestamp column can return the string value from the logfile, or it can return separate integer values for the year, month, day, etc.
If this were a fully supported SQLite extension, it would likely include more than just the weblog module. Ideally, it would also include a number of utility functions, such as a function that converted text values containing dot-decimal IP addresses to and from integer values. (Then again, if this were a fully supported module, it would include decent error messages and other polish that this example lacks. I’m trying to keep the line counts as small as possible.) Some of these functions would reduce the need for extra columns, since you could just convert the data using SQL, but there are still times when having the extra columns is extremely useful.
Since the weblog module is an external module, there isn’t any data to initialize.
This means that, like the dblist, we can use the same function for
both xCreate()
and xConnect()
.
Before we get into the function, let’s have a quick look at our augmented vtab structure. Since this module does not use the table name for anything, the only data we need to keep around is the logfile filename:
typedef struct weblog_vtab_s { sqlite3_vtab vtab; char *filename; } weblog_vtab;
The weblog create/connect function is a bit longer than the dblist version, but still fairly easy to follow. First, it verifies that we have exactly four arguments. Remember that the first three arguments are always the module name, the database name, and the table name. The fourth argument is the first user-provided argument, which in this case is the log filename. The function tries to open that file for read-only access, just to verify the file is there and can be opened it for reading. This test isn’t foolproof, but it is a nice check. The module then allocates the vtab structure, stashes a copy of the filename, and declares the table definition:
static int weblog_connect( sqlite3 *db, void *udp, int argc, const char *const *argv, sqlite3_vtab **vtab, char **errmsg ) { weblog_vtab *v = NULL; const char *filename = argv[3]; FILE *ftest; if ( argc != 4 ) return SQLITE_ERROR; *vtab = NULL; *errmsg = NULL; /* test to see if filename is valid */ ftest = fopen( filename, "r" ); if ( ftest == NULL ) return SQLITE_ERROR; fclose( ftest ); /* allocate structure and set data */ v = sqlite3_malloc( sizeof( weblog_vtab ) ); if ( v == NULL ) return SQLITE_NOMEM; ((sqlite3_vtab*)v)->zErrMsg = NULL; /* need to init this */ v->filename = sqlite3_mprintf( "%s", filename ); if ( v->filename == NULL ) { sqlite3_free( v ); return SQLITE_NOMEM; } v->db = db; sqlite3_declare_vtab( db, weblog_sql ); *vtab = (sqlite3_vtab*)v; return SQLITE_OK; }
The table definition contains 20 columns total. The first 9 map directly to the fields within the logfile, while the extra 11 columns provide different representations of the same data. The last column represents the whole line of the logfile, without modifications:
const static char *weblog_sql = " CREATE TABLE weblog ( " " ip_str TEXT, " /* 0 */ " login TEXT HIDDEN, " /* 1 */ " user TEXT, " /* 2 */ " time_str TEXT, " /* 3 */ " req TEXT, " /* 4 */ " result INTEGER, " /* 5 */ " bytes INTEGER, " /* 6 */ " ref TEXT, " /* 7 */ " agent TEXT, " /* 8 */ #define TABLE_COLS_SCAN 9 " ip_int INTEGER, " /* 9 */ " time_day INTEGER, " /* 10 */ " time_mon_s TEXT, " /* 11 */ " time_mon INTEGER, " /* 12 */ " time_year INTEGER, " /* 13 */ " time_hour INTEGER, " /* 14 */ " time_min INTEGER, " /* 15 */ " time_sec INTEGER, " /* 16 */ " req_op TEXT, " /* 17 */ " req_url TEXT, " /* 18 */ " line TEXT HIDDEN " /* 19 */ " ); "; #define TABLE_COLS 20
You may have noticed a few of the columns have the
keyword
HIDDEN
. This keyword is only valid
for virtual table definitions. Any column marked HIDDEN
will not be returned by
SELECT * FROM...
style
queries. You can explicitly request the column, but it is not returned
by default. This is very similar in behavior to the ROWID
column found in standard tables.
In our case, we’ve marked the login
and line
columns as HIDDEN
. The login
column almost never contains valid data, while
the line
column is redundant (and
large). The columns are there if you need them, but in most cases
people aren’t interested in seeing them. To keep the general output
cleaner, I’ve chosen to hide them.
As with xConnect()
and
xCreate()
, the weblog
xDisconnect()
and xDestroy()
functions share the same
implementation:
static int weblog_disconnect( sqlite3_vtab *vtab ) { sqlite3_free( ((weblog_vtab*)vtab)->filename ); sqlite3_free( vtab ); return SQLITE_OK; }
Free up the memory used for the filename, free up the memory used by the vtab structure, and return. Simple and easy.
The last set of table-level
functions includes xBestIndex()
, xFindFunction()
, xRename()
, and xUpdate()
, as well as the four
transactional functions, xBegin()
,
xSync()
, xCommit()
, and xRollback()
. The xFindFunction()
is optional, and the
weblog module has no use for it, so there is no implementation of this
function. Since this is a read-only module, same is true of xUpdate()
. Similarly, the
transactional functions are also optional and not required for
read-only modules. For table-level functions, that leaves only
xRename()
and xBestIndex()
.
The xRename()
function is required, but since the module makes no use of the virtual
table instance name, it is basically a no-op:
static int weblog_rename( sqlite3_vtab *vtab, const char *newname ) { return SQLITE_OK; }
In the case of the weblog module, once you set the name of the external logfile when creating a virtual table, there is no way to alter it, other than dropping and re-creating the table.
The last function, xBestIndex()
, is required, but it isn’t actually
returning any useful data:
static int weblog_bestindex( sqlite3_vtab *vtab, sqlite3_index_info *info ) { return SQLITE_OK; }
Since the module has no indexing system, it can’t offer any optimized search patterns. The logfile is always scanned start to finish anyway, so every query is a full table scan.
We can now move on to the cursor functions. The first thing to look at is the weblog cursor structure. The weblog cursor is a bit more complex than the dblist example, as it needs to read and scan the data values from the logfile.
There are three basic
sections to this structure. The first is the base sqlite3_vtab_
cursor
structure.
As always, this must come first, and must be a full instance of the
structure:
#define LINESIZE 4096 typedef struct weblog_cursor_s { sqlite3_vtab_cursor cur; /* this must be first */ FILE *fptr; /* used to scan file */ sqlite_int64 row; /* current row count (ROWID) */ int eof; /* EOF flag */ /* per-line info */ char line[LINESIZE]; /* line buffer */ int line_len; /* length of data in buffer */ int line_ptrs_valid; /* flag for scan data */ char *(line_ptrs[TABLE_COLS]); /* array of pointers */ int line_size[TABLE_COLS]; /* length of data for each pointer */ } weblog_cursor;
The second block deals with the data we need to
scan the logfile. The weblog module uses the standard C library
f
functions (such as fopen()
) to open and scan the logfile.
Each weblog cursor needs a unique FILE
pointer, just as each dblist cursor required a
unique statement structure. The module uses the FILE
structure to keep track of its
location within the file, so each cursor needs its own unique FILE
structure. The cursor needs to
keep track of the number of lines it has read from the file, as this
value is used as the ROWID
.
Finally, the cursor needs an EOF flag to indicate when it has reached
the end of the file.
Having a unique FILE
pointer for each cursor means the module needs
to reopen the file for each table scan. In the case of the weblog
module, this is actually an advantage, as each table scan will
reassociate itself with the correct file. This can be important in a
web server environment, where logfiles may roll frequently.
The third section of the weblog_cursor
structure holds everything the cursor
needs to know about the current line. The cursor has a buffer to hold
the text and length of the current line. There are also a series of
pointers and length counters that are used to scan the line. Since
scanning the line is fairly expensive, and must be done all at once,
the module delays scanning the line until it’s sure the data is
needed. Once scanned, the module will keep the scan data around until
it reads a new line. To keep track of when a line has been scanned,
the cursor contains a “valid” flag.
As we go through the rest of the module functions, you’ll see how these fields are used.
You might be thinking that a 4 KB line buffer seems a bit large, but frequently it is not enough. CGI scripts that use extensive query strings can generate very long logfile lines. Another issue is that many referrer URLs, especially those from search engines, can be extremely large. While most lines are only a hundred characters or so, it is best if the module can try to deal with the longer ones as well. Even with a 4 KB buffer, you’ll need to properly deal with potential buffer overflows.
Now that we’ve seen what the cursor looks like, let’s have a look at how it is opened and created. When the module needs to create a new cursor, it will first attempt to open the correct logfile. Assuming that succeeds, it will allocate the cursor structure and initialize the basic data:
static int weblog_open( sqlite3_vtab *vtab, sqlite3_vtab_cursor **cur ) { weblog_vtab *v = (weblog_vtab*)vtab; weblog_cursor *c; FILE *fptr; *cur = NULL; fptr = fopen( v->filename, "r" ); if ( fptr == NULL ) return SQLITE_ERROR; c = sqlite3_malloc( sizeof( weblog_cursor ) ); if ( c == NULL ) { fclose( fptr ); return SQLITE_NOMEM; } c->fptr = fptr; *cur = (sqlite3_vtab_cursor*)c; return SQLITE_OK; }
The open function doesn’t need to initialize the line data, as this will all be reset when we read the first line from the data file.
The xClose()
function is relatively simple:
static int weblog_close( sqlite3_vtab_cursor *cur ) { if ( ((weblog_cursor*)cur)->fptr != NULL ) { fclose( ((weblog_cursor*)cur)->fptr ); } sqlite3_free( cur ); return SQLITE_OK; }
Since the weblog module chooses to ignore the xBestIndex()
function, it largely ignores xFilter()
as well. The file is reset
to the beginning, just to be sure, and the module reads the first line
of data:
static int weblog_filter( sqlite3_vtab_cursor *cur, int idxnum, const char *idxstr, int argc, sqlite3_value **value ) { weblog_cursor *c = (weblog_cursor*)cur; fseek( c->fptr, 0, SEEK_SET ); c->row = 0; c->eof = 0; return weblog_get_line( (weblog_cursor*)cur ); }
The weblog_get_line()
function reads in a single line
from the logfile and copies it into our line buffer. It also verifies
that it got a full line. If it didn’t get a full line, the function
keeps reading (but discards the input) to make sure the file location
is left at the beginning of the next valid line. We can reduce how
often this happens by making the line buffer bigger, but no matter how
big we make the buffer, it is always a good idea to make sure a whole
line is consumed, even if the tail is discarded:
static int weblog_get_line( weblog_cursor *c ) { char *cptr; int rc = SQLITE_OK; c->row++; /* advance row (line) counter */ c->line_ptrs_valid = 0; /* reset scan flag */ cptr = fgets( c->line, LINESIZE, c->fptr ); if ( cptr == NULL ) { /* found the end of the file/error */ if ( feof( c->fptr ) ) { c->eof = 1; } else { rc = -1; } return rc; } /* find end of buffer and make sure it is the end a line... */ cptr = c->line + strlen( c->line ) - 1; /* find end of string */ if ( ( *cptr != ' ' )&&( *cptr != ' ' ) ) { /* overflow? */ char buf[1024], *bufptr; /* ... if so, keep reading */ while ( 1 ) { bufptr = fgets( buf, sizeof( buf ), c->fptr ); if ( bufptr == NULL ) { /* found the end of the file/error */ if ( feof( c->fptr ) ) { c->eof = 1; } else { rc = -1; } break; } bufptr = &buf[ strlen( buf ) - 1 ]; if ( ( *bufptr == ' ' )||( *bufptr == ' ' ) ) { break; /* found the end of this line */ } } } while ( ( *cptr == ' ' )||( *cptr == ' ' ) ) { *cptr-- = ' '; /* trim new line characters off end of line */ } c->line_len = ( cptr - c->line ) + 1; return rc; }
Besides reading a full line, this function also resets the scan flag (to indicate the line buffer has not had the individual fields scanned) and adds one (1) to the line count. At the end, the function also trims off any trailing newline or carriage return characters.
We only have a few functions left. In specific, the module only needs to
define the two row-handling functions, xNext()
and xEof()
. We also need the two column functions, xRowid()
and xColumn()
.
Three of these four functions are quite simple.
The xNext()
function can call
weblog_get_line()
, just as
the xFilter()
function did. The
xEof()
and xRowid()
functions return or pass back
values that have already been calculated elsewhere:
static int weblog_next( sqlite3_vtab_cursor *cur ) { return weblog_get_line( (weblog_cursor*)cur ); } static int weblog_eof( sqlite3_vtab_cursor *cur ) { return ((weblog_cursor*)cur)->eof; } static int weblog_rowid( sqlite3_vtab_cursor *cur, sqlite3_int64 *rowid ) { *rowid = ((weblog_cursor*)cur)->row; return SQLITE_OK; }
The interesting function is the xColumn()
function.
If you’ll recall, in addition to the line buffer, the weblog_cursor
structure also had an
array of character pointers and length values. Each of these pointers
and lengths corresponds to a column value in the defined table format.
Before the module can extract those values, it needs to scan the input
line and mark all the columns by setting the pointer and length
values.
Using a length value means the module doesn’t need to insert termination characters into the original string buffer. That’s good, since several of the fields overlap. Using terminating characters would require making private copies of these data fields. In the end, a length value is quite useful anyway, as most of SQLite’s value-handling routines utilize length values.
The function that sets up all these pointers and
length calculations is weblog_scanline()
. We’ll work our way through this
section by section. At the top are, of course, the variable
definitions. The start
and end
pointers will be used to scan the
line buffer, while the next
value
keeps track of the terminating character for the current field:
static int weblog_scanline( weblog_cursor *c ) { char *start = c->line, *end = NULL, next = ' '; int i; /* clear pointers */ for ( i = 0; i < TABLE_COLS; i++ ) { c->line_ptrs[i] = NULL; c->line_size[i] = -1; }
With the variables declared, the first order of business is to reset all of the column pointers and sizes.
Next, the scan function loops over the native data fields in the line. This scans up to nine fields from the line buffer. These fields correspond to all the primary fields in a combine format logfile. If the logfile is a common format file (with only seven fields) or if the line buffer was clipped off, fewer fields are scanned. Any fields that are not properly scanned will eventually end up returning NULL SQL values:
/* process actual fields */ for ( i = 0; i < TABLE_COLS_SCAN; i++ ) { next = ' '; while ( *start == ' ' ) start++; /* trim whitespace */ if (*start == ' ' ) break; /* found the end */ if (*start == '"' ) { next = '"'; /* if we started with a quote, end with one */ start++; } else if (*start == '[' ) { next = ']'; /* if we started with a bracket, end with one */ start++; } end = strchr( start, next ); /* find end of this field */ if ( end == NULL ) { /* found the end of the line */ int len = strlen ( start ); end = start + len; /* end now points to ' ' */ } c->line_ptrs[i] = start; /* record start */ c->line_size[i] = end - start; /* record length */ while ( ( *end != ' ' )&&( *end != ' ' ) ) end++; /* find end */ start = end; }
This loop attempts to scan one field at a time. The first half of the loop figures out the ending character of the field. In most cases it is a space, but it can also be a double-quote or square bracket. Once it knows what it’s looking for, the string is scanned for the next end marker. If the marker isn’t found, the rest of the string is used.
When this loop exits, the code has attempted to set up the first nine column pointers. These make up the native fields of the logfile. The next step is to set up pointers and lengths for the additional 11 columns that represent subfields and alternate representations. The first additional value is the IP address, returned as an integer. This function doesn’t do data conversions, so a direct copy of pointer and length from the first column can be made:
/* process special fields */ /* ip_int - just copy */ c->line_ptrs[9] = c->line_ptrs[0]; c->line_size[9] = c->line_size[0];
Next, all of the date field pointers and lengths are set up. This section of code makes some blatant assumptions about the format of the timestamp, but there isn’t much choice. The code could scan the individual fields, but it would still be forced to make assumptions about the ordering of the fields. In the end, it is easiest to just assume the format is consistent and hardcode the field lengths. This example ignores the time zone information:
/* assumes: "DD/MMM/YYYY:HH:MM:SS zone" */ /* idx: 012345678901234567890... */ if (( c->line_ptrs[3] != NULL )&&( c->line_size[3] >= 20 )) { start = c->line_ptrs[3]; c->line_ptrs[10] = &start[0]; c->line_size[10] = 2; c->line_ptrs[11] = &start[3]; c->line_size[11] = 3; c->line_ptrs[12] = &start[3]; c->line_size[12] = 3; c->line_ptrs[13] = &start[7]; c->line_size[13] = 4; c->line_ptrs[14] = &start[12]; c->line_size[14] = 2; c->line_ptrs[15] = &start[15]; c->line_size[15] = 2; c->line_ptrs[16] = &start[18]; c->line_size[16] = 2; }
After the date fields, the next step is to extract
the HTTP operation and URL. These are extracted as the first two
subfields of the HTTP Request log field. The code plays some games to
be sure it doesn’t accidentally pass a NULL pointer into strchr()
, but otherwise it just finds
the first two spaces and considers those to be the ending of the two
fields it is trying to extract:
/* req_op, req_url */ start = c->line_ptrs[4]; end = ( start == NULL ? NULL : strchr( start, ' ' ) ); if ( end != NULL ) { c->line_ptrs[17] = start; c->line_size[17] = end - start; start = end + 1; } end = ( start == NULL ? NULL : strchr( start, ' ' ) ); if ( end != NULL ) { c->line_ptrs[18] = start; c->line_size[18] = end - start; }
The final column represents the full contents of the line buffer. We also need to set the valid flag to indicate the field pointers are valid and ready for use:
/* line */ c->line_ptrs[19] = c->line; c->line_size[19] = c->line_len; c->line_ptrs_valid = 1; return SQLITE_OK; }
Once this function has been called, all the fields
that could be scanned will have a valid pointer and length value. With
the data scanned, this and subsequent calls to xColumn()
can use the relevant values
to pass back their database values. Let’s return to looking at
xColumn()
.
The first thing the xColumn()
code does is making sure the line has
already been scanned. If not, the code calls weblog_scanline()
to set up all the field
pointers:
static int weblog_column( sqlite3_vtab_cursor *cur, sqlite3_context *ctx, int cidx ) { weblog_cursor *c = (weblog_cursor*)cur; if ( c->line_ptrs_valid == 0 ) { weblog_scanline( c ); /* scan line, if required */ } if ( c->line_size[cidx] < 0 ) { /* field not scanned and set */ sqlite3_result_null( ctx ); return SQLITE_OK; }
Next, if the requested column doesn’t have a valid set of values, the module passes back an SQL NULL for the column.
The code then processes columns with specific conversion needs. Any column that needs special processing or conversion will be caught by this switch statement. The first specialized column is the integer version of the IP address. This block of code converts each octet of the IP address into an integer value. The only issue is that all integer values within SQLite are signed, so the code needs to be careful about constructing the value into a 64-bit integer. For maximum compatibility, it avoids using shift operations:
switch( cidx ) { case 9: { /* convert IP address string to signed 64 bit integer */ int i; sqlite_int64 v = 0; char *start = c->line_ptrs[cidx], *end, *oct[4]; for ( i = 0; i < 4; i++ ) { oct[i] = start; end = ( start == NULL ? NULL : strchr( start, '.' ) ); if ( end != NULL ) { start = end + 1; } } v += ( oct[3] == NULL ? 0 : atoi( oct[3] ) ); v *= 256; v += ( oct[2] == NULL ? 0 : atoi( oct[2] ) ); v *= 256; v += ( oct[1] == NULL ? 0 : atoi( oct[1] ) ); v *= 256; v += ( oct[0] == NULL ? 0 : atoi( oct[0] ) ); sqlite3_result_int64( ctx, v ); return SQLITE_OK; }
The next specialized column is one of the two month fields. In the logfile, the month value is given as a three-character abbreviation. One column returns this original text value, while another returns a numeric value. To convert from the abbreviation to the numeric value, the code simply looks for constants in the month string. If it can’t find a match, the code breaks out. As we’ll see, if the code breaks out it will eventually end up returning the text value:
case 12: { int m = 0; if ( strncmp( c->line_ptrs[cidx], "Jan", 3 ) == 0 ) m = 1; else if ( strncmp( c->line_ptrs[cidx], "Feb", 3 ) == 0 ) m = 2; else if ( strncmp( c->line_ptrs[cidx], "Mar", 3 ) == 0 ) m = 3; else if ( strncmp( c->line_ptrs[cidx], "Apr", 3 ) == 0 ) m = 4; else if ( strncmp( c->line_ptrs[cidx], "May", 3 ) == 0 ) m = 5; else if ( strncmp( c->line_ptrs[cidx], "Jun", 3 ) == 0 ) m = 6; else if ( strncmp( c->line_ptrs[cidx], "Jul", 3 ) == 0 ) m = 7; else if ( strncmp( c->line_ptrs[cidx], "Aug", 3 ) == 0 ) m = 8; else if ( strncmp( c->line_ptrs[cidx], "Sep", 3 ) == 0 ) m = 9; else if ( strncmp( c->line_ptrs[cidx], "Oct", 3 ) == 0 ) m = 10; else if ( strncmp( c->line_ptrs[cidx], "Nov", 3 ) == 0 ) m = 11; else if ( strncmp( c->line_ptrs[cidx], "Dec", 3 ) == 0 ) m = 12; else break; /* give up, return text */ sqlite3_result_int( ctx, m ); return SQLITE_OK; }
There are a number of additional columns
(including some of the “native” ones) that are returned as integers.
None of these columns require special processing, other than the
string-to-integer conversion. The standard
atoi()
function is used for this
conversion. Although the string pointers are not null-terminated, the
atoi()
function will
automatically return once it encounters a non-numeric character. Since
all of these fields are bound by spaces or other characters, this
works out exactly the way we want:
case 5: /* result code */ case 6: /* bytes transfered */ case 10: /* day-of-month */ case 13: /* year */ case 14: /* hour */ case 15: /* minute */ case 16: /* second */ sqlite3_result_int( ctx, atoi( c->line_ptrs[cidx] ) ); return SQLITE_OK; default: break; } sqlite3_result_text( ctx, c->line_ptrs[cidx], c->line_size[cidx], SQLITE_STATIC ); return SQLITE_OK; }
Finally, any field that did not require special
processing is returned as a text value. Although the line buffer will
be overwritten when the next line is read, the data pointer passed
into sqlite3_result_text()
only
needs to stay valid until the next call to xNext()
. This allows the module to use the SQLITE_STATIC
flag.
With that, we’ve defined all the required functions for our weblog module.
Now that we’ve seen how all the module functions are implemented, the last thing to do is register the weblog module as part of the extension initialization function:
static sqlite3_module weblog_mod = { 1, /* iVersion */ weblog_connect, /* xCreate() */ weblog_connect, /* xConnect() */ weblog_bestindex, /* xBestIndex() */ weblog_disconnect, /* xDisconnect() */ weblog_disconnect, /* xDestroy() */ weblog_open, /* xOpen() */ weblog_close, /* xClose() */ weblog_filter, /* xFilter() */ weblog_next, /* xNext() */ weblog_eof, /* xEof() */ weblog_column, /* xColumn() */ weblog_rowid, /* xRowid() */ NULL, /* xUpdate() */ NULL, /* xBegin() */ NULL, /* xSync() */ NULL, /* xCommit() */ NULL, /* xRollback() */ NULL, /* xFindFunction() */ weblog_rename /* xRename() */ }; int weblog_init( sqlite3 *db, char **error, const sqlite3_api_routines *api ) { SQLITE_EXTENSION_INIT2(api); return sqlite3_create_module( db, "weblog", &weblog_mod, NULL ); }
Since there is no attempt to create an instance of a weblog table, this initialization function is a bit simpler than the previous dblist example.
Now that we’ve worked through the whole example, let’s see what the code can do. Here are a few different examples that show off the power of the weblog module.
While doing these types of queries is not a big deal for people that are comfortable with SQL, realize that we can run all of these queries without having to first import the logfile data. Not only does that make the whole end-to-end process much faster, it means we can run these types of queries against active, “up to the second” logfiles.
To show off how this module works, the server administrators of http://oreilly.com/ were nice enough to provide me with some of their logfiles. The file referred to as oreilly.com_access.log is an Apache combine logfile with 100,000 lines of data. Once compiled and built into a loadable module, we can import the weblog module and create a virtual table that is bound to this file using these commands:
sqlite>.load weblog.sqlite3ext weblog_init
sqlite>CREATE VIRTUAL TABLE log USING weblog( oreilly.com_access.log );
We then issue queries to look at different aspects of the file. For example, if we want to know what the most common URL is, we run a query like this:
sqlite>SELECT count(*) AS Count, req_url AS URL FROM log
...>GROUP BY 2 ORDER BY 1 DESC LIMIT 8;
Count URL ----- ---------------------------------------- 2490 /images/oreilly/button_cart.gif 2480 /images/oreilly/button_acct.gif 2442 /styles/all.css 2348 /images/oreilly/888-line.gif 2233 /styles/chrome.css 2206 /favicon.ico 1975 /styles/home2.css 1941 /images/oreilly/satisfaction-icons.gif
It is fairly common to see favicon.ico
very near the top, along
with any site-wide CSS and image files. In the case of smaller sites
that have a lot less traffic, it isn’t uncommon for the most requested
URL to be /robots.txt
, which is
used by search engines.
We can also see what the most expensive items on the website are, in terms of bytes moved:
sqlite>SELECT sum(bytes) AS Bytes, count(*) AS Count, req_url AS URL
...>FROM log WHERE result = 200 GROUP BY 3 ORDER BY 1 DESC LIMIT 8;
Bytes Count URL -------- ----- --------------------------------------------- 46502163 1137 /images/oreilly/mac_os_x_snow_leopard-148.jpg 40780252 695 / 37171328 2384 /styles/all.css 35403200 2180 /styles/chrome.css 31728906 781 /catalog/assets/pwr/engine/js/full.js 31180460 494 /catalog/9780596510046/index.html 21573756 88 /windows/archive/PearPC.html 21560154 3 /catalog/dphotohdbk/chapter/ch03.pdf
We see that some of these items are not that large, but are requested frequently. Other items have only a small number of requests, but are big enough to make a noticeable contribution to the total number of served bytes.
Here is one final example. This shows what IP addresses are downloading the most number of unique items. Since this is from live data, I’ve altered the IP addresses:
sqlite>SELECT count(*) AS Uniq, sum(sub_count) AS Ttl,
...>sum(sub_bytes) AS TtlBytes, sub_ip AS IP
...>FROM (SELECT count(*) AS sub_count, sum(bytes) AS sub_bytes,
...>ip_str AS sub_ip FROM log GROUP BY 3, req_url)
...>GROUP BY 4 ORDER BY 1 DESC LIMIT 8;
Uniq Ttl TtlBytes IP ---- ---- ---------- ------------------ 1295 1295 31790418 10.5.69.83 282 334 13571771 10.170.13.97 234 302 4234382 10.155.7.28 213 215 3089112 10.155.7.77 163 176 2550477 10.155.7.29 159 161 4279779 10.195.137.175 153 154 2292407 10.23.146.198 135 171 2272949 10.155.7.71
For each IP address, the first column is the number of unique URLs requested, while the second column is the total number of requests. The second column should always be greater than or equal to the first column. The third column is the total number of bytes, followed by the (altered) IP address in question. Exactly how this query works is left as an exercise for the reader.
There are countless other queries we could run. For anyone that has ever imported log data into an SQL database and played around with it, none of this is particularly inspiring. But consider this for a moment: the query time for the first two of these examples is a bit less than five seconds on an economy desktop system that is several years old. The third query was a bit closer to eight seconds.
Five seconds to scan a 100,000-row table might not be blazingly fast, but remember that those five seconds are the grand total for everything, including data “import.” Using the virtual table module allows us to go from a raw logfile with 100,000 lines to a query answer in just that amount of time—no data staging, no format conversions, no data imports. That’s important, since importing involves a lot of I/O and can be a slow process. For example, importing the same file into a standard SQLite table by more traditional means takes nearly a minute and that doesn’t even include any queries!
Now consider that we enable all this functionality
with less than 400 lines of C code. Accessing the original data,
rather than importing it into standard tables, allows the end-to-end
data analysis process to be much faster, and allows you to query the
data, as it is recorded by the web server, in real time. As an added
bonus, the virtual table can also be used as an importer, by using the
CREATE TABLE... AS
or
INSERT... SELECT
SQL
commands.
If you find yourself faced with the task of writing a script to analyze, search, or summarize some structured source of data, you might consider writing an SQLite module instead. A basic, read-only module is a fairly minor project, and once you’ve got that in place you have the complete power of the SQLite database engine at your disposal (plus an added data importer!). That makes it easy to write, test, and tune whatever queries you need in just a few lines of SQL.
3.147.61.142