THE BODY OF THIS BOOK CONTAINS MANY CODE SAMPLES AND SNIPPETS. I TOOK QUITE A NUMBER OF them (and slightly adapted them for confidentiality reasons) from code I have audited and modified; because these samples cannot run without a full array of tables and indexes filled with classified data, providing them makes little sense.
This book also includes many other programs and examples that I wrote specifically for it (which I sometimes merely allude to), as well as utility scripts that query the data dictionary and are of general use. These programs and scripts are available for download. You will find them on the O’Reilly website for this book, http://www.oreilly.com/catalog/9780596514976.
This appendix describes what you will find for download and provides a number of additional comments, which I hope you’ll find interesting.
Because most people are interested in one particular DBMS, I have regrouped the code samples based on DBMS, with one archive file per DBMS. Within each archive file, I organized the code samples by chapter (one directory per chapter).
Note that some of the programs I wrote specifically for one DBMS and did not port them to the others (e.g., this is the case in all the Chapter 7 examples that were written for MySQL exclusively). In several cases, you can adapt them to other products with some minor changes. Therefore, don’t hesitate to look at the code for the other products if you are interested in a program that isn’t provided for your favorite DBMS.
In the descriptions that follow, for programs that aren’t available for MySQL, Oracle, and SQL Server, I have indicated in parentheses the DBMS product(s) for which they are provided. In other words, if nothing is specified, it means a version of the program or script is available for all three products. Note that in the case of scripts that query the data dictionary, a similar name doesn’t mean all versions of the script return exactly the same information, but rather the same type of information.
To begin, I used four scripts to create the tables in this chapter (these tables appear again in some examples in later chapters). If you want to run the programs on your own test database, you should start by generating the sample data.
These two scripts create the tables and the primary key indexes:
refactoring_big_tables.sql |
refactoring_small_tables.sql (includes data) |
These two scripts create additional indexes; the second script creates an index that replaces the index created by the first one (please refer to Chapter 1 for details):
additional_index.sql |
additional_index_alternate.sql |
The big transactions
table is populated by GenerateData.java, which requires database.properties, which you must update to specify how to connect to your database.
Once the tables are created and populated, you can run the following programs:
FirstExample.java |
SecondExample.java |
ThirdExample.java (Java 1.5 or later required) |
FourthExample.java |
FifthExample.java |
SixthExample.java |
All of them use the same database.properties file as the data generation program.
The following scripts are product-specific:
dba_analysis.sql (SQL Server) is the query the SQL Server DBA ran to diagnose problems (like all DBA scripts, it requires some system privileges).
profiler_analysis.sql (SQL Server) is a query you can run if you save what the SQL Server profiler collects to a table. It includes a T-SQL function called Normalize
to normalize statement text.
snapmon.sql (Oracle, SQL Server) is my core monitoring tool. It collects statistical information regarding all the queries that are currently held in the cache, as well as global values, which are returned as though they were associated with a dummy query. In the course of my assignments, I have developed monitoring tools in various languages (shell script included), all based on this query or slight variants of it. Each query is identified by an identifier, and a particular value is assigned to the global statistics, which allow you to check whether you have collected meaningful information or whether you have missed a lot of statements. In the latter case, I suggest that you run a modified version of this query in which you identify queries by the identifier of their execution plan.
The values returned by snapmon.sql are cumulative. If you want values by time slice, you must compute each time you poll the difference with the previous results.
You can do this in several ways. You can dump the output of the query to files, merge and sort two successive files, or make creative use of awk
or perl
to get what you want. I advise you to use SQLite (http://www.sqlite.org), which has APIs for most languages and scripting languages, and proceed as follows:
Create two temporary tables, named snaptmp1
and snaptmp2
. These tables must have a structure that matches the output of snapmon.sql. You must successively insert into snaptmp1
and snaptmp2
. Note that you don’t need to query the database very often. Within a few minutes, you can usually get a very detailed picture of what the database is doing.
After the second pass, begin to insert data into a permanent table with the same structure by executing a query on this pattern:
insert into ... select<identifier>
, abs(sum(<col1>
)), ...., abs(sum(<coln>
)) from (select<identifier>, <col1>, ... <coln>
from snaptmp1 union all select identifier, -1 *<col1>
, ..., -1 *<coln>
from snaptmp2) x group by<identifier>
having sum(...) <> 0
You can refine this further by selecting only the top queries (to which the dummy “global query” will necessarily belong). You should also have a table that holds the text of statements. Whenever new identifiers are encountered, you should get the associated statement text from the database. I don’t want to stymie your creativity by suggesting too many things, but you can collect other information as well (e.g., waits) and generate graphics that will show you the load.
At the time of this writing, something similar wasn’t available for MySQL, for which other monitoring means may be more adequate. Nevertheless, depending on the storage engine that is used, some useful data can be collected from information_schema.global_status (COM_
xxx
and INNODB_
xxx
parameters) as well as the falcon_
xxx
tables that will at least give you a global picture. This can be completed by select
statements against information_schema.processlist
, which must be sampled more frantically than the Oracle and SQL Server tables because it shows only current information.
The following scripts are used to create the tables in Chapter 2:
IndexSelectivity.java tests the impact of selectivity on index access performance.
stats.sql (MySQL, Oracle) displays some statistical values regarding tables.
indexing.sql lists all tables in the current schema and tells how many indexes they have, among other information.
checkparse.sql is a script that tells you whether many statements are parsed on your database, which is always indicative of hardcoded statements (although the reverse isn’t quite true, because the DBMS sometimes substitutes parameters with constants, as you saw in the chapter).
A few other .java programs come with the mandatory database.properties file. The following three programs test the performance loss of parsing:
HardCoded.java |
FirmCoded.java |
SoftCoded.java |
These programs show step by step how a list of values can be passed as a single parameter:
list0.sql |
list1.sql |
list2.sql |
list3.sql |
These three programs fetch rows from the transactions
table; the first one uses the default fetch size, the second one takes the fetch size as a command-line argument, and the third one activates the MySQL streaming mode:
DumpTx_default.java |
DumpTx.java |
DumpTx_Stream.java (MySQL only) |
And finally, this program tests the impact of the commit rate on throughput:
UpdateTx.java |
For Oracle, there are three different ways to write a function that counts patterns. There is also a script that creates a test table and successively applies the three functions to this table.
function1.sql (Oracle) |
function2.sql (Oracle) |
function3.sql (Oracle) |
test_search.sql (Oracle) |
Also for Oracle, there are three functions to check whether a date corresponds to a weekend day, first as a nondeterministic function and then as a deterministic function, and finally as a function that wraps a deterministic function after removing the time part from the date:
weekend_day1.sql (Oracle) |
weekend_day2.sql (Oracle) |
weekend_day3.sql (Oracle) |
A naive function is available for computing the next business day (NextBusinessDay.sql). This function relies on a table created by public_holidays.sql.
Also, with Oracle (in particular) and with MySQL, there are various attempts at refactoring this function. For Oracle, NextBusinessDay2.sql and NextBusinessDay3.sql both require Oracle11g or later, which isn’t the case for NextBusinessDay4.sql:
public_holidays.sql |
NextBusinessDay.sql |
NextBusinessDay2.sql (MySQL, Oracle) |
NextBusinessDay3.sql (Oracle) |
NextBusinessDay4.sql (Oracle) |
Similarly, there is a naive currency conversion function called fx_convert.sql and, for Oracle and MySQL, alternative versions (the Oracle fx_convert2.sql file requires Oracle11g or later):
fx_convert.sql |
fx_convert2.sql (MySQL, Oracle) |
fx_convert3.sql (MySQL, Oracle) |
Finally, one script replaces the function with a simple join:
fxjoin.sql |
You’ll find various views built on the tables created by the scripts and programs in Chapter 1:
v_amount_by_currency.sql |
v_amount_main_currencies.sql |
v_amount_other_currencies.sql |
v_last_rate.sql |
You’ll also find a query that tries to identify complex views that might be a reason for poor query performance:
complex_views.sql |
For Chapter 4, there is the definition of the view, which returns 10 rows numbered 0 through 9:
ten_rows.sql |
For SQL Server, there is the creation of the view that is required for further generation of random data, and the function that uses this view:
random_view.sql (SQL Server) |
randuniform.sql (SQL Server) |
Two functions to return random numbers that are not uniformly distributed in an interval are in these scripts:
randexp.sql |
randgauss.sql |
And if you want to see what the distribution looks like, a small script that draws a character-based histogram is available:
histogram.sql |
To generate a string of random characters (which can be useful for password generation), you have this script:
randstring.sql |
job_ref.sql shows how to generate job names that follow a predefined distribution.
To generate American names that are more realistic than those of all the people who email you to tout various drugs, anatomical enhancement devices, and fake watches, I suggest you go to http://www.census.gov/genealogy/www/ and click the link that will bring you to a list of frequently occurring surnames. You should find there a spreadsheet that contains the top 1,000 names in the United States. Download it and save it as a .csv file; my scripts assume that a semicolon is used for a separator.
Once you have the data, create a table to load it:
name_ref.sql |
The MySQL script creates the table, loads and normalizes the data, and creates an index, and so does the SQL Server script (which uses an ancillary .xml file).
For Oracle, name_ref.sql just creates the table. You must load the data using name_ref.ctl, which is a control file you can use with SQL*Loader as follows:
sqlldr <username>/<password>
control=name_ref.ctl
Then, still with Oracle, run name_ref2.sql to normalize the data and index the table.
get_ten_names.sql generates 10 random American surnames.
For Oracle, there are three additional scripts: dept_ref.sql allows you to randomly pick a department number for the emp
table; gen_emp.sql is a pure SQL generator of rows for the emp
table that fails to get truly random data; and gen_emp_pl.sql is a PL/SQL version that works fine.
For the generation of random text, please see Appendix B.
check (MySQL and Oracle) is a bash script that runs the output of an arbitrary query through the md5sum program.
qrysum.sql is a function (Oracle) or procedure (MySQL and SQL Server) that checksums the output of an arbitrary query. The script qrysum_complicated.sql does the same for Oracle, but applies successful checksums to buffers. It is mostly interesting as a wild PL/SQL example of dynamic SQL.
qrysum.sql |
qrysum_complicated.sql (Oracle) |
There is no script sample for this chapter. Examples in this chapter come from real cases.
Most of the examples in this chapter also come from real cases, but I wrote some of them for this book.
The SQL Server directory contains the files for the example of unlocking accounts after payment. Two functions were written in Visual Basic—one that strictly follows specifications and another that performs the same task with a more clever use of SQL:
create_tables.sql (SQL Server) |
procs.vb (SQL Server) |
create_vb_procs.sql (SQL Server) |
There is also a rather complete PHP example written for MySQL, and a configuration file that must be updated with connection data. The first two examples show an incorrect and a correct use of found_rows()
(very specific to MySQL):
config.php (MySQL) |
sample_bad.php (MySQL) |
sample_ok.php (MySQL) |
Another series of PHP programs shows how you can avoid counting before fetching data when there is an upper limit on the number of rows that will be displayed:
with_count.php (MySQL) |
with_count2.php (MySQL) |
without_count.php (MySQL) |
The most interesting query in the last program is a query that returns the total number of rows in the result set in the first row. It is provided (as a query on the transactions
table that returns more than 500 rows) for all DBMS products (including MySQL).
top500GBP.sql |
For Oracle users, a very simple example that uses emp
and dept
shows why I dislike cursor loops:
cursor.sql (Oracle) |
For Chapter 7, there are code samples for MySQL only (however, SQL Server and Oracle users mustn’t forget the Roughbench tool, which is described in Appendix B).
The table that is used in the examples is created by fifo.sql.
The various programs that simulate the processing of messages need the following sources and the accompanying makefile:
common.c |
common.h |
consumer_inno.c |
consumer_lock.c |
consumer_naive.c |
consumer_not_so_naive.c |
makefile |
For the test of concurrent inserts, there are two subdirectories, one for each storage engine used. Scripts postfixed with _ai use an auto-increment key, and scripts postfixed with _p use a partitioned table. There are scripts to create the tables, and scripts to insert data, which you must run with Roughbench.
InnoDB |
•create_concurrent_insert.sql |
•create_concurrent_insert_ai.sql |
•create_concurrent_insert_p.sql |
•insert_concurrent_insert.sql |
•insert_concurrent_insert_ai.sql |
•insert_concurrent_insert_p.sql |
MyISAM |
•create_concurrent_insert.sql |
•create_concurrent_insert_ai.sql |
•create_concurrent_insert_p.sql |
•insert_concurrent_insert.sql |
•insert_concurrent_insert_ai.sql |
•insert_concurrent_insert_p.sql |
3.143.255.36