Chapter 6. MANAGING LARGE AMOUNTS OF DATA

You will soon find that your webbots are capable of collecting massive amounts of data. The amount of data a simple automated webbot or spider can collect, even if it runs only once a day for several months, is colossal. Since none of us have unlimited storage, managing the quality and volume of the data our programs collect and store becomes very important. In this chapter, I will describe methods to organize the data that your webbots collect and then investigate ways to reduce the size of what you save.

Organizing Data

Organizing the resources that your webbots download requires planning. Whether you employ a well-defined file structure or a relational database, the result should meet the needs of the particular problem your application attempts to solve. For example, if the data is primarily text, is accessed by many people, or is in need of sort or search capability, then you may prefer to store information in a relational database, which addresses these needs. If, on the other hand, you are storing many images, PDFs, or Word documents, you may favor storing files in a structured filesystem. You may even create a hybrid system where a database references media files stored in structured directories.

Naming Conventions

While there is no "correct" way to organize data, there are many bad ways to store the data webbots generate. Most mistakes arise from assigning non-descriptive or confusing names to the data your webbots collect. For this reason, your designs must incorporate naming conventions that uniquely identify files, directories, and database properties. Define names for things early, during your planning stages, as opposed to naming things as you go along. Always name in a way that allows your data structure to grow. For example, a real estate webbot that refers to properties as houses may be difficult to maintain if your application later expands to include raw land, offices, or businesses. Updating names for your data can become tedious, since your code and documentation will reference those names many times.

Your naming convention can enforce any rules you like, but you should consider the following guidelines:

  • You need to enforce any naming standards with an iron fist, or they will cease to be standards.

  • It's often better to assign names based on the type of thing an object is, rather than what is actually is. For example, in the previous real estate example, it may have been better to name the database table that describes houses properties, so when the scope of the project expands,[20] it can handle a variety of real estate. With this method, if your project grows, you could add another column to the table to describe the type of property. It is always easier to expand data tables than to rename columns.

  • Consider who (or what) will be using your data organization. For example, a directory called Saturday_January_23 might be easy for a person to read, but a directory called 0123 might be a better choice if a computer accesses its contents. Sequential numbers are easier for computer programs to interpret.

  • Define the format of your names. People will often use compound words and separate the word with underscores for readability, as in name_first. Other times, people separate compound words with case, as in nameFirst; this is commonly referred to as CamelCase. These format definitions should include things like case, language, and parts of speech. For example, if you decide to separate terms with underscores, you shouldn't use CamelCase to name other terms later. It's very common for developers to use different standards to help identify differences between functions, data variables, and objects.

  • If you give members of a certain group labels that are all the same part of speech, don't occasionally throw in a label with another grammatical form. For example, if you have a group of directories named with nouns, don't name another directory in the same group with a verb—and if you do, chances are it probably doesn't belong in that group of things in the first place.

  • If you are naming files in a directory, you may want to give the files names that will later facilitate easy grouping or sorting. For example, if you are using a filename that defines a date, filenames with the format year_month_day will make more sense when sorted than filenames with the format month_day_year. This is because year, month, and day is a sequential progression from largest to smallest and will accurately reflect order when sorted.

Storing Data in Structured Files

To successfully store files in a structured series of directories, you need to find out what the files have in common. In most cases, the problem you're trying to solve and the means for retrieving the data will dictate the common factors among your files. Figuratively, you need to look for the lowest common denominator for all your files. Figure 6-1 shows a file structure for storing data retrieved by a webbot that runs once a day. Its common theme is time.

Example of a structured filesystem primarily based on dates

Figure 6-1. Example of a structured filesystem primarily based on dates

With the structure defined in Figure 6-1, you could easily locate thumbnail images created by the webbot on February 3, 2006 because the folders comply with the following specification:

drive:projectyearmonthdaycategorysubcategoryfiles

Therefore, the path would look like this:

c:Spider_files200623GraphicsThumbnails

People may easily decipher this structure, and so will programs, which need to determine the correct file path programmatically. Figure 6-2 shows another file structure, primarily based on geography.

A geographically themed example of a structured filesystem

Figure 6-2. A geographically themed example of a structured filesystem

Ensure that all files have a unique path and that either a person or a computer can easily make sense of these paths.

File structures, like the ones shown in the previous figures, are commonly created by webbots. You'll see how to write webbots that create file structures in Chapter 8.

Storing Text in a Database

While many applications call for file structures similar to the ones shown in Figure 6-1 or Figure 6-2, the majority of projects you're likely to encounter will require that data is stored in a database. A database has many advantages over a file structure. The primary advantage is the ability to query or make requests from the database with a query language called Structured Query Language or SQL (pronounced SEE-quill). SQL allows programs to sort, extract, update, combine, insert, and manipulate data in nearly any imaginable way.

It is not within the scope of this book to teach SQL, but this book does include the LIB_mysql library, which simplifies using SQL with the open source database called MySQL[21] (pronounced my-esk-kew-el).

LIB_mysql

LIB_mysql consists of a few server configurations and three functions, which should handle most of your database needs. These functions act as abstractions or simplifications of the actual interface to the program. Abstractions are important because they allow access to a wide variety of database functions with a common interface and error-reporting method. They also allow you to use a database other than MySQL by creating a similar library for a new database. For example, if you choose to use another database someday, you could write abstractions with the same function names used in LIB_mysql. In this way, you can make the code in this book work with Oracle, SQL Server, or any other database without modifying any scripts.

The source code for LIB_mysql is available from this book's website. There are other fine database abstractions available from projects like PEAR and PECL; however, the examples in this book use LIB_mysql.

Listing 6-1 shows the configuration area of LIB_mysql. You should configure this area for your specific MySQL installation before you use it.

MySQL Constants (scope = global)

define("MYSQL_ADDRESS", "localhost"); // Define IP address of your MySQL Server
define("MYSQL_USERNAME", "");         // Define your MySQL username
define("MYSQL_PASSWORD", "");         // Define your MySQL password
define("DATABASE", "");               // Define your default database

Listing 6-1: LIB_mysql server configurations

As shown in Listing 6-1, the configuration section provides an opportunity to define where your MySQL server resides and the credentials needed to access it. The configuration section also defines a constant, "DATABASE", which you may use to define the default database for your project.

There are three functions in LIB_mysql that facilitate the following:

  • Inserting data into the database

  • Updating data already in the database

  • Executing a raw SQL query

Each function uses a similar interface, and each provides error reporting if you request an erroneous query.

The insert() Function

The insert() function in LIB_mysql simplifies the process of inserting a new entry into a database by passing the new data in a keyed array. For example, if you have a table like the one in Figure 6-3, you can insert another row of data with the script in Listing 6-2, making it look like the table in Figure 6-4.

Example table people before the insert()

Figure 6-3. Example table people before the insert()

$data_array['NAME']  = "Jill Monroe";
$data_array['CITY']  = "Irvine";
$data_array['STATE'] = "CA";
$data_array['ZIP']   = "55410";
insert(DATABASE, $table="people", $data_array);

Listing 6-2: Example of using insert()

Example table people after executing the insert() in Listing 6-2

Figure 6-4. Example table people after executing the insert() in Listing 6-2

The update() Function

Alternately, you can use update() to update the record you just inserted with the script in Listing 6-3, which changes the ZIP code for the record.

$data_array['NAME']  = "Jill Monroe";
$data_array['CITY']  = "Irvine";
$data_array['STATE'] = "CA";
$data_array['ZIP']   = "92604";
update(DATABASE, $table="people", $data_array, $key_column="ID", $id="3");

Listing 6-3: Example script for updating data in a table

Running the script in Listing 6-3 changes values in the table, as shown in Figure 6-5.

Example table people after updating ZIP codes with the script in Listing 6-3

Figure 6-5. Example table people after updating ZIP codes with the script in Listing 6-3

The exe_sql() Function

For database functions other than inserting or updating records, LIB_mysql provides the exe_sql() function, which executes a SQL query against the database. This function is particularly useful for extracting data with complex queries or for deleting records, altering tables, or anything else you can do with SQL. Table 6-1 shows various uses for this function.

Table 6-1. Example Usage Scenarios for the LIB_mysql_exe_sql() Function

Instruction

Result

$array = exe_sql(DATABASE, "select *

$array[1]['ID']="1";

from people");

$array[1]['NAME']="Kelly Garrett";

 

$array[1]['CITY']="Culver City";

 

$array[1]['STATE']="CA";

 

$array[1]['ZIP']="90232";

 

$array[2]['ID']="2";

 

$array[2]['NAME']="Sabrina Duncan";

 

$array[2]['CITY']="Anaheim";

 

$array[2]['STATE']="CA";

 

$array[2]['ZIP']="92812";

 

$array[3]['ID']="3";

 

$array[3]['NAME']="Jill Monroe";

 

$array[3]['CITY']="Irvine";

 

$array[3]['STATE']="CA";

 

$array[3]['ZIP']="92604";

$array = exe_sql(DATABASE, "select * from people where ID='2'");

$array['ID']="2";

 

$array['NAME']="Sabrina Duncan";

 

$array['CITY']="Anaheim";

 

$array['STATE']="CA";

 

$array['ZIP']="92604";

List($name)= exe_sql(DATABASE, "select NAME from people where ID='2'");

$name = "Sabrina Duncan";

exe_sql(DATABASE, "delete from people where ID='2'");

Deletes row 3 from table

Please note that if exe_sql() is fetching data from the database, it will always return an array of data. If the query returns multiple rows of data, you'll get a multidimensional array. Otherwise, a single-dimensional array is returned.

Storing Images in a Database

It is usually better to store images in a file structure and then refer to the paths of the images in the database, but occasionally you may find the need to store images as blobs, or large unstructured pieces of data, directly in a database. These occasions may arise when you don't have the requisite system permissions to create a file. For example, many web administrators do not allow their webservers to create files, as a security measure. To store an image in a database, set the typecasting or variable type for the image to blob or large blob and insert the data, as shown in Listing 6-4.

$data_array['IMAGE_ID'] = 6;
$data_array['IMAGE'] = base64_encode(file_get_contents($file_path));
insert(DATABASE, $table, $data_array);

Listing 6-4: Storing an image directly in a database record

When you store a binary file, like an image, in a database, you should base64-encode the data first. Since the database assumes text or numeric data, this precaution ensures that no bit combinations will cause internal errors in the database. If you don't do this, you take the risk that some odd bit combination in the image will be interpreted as an unintended database command or special character.

Since images are—or should be—base64 encoded, you need to decode the images before you can reuse them. The script in Listing 6-5 shows how to display an image stored in a database record.

<!— Display an image stored in a database where the image ID is 6 —>
<img src="show_image.php?img_id=6">

Listing 6-5: HTML that displays an image stored in a database

Listing 6-6 shows the code to extract, decode, and present the image.

<?
# Get needed database library
include("LIB_mysql.php");

# Convert the variable on the URL to a new variable
$image_id=$_GET['img_id'];

# Get the base64-encoded image from the database
$sql = "select IMAGE from table where IMAGE_ID='".$image_id."'";
list($img) = exe_sql (DATABASE, $sql);

# Decode the image and send it as a file to the requester
header("Content-type: image/jpeg");
echo base64_decode($img);
exit;
?>

Listing 6-6: Script to query, decode, and create an image from an image record in a database

When an image tag is used in this fashion, the image src attribute is actually a function that pulls the image from the database before is sends it to the waiting web agent. This function knows which image to send because it is referenced in the query of the src attribute. In this case, that record is img_id, which corresponds with the table column IMAGE_ID. The program show_image.php actually creates a new image file each time it is executed.

Database or File?

Your decision to store information in a database or as files in a directory structure is largely dependent on your application, but because of the advantages that SQL brings to data storage, I often use databases. The one common exception to this rule is images files, which (as previously mentioned) are usually more efficiently stored as files in a directory. Nevertheless, when files are stored in local directories, it is often convenient to identify the physical address of the file you saved in a database.



[20] Projects always expand in scope.

[21] More information about MySQL is available at http://www.mysql.com and http://www.php.net.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.86.60