External tables: This is a feature that permits access to operating system files as if they were database tables and even allows for the creation of operating system files as extracts of tables.
SQL*Loader (pronounced “sequel loader” and hereafter referred to as SQLLDR): This is Oracle’s historical data loader that is still a common method for loading data.
Data Pump unload: Data Pump is a binary format proprietary to Oracle and accessible via the Data Pump tool and external tables.
Flat file unload: The flat file unloads will use SQL*Plus and PL/SQL. These will provide you with an output file (e.g., CSV file) that is portable to other types of systems such as a spreadsheet.
External Tables
External tables allow us to treat an operating system file as if it is a read-only database table. They are not intended to be a replacement for a “real” table, or to be used in place of a real table; rather, they are intended to be used as a tool to ease the loading and unloading of data.
When the external table feature was first unveiled, I often referred to it as “the replacement for SQLLDR.” This idea still holds true—most of the time. Having said this, you might wonder why there is material in this chapter that covers SQLLDR. The reason is that SQLLDR has been around for a long time, and there are many, many legacy control files lying around. SQLLDR is still a commonly used tool; it is what many people know and have used. We are still in a period of transition from the use of SQLLDR to external tables; thus, SQLLDR is still very relevant.
What many DBAs don’t realize is that their knowledge of SQLLDR control files is readily transferable to the use of external tables. You’ll discover, as we work through the examples in this part of the chapter, that external tables incorporate much of the SQLLDR syntax and many of the techniques.
You have to load data over a network—in other words, when the input file is not on the database server itself. One of the restrictions of external tables is that the input file must be accessible on the database server.
Multiple users must concurrently work with the same external table processing different input files.
The ability to use complex WHERE conditions to selectively load data. SQLLDR has a WHEN clause to select rows to load, but you are limited to using only AND expressions and expressions using equality—no ranges (greater than, less than), no OR expressions, no IS NULL, and so on.
The ability to MERGE data. You can take an operating system file full of data and update existing database records from it.
The ability to perform efficient code lookups. You can join an external table to other database tables as part of your load process.
The ability to load data sorted by including an ORDER BY statement in the CREATE TABLE or INSERT statement.
Easier multitable inserts using INSERT. An INSERT statement can insert into one or more tables using complex WHEN conditions. While SQLLDR can load into multiple tables, it can be quite complex to formulate the syntax.
The ability to specify one or more operating system commands to be executed as the first step (preprocess) when selecting data from an external table.
A shallower learning curve for new developers. SQLLDR is yet another tool to learn, in addition to the programming language, the development tools, the SQL language, and so on. As long as a developer knows SQL, they can immediately apply that knowledge to bulk data loading, without having to learn a new tool (SQLLDR).
So, with that in mind, let’s look at how to use external tables.
Introduction to External Tables
- 1.
Create a directory object.
- 2.
Place a formatted operating system file (e.g., a CSV file) in the location defined in the directory object from step 1.
- 3.
Run a CREATE TABLE statement with the ORGANIZATION EXTERNAL clause.
In this manner, you can easily query data in a text file as if it were a table object in the database.
In this manner, you can format and transform character data into the DATE datatype in an external table.
Creating an External Table Using Legacy SQLLDR Control File
NOT_USED: This is self-evident in meaning and is the default value.
EXECUTE: This value means that SQLLDR will not generate a SQL INSERT statement and execute it. Rather, it will create an external table and use a single bulk SQL statement to load it.
GENERATE_ONLY: This value causes SQLLDR to not actually load any data, but only to generate the SQL DDL and DML statements it would have executed into the log file it creates.
The ORGANIZATION EXTERNAL clause tells Oracle this is not a “normal” table. We saw this clause before in Chapter 10 when we looked at IOTs. Currently, there are three organization types: HEAP for a normal table, INDEX for an IOT, and EXTERNAL for an external table. The rest of the text starts to tell Oracle more about the external table. The ORACLE_LOADER type is one of two supported types. The other type is ORACLE_DATAPUMP, the proprietary Data Pump format. We will take a look at that type in a subsequent section on data unloading—it is a format that can be used to both load and unload data. An external table may be used both to create a Data Pump format file and to subsequently read it.
RECORDS: Records are terminated by newlines by default, as they are for SQLLDR.
BADFILE: There is a bad file (a file where records that fail processing are recorded to) set up in the directory we just created.
LOGFILE: There is a log file that is equivalent to SQLLDR’s log file set up in the current working directory.
READSIZE: This is the default buffer used by Oracle to read the input datafile. It is 1MB in this case. This memory comes from the PGA in dedicated server mode and the SGA in shared server mode, and it is used to buffer the information in the input datafile for a session (refer to Chapter 4, where we discussed PGA and SGA memory). Keep that shared server fact in mind if you’re using shared servers: the memory is allocated from the SGA.
SKIP 6: This determines how many records in the input file should be skipped. You might be asking, “Why ‘skip 6’?” Well, we used INFILE * in this example; SKIP 6 is used to skip over the control file itself to get to the embedded data. If we did not use INFILE *, there would be no SKIP clause at all.
FIELDS TERMINATED BY: This is just as we used in the control file itself. However, the external table did add LDRTRIM, which stands for LoaDeR TRIM. This is a trim mode that emulates the way in which SQLLDR trims data by default. Other options include LRTRIM, LTRIM, and RTRIM (for left/right trimming of whitespace) and NOTRIM to preserve all leading/trailing whitespace.
REJECT ROWS WITH ALL NULL FIELDS: This causes the external table to log to the bad file any entirely blank lines and to not load that row.
The column definitions themselves: This is the metadata about the expected input data values. They are all character strings in the datafile to be loaded, and they can be up to 255 characters in length (SQLLDR’s default size), and terminated by, and optionally enclosed by quotes.
For a comprehensive list of all options available to you when using external tables, review the Oracle Utilities Guide. This reference contains a section dedicated to external tables. The Oracle SQL Language Reference Guide provides the basic syntax, but not the details of the ACCESS PARAMETERS section.
This would perform the logical equivalent of a direct path load if possible (assuming the APPEND hint may be obeyed; the existence of triggers or foreign key constraints may prevent the direct path operation from taking place).
Read the file we are pointing to. In UNIX/Linux, that means the Oracle software owner must have read and execute permissions on all directory paths leading to the file. In Windows, the Oracle software owner must be able to read that file.
Write to the directories where the log file will be written to (or bypass the generation of the log file altogether, but this is not generally recommended). In fact, if the log file already exists, the Oracle software owner must be able to write to the existing file.
Write to any of the bad files we have specified, just like the log file.
This command actually gives everyone the ability to write into our directory! This is just a demonstration; normally, we would use a special directory—perhaps owned by the Oracle software owner itself—to do this.
You can see that this time I accessed the file, I successfully loaded four rows, and the log file was created and in fact is owned by “oracle,” not by my operating system account.
Dealing with Errors
In a perfect world, there would be no errors. The data in the input file would be perfect, and it would all load correctly. That almost never happens. So, how can we track errors with this process?
The most common method is to use the BADFILE option. Here, Oracle will record all records that failed processing. For example, if our control file contained a record with DEPTNO 'ABC', that record would fail and end up in the bad file because 'ABC' cannot be converted into a number. We’ll demonstrate that in the following example.
This is just a table that can read any file without failing on a datatype error, as long as the lines in the file consist of fewer than 12,000 characters. If they are longer than 12,000 characters, then we can simply add more text columns to accommodate them.
A COUNT(*) could tell us how many records were rejected. Another external table created on the log file associated with this external table could tell us why the record was rejected. We would need to go one step further to make this a repeatable process, however. The reason is that the bad file is not blanked out if there were no errors in our use of the external table. So, if there were some preexisting bad file with data in it and our external table generated no errors, we would be misled into thinking there were errors.
Use UTL_FILE and reset the bad file—truncate it, in effect, by simply opening it for write and closing it.
Use UTL_FILE to rename any preexisting bad files, preserving their contents, but allowing us to create a new one.
Incorporate the PID into the bad (and log) file names. We’ll demonstrate this later in the “Multiuser Issues” section.
Manually use operating system commands to resolve issues (like renaming the file, removing it, and so on).
In that fashion, we’ll be able to tell if the bad records in the bad file were generated by us just recently or if they were left over from some older version of the file itself and are not meaningful.
The COUNT(*) earlier in this section made me think about a feature that allows you to optimize external table access by only accessing the fields in the external file that are referenced in the query. That is, if the external table is defined to have 100 number fields, but you select out only one of them, you can direct Oracle to bypass converting the other 99 strings into numbers. It sounds great, but it can cause a different number of rows to be returned from each query. Suppose the external table has 100 lines of data in it. All of the data for column C1 is “valid” and converts to a number. None of the data for column C2 is “valid,” and it does not convert into a number. If you select C1 from that external table, you’ll get 100 rows back. If you select C2 from that external table, you’ll get 0 rows back.
The DNAME field was valid for every single record in the input file, but the DEPTNO column was not. If we do not retrieve the DEPTNO column, it does not fail the record—the resultset is materially changed.
Using an External Table to Load Different Files
A common need is to use an external table to load data from differently named files over a period of time. That is, this week we must load file1.dat, and next week it will be file2.dat, and so on. So far, we’ve been loading from a fixed file name, demo1.ctl. What if we subsequently need to load from a second file, demo2.ctl?
And that would pretty much be it—the very next query of that external table would have it accessing the file demo2.ctl.
Multiuser Issues
In the introduction to this section, I described three situations where external tables might not be as useful as SQLLDR. One of them was a specific multiuser issue. We just saw how to change the location of an external table—how to make it read from file 2 instead of file 1 and so on. The problem arises when multiple users each try to concurrently use that external table and have it point to different files for each session.
This cannot be done. The external table will point to a single file (or set of files) at any given time. If I log in and alter the table to point to file 1 and you do the same at about the same time, and then we both query that table, we’ll both be processing the same file.
If parallel processing is required, then the database already has the built-in ability to do this, as demonstrated in the last chapter.
So the only multiuser issue would be if two sessions both tried to alter the location at about the same time (assuming they had the privilege to ALTER the table)—and this is just a possibility to be aware of, not something I believe you’ll actually run into very often.
%p: PID.
%a: Parallel execution servers agent ID. The parallel execution servers have numbers 001, 002, 003, and so on assigned to them.
However, you still might have issues over lengthy periods of time. The PIDs will be reused on most operating systems. So the techniques outlined in dealing with errors may well still be relevant—you’ll need to reset your bad file or rename it if it exists if you determine this to be an issue.
Preprocessing
Preprocessing is an external table feature that allows you to execute one or more operating system commands as the first step when selecting from an external table. You invoke preprocessing by specifying the PREPROCESSOR clause . The input of this clause can be an OS command or a shell script.
Dynamically display the output of an OS command (or a combination of commands) by selecting from an external table
Search for files and filter columns and/or rows before displaying the data
Process and modify the contents of a file before returning the data
Examples of these scenarios follow.
Monitoring the File System Through SQL
The requested extension is that all datafiles must be capable of growing by at least 20 percent of their current size; so if, for example, ts_A_file01.dbf currently is 100GB and ts_B_file01.dbf is 200GB, we must ensure that at least 20GB + 40GB = 60GB is free in the /u01/oradata file system.
The question is: How can we monitor this in a single query inside the database? Right now, we have a complex script gathering free space from the df command in a text file, opening a cursor, calculating the current allocated space from DBA_DATA_FILES, and reading the df data via an external table.
This can be done in a single SQL query. To accomplish this, I’ll need to be able to query disk free (df) interactively—without a complex set of maneuvers such as running a script or redirecting output. I’ll start by making it so that the df output can be queried as if it were in a table. I’ll use the preprocessor directive to achieve this.
Note that in the run_df.bsh script , I used explicit pathnames to run df; I did not rely on the environment and on the path environment variable in particular. This is very important: when coding scripts for external tables—when coding scripts, in general—you always want to use explicit paths to run the program you actually intend to run. You don’t have any real control over the environment this script will run in, so relying on the environment being set a certain way is a recipe for disaster.
Hopefully, you see how this approach could work easily for ps, ls, du, and so on—all the UNIX/Linux utilities could easily be considered “tables” now!
Cause: The query that is executing requires that a datasource for an external table be read multiple times. However, the datasource is a sequential device that cannot be reread. Examples of this type of datasource are a tape or a pipe.
Action: There are a few ways around this problem. One is to rewrite the query so that the external table is referenced only once. Another option is to move the datasource to a rereadable device such as a disk file. A third option is to load the data for the external table into a temporary table and change the query so that it references the temporary table.
On lines 27–30, I join DBA_DATA_FILES to df data with a WHERE clause, using LIKE. This will join every file in DBA_DATA_FILES to every possible mount point in the df output. I know that the goal, however, is to find the “longest” matching mount point, so to accomplish that I assign—on lines 24–26—a ROW_NUMBER to each row. This ROW_NUMBER will be sequentially assigned to each duplicated row in DBA_DATA_FILES, so if the FILE_NAME matches more than one MOUNT, each FILE_NAME occurrence will be assigned a unique, increasing ROW_NUMBER value. This ROW_NUMBER will be assigned after the data is sorted by the length of the MOUNT, from big to small.
Once I have that data, I apply a WHERE clause to save only the first entry for each FILE_NAME value—that predicate is WHERE rn = 1, on line 32. At the same time, I’ve added another column—TOT_MBYTES—on lines 18–19. This will enable me to verify the 20 percent threshold.
The last step is to format and output the data. I print the columns I’m interested in and add a CASE statement on lines 11–15 to verify that 20 percent of the total bytes of storage allocated on a given mount point does not exceed the remaining available bytes of free storage.
So, now you see how to use external tables to query operating system output such as df, ps, find, and ls. Additionally, you can use them to query anything that writes to standard out, including gunzip, sed, and so on.
Reading and Filtering Compressed Files
When working with large data loads, it’s common to use compressed files. Normally, when working with compressed files as part of the data load processing, you would have an initial step that uncompresses the file and then another step to load it into staging tables. With external tables, you can streamline this by instructing the external table to uncompress the data as it is read from the compressed file.
The prior script will search starting in a directory that gets passed into the script; it will search the base directory and all subdirectories underneath it for files with the extension .gz. And then for each file that is found, zcat is used to view uncompressed data. Finally, cut displays just the first column of data.
So not a typical example of using an external table, but rather what’s possible using the preprocessor option.
Trimming Characters Out of a File
- 1.
Save the spreadsheet on a Windows laptop as a CSV file.
- 2.
Copy the CSV file to a secure server configured specifically for file transfers to the production server, and then copy the CSV file to the production server.
- 3.
Use an OS utility to remove hidden DOS characters that are embedded in the CSV file.
- 4.
As the Oracle user, create an external table based on the CSV file.
- 5.
Use SQL to insert into a production table by selecting from the external table.
Step 3 of the preceding list is what I’ll focus on here because this is where the preprocessor eliminates the need for that step (as for the other steps, the DBA will have to figure something else out long term).
Also consider using the dos2unix utility to remove unwanted characters when transferring from Windows/DOS to UNIX/Linux.
The ^M character in the run_sed.bsh script is placed in there by pressing Ctrl+V and then Ctrl+M (or instead of Ctrl+M, you can press the Return key here); you can’t simply type in ^ and then an M. It needs to be the special character ^M.
If the hidden ^M characters had not been removed, the length of COL2 would have been at least 1 byte longer than the character data within the column (not to mention we could have had some surprises with searches and joins).
Preprocessing Wrap-Up
The prior examples show the power of the external table PREPROCESSOR directive . Sure, you could achieve the same results without preprocessing, but you would have more steps, more code to maintain, and more places where things can break.
Specific examples were chosen to illustrate the power and flexibility of preprocessing. In the first example, the file operated on was a shell script; there was no datafile. The output of the OS command was the data returned when selecting from the external table.
In the second example, a compressed file was operated on by the preprocessor script. This provided a method for on-the-fly uncompressing of the contents and then additional filtering of columns and/or rows with commands like find, cut, and zgrep.
The third example showed a script operating on a CSV file to remove unwanted characters by using sed before displaying the data. This is a common need when transferring files to and from DOS and UNIX/Linux platforms.
These three examples demonstrate distinctive ways to use the PREPROCESSOR directive. This provides a base knowledge of the possible uses and allows you to build on this to simplify your loading requirements.
External Table Summary
In this section, we explored external tables. They are a feature available with Oracle 9i and later that may for the most part replace SQLLDR. We investigated the quickest way to get going with external tables: the technique of using SQLLDR to convert the control files we have from past experiences. We demonstrated some techniques for detecting and handling errors via the bad files, and we also explored some multiuser issues regarding external tables. Lastly, we demonstrated preprocessing techniques to execute operating system commands as a first step when selecting from an external table.
We are now ready to get into the next section in this chapter, which deals with unloading data from the database.
Data Pump Unload
Oracle has the ability to extract data (from a table) in a proprietary binary format known as Data Pump format , which is the same format as the EXPDP and IMPDP tools provided by Oracle to move data from database to database. Oracle uses external tables and the CREATE TABLE statement to perform this data unloading function.
I purposely chose the ALL_OBJECTS view because it is a quite complex view with lots of joins and predicates. This shows you can use this Data Pump unload technique to extract arbitrary data from your database. We could add predicates or whatever else we wanted to extract a slice of data.
This example shows you can use this to extract arbitrary data from your database. Yes, that is repeated text. From a security perspective, this does make it rather easy for someone with access to the information to take the information elsewhere. You need to control access to the set of people who have the ability to create DIRECTORY objects and write to them and who have the necessary access to the physical server to get the unloaded data.
You can create compressed and encrypted dump files. These features require the Enterprise Edition of Oracle, as well as the Advanced Compression option (for compression) and the Advanced Security option (for encryption).
SQLLDR
Conventional path: SQLLDR will employ SQL inserts on our behalf to load data.
Direct path: SQLLDR does not use SQL in this mode; it formats database blocks directly.
The direct path load allows you to read data from a flat file and write it directly to formatted database blocks, bypassing the entire SQL engine, undo generation and, optionally, redo generation at the same time. Parallel direct path load is among the fastest ways to go from having no data to a fully loaded database.
It should be noted that the Oracle Call Interface (OCI) allows you to write your own direct path loader using C. This is useful when the operation you want to perform is not feasible in SQLLDR or when seamless integration with your application is desired. SQLLDR is a command-line tool (i.e., it’s a separate program). It is not an API or anything that can be “called from PL/SQL,” for example.
To use SQLLDR, you will need a control file. A SQLLDR control file simply contains information describing the input data—its layout, datatypes, and so on—as well as information about the target table(s).
Don’t confuse a SQLLDR control file with a database control file. Recall from Chapter 3 that a database control file is a small binary file that stores a directory of the files Oracle requires along with other information such as checkpoint data, the name of the database, and so on.
LOAD DATA (1): This tells SQLLDR what to do (in this case, load data). The other thing SQLLDR can do is CONTINUE_LOAD to resume a load. You would use this latter option only when continuing a multitable direct path load.
INFILE * (2): This tells SQLLDR the data to be loaded is actually contained within the control file itself as shown on lines 6 through 10. Alternatively, you could specify the name of another file that contains the data. You can override this INFILE statement using a command-line parameter if you wish. Be aware that command-line options override control file settings.
INTO TABLE DEPT (3): This tells SQLLDR to which table you are loading data (in this case, the DEPT table).
FIELDS TERMINATED BY ',' (4): This tells SQLLDR that the data will be in the form of comma-separated values. There are dozens of ways to describe the input data to SQLLDR; this is just one of the more common methods.
(DEPTNO, DNAME, LOC) (5): This tells SQLLDR what columns you are loading, their order in the input data, and their datatypes. The datatypes are for the data in the input stream, not the datatypes in the database. In this case, they are defaulting to CHAR(255), which is sufficient.
BEGINDATA (6): This tells SQLLDR you have finished describing the input data and that the very next lines, lines 7–10, are the actual data to be loaded into the DEPT table.
This is because we allowed almost everything in the control file to default, and the default load option is INSERT (as opposed to APPEND, TRUNCATE, or REPLACE ). To INSERT, SQLLDR assumes the table is empty. If we wanted to add records to the DEPT table, we could have specified APPEND; to replace the data in the DEPT table, we could have used REPLACE or TRUNCATE. REPLACE uses a conventional DELETE FROM TABLE statement to remove records; hence, if the table to be loaded into already contains many records, it could be quite slow to perform. TRUNCATE uses the TRUNCATE SQL command and is typically faster, as it does not have to physically remove each row.
The log file tells us about many of the aspects of our load. We can see the options we used (defaulted or otherwise). We can see how many records were read, how many loaded, and so on. The log files specify the locations of all BAD and DISCARD files. They even tell us how long it took. These log files are crucial for verifying that the load was successful, as well as for diagnosing errors. If the loaded data resulted in SQL errors (i.e., the input data was “bad” and created records in the BAD file), these errors would be recorded here. The information in the log files is largely self-explanatory, so we will not spend any more time on it.
Running SQLLDR in Express Mode
SQLLDR express mode allows you to quickly load data from a CSV file into a regular database table. If the schema you’re using has CREATE ANY DIRECTORY privilege, then express mode will attempt to use an external table to load the data. Otherwise, express mode will use SQLLDR in direct path mode to load the data.
Since express mode was able to create a directory object, an external table using dept.dat as its datasource is accessed by an INSERT statement to load the data into the DEPT table. After the load is finished, the external table is dropped.
The idea is that when you run SQLLDR in express mode with EXTERNAL_TABLE=GENERATE_ONLY, you can use the contents of the log file to run the SQL statements manually from SQL (if so desired).
For full details of all options available with SQLLDR express mode, refer to the Oracle Database Utilities manual.
Loading Data with SQLLDR FAQs
We will now cover what I have found to be the most frequently asked questions with regard to loading data in an Oracle database using SQLLDR.
Why Do I Receive “exceeds maximum length” in My Log File?
This is due to the fact that the default datatype in SQLLDR for processing an input record is a char(255). If you have any string datatypes in your table that exceed that, you’ll have to explicitly tell SQLLDR that the input record can contain more than 255 characters.
That’s it! Just tell SQLLDR the maximum width of the field in the input record—in this case, 1000.
How Do I Load Delimited Data?
Delimited data , or data that is separated by some special character and perhaps enclosed in quotes, is the most popular data format for flat files today. On a mainframe, a fixed-length, fixed-format file would probably be the most recognized file format, but on UNIX/Linux and Windows, delimited files are the norm. In this section, we will investigate the popular options used to load delimited data.
Virginia,USA in department 10: This results from input data that was "Virginia,USA". This input data field had to be enclosed in quotes to retain the comma as part of the data. Otherwise, the comma would have been treated as the end-of-field marker, and Virginia would have been loaded without the USA text.
Va, "USA": This resulted from input data that was "Va, ""USA""". SQLLDR counted the double occurrence of " as a single occurrence within the enclosed string. To load a string that contains the optional enclosure character, you must ensure the enclosure character is doubled up.
TERMINATED BY X'09' (the tab character using hexadecimal format; in ASCII, 9 is a tab character)
TERMINATED BY WHITESPACE
TERMINATED BY WHITESPACE parses the string by looking for the first occurrence of whitespace (tab, blank, or newline), and then it continues until it finds the next non-whitespace character. Hence, when it parsed the data, DEPTNO had 10 assigned to it, the two subsequent tabs were considered as whitespace, Sales was assigned to DNAME, and so on.
Here, once SQLLDR encountered a tab, it output a value. Hence, 10 is assigned to DEPTNO, and DNAME gets NULL since there is no data between the first tab and the next occurrence of a tab. Sales gets assigned to LOC.
This is the intended behavior of TERMINATED BY WHITESPACE and TERMINATED BY <character>. The one that is more appropriate to use will be dictated by the input data and how you need it to be interpreted.
How Do I Load Fixed Format Data?
Often, you have a flat file generated from some external system, and this file is a fixed-length file with positional data. For example, the NAME field is in bytes 1 to 10, the ADDRESS field is in bytes 11 to 35, and so on. We will look at how SQLLDR can import this kind of data for us.
This fixed-width, positional data is the optimal data format for SQLLDR to load. It will be the fastest way to process, as the input data stream is somewhat trivial to parse. SQLLDR will have stored fixed-byte offsets and lengths into data records, and extracting a given field is very simple. If you have an extremely large volume of data to load, converting it to a fixed position format is generally the best approach. The downside to a fixed-width file is, of course, that it can be much larger than a simple, delimited file format.
The * instructs the control file to pick up where the last field left off. Therefore (*:16) is just the same as (3:16) in this case. Notice that you can mix relative and absolute positions in the control file. Additionally, when using the * notation, you can add to the offset. For example, if DNAME started 2 bytes after the end of DEPTNO, we could have used (*+2:16). In this example, the effect would be identical to using (5:16).
Here, we had to tell SQLLDR only where the first field begins and its length. Each subsequent field starts where the last one left off and continues for a specified length. It is not until the last field that we have to specify a position again, since this field goes back to the beginning of the record.
How Do I Load Dates?
Loading dates using SQLLDR is fairly straightforward, but it seems to be a common point of confusion. You simply need to use the DATE datatype in the control file and specify the date mask to be used. This date mask is the same mask you use with TO_CHAR and TO_DATE in the database. SQLLDR will apply this date mask to your data and load it for you.
It is that easy. Just supply the format in the control file and SQLLDR will convert the date for us. In some cases, it might be appropriate to use a more powerful SQL function. For example, your input file might contain dates in many different formats: sometimes with the time component, sometimes without; sometimes in DD-MON-YYYY format; sometimes in DD/MM/YYYY format; and so on. You’ll learn in the next section how to use functions in SQLLDR to overcome these challenges.
How Do I Load Data Using Functions?
In this section, you’ll see how to refer to functions while loading data. Bear in mind, however, that the use of such functions (including database sequences) requires the SQL engine and hence won’t work in a direct path load.
HH24:MI:SS: Just a time; the date should default to the first day of the current month.
DD/MM/YYYY: Just a date; the time should default to midnight.
HH24:MI:SS DD/MM/YYYY: The date and time are both explicitly supplied.
Now, one of three date formats will be applied to the input character string (notice that you are not loading a DATE anymore; you are just loading a string). The CASE function will look at the length and the contents of the string to determine which of the masks it should use.
It is interesting to note that you can write your own functions to be called from SQLLDR. This is a straightforward application of the fact that PL/SQL can be called from SQL.
How Do I Load Data with Embedded Newlines?
Load the data with some other character in the data that represents a newline (e.g., put the string in the text where a newline should appear) and use a SQL function to replace that text with a CHR(10) during load time.
Use the FIX attribute on the INFILE directive, and load a fixed-length flat file. In this case, there is no record terminator; rather, the fact that each record is exactly as long as every other record is used to determine where records begin and end.
Use the VAR attribute on the INFILE directive, and load a variable-width file that uses a format such that the first few bytes of each line specify the length in bytes of the line to follow.
Use the STR attribute on the INFILE directive to load a variable-width file with some sequence of characters that represents the end of line, as opposed to just the newline character representing this.
The following sections demonstrate each in turn.
Use a Character Other Than a Newline
Use the FIX Attribute
The FIX attribute is another method available to us. If we use this, the input data must appear in fixed-length records. Each record will be exactly the same number of bytes as any other record in the input data set. When using positional data, the use of the FIX attribute is especially valid. These files are typically fixed-length input files to begin with. When using free-form delimited data, it is less likely that we will have a fixed-length file, as these files are generally of varying length (this is the entire point of delimited files: to make each line only as big as it needs to be).
Notice that in this input file, the newlines ( ) are not there to indicate where the end of the record for SQLLDR is; rather, they are just data to be loaded in this case. SQLLDR is using the FIX width of 80 bytes to figure out how much data to read. In fact, if we look at the input data, the records for SQLLDR are not even terminated by in this input file. The character right before department 20’s record is a space, not a newline.
You might need to trim this data, since the trailing whitespace is preserved. You can do that in the control file, using the TRIM built-in SQL function.
A word of caution to those of you lucky enough to work on both Windows and UNIX/Linux: the end-of-line marker is different on these platforms. On UNIX/Linux, it is simply (CHR(10) in SQL). On Windows/DOS, it is (CHR(13)||CHR(10) in SQL). In general, if you use the FIX approach, make sure to create and load the file on a homogeneous platform (UNIX/Linux and UNIX/Linux, or Windows and Windows).
Use the VAR Attribute
In our input datafile, we have four rows of data. The first row starts with 055, meaning that the next 55 bytes represent the first input record. This 55 bytes includes the terminating newline after the word Virginia. The next row starts with 065. It has 65 bytes of text, and so on. Using this format datafile, we can easily load our data with embedded newlines.
Again, if you are using UNIX/Linux and Windows (the preceding example was with UNIX/Linux, where a newline is one character long), you would have to adjust the length field for each record. On Windows, the preceding example’s .dat file would have to contain 56, 66, 66, and 60 for the values in the length fields.
Use the STR Attribute
This is perhaps the most flexible method of loading data with embedded newlines. Using the STR attribute , we can specify a new end-of-line character (or sequence of characters). This allows us to create an input datafile that has some special character at the end of each line—the newline is no longer “special.”
It shows us that the STR we need to use on UNIX/Linux is X'7C0A'.
On Windows, you would use UTL_RAW.CAST_TO_RAW( '|'||chr(13)||chr(10) ).
Embedded Newlines Wrap-Up
We explored at least four ways to load data with embedded newlines in this section. In the upcoming section titled “Flat File Unload,” we will use one of these techniques, the STR attribute, in a generic unload utility to avoid issues with regard to newlines in text.
Additionally, one thing to be very aware of—and I’ve mentioned it previously a couple of times—is that in Windows (all flavors), text files may end in (ASCII 13 + ASCII 10, carriage return/linefeed). Your control file will have to accommodate this: that is part of the record. The byte counts in the FIX and VAR, and the string used with STR must accommodate this. For example, if you took any of the previous .dat files that currently contain just in them and FTP-ed them to Windows using an ASCII transfer (the default), every would turn into . The same control file that just worked in UNIX/Linux would not be able to load the data anymore. This is something you must be aware of and take into consideration when setting up the control file.
How Do I Load LOBs?
We will now consider some methods for loading into LOBs. This is not a LONG or LONG RAW field, but rather the preferred datatypes of BLOB and CLOB. These datatypes support a much richer interface/set of functionality than the legacy LONG and LONG RAW types, as discussed in Chapter 12.
We will investigate two methods for loading these fields: SQLLDR and PL/SQL. Others exist, such as Java streams, Pro*C, and OCI. We will begin working with the PL/SQL method of loading LOBs, and then we’ll look at using SQLLDR to load them as well.
Loading a LOB via PL/SQL
The DBMS_LOB package has entry points called LoadFromFile, LoadBLOBFromFile, and LoadCLOBFromFile. These procedures allow us to use a BFILE (which can be used to read operating system files) to populate a BLOB or CLOB in the database. There is not a significant difference between the LoadFromFile and LoadBLOBFromFile routines, other than the latter returns OUT parameters that indicate how far into the BLOB column we have loaded data. The LoadCLOBFromFile routine, however, provides a significant feature: character set conversion. If you recall, in Chapter 12 we discussed some of the National Language Support (NLS) features of the Oracle database and the importance of character sets. LoadCLOBFromFile allows us to tell the database that the file it is about to load is in a character set different from the one the database is using, and that it should perform the required character set conversion. For example, you may have a UTF8-compatible database, but the files received to be loaded are encoded in the WE8ISO8859P1 character set, or vice versa. This function allows you to successfully load these files.
For complete details on the procedures available in the DBMS_LOB package and their full set of inputs and outputs, please refer to the Oracle PL/SQL Packages and Types Reference.
To use these procedures, we will need to create a DIRECTORY object in the database. This object will allow us to create BFILES (and open them) that point to a file existing on the file system that the database server has access to. This last phrase, “that the database server has access to,” is a key point when using PL/SQL to load LOBs. The DBMS_LOB package executes entirely in the server. It can see only the file systems the server can see. It cannot, in particular, see your local file system if you are accessing Oracle over the network.
Oracle DIRECTORY objects are logical directories, meaning they are pointers to existing, physical directories in your operating system. The CREATE DIRECTORY command does not actually create a directory in the file system—you must perform that operation separately.
The user who performs this operation needs to have the CREATE ANY DIRECTORY privilege. The reason we create two directories is to demonstrate a common case-related (“case” as in uppercase vs. lowercase characters) issue with regard to DIRECTORY objects. When Oracle created the first directory DIR1, it stored the object name DIR1 in uppercase as it is the default. In the second example with dir2, it will have created the DIRECTORY object preserving the case we used in the name. The importance of this will be demonstrated shortly when we use the BFILE object.
On lines 5 and 6, we create a row in our table, set the CLOB to an EMPTY_CLOB(), and retrieve its value in one call. With the exception of temporary LOBs, LOBs live in the database—we cannot write to a LOB variable without having a pointer to either a temporary LOB or a LOB that is already in the database. An EMPTY_CLOB() is not a NULL CLOB; it is a valid non-NULL pointer to an empty structure. The other thing this did for us was to get a LOB locator, which points to data in a row that is locked. If we were to have selected this value out without locking the underlying row, our attempts to write to it would fail because LOBs must be locked prior to writing (unlike other structured data). By inserting the row, we have, of course, locked the row. If we were modifying an existing row instead of inserting, we would have used SELECT FOR UPDATE to retrieve and lock the row.
On line 8, we create a BFILE object. Note how we use DIR1 in uppercase—this is key, as we will see in a moment. This is because we are passing to BFILENAME() the name of an object, not the object itself. Therefore, we must ensure the name matches the case Oracle has stored for this object.
On line 9, we open the LOB. This will allow us to read it.
On lines 11 and 12, we load the entire contents of the operating system file /tmp/test.txt into the LOB locator we just inserted. We use DBMS_LOB.GETLENGTH() to tell the LOADFROMFILE() routine how many bytes of the BFILE to load (all of them).
Lastly, on line 14, we close the BFILE we opened, and the CLOB is loaded.
then you most likely are running into a character set mismatch between your database and the encoding used for the file. This next example uses the LoadCLOBFromFile to account for a file encoded in WE8ISO8859P1:
There are methods other than the load from file routines by which you can populate a LOB using PL/SQL. Using DBMS_LOB and its supplied routines is by far the easiest if you are going to load the entire file. If you need to process the contents of the file while loading it, you may also use DBMS_LOB.READ on the BFILE to read the data. The use of UTL_RAW.CAST_TO_VARCHAR2 is handy here if the data you are reading is text, not RAW. You may then use DBMS_LOB.WRITE or WRITEAPPEND to place the data into a CLOB or BLOB.
Loading LOB Data via SQLLDR
When the data is inline with the rest of the data.
When the data is stored out of line, and the input data contains a file name to be loaded with the row. These are also known as secondary datafiles (SDFs) in SQLLDR terminology.
We will start with data that is inline.
Loading LOB Data That Is Inline
This example is from UNIX/Linux, where the end-of-line marker is 1 byte, hence the STR setting in the preceding control file. On Windows, it would have to be '7C0D0A'.
The one thing to observe here is that the doubled-up quotes are no longer doubled up. SQLLDR removed the extra quotes we placed there.
Loading LOB Data That Is Out of Line
A common scenario is to have a datafile that contains the names of files to load into the LOBs, instead of having the LOB data mixed in with the structured data. This offers a greater degree of flexibility, as the datafile given to SQLLDR does not have to use one of the four methods to get around having embedded newlines in the input data, as would frequently happen with large amounts of text or binary data. SQLLDR calls this type of additional datafile a LOBFILE.
SQLLDR can also support the loading of a structured datafile that points to another, single datafile. We can tell SQLLDR how to parse LOB data from this other file, so that each row in the structured data gets loaded with a piece of it. I find this mode to be of limited use (in fact, I’ve never found a use for it to date), and I will not discuss it here. SQLLDR refers to these externally referenced files as complex secondary datafiles .
Fixed-length fields (e.g., load bytes 100 through 1000 from the LOBFILE)
Delimited fields (terminated by something or enclosed by something)
Length/value pairs, a variable-length field
The most common of these types is the delimited fields—ones that are terminated by an end of file (EOF), in fact. Typically, you have a directory full of files you would like to load into LOB columns, and each file in its entirety will go into a BLOB. The LOBFILE statement with TERMINATED BY EOF is what you will use.
You might ask, “Why is the size of demo.log apparently 0?” During the running of SQLLDR, it would open the demo.log file for writing, which would zero out the length of that file and reset that file. So while loading the demo.log file, it was empty.
This works with CLOBs as well as BLOBs. Loading a directory of text files using SQLLDR in this fashion is easy.
Loading LOB Data into Object Columns
Now that we know how to load into a simple table we have created ourselves, we might also find the need to load into a table that has a complex object type with a LOB in it. This happens most frequently when using the image capabilities. The image capabilities are implemented using a complex object type, ORDSYS.ORDIMAGE. We need to be able to tell SQLLDR how to load into this.
You could issue SET DESC DEPTH ALL or SET DESC DEPTH <n> in SQL*Plus to have the entire hierarchy displayed at once. Given that the output from describing the ORDSYS.ORDIMAGE type would have been many pages long, I chose to do it piece by piece.
COLUMN OBJECT: This tells SQLLDR that this is not a column name; rather, it is part of a column name. It is not mapped to a field in the input file, but is used to build the correct object column reference to be used during the load. In the preceding file, we have two column object tags, one nested inside the other. Therefore, the column name that will be used is IMAGE.SOURCE.LOCALDATA, as we need it to be. Note that we are not loading any of the other attributes of these two object types (e.g., IMAGE.HEIGHT, IMAGE.CONTENTLENGTH, and IMAGE.SOURCE.SRCTYPE). We’ll shortly see how to get those populated.
NULLIF FILE_NAME = 'NONE': This tells SQLLDR to load a NULL into the object column in the event that the field FILE_NAME contains the word NONE in it.
SETPROPERTIES is an object method provided by the ORDSYS.ORDIMAGE type, which processes the image itself and updates the remaining attributes of the object with appropriate values.
How Do I Call SQLLDR from a Stored Procedure?
The short answer is that you cannot do this. SQLLDR is not an API; it is not something that is callable. SQLLDR is a command-line program. You can definitely write an external procedure in Java or C that runs SQLLDR, but that won’t be the same as “calling” SQLLDR. The load will happen in another session, and it will not be subject to your transaction control. Additionally, you will have to parse the resulting log file to determine if the load was successful or not and how successful (i.e., how many rows got loaded before an error terminated the load) it may have been. Invoking SQLLDR from a stored procedure is not something I recommend doing.
Write a mini-SQLLDR in PL/SQL. It can use either BFILES to read binary data or UTL_FILE to read text data to parse and load.
Write a mini-SQLLDR in Java. This can be a little more sophisticated than a PL/SQL-based loader and can make use of the many available Java routines.
Write a SQLLDR in C, and call it as an external procedure.
I’d like to finish up the topic of SQLLDR by discussing a few topics that are not immediately intuitive.
SQLLDR Caveats
In this section, we will discuss some things to have to watch out for when using SQLLDR.
TRUNCATE Appears to Work Differently
The REUSE STORAGE option does not release allocated extents—it just marks them as free space. If this were not the desired outcome, you would truncate the table prior to executing SQLLDR.
SQLLDR Defaults to CHAR(255)
This does not mean the data will not fit into the database column; rather, it indicates that SQLLDR was expecting 255 bytes or less of input data, and it received somewhat more than that. The solution is to simply use CHAR(N) in the control file, where N is big enough to accommodate the largest field length in the input file. Refer to the very first item in the earlier section “Loading Data with SQLLDR FAQs” for an example.
Command Line Overrides Control File
Many of the SQLLDR options may be either placed in the control file or used on the command line. For example, I can use INFILE FILENAME as well as SQLLDR ... DATA=FILENAME. The command line overrides any options in the control file. You cannot count on the options in a control file actually being used, as the person executing SQLLDR can override them.
SQLLDR Summary
In this section, we explored many areas of loading data. We covered the typical, everyday issues we will encounter: loading delimited files, loading fixed-length files, loading a directory full of image files, using functions on input data to transform the input, and so on. We did not cover massive data loads using the direct path loader in any detail; rather, we touched lightly on that subject. Our goal was to answer the questions that arise frequently with the use of SQLLDR and that affect the broadest audience.
Flat File Unload
There are many utilities you can use to unload data to a flat file. For example, the following tools can generate a text file using a database table as its source: APEX, SQL Workshop, SQL Developer, TOAD, and SQL*Plus. Alternatively, if you need better performance, you can write a PL/SQL, Pro*C, C++, or Java program to unload data.
From the prior list of utilities, I’ll show how to use SQL*Plus and PL/SQL to unload data to a flat file. I focus on these two tools mainly because most developers and DBAs are familiar with SQL*Plus and PL/SQL, and these utilities are fairly easy to use.
Using SQL*Plus
The prior technique gives you a simple and easy method to quickly unload large amounts of data from a table into a flat file. Additionally, the text file is generated from wherever your SQL*Plus client exists. For example, if you connect to the database from a client on your laptop, the output file is created on your laptop. This gives end users and clients more control over generating flat files.
Using PL/SQL
In this section, we will develop a small PL/SQL utility that may be used to unload data on a server in a SQLLDR-friendly format. The PL/SQL utility will work fine in most small cases, but better performance will be had using Pro*C. Note that Pro*C and SQL*Plus are also useful if you need the files to be generated on the client and not on the server, which is where PL/SQL will create them.
Note the use of AUTHID CURRENT_USER . This permits this package to be installed once in a database and used by anyone to unload data. All the person needs is SELECT privileges on the table(s) they want to unload and EXECUTE privileges on this package. If we did not use AUTHID CURRENT_USER in this case, then the owner of this package would need direct SELECT privileges on all tables to be unloaded. Additionally, since this routine would be completely subject to “SQL injection” attacks (it takes an arbitrary SQL statement as input), the CREATE statement must specify AUTHID CURRENT_USER. If it was a default definer’s rights routine, anyone with EXECUTE on it would be able to execute any SQL statement using the owner’s permissions. If you know a routine is SQL injectable, it had better be an invoker’s rights routine!
The SQL will execute with the privileges of the invoker of this routine. However, all PL/SQL calls will run with the privileges of the definer of the called routine; therefore, the ability to use UTL_FILE to write to a directory is implicitly given to anyone with execute permission on this package.
These are some global variables used in this package body. The global cursor is opened once, the first time we reference this package, and it will stay open until we log out. This avoids the overhead of getting a new cursor every time we call this package. The G_DESCTBL is a PL/SQL table that will hold the output of a DBMS_SQL.DESCRIBE call. G_NL is a newline character. We use this in strings that need to have newlines embedded in them. We do not need to adjust this for Windows—UTL_FILE will see the CHR(10) in the string of characters and automatically turn that into a carriage return/linefeed for us.
The IS_WINDOWS function does rely on you using the in your CONTROL_FILES parameter. Be aware that you may use /, but it would be highly unusual.
Line (2): We use the STR feature of SQLLDR. We can specify what character or string is used to terminate a record. This allows us to load data with embedded newlines easily. The string x'7E0A' is simply a tilde followed by a newline.
Line (5): We use our separator character and enclosure character. We do not use OPTIONALLY ENCLOSED BY, since we will be enclosing every single field after doubling any occurrence of the enclosure character in the raw data.
Line (11): We use a large numeric date format. This does two things: it avoids any NLS issues with regard to the data, and it preserves the time component of the date field.
Each field is enclosed in our enclosure character.
The DATES are unloaded as large numbers.
Each line of data in this file ends with a ~ as requested.
We can now reload this data easily using SQLLDR. You may add options to the SQLLDR command line as you see fit.
As stated previously, the logic of the unload package may be implemented in a variety of languages and tools. PL/SQL is a good all-around implementation (no need to compile and install on client workstations), but it always writes to the server file system. SQL*Plus is a good middle ground, offering fair performance and the ability to write to the client file system.
Summary
In this chapter, we covered many of the ins and outs of data loading and unloading. First, we discussed the advantages that external tables have over SQLLDR. Then we looked at easy techniques to get started with external tables. We also showed examples of using the PREPROCESSOR directive to execute OS commands prior to loading the data.
Then we looked at external table unload and the ability to easily create and move extracts of data from database to database. We wrapped this up by investigating how to unload data from a table into a dump file that can be used to move data from one database to another.
We discussed that in most scenarios you should be using external tables and not SQLLDR. However, there are some situations that may require the use of SQLLDR, like loading data over the network. We then examined many of the basic techniques for loading delimited data, fixed-width data, LOBs, and the like.
Finally, we looked at the reverse process, data unloading, and how to get data out of the database in a format that other tools—such as spreadsheets or the like—may use. In the course of that discussion, we looked at using SQL*Plus. After that, we developed a PL/SQL utility to demonstrate the process—one that unloads data in a SQLLDR (or external table) friendly format, but could easily be modified to meet your needs.