Particularly if you get data from somewhere else, it is not uncommon to get it in the form of a string with a list of values separated by some delimiter, typically comma, semicolon, tab, or similar. As you most often don’t know the number of elements in the list, you can’t just use substr to split it into a fixed number of columns. Instead it is normally most useful to be able to turn the list into rows, so you can treat it as a table in your SQL.
Such splitting involves generating rows, which you can do in many ways. I’ll show some different methods, ranging from using PL/SQL to loop over the elements of the list and generating a row at a time, over generating all rows at once by selecting from dual and retrieving the elements for each row from the list, to pretending the list is JSON and parsing it with native JSON functionality.
Customer favorites and reviews
Both tables have a proper foreign key to the customers table, but of course cannot have it to the products table, as the product ids are just part of the strings in columns favorite_list and review_list – I show sample data in the upcoming sections. The task at hand is basically to extract out those product ids to be able to join to the products table.
Delimited single values
Comma-delimited content of customer_favorites table
I now need to treat this list as if it was a child table with a row for each of the comma-separated entries. That will enable me to join to the products table (and any other table with a product id column, for that matter). In the rest of this section, I show four different ways to do this.
Pipelined table function
Collection type and pipelined table function
Collection types can be of object types or scalar types – in this case a scalar type: integer.
I’ve chosen to make the table function pipelined by using the keyword pipelined in line 4.
Inside the function, I create a loop beginning in line 11, where I search for the position of the next comma (the first if it’s the first iteration of the loop). Lines 13–22 then pipe a row to the output containing the substr from the previous comma to the found comma (or the end of the string if no comma was found).
If I reach the end of the string (no comma was found), line 23 breaks out of the loop. If there’s still something left in the string, line 24 sets the next v_from_pos to be used in the next iteration of the loop.
The loop strategy works if there’s at least one element in the comma-separated list. If it’s a completely empty list, I make sure in line 9 that I don’t start the loop at all – in such a case, no rows will be piped to the output.
I could have used a regular table function instead of pipelined – then I would have had to build the entire output collection before returning it. But if a table function is meant to be used strictly from SQL and never from PL/SQL, it is almost always a good idea to make it pipelined. This has the advantage of less PGA memory usage as well as the ability to quit processing if the client SQL stops fetching rows from the function. The downside is that you cannot use it in PL/SQL.
Using pipelined table function to split string
But you’ll undoubtedly notice that the customer with a blank favorite_list is missing in the output. That’s how Listing 9-3 works; I’m joining the customer_favorites table to the row source that is pipelined from my function, and it outputs (correctly) no rows for a blank favorite_list. This is exactly as if I was inner joining to a child table where no rows existed for this customer.
Join the results of the splitting to products
The normal join syntax requires an on clause, which I do not have and do not need. In principle what I need is like a cross join lateral to an inline view, but in ANSI SQL, it has been decided instead to use a special syntax cross apply for this, which I put just before the table keyword in line 7.
This first method is a custom built table function for this purpose only. You can also do a generic function, but in fact you don’t need to do that. The built-in APEX schema that you probably have in your database has already done this for you, as I’ll show next.
Built-in APEX table function
There is APEX API function apex_util.string_to_table(favorite_list, ',') – but it returns a PL/SQL collection type defined in a package, not a nested table type defined in SQL. But it is a deprecated function anyway, so I just mention it so you won’t use it, even if you happen to Google it.
As of version 12.2, APEX is not installed in the database by default; rather it is just shipped with the software for easy installation. Even if your company does not use APEX applications as such, I think it is a good idea to install APEX in the database anyway to take advantage of the API packages when you code SQL and PL/SQL. If you wish, you can do it without configuring a web listener (ORDS, embedded PL/SQL gateway, or Oracle HTTP Server).
Splitting with apex_string.split
The difference is just the function call in line 8 and then a small detail in line 14, where I utilize the fact that I can use column aliases in the order by clause to order by the more meaningful p_id instead of fl.column_value.
The output of Listing 9-5 is identical to that of Listing 9-4. Both methods call PL/SQL functions to do the actual splitting of the strings, which of course means context switching happening. Next up is a method in straight SQL without the context switching.
Straight SQL with row generators
Generating as many rows as delimiter count
Using cross join lateral in line 13 makes the inline view fl in lines 14–16 be executed for each row in customer_favorites, since I correlate the lateral inline view by using cf.favorite_list in line 16. By counting the number of commas and adding one, the inline view generates exactly the number of rows as there are elements in the comma-separated list.
As I’ve numbered the fl rows consecutively 1, 2, 3... in column sub#, I can use sub# in regexp_substr in line 10 to extract the first, second, third... occurrence of a “list of at least one character not containing a comma.” This is then my product_id which I use to join the products table.
The output of Listing 9-6 is identical to both Listing 9-5 and Listing 9-4.
The second group in the expression is like before, just with + changed to *, but I need to state it must follow either the beginning of the string or a comma. As I don’t want that preceding comma to be part of the output, I ask for regexp_substr to return to me just the second group (line 16).
Treating the string as a JSON array
Treating the string as a JSON array
Instead of a PL/SQL table function, I use the SQL function json_table in line 7.
The first parameter to json_table must be valid JSON, which in this case I can very simply accomplish by surrounding the comma-separated list with square brackets in line 8.
I can keep line 8 very simple only because my values are all numeric. If there had been text values involved, I would have needed to surround the text values with double quotes by replacing commas with quote-comma-quotes and take into consideration escaping any existing quotes. Then I would do as Stew Ashton shows here: https://stewashton.wordpress.com/2018/06/05/splitting-strings-a-new-champion/.
In line 9, I state that there should be one row output from json_table for every element in the JSON array. As those elements are simple scalars, the path in line 11 becomes a simple $.
I’ve shown four methods to split simple delimited strings into rows of scalar values. In most cases, I’d choose between using straight SQL, JSON arrays, and apex_string.split. If you have very long strings with many elements, the SQL method of asking for the 1st, 2nd, 3rd…occurrence in regexp_substr might become slower for the 50th occurrence – such a case might be better with a function that pipes a row as it traverses the string. On the other hand, if you have many relatively short strings each with few elements, the overhead of occurrence retrieval of elements might be smaller than the comparatively more context switching to PL/SQL.
As always, test your own use case whether SQL or pipelined function is the best. If pipelined function is the answer for you, using built-in apex_string.split is often a good choice – creating your own pipelined function would be useful if your database does not have the APEX API packages installed or if you need some special datatype handling.
Now it’s time to increase the complexity and look at delimited strings with some more structure in them.
Delimited multiple values
From time to time, I see applications where a string contains data with two delimiters – a row delimiter and a column delimiter. These days that would typically be a JSON string instead, but as data lives on a long time, you might still have to deal with such strings.
Comma- and colon-delimited content of customer_reviews table
To split up those strings into rows and columns, I’ll show you four different methods.
Custom ODCI table function
The first method I’ll show involves a pipelined table function again, but not a straightforward one like Listing 9-2.
Instead I am implementing it with the Oracle Data Cartridge Interface (ODCI) that allows me to hook into specific points in the processing of a SQL statement. This means that when the SQL engine hard parses a statement using this function, it will call my code to find out what columns and datatypes will be returned – instead of finding this information from the data dictionary. When a statement is prepared, when a row is fetched, and when the cursor is closed – all these will call my code instead of the standard handling.
This is just one type of ODCI function implementing a custom pipelined table function. ODCI can also be used to implement a custom aggregate function, which I’ll show you in the next chapter.
Here I’ll focus on using this ODCI function – all of the details of the PL/SQL is outside the scope of this book. In Listing 9-9, I just show the skeleton of the object type used for implementation of the function.
The skeleton of the object type that implements the ODCI function
The object type must contain and implement the 5 odci* functions – they will be called by the SQL engine, not by anyone using the type.
The parser function is the one that should be called when you wish to use it. As it references the implementing object type using the syntax using delimited_col_row (line 20), it needs not be inside the object type; if you prefer, it could be implemented as a stand-alone function or in a package.
Using the ODCI table function to parse the delimited data
First, the string that contains my delimited data: cr.review_list
Then, the specification of the “columns” of each “row” of delimited data, what are their names and datatypes (this should be a literal, not a variable, as this is used at hard parse time, not soft parsing)
Last, what is the column delimiter and the row delimiter in the data (these same delimiters I use in the column specification in line 6)
When I execute this statement the first time (hard parse), the SQL engine calls my odcitabledescribe function, which parses the second parameter and lets the SQL engine know the table function will return a row set with two columns, product_id and score, of the specified datatypes.
Joining with real column names instead of generic column_value
Using an ODCI implementation like this allows fine control of all the small details of the implementation. This is well and good, but there are other solutions as well that doesn’t need installing a custom ODCI function.
Combining apex_string.split and substr
For the simple delimited list, I showed using apex_string.split as an alternative to building your own pipelined table function. There is no such standard alternative for the ODCI function delimited_col_row.parser that will handle both rows and columns.
Getting rows with apex_string.split and columns with substr
I start by splitting the review list into rows in line 12 by using apex_string.split with the row delimiter comma. That means that rl will have rows with column_value, which will contain values with the two columns delimited by a colon – for example, 4040:A.
Then it is a simple matter of using substr to pick out the product id in lines 17–22 and pick out the score in lines 6–9. The output is identical to Listing 9-11.
I’ve eliminated the custom function, but I’m still incurring a lot of context switches to PL/SQL, so next I’ll try to use pure SQL again.
Row generators and regexp_substr
Generating as many rows as delimiter count
Line 13 looks for either the beginning of the string or a comma (group 1), followed by zero or more characters that are neither colon nor comma (group 2). Line 17 states the function should return the second group (this needs minimum version 11.2).
Line 22 looks for zero or more characters that are neither colon nor comma (group 1), followed by either a comma or the end of the string (group 2). Line 26 states the function should return the first group.
Listing 9-13 produces an identical output as Listing 9-11 and Listing 9-12, but does it without PL/SQL calls at all. The cost is more use of regular expression functions, which can be relatively CPU expensive – so to find which performs best, you should test the approaches against your specific use case.
All three solutions so far handle the string as it is, but I also mentioned at the start of the chapter that in many modern applications, such data would be stored as JSON rather than delimited. The database is capable of efficiently handling JSON as well as XML, so here’s a fourth method that utilizes this.
Transformation to JSON
Turning delimited text into JSON
You can see in the output that the code in lines 3–13 transformed the text of review_list into nested JSON arrays. An outer array whose elements correspond to rows, where each row itself is an inner array whose elements correspond to columns.
End of inner element
End of inner array
Comma as delimiter between elements of the outer array
Start of new inner array
Start of new inner element
End of inner element
Comma as delimiter between elements in the inner array
Start of new inner element
In line 3, the JSON begins with the three characters [[" for start of outer array, start of inner array, and start of inner element.
Finally in line 13, the JSON ends with the three characters "]] for end of inner element, end of inner array, and end of outer array.
Parsing JSON with json_table
The first parameter to the json_table function is the JSON itself, so lines 9–19 are the expression I developed in the previous listing.
The second parameter in line 20 specifies that json_table should take as rows all the inner arrays (*) in the outer JSON array that is in the root of the JSON string ($).
And last in the column specification lines 22–23, I state that the first element ($[0]) of the inner array is a number and should be a column called product_id, while the second element ($[1]) of the inner array is a varchar2 and should be a column called score.
As shown before, if I had wanted to show the customer with a blank review_list, I change cross apply in line 8 to outer apply.
Listing 9-15 can be adapted to use linefeed for row delimiter and comma for column delimiter if you have plain CSV in a CLOB, for example. Alternatively you could look into the apex_data_parser package as shown here: https://blogs.oracle.com/apex/super-easy-csv-xlsx-json-or-xml-parsing-about-the-apex_data_parser-package.
Using json_table requires version 12.1.0.2 or newer. If you have a need for older versions, you’ll find in the companion script an example of doing the same thing by transforming to XML and using xmltable instead.
Lessons learned
Split delimited text with SQL only or built-in PL/SQL functionality.
Create custom PL/SQL table functions – both regular and the ODCI variant – for special needs.
Transform the text to JSON and use native JSON parsing.
If you create your own data model, you should use child tables, collections, XML, or JSON rather than relying on storing data as delimited text. But it is common to receive delimited text from places out of your control, in which case any of the shown methods can be useful. Normally using native and built-in functionality is the easiest and the best performant, but for more special use cases, you can test if the other methods are better suited for you.