This chapter focuses on string manipulation in SQL. Keep in mind that SQL is not designed to perform complex string manipulation and you can (and will) find working with strings in SQL to be very cumbersome and frustrating at times. Despite SQL’s limitations, there are some very useful built-in functions provided by the different DBMSs, and we’ve tried to use them in creative ways. This chapter in particular is very representative of the message we tried to convey in the introduction; SQL is the good, the bad, and the ugly. Hopefully you take away from this chapter a better appreciation for what can and can’t be done in SQL when working with strings. In many cases you’ll be surprised by how easy parsing and transformation of strings can be, while at other times you’ll be aghast by the kind of SQL that is necessary to accomplish a particular task.
Many of the recipes that follow use the Translate and Replace functions that are now available in all the DBMSs covered in this book, with the exception of MySQL, which only has Replace. In this last case, it is worth noting early on that you can replicate the effect of Translate by using nested Replace functions.
The first recipe in this chapter is critically important, as it is leveraged by several of the subsequent solutions. In many cases, you’d like to have the ability to traverse a string by moving through it a character at a time. Unfortunately, SQL does not make this easy. Because there is limited loop functionality in SQL, you need to mimic a loop to traverse a string. We call this operation “walking a string” or “walking through a string” and the very first recipe explains the technique. This is a fundamental operation in string parsing when using SQL, and is referenced and used by almost all recipes in this chapter. We strongly suggest becoming comfortable with how the technique works.
You want to traverse a string to return each character as a row, but SQL lacks a loop operation. For example, you want to display the ENAME “KING” from table EMP as four rows, where each row contains just characters from “KING”.
Use a Cartesian product to generate the number of rows needed to return each character of a string on its own line. Then use your DBMS’s built-in string parsing function to extract the characters you are interested in (SQL Server users will use SUBSTRING instead of SUBSTR and DATALENGTH instead of LENGTH):
1 select substr(e.ename,iter.pos,1) as C
2 from (select ename from emp where ename = 'KING') e,
3 (select id as pos from t10) iter
4 where iter.pos <= length(e.ename)
C - K I N G
The key to iterating through a string’s characters is to join against a table that has enough rows to produce the required number of iterations. This example uses table T10, which contains 10 rows (it has one column, ID, holding the values 1 through 10). The maximum number of rows that can be returned from this query is 10.
The following example shows the Cartesian product between E and ITER (i.e., between the specific name and the 10 rows from T10) without parsing ENAME:
select ename, iter.pos
from (select ename from emp where ename = 'KING') e,
(select id as pos from t10) iter
ENAME POS ---------- ---------- KING 1 KING 2 KING 3 KING 4 KING 5 KING 6 KING 7 KING 8 KING 9 KING 10
The cardinality of inline view E is 1, and the cardinality of inline view ITER is 10. The Cartesian product is then 10 rows. Generating such a product is the first step in mimicking a loop in SQL.
It is common practice to refer to table T10 as a “pivot” table.
The solution uses a WHERE clause to break out of the loop after four rows have been returned. To restrict the result set to the same number of rows as there are characters in the name, that WHERE clause specifies ITER.POS <= LENGTH(E. ENAME) as the condition:
select ename, iter.pos
from (select ename from emp where ename = 'KING') e,
(select id as pos from t10) iter
where iter.pos <= length(e.ename)
ENAME POS ---------- ---------- KING 1 KING 2 KING 3 KING 4
Now that you have one row for each character in E.ENAME, you can use ITER.POS as a parameter to SUBSTR, allowing you to navigate through the characters in the string. ITER.POS increments with each row, and thus each row can be made to return a successive character from E.ENAME. This is how the solution example works.
Depending on what you are trying to accomplish you may or may not need to generate a row for every single character in a string. The following query is an example of walking E.ENAME and exposing different portions (more than a single character) of the string:
select substr(e.ename,iter.pos) a, substr(e.ename,length(e.ename)-iter.pos+1) b from (select ename from emp where ename = 'KING') e, (select id pos from t10) iter where iter.pos <= length(e.ename) A B ---------- ------ KING G ING NG NG ING G KING
The most common scenarios for the recipes in this chapter involve walking the whole string to generate a row for each character in the string, or walking the string such that the number of rows generated reflects the number of particular characters or delimiters that are present in the string.
You want to embed quote marks within string literals. You would like to produce results such as the following with SQL:
QMARKS -------------- g'day mate beavers' teeth '
The following three SELECTs highlight different ways you can create quotes: in the middle of a string and by themselves:
1 select 'g''day mate' qmarks from t1 union all 2 select 'beavers'' teeth' from t1 union all 3 select '''' from t1
When working with quotes, it’s often useful to think of them like parentheses. When you have an opening parenthesis, you must always have a closing parenthesis. The same goes for quotes. Keep in mind that you should always have an even number of quotes across any given string. To embed a single quote within a string you need to use two quotes:
select 'apples core', 'apple''s core',
case when '' is null then 0 else 1 end
from t1
'APPLESCORE 'APPLE''SCOR CASEWHEN''ISNULLTHEN0ELSE1END ----------- ------------ ----------------------------- apples core apple's core 0
Following is the solution stripped down to its bare elements. You have two outer quotes defining a string literal, and, within that string literal you have two quotes that together represent just one quote in the string that you actually get:
select '''' as quote from t1
Q
-
'
When working with quotes, be sure to remember that a string literal comprising two quotes alone, with no intervening characters, is NULL.
You want to count the number of times a character or substring occurs within a given string. Consider the following string:
10,CLARK,MANAGER
You want to determine how many commas are in the string.
Subtract the length of the string without the commas from the original length of the string to determine the number of commas in the string. Each DBMS provides functions for obtaining the length of a string and removing characters from a string. In most cases, these functions are LENGTH and REPLACE, respectively (SQL Server users will use the built-in function LEN rather than LENGTH):
1 select (length('10,CLARK,MANAGER')- 2 length(replace('10,CLARK,MANAGER',',','')))/length(',') 3 as cnt 4 from t1
You arrive at the solution by using simple subtraction. The call to LENGTH on line 1 returns the original size of the string, and the first call to LENGTH on line 2 returns the size of the string without the commas, which are removed by REPLACE.
By subtracting the two lengths you obtain the difference in terms of characters, which is the number of commas in the string. The last operation divides the difference by the length of your search string. This division is necessary if the string you are looking for has a length greater than 1. In the following example, counting the occurrence of “LL” in the string “HELLO HELLO” without dividing will return an incorrect result:
select
(length('HELLO HELLO')-
length(replace('HELLO HELLO','LL','')))/length('LL')
as correct_cnt,
(length('HELLO HELLO')-
length(replace('HELLO HELLO','LL',''))) as incorrect_cnt
from t1
CORRECT_CNT INCORRECT_CNT ----------- ------------- 2 4
You want to remove specific characters from your data. A scenario where this may occur is in dealing with badly formatted numeric data, especially currency data, where commas have been used to separate zeros and currency markers are mixed in the column with the quantity. Another scenario is that you want to export data from your database as a csv, but there is a text field containing commas, which will be read as separators when the csv is accessed. Consider this result set:
ENAME SAL ---------- ---------- SMITH 800 ALLEN 1600 WARD 1250 JONES 2975 MARTIN 1250 BLAKE 2850 CLARK 2450 SCOTT 3000 KING 5000 TURNER 1500 ADAMS 1100 JAMES 950 FORD 3000 MILLER 1300
You want to remove all zeros and vowels as shown by the following values in columns STRIPPED1 and STRIPPED2:
ENAME STRIPPED1 SAL STRIPPED2 ---------- ---------- ---------- --------- SMITH SMTH 800 8 ALLEN LLN 1600 16 WARD WRD 1250 125 JONES JNS 2975 2975 MARTIN MRTN 1250 125 BLAKE BLK 2850 285 CLARK CLRK 2450 245 SCOTT SCTT 3000 3 KING KNG 5000 5 TURNER TRNR 1500 15 ADAMS DMS 1100 11 JAMES JMS 950 95 FORD FRD 3000 3 MILLER MLLR 1300 13
Each DBMS provides functions for removing unwanted characters from a string. The functions REPLACE and TRANSLATE are most useful for this problem.
Use the built-in functions TRANSLATE and REPLACE to remove unwanted characters and strings:
1 select ename, 2 replace(translate(ename,'aaaaa','AEIOU'),'a','') as stripped1, 3 sal, 4 replace(cast(sal as char(4)),'0','') as stripped2 5 from emp
Note that for DB2, the AS keyword is optional for assigning a column alias and can be left out.
MySQL does not offer a TRANSLATE function, so several calls to REPLACE are needed:
1 select ename, 2 replace( 3 replace( 4 replace( 5 replace( 6 replace(ename,'A',''),'E',''),'I',''),'O',''),'U','') 7 as stripped1, 8 sal, 9 replace(sal,0,'') stripped2 10 from emp
The built-in function REPLACE removes all occurrences of zeros. To remove the vowels, use TRANSLATE to convert all vowels into one specific character (I used “a”; you can use any character), then use REPLACE to remove all occurrences of that character.
You have numeric data stored with character data together in one column. This could easily happen if you inherit data where units of measurement or currency have been stored with their quantity (e.g. a column with 100 km, AUD$200 or 40 pounds, rather than either the column making the units clear or a separate column showing the units where necessary). You want to separate the character data from the numeric data. Consider the following result set:
DATA --------------- SMITH800 ALLEN1600 WARD1250 JONES2975 MARTIN1250 BLAKE2850 CLARK2450 SCOTT3000 KING5000 TURNER1500 ADAMS1100 JAMES950 FORD3000 MILLER1300
You would like the result to be:
ENAME SAL ---------- ---------- SMITH 800 ALLEN 1600 WARD 1250 JONES 2975 MARTIN 1250 BLAKE 2850 CLARK 2450 SCOTT 3000 KING 5000 TURNER 1500 ADAMS 1100 JAMES 950 FORD 3000 MILLER 1300
Use the built-in functions TRANSLATE and REPLACE to isolate the character from the numeric data. Like other recipes in this chapter, the trick is to use TRANSLATE to transform multiple characters into a single character you can reference. This way you are no longer searching for multiple numbers or characters, rather one character to represent all numbers or one character to represent all characters.
Use the functions TRANSLATE and REPLACE to isolate and separate the numeric from the character data:
1 select replace( 2 translate(data,'0000000000','0123456789'),'0','') ename, 3 cast( 4 replace( 5 translate(lower(data),repeat('z',26), 6 'abcdefghijklmnopqrstuvwxyz'),'z','') as integer) sal 7 from ( 8 select ename||cast(sal as char(4)) data 9 from emp 10 ) x
Use the functions TRANSLATE and REPLACE to isolate and separate the numeric from the character data:
1 select replace( 2 translate(data,'0123456789','0000000000'),'0') ename, 3 to_number( 5 replace( 6 translate(lower(data), 7 'abcdefghijklmnopqrstuvwxyz', 8 rpad('z',26,'z')),'z')) sal 9 from ( 10 select ename||sal data 11 from emp 12 )
Use the functions TRANSLATE and REPLACE to isolate and separate the numeric from the character data:
1 select replace( 2 translate(data,'0123456789','0000000000'),'0','') as ename, 3 cast( 4 replace( 5 translate(lower(data), 6 'abcdefghijklmnopqrstuvwxyz', 7 rpad('z',26,'z')),'z','') as integer) as sal 8 from ( 9 select ename||sal as data 10 from emp 11 ) x
Use the functions TRANSLATE and REPLACE to isolate and separate the numeric from the character data:
1 select replace( 2 translate(data,'0123456789','0000000000'),'0','') as ename, 3 cast( 4 replace( 5 translate(lower(data), 6 'abcdefghijklmnopqrstuvwxyz', 7 replicate('z',26),'z','') as integer) as sal 8 from ( 9 select concat(ename,sal) as data 10 from emp 11 ) x
The syntax is a bit different for each DBMS, but the technique is the same. We will use the solution for Oracle in the discussion section. The key to solving this problem is to isolate the numeric and character data. You can use TRANSLATE and REPLACE to do this. To extract the numeric data, first isolate all character data using TRANSLATE:
select data,
translate(lower(data),
'abcdefghijklmnopqrstuvwxyz',
rpad('z',26,'z')) sal
from (select ename||sal data from emp)
DATA SAL -------------------- ------------------- SMITH800 zzzzz800 ALLEN1600 zzzzz1600 WARD1250 zzzz1250 JONES2975 zzzzz2975 MARTIN1250 zzzzzz1250 BLAKE2850 zzzzz2850 CLARK2450 zzzzz2450 SCOTT3000 zzzzz3000 KING5000 zzzz5000 TURNER1500 zzzzzz1500 ADAMS1100 zzzzz1100 JAMES950 zzzzz950 FORD3000 zzzz3000 MILLER1300 zzzzzz1300
By using TRANSLATE you convert every non-numeric character into a lowercase Z. The next step is to remove all instances of lowercase Z from each record using REPLACE, leaving only numerical characters that can then be cast to a number:
select data,
to_number(
replace(
translate(lower(data),
'abcdefghijklmnopqrstuvwxyz',
rpad('z',26,'z')),'z')) sal
from (select ename||sal data from emp)
DATA SAL -------------------- ---------- SMITH800 800 ALLEN1600 1600 WARD1250 1250 JONES2975 2975 MARTIN1250 1250 BLAKE2850 2850 CLARK2450 2450 SCOTT3000 3000 KING5000 5000 TURNER1500 1500 ADAMS1100 1100 JAMES950 950 FORD3000 3000 MILLER1300 1300
To extract the non-numeric characters, isolate the numeric characters using TRANSLATE:
select data,
translate(data,'0123456789','0000000000') ename
from (select ename||sal data from emp)
DATA ENAME -------------------- ---------- SMITH800 SMITH000 ALLEN1600 ALLEN0000 WARD1250 WARD0000 JONES2975 JONES0000 MARTIN1250 MARTIN0000 BLAKE2850 BLAKE0000 CLARK2450 CLARK0000 SCOTT3000 SCOTT0000 KING5000 KING0000 TURNER1500 TURNER0000 ADAMS1100 ADAMS0000 JAMES950 JAMES000 FORD3000 FORD0000 MILLER1300 MILLER0000
By using TRANSLATE you convert every numeric character into a zero. The next step is to remove all instances of zero from each record using REPLACE, leaving only non-numeric characters:
select data,
replace(translate(data,'0123456789','0000000000'),'0') ename
from (select ename||sal data from emp)
DATA ENAME -------------------- ------- SMITH800 SMITH ALLEN1600 ALLEN WARD1250 WARD JONES2975 JONES MARTIN1250 MARTIN BLAKE2850 BLAKE CLARK2450 CLARK SCOTT3000 SCOTT KING5000 KING TURNER1500 TURNER ADAMS1100 ADAMS JAMES950 JAMES FORD3000 FORD MILLER1300 MILLER
Put the two techniques together and you have your solution.
You want to return rows from a table only when a column of interest contains no characters other than numbers and letters. Consider the following view V (SQL Server users will use the operator “+” for concatenation instead of “||”):
create view V as select ename as data from emp where deptno=10 union all select ename||', $'|| cast(sal as char(4)) ||'.00' as data from emp where deptno=20 union all select ename|| cast(deptno as char(4)) as data from emp where deptno=30
The view V represents your table, and it returns the following:
DATA -------------------- CLARK KING MILLER SMITH, $800.00 JONES, $2975.00 SCOTT, $3000.00 ADAMS, $1100.00 FORD, $3000.00 ALLEN30 WARD30 MARTIN30 BLAKE30 TURNER30 JAMES30
However, from the view’s data you want to return only the following records:
DATA ------------- CLARK KING MILLER ALLEN30 WARD30 MARTIN30 BLAKE30 TURNER30 JAMES30
In short, you wish to omit those rows containing data other than letters and digits.
It may seem intuitive at first to solve the problem by searching for all the possible non-alphanumeric characters that can be found in a string, but, on the contrary, you will find it easier to do the exact opposite: find all the alphanumeric characters. By doing so, you can treat all the alphanumeric characters as one by converting them to one single character. The reason you want to do this is so the alphanumeric characters can be manipulated together, as a whole. Once you’ve generated a copy of the string in which all alphanumeric characters are represented by a single character of your choosing, it is easy to isolate the alphanumeric characters from any other characters.
Use the function TRANSLATE to convert all alphanumeric characters to a single character, then identify any rows that have characters other than the converted alphanumeric character. For DB2 users, the CAST function calls in view V are necessary; otherwise, the view cannot be created due to type conversion errors. Take extra care when working with casts to CHAR as they are fixed length (padded):
1 select data 2 from V 3 where translate(lower(data), 4 repeat('a',36), 5 '0123456789abcdefghijklmnopqrstuvwxyz') = 6 repeat('a',length(data))
The syntax for view V is slightly different in MySQL:
create view V as select ename as data from emp where deptno=10 union all select concat(ename,', $',sal,'.00') as data from emp where deptno=20 union all select concat(ename,deptno) as data from emp where deptno=30
Use a regular expression to easily find rows that contain non-alphanumeric data:
1 select data 2 from V 3 where data regexp '[^0-9a-zA-Z]' = 0
Use the function TRANSLATE to convert all alphanumeric characters to a single character, then identify any rows that have characters other than the converted alphanumeric character. The CAST function calls in view V are not needed for Oracle and PostgreSQL. Take extra care when working with casts to CHAR as they are fixed length (padded). If you decide to cast, cast to VARCHAR or VARCHAR2:
1 select data 2 from V 3 where translate(lower(data), 4 '0123456789abcdefghijklmnopqrstuvwxyz', 5 rpad('a',36,'a')) = rpad('a',length(data),'a')
The technique is the same, with the exception of there being no rpad in SQL Server
1 select data 2 from V 3 where translate(lower(data), 4 '0123456789abcdefghijklmnopqrstuvwxyz', 5 replicate('a',36)) = replicate('a',len(data))
The key to these solutions is being able to reference multiple characters concurrently. By using the function TRANSLATE you can easily manipulate all numbers or all characters without having to “iterate” and inspect each character one by one.
Only 9 of the 14 rows from view V are alphanumeric. To find the rows that are alphanumeric only, simply use the function TRANSLATE. In this example, TRANSLATE converts characters 0–9 and a–z to “a”. Once the conversion is done, the converted row is then compared with a string of all “a” with the same length (as the row). If the length is the same, then you know all the characters are alphanumeric and nothing else.
By using the TRANSLATE function (using the Oracle syntax):
where translate(lower(data), '0123456789abcdefghijklmnopqrstuvwxyz', rpad('a',36,'a'))
you convert all numbers and letters into a distinct character (I chose “a”). Once the data is converted, all strings that are indeed alphanumeric can be identified as a string comprising only a single character (in this case, “a”). This can be seen by running TRANSLATE by itself:
select data, translate(lower(data),
'0123456789abcdefghijklmnopqrstuvwxyz',
rpad('a',36,'a'))
from V
DATA TRANSLATE(LOWER(DATA) -------------------- --------------------- CLARK aaaaa … SMITH, $800.00 aaaaa, $aaa.aa … ALLEN30 aaaaaaa …
The alphanumeric values are converted, but the string lengths have not been modified. Because the lengths are the same, the rows to keep are the ones for which the call to TRANSLATE returns all a’s. You keep those rows, rejecting the others, by comparing each original string’s length with the length of its corresponding string of a’s:
select data, translate(lower(data),
'0123456789abcdefghijklmnopqrstuvwxyz',
rpad('a',36,'a')) translated,
rpad('a',length(data),'a') fixed
from V
DATA TRANSLATED FIXED -------------------- -------------------- ---------------- CLARK aaaaa aaaaa … SMITH, $800.00 aaaaa, $aaa.aa aaaaaaaaaaaaaa … ALLEN30 aaaaaaa aaaaaaa …
The last step is to keep only the strings where TRANSLATED equals FIXED.
The expression in the WHERE clause:
where data regexp '[^0-9a-zA-Z]' = 0
causes rows that have only numbers or characters to be returned. The value ranges in the brackets, “0-9a-zA-Z”, represent all possible numbers and letters. The character “^” is for negation, so the expression can be stated as “not numbers or letters.” A return value of 1 is true and 0 is false, so the whole expression can be stated as “return rows where anything other than numbers and letters is false.”
You want convert a full name into initials. Consider the following name:
Stewie Griffin
You would like to return:
S.G.
It’s important to keep in mind that SQL does not provide the flexibility of languages such as C or Python; therefore, creating a generic solution to deal with any name format is not something particularly easy to do in SQL. The solutions presented here expect the names to be either first and last name, or first, middle name/middle initial, and last name.
Use the built-in functions REPLACE, TRANSLATE, and REPEAT to extract the initials:
1 select replace( 2 replace( 3 translate(replace('Stewie Griffin', '.', ''), 4 repeat('#',26), 5 'abcdefghijklmnopqrstuvwxyz'), 6 '#','' ), ' ','.' ) 7 ||'.' 8 from t1
Use the built-in functions CONCAT, CONCAT_WS, SUBSTRING, and SUBSTRING_ INDEX to extract the initials:
1 select case 2 when cnt = 2 then 3 trim(trailing '.' from 4 concat_ws('.', 5 substr(substring_index(name,' ',1),1,1), 6 substr(name, 7 length(substring_index(name,' ',1))+2,1), 8 substr(substring_index(name,' ',-1),1,1), 9 '.')) 10 else 11 trim(trailing '.' from 12 concat_ws('.', 13 substr(substring_index(name,' ',1),1,1), 14 substr(substring_index(name,' ',-1),1,1) 15 )) 16 end as initials 17 from ( 18 select name,length(name)-length(replace(name,' ','')) as cnt 19 from ( 20 select replace('Stewie Griffin','.','') as name from t1 21 )y 22 )x
Use the built-in functions REPLACE, TRANSLATE, and RPAD to extract the initials:
1 select replace( 2 replace( 3 translate(replace('Stewie Griffin', '.', ''), 4 'abcdefghijklmnopqrstuvwxyz', 5 rpad('#',26,'#') ), '#','' ),' ','.' ) ||'.' 6 from t1
1 select replace( 2 replace( 3 translate(replace('Stewie Griffin', '.', ''), 4 'abcdefghijklmnopqrstuvwxyz', 5 replicate('#',26) ), '#','' ),' ','.' ) + '.' 6 from t1
By isolating the capital letters you can extract the initials from a name. The following sections describe each vendor-specific solution in detail.
The REPLACE function will remove any periods in the name (to handle middle initials), and the TRANSLATE function will convert all non-uppercase letters to #.
select translate(replace('Stewie Griffin', '.', ''),
repeat('#',26),
'abcdefghijklmnopqrstuvwxyz')
from t1
TRANSLATE('STE -------------- S##### G######
At this point, the initials are the characters that are not #. The function REPLACE is then used to remove all the # characters:
select replace(
translate(replace('Stewie Griffin', '.', ''),
repeat('#',26),
'abcdefghijklmnopqrstuvwxyz'),'#','')
from t1
REP --- S G
The next step is to replace the white space with a period by using REPLACE again:
select replace(
replace(
translate(replace('Stewie Griffin', '.', ''),
repeat('#',26),
'abcdefghijklmnopqrstuvwxyz'),'#',''),' ','.') || '.'
from t1
REPLA ----- S.G
The final step is to append a decimal to the end of the initials.
The REPLACE function will remove any periods in the name (to handle middle initials), and the TRANSLATE function will convert all non-uppercase letters to #.
select translate(replace('Stewie Griffin','.',''),
'abcdefghijklmnopqrstuvwxyz',
rpad('#',26,'#'))
from t1
TRANSLATE('STE -------------- S##### G######
At this point, the initials are the characters that are not “#”. The function REPLACE is then used to remove all the # characters:
select replace(
translate(replace('Stewie Griffin','.',''),
'abcdefghijklmnopqrstuvwxyz',
rpad('#',26,'#')),'#','')
from t1
REP --- S G
The next step is to replace the white space with a period by using REPLACE again:
select replace(
replace(
translate(replace('Stewie Griffin','.',''),
'abcdefghijklmnopqrstuvwxyz',
rpad('#',26,'#') ),'#',''),' ','.') || '.'
from t1
REPLA ----- S.G
The final step is to append a decimal to the end of the initials.
The inline view Y is used to remove any period from the name. The inline view X finds the number of white spaces in the name so the SUBSTR function can be called the correct number of times to extract the initials. The three calls to SUBSTRING_ INDEX parse the string into individual names based on the location of the white space. Because there is only a first and last name, the code in the ELSE portion of the case statement is executed:
select substr(substring_index(name, ' ',1),1,1) as a,
substr(substring_index(name,' ',-1),1,1) as b
from (select 'Stewie Griffin' as name from t1) x
A B - - S G
If the name in question has a middle name or initial, the initial would be returned by executing
substr(name,length(substring_index(name, ' ',1))+2,1)
which finds the end of the first name then moves two spaces to the beginning of the middle name or initial; that is, the start position for SUBSTR. Because only onecharacter is kept, the middle name or initial is successfully returned. The initials are then passed to CONCAT_WS, which separates the initials by a period:
select concat_ws('.',
substr(substring_index(name, ' ',1),1,1),
substr(substring_index(name,' ',-1),1,1),
'.' ) a
from (select 'Stewie Griffin' as name from t1) x
A ----- S.G..
The last step is to trim the extraneous period from the initials.
You want to order your result set based on a substring. Consider the following records:
ENAME ---------- SMITH ALLEN WARD JONES MARTIN BLAKE CLARK SCOTT KING TURNER ADAMS JAMES FORD MILLER
You want the records to be ordered based on the last two characters of each name:
ENAME --------- ALLEN TURNER MILLER JONES JAMES MARTIN BLAKE ADAMS KING WARD FORD CLARK SMITH SCOTT
The key to this solution is to find and use your DBMS’s built-in function to extract the substring on which you wish to sort. This is typically done with the SUBSTR function.
Use a combination of the built-in functions LENGTH and SUBSTR to order by a specific part of a string:
1 select ename 2 from emp 3 order by substr(ename,length(ename)-1,)
Use functions SUBSTRING and LEN to order by a specific part of a string:
1 select ename 2 from emp 3 order by substring(ename,len(ename)-1,2)
By using a SUBSTR expression in your ORDER BY clause, you can pick any part of a string to use in ordering a result set. You’re not limited to SUBSTR either. You can order rows by the result of almost any expression.
You want order your result set based on a number within a string. Consider the following view:
create view V as select e.ename ||' '|| cast(e.empno as char(4))||' '|| d.dname as data from emp e, dept d where e.deptno=d.deptno
This view returns the following data:
DATA ---------------------------- CLARK 7782 ACCOUNTING KING 7839 ACCOUNTING MILLER 7934 ACCOUNTING SMITH 7369 RESEARCH JONES 7566 RESEARCH SCOTT 7788 RESEARCH ADAMS 7876 RESEARCH FORD 7902 RESEARCH ALLEN 7499 SALES WARD 7521 SALES MARTIN 7654 SALES BLAKE 7698 SALES TURNER 7844 SALES JAMES 7900 SALES
You want to order the results based on the employee number, which falls between the employee name and respective department:
DATA --------------------------- SMITH 7369 RESEARCH ALLEN 7499 SALES WARD 7521 SALES JONES 7566 RESEARCH MARTIN 7654 SALES BLAKE 7698 SALES CLARK 7782 ACCOUNTING SCOTT 7788 RESEARCH KING 7839 ACCOUNTING TURNER 7844 SALES ADAMS 7876 RESEARCH JAMES 7900 SALES FORD 7902 RESEARCH MILLER 7934 ACCOUNTING
Each solution uses functions and syntax specific to its DBMS, but the method (making use of the built-in functions REPLACE and TRANSLATE) is the same for each. The idea is to use REPLACE and TRANSLATE to remove non-digits from the strings, leaving only the numeric values upon which to sort.
Use the built-in functions REPLACE and TRANSLATE to order by numeric characters in a string:
1 select data 2 from V 3 order by 4 cast( 5 replace( 6 translate(data,repeat('#',length(data)), 7 replace( 8 translate(data,'##########','0123456789'), 9 '#','')),'#','') as integer)
Use the built-in functions REPLACE and TRANSLATE to order by numeric characters in a string:
1 select data 2 from V 3 order by 4 to_number( 5 replace( 6 translate(data, 7 replace( 8 translate(data,'0123456789','##########'), 9 '#'),rpad('#',20,'#')),'#'))
Use the built-in functions REPLACE and TRANSLATE to order by numeric characters in a string:
1 select data 2 from V 3 order by 4 cast( 5 replace( 6 translate(data, 7 replace( 8 translate(data,'0123456789','##########'), 9 '#',''),rpad('#',20,'#')),'#','') as integer)
As of the time of this writing, MySQL does not provide the TRANSLATE function.
The purpose of view V is only to supply rows on which to demonstrate this recipe’s solution. The view simply concatenates several columns from the EMP table. The solution shows how to take such concatenated text as input and sort it by the employee number embedded within.
The ORDER BY clause in each solution may look intimidating but it performs quite well and is straightforward once you examine it piece by piece. To order by the numbers in the string, it’s easiest to remove any characters that are not numbers. Once the non-numeric characters are removed all that is left to do is cast the string of numerals into a number, then sort as you see fit. Before examining each function call it is important to understand the order in which each function is called. Starting with the innermost call, TRANSLATE (line 8 from each of the original solutions), you see that:
TRANSLATE (line 8) is called and the results are returned to
REPLACE (line 7) and those results are returned to
TRANSLATE (line 6) and those results are returned to
REPLACE (line 5) and those results are returned and finally
cast into a number
The first step is to convert the numbers into characters that do not exist in the rest of the string. For this example, I chose " " and used TRANSLATE to convert all non-numeric characters into occurrences of “”. For example, the following query shows the original data on the left and the results from the first translation:
select data,
translate(data,'0123456789','##########') as tmp
from V
DATA TMP ------------------------------ ----------------------- CLARK 7782 ACCOUNTING CLARK #### ACCOUNTING KING 7839 ACCOUNTING KING #### ACCOUNTING MILLER 7934 ACCOUNTING MILLER #### ACCOUNTING SMITH 7369 RESEARCH SMITH #### RESEARCH JONES 7566 RESEARCH JONES #### RESEARCH SCOTT 7788 RESEARCH SCOTT #### RESEARCH ADAMS 7876 RESEARCH ADAMS #### RESEARCH FORD 7902 RESEARCH FORD #### RESEARCH ALLEN 7499 SALES ALLEN #### SALES WARD 7521 SALES WARD #### SALES MARTIN 7654 SALES MARTIN #### SALES BLAKE 7698 SALES BLAKE #### SALES TURNER 7844 SALES TURNER #### SALES JAMES 7900 SALES JAMES #### SALES
TRANSLATE finds the numerals in each string and converts each one to to the " " character. The modified strings are then returned to REPLACE (line 11), which removes all occurrences of “”:
select data,
replace(
translate(data,'0123456789','##########'),'#') as tmp
from V
DATA TMP ------------------------------ ------------------- CLARK 7782 ACCOUNTING CLARK ACCOUNTING KING 7839 ACCOUNTING KING ACCOUNTING MILLER 7934 ACCOUNTING MILLER ACCOUNTING SMITH 7369 RESEARCH SMITH RESEARCH JONES 7566 RESEARCH JONES RESEARCH SCOTT 7788 RESEARCH SCOTT RESEARCH ADAMS 7876 RESEARCH ADAMS RESEARCH FORD 7902 RESEARCH FORD RESEARCH ALLEN 7499 SALES ALLEN SALES WARD 7521 SALES WARD SALES MARTIN 7654 SALES MARTIN SALES BLAKE 7698 SALES BLAKE SALES TURNER 7844 SALES TURNER SALES JAMES 7900 SALES JAMES SALES
The strings are then returned to TRANSLATE once again, but this time it’s the second (outermost) TRANSLATE in the solution. TRANSLATE searches the original string for any characters that match the characters in TMP. If any are found, they too are converted to “#"s. This conversion allows all non-numeric characters to be treated as a single character (because they are all transformed to the same character):
select data, translate(data,
replace(
translate(data,'0123456789','##########'),
'#'),
rpad('#',length(data),'#')) as tmp
from V
DATA TMP ------------------------------ --------------------------- CLARK 7782 ACCOUNTING ########7782########### KING 7839 ACCOUNTING ########7839########### MILLER 7934 ACCOUNTING ########7934########### SMITH 7369 RESEARCH ########7369######### JONES 7566 RESEARCH ########7566######### SCOTT 7788 RESEARCH ########7788######### ADAMS 7876 RESEARCH ########7876######### FORD 7902 RESEARCH ########7902######### ALLEN 7499 SALES ########7499###### WARD 7521 SALES ########7521###### MARTIN 7654 SALES ########7654###### BLAKE 7698 SALES ########7698###### TURNER 7844 SALES ########7844###### JAMES 7900 SALES ########7900######
The next step is to remove all “#” characters through a call to REPLACE (line 8), leaving you with only numbers:
select data, replace(
translate(data,
replace(
translate(data,'0123456789','##########'),
'#'),
rpad('#',length(data),'#')),'#') as tmp
from V
DATA TMP ------------------------------ ----------- CLARK 7782 ACCOUNTING 7782 KING 7839 ACCOUNTING 7839 MILLER 7934 ACCOUNTING 7934 SMITH 7369 RESEARCH 7369 JONES 7566 RESEARCH 7566 SCOTT 7788 RESEARCH 7788 ADAMS 7876 RESEARCH 7876 FORD 7902 RESEARCH 7902 ALLEN 7499 SALES 7499 WARD 7521 SALES 7521 MARTIN 7654 SALES 7654 BLAKE 7698 SALES 7698 TURNER 7844 SALES 7844 JAMES 7900 SALES 7900
Finally, cast TMP to a number (line 4) using the appropriate DBMS function (often CAST) to accomplish this:
select data, to_number(
replace(
translate(data,
replace(
translate(data,'0123456789','##########'),
'#'),
rpad('#',length(data),'#')),'#')) as tmp
from V
DATA TMP ------------------------------ ---------- CLARK 7782 ACCOUNTING 7782 KING 7839 ACCOUNTING 7839 MILLER 7934 ACCOUNTING 7934 SMITH 7369 RESEARCH 7369 JONES 7566 RESEARCH 7566 SCOTT 7788 RESEARCH 7788 ADAMS 7876 RESEARCH 7876 FORD 7902 RESEARCH 7902 ALLEN 7499 SALES 7499 WARD 7521 SALES 7521 MARTIN 7654 SALES 7654 BLAKE 7698 SALES 7698 TURNER 7844 SALES 7844 JAMES 7900 SALES 7900
When developing queries like this, it’s helpful to work with your expressions in the SELECT list. That way, you can easily view the intermediate results as you work toward a final solution. However, because the point of this recipe is to order the results, ultimately you should place all the function calls into the ORDER BY clause:
select data
from V
order by
to_number(
replace(
translate( data,
replace(
translate( data,'0123456789','##########'),
'#'),rpad('#',length(data),'#')),'#'))
DATA --------------------------- SMITH 7369 RESEARCH ALLEN 7499 SALES WARD 7521 SALES JONES 7566 RESEARCH MARTIN 7654 SALES BLAKE 7698 SALES CLARK 7782 ACCOUNTING SCOTT 7788 RESEARCH KING 7839 ACCOUNTING TURNER 7844 SALES ADAMS 7876 RESEARCH JAMES 7900 SALES FORD 7902 RESEARCH MILLER 7934 ACCOUNTING
As a final note, the data in the view is comprised of three fields, only one being numeric. Keep in mind that if there had been multiple numeric fields, they would have all been concatenated into one number before the rows were sorted.
You want to return table rows as values in a delimited list, perhaps delimited by commas, rather than in vertical columns as they normally appear. You want to convert a result set from this:
DEPTNO EMPS ------ ---------- 10 CLARK 10 KING 10 MILLER 20 SMITH 20 ADAMS 20 FORD 20 SCOTT 20 JONES 30 ALLEN 30 BLAKE 30 MARTIN 30 JAMES 30 TURNER 30 WARD
to this:
DEPTNO EMPS ------- ------------------------------------ 10 CLARK,KING,MILLER 20 SMITH,JONES,SCOTT,ADAMS,FORD 30 ALLEN,WARD,MARTIN,BLAKE,TURNER,JAMES
Each DBMS requires a different approach to this problem. The key is to take advantage of the built-in functions provided by your DBMS. Understanding what is available to you will allow you to exploit your DBMS’s functionality and come up with creative solutions for a problem that is typically not solved in SQL.
Most DBMS’s have now adopted a function specifically designed to concatenate strings, such as MySQL’s GROUP_CONCAT function (one of the earliest) or STRING_ADD (added to SQL Server as recently as SQL Server 2017). These functions have very similar syntax, and make this task straight forward.
Use list_agg to build the delimited list:
1 select deptno, 2 list_agg(ename ',') within GROUP(Order by 0) as emps 3 from emp 4 group by deptno
Use the built-in function GROUP_CONCAT to build the delimited list:
1 select deptno, 2 group_concat(ename order by empno separator, ',') as emps 3 from emp 4 group by deptno
Use the built-in function SYS_CONNECT_BY_PATH to build the delimited list:
1 select deptno, 2 ltrim(sys_connect_by_path(ename,','),',') emps 3 from ( 4 select deptno, 5 ename, 6 row_number() over 7 (partition by deptno order by empno) rn, 8 count(*) over 9 (partition by deptno) cnt 10 from emp 11 ) 12 where level = cnt 13 start with rn = 1 14 connect by prior deptno = deptno and prior rn = rn-1
1 select deptno, 2 string_agg(ename order by empno separator, ',') as emps 3 from emp 4 group by deptno
Being able to create delimited lists in SQL is useful because it is a common requirement. The SQL:2016 standard added LIST_AGG to perform this task, but only DB2 has implemented this function so far. Thankfully, other DBMS have similar fucntions, often with simpler syntax
The function GROUP_CONCAT in MySQL. It concatenates the values found in the column passed to it, in this case ENAME. It’s an aggregate function, thus the need for GROUP BY in the query.
The STRING_AGG function syntax is similar enough to GROUP_CONCAT that the same query can be used with the GROUP_CONCAT simply changed to STRING_AGG
The first step to understanding the Oracle query is to break it down. Running the inline view by itself (lines 4–10), you generate a result set that includes the following for each employee: her department, her name, a rank within her respective department that is derived by an ascending sort on EMPNO, and a count of all employees in her department. For example:
select deptno,
ename,
row_number() over
(partition by deptno order by empno) rn,
count(*) over (partition by deptno) cnt
from emp
DEPTNO ENAME RN CNT ------ ---------- -- --- 10 CLARK 1 3 10 KING 2 3 10 MILLER 3 3 20 SMITH 1 5 20 JONES 2 5 20 SCOTT 3 5 20 ADAMS 4 5 20 FORD 5 5 30 ALLEN 1 6 30 WARD 2 6 30 MARTIN 3 6 30 BLAKE 4 6 30 TURNER 5 6 30 JAMES 6 6
The purpose of the rank (aliased RN in the query) is to allow you to walk the tree. Since the function ROW_NUMBER generates an enumeration starting from one with no duplicates or gaps, just subtract one (from the current value) to reference a prior (or parent) row. For example, the number prior to 3 is 3 minus 1, which equals 2. In this context, 2 is the parent of 3; you can observe this on line 12. Additionally, the lines
start with rn = 1 connect by prior deptno = deptno
identify the root for each DEPTNO as having RN equal to 1 and create a new list whenever a new department is encountered (whenever a new occurrence of 1 is found for RN).
At this point, it’s important to stop and look at the ORDER BY portion of the ROW_NUMBER function. Keep in mind the names are ranked by EMPNO and the list will be created in that order. The number of employees per department is calculated (aliased CNT) and is used to ensure that the query returns only the list that has all the employee names for a department. This is done because SYS_CONNECT_ BY_PATH builds the list iteratively, and you do not want to end up with partial lists.
For hierarchical queries, the pseudocolumn LEVEL starts with 1 (for queries not using CONNECT BY, LEVEL is 0, unless you are on 10g and later when LEVEL is only available when using CONNECT BY) and increments by one after each employee in a department has been evaluated (for each level of depth in the hierarchy). Because of this, you know that once LEVEL reaches CNT, you have reached the last EMPNO and will have a complete list.
The SYS_CONNECT_BY_PATH function prefixes the list with your chosen delimiter (in this case, a comma). You may or may not want that behavior. In this recipe’s solution, the call to the function LTRIM removes the leading comma from the list.
You have delimited data that you want to pass to the IN-list iterator of a WHERE clause. Consider the following string:
7654,7698,7782,7788
You would like to use the string in a WHERE clause but the following SQL fails because EMPNO is a numeric column:
select ename,sal,deptno from emp where empno in ( '7654,7698,7782,7788' )
This SQL fails because, while EMPNO is a numeric column, the IN list is composed of a single string value. You want that string to be treated as a comma-delimited list of numeric values.
On the surface it may seem that SQL should do the work of treating a delimited string as a list of delimited values for you, but that is not the case. When a comma embedded within quotes is encountered, SQL can’t possibly know that signals a multi-valued list. SQL must treat everything between the quotes as a single entity, as one string value. You must break the string up into individual EMPNOs. The key to this solution is to walk the string, but not into individual characters. You want to walk the string into valid EMPNO values.
By walking the string passed to the IN-list, you can easily convert it to rows. The functions ROW_NUMBER, LOCATE, and SUBSTR are particularly useful here:
1 select empno,ename,sal,deptno 2 from emp 3 where empno in ( 4 select cast(substr(c,2,locate(',',c,2)-2) as integer) empno 5 from ( 6 select substr(csv.emps,cast(iter.pos as integer)) as c 7 from (select ','||'7654,7698,7782,7788'||',' emps 8 from t1) csv, 9 (select id as pos 10 from t100 ) iter 11 where iter.pos <= length(csv.emps) 12 ) x 13 where length(c) > 1 14 and substr(c,1,1) = ',' 15 )
By walking the string passed to the IN-list, you can easily convert it to rows:
1 select empno, ename, sal, deptno 2 from emp 3 where empno in 4 ( 5 select substring_index( 6 substring_index(list.vals,',',iter.pos),',',-1) empno 7 from (select id pos from t10) as iter, 8 (select '7654,7698,7782,7788' as vals 9 from t1) list 10 where iter.pos <= 11 (length(list.vals)-length(replace(list.vals,',','')))+1 12 )
By walking the string passed to the IN-list, you can easily convert it to rows. The functions ROWNUM, SUBSTR, and INSTR are particularly useful here:
1 select empno,ename,sal,deptno 2 from emp 3 where empno in ( 4 select to_number( 5 rtrim( 6 substr(emps, 7 instr(emps,',',1,iter.pos)+1, 8 instr(emps,',',1,iter.pos+1) 9 instr(emps,',',1,iter.pos)),',')) emps 10 from (select ','||'7654,7698,7782,7788'||',' emps from t1) csv, 11 (select rownum pos from emp) iter 12 where iter.pos <= ((length(csv.emps)- 13 length(replace(csv.emps,',')))/length(','))-1 14 )
By walking the string passed to the IN-list, you can easily convert it to rows. The function SPLIT_PART makes it easy to parse the string into individual numbers:
1 select ename,sal,deptno 2 from emp 3 where empno in ( 4 select cast(empno as integer) as empno 5 from ( 6 select split_part(list.vals,',',iter.pos) as empno 7 from (select id as pos from t10) iter, 8 (select ','||'7654,7698,7782,7788'||',' as vals 9 from t1) list 10 where iter.pos <= 11 length(list.vals)-length(replace(list.vals,',','')) 12 ) z 13 where length(empno) > 0 14 )
By walking the string passed to the IN-list, you can easily convert it to rows. The functions ROW_NUMBER, CHARINDEX, and SUBSTRING are particularly useful here:
1 select empno,ename,sal,deptno 2 from emp 3 where empno in (select substring(c,2,charindex(',',c,2)-2) as empno 4 from ( 5 select substring(csv.emps,iter.pos,len(csv.emps)) as c 6 from (select ','+'7654,7698,7782,7788'+',' as emps 7 from t1) csv, 8 (select id as pos 9 from t100) iter 10 where iter.pos <= len(csv.emps) 11 ) x 12 where len(c) > 1 13 and substring(c,1,1) = ',' 14 )
The first and most important step in this solution is to walk the string. Once you’ve accomplished that, all that’s left is to parse the string into individual, numeric values using your DBMS’s provided functions.
The inline view X (lines 6–11) walks the string. The idea in this solution is to “walk through” the string, so that each row has one less character than the one before it:
,7654,7698,7782,7788, 7654,7698,7782,7788, 654,7698,7782,7788, 54,7698,7782,7788, 4,7698,7782,7788, ,7698,7782,7788, 7698,7782,7788, 698,7782,7788, 98,7782,7788, 8,7782,7788, ,7782,7788, 7782,7788, 782,7788, 82,7788, 2,7788, ,7788, 7788, 788, 88, 8, ,
Notice that by enclosing the string in commas (the delimiter), there’s no need to make special checks as to where the beginning or end of the string is.
The next step is to keep only the values you want to use in the IN-list. The values to keep are the ones with leading commas, with the exception of the last row with its lone comma. Use SUBSTR or SUBSTRING to identify which rows have a leading comma, then keep all characters found before the next comma in that row. Once that’s done, cast the string to a number so it can be properly evaluated against the numeric column EMPNO (lines 4–14):
EMPNO ------ 7654 7698 7782 7788
The final step is to use the results in a subquery to return the desired rows.
The inline view (lines 5–9) walks the string. The expression on line 10 determines how many values are in the string by finding the number of commas (the delimiter) and adding one. The function SUBSTRING_INDEX (line 6) returns all characters in the string before (to the left of ) the nth occurrence of a comma (the delimiter):
+---------------------+ | empno | +---------------------+ | 7654 | | 7654,7698 | | 7654,7698,7782 | | 7654,7698,7782,7788 | +---------------------+
Those rows are then passed to another call to SUBSTRING_INDEX (line 5); this time the nth occurrence of the delimited is –1, which causes all values to the right of the nth occurrence of the delimiter to be kept:
+-------+ | empno | +-------+ | 7654 | | 7698 | | 7782 | | 7788 | +-------+
The final step is to plug the results into a subquery.
The first step is to walk the string:
select emps,pos
from (select ','||'7654,7698,7782,7788'||',' emps
from t1) csv,
(select rownum pos from emp) iter
where iter.pos <=
((length(csv.emps)-length(replace(csv.emps,',')))/length(','))-1
EMPS POS --------------------- ---------- ,7654,7698,7782,7788, 1 ,7654,7698,7782,7788, 2 ,7654,7698,7782,7788, 3 ,7654,7698,7782,7788, 4
The number of rows returned represents the number of values in your list. The values for POS are crucial to the query as they are needed to parse the string into individual values. The strings are parsed using SUBSTR and INSTR. POS is used to locate the nth occurrence of the delimiter in each string. By enclosing the strings in commas, no special checks are necessary to determine the beginning or end of a string. The values passed to SUBSTR, INSTR (lines 7–9) locate the nth and nth+1 occurrence of the delimiter. By subtracting the value returned for the current comma (the location in the string where the current comma is) from the value returned bythe next comma (the location in the string where the next comma is) you can extract each value from the string:
select substr(emps,
instr(emps,',',1,iter.pos)+1,
instr(emps,',',1,iter.pos+1)
instr(emps,',',1,iter.pos)) emps
from (select ','||'7654,7698,7782,7788'||',' emps
from t1) csv,
(select rownum pos from emp) iter
where iter.pos <=
((length(csv.emps)-length(replace(csv.emps,',')))/length(','))-1
EMPS ----------- 7654, 7698, 7782, 7788,
The final step is to remove the trailing comma from each value, cast it to a number, and plug it into a subquery.
The inline view Z (lines 6–9) walks the string. The number of rows returned is determined by how many values are in the string. To find the number of values in the string, subtract the size of the string without the delimiter from the size of the string with the delimiter (line 9). The function SPLIT_PART does the work of parsing the string. It looks for the value that comes before the nth occurrence of the delimiter:
select list.vals,
split_part(list.vals,',',iter.pos) as empno,
iter.pos
from (select id as pos from t10) iter,
(select ','||'7654,7698,7782,7788'||',' as vals
from t1) list
where iter.pos <=
length(list.vals)-length(replace(list.vals,',',''))
vals | empno | pos ----------------------+-------+----- ,7654,7698,7782,7788, | | 1 ,7654,7698,7782,7788, | 7654 | 2 ,7654,7698,7782,7788, | 7698 | 3 ,7654,7698,7782,7788, | 7782 | 4 ,7654,7698,7782,7788, | 7788 | 5
The final step is to cast the values (EMPNO) to a number and plug it into a subquery.
You want alphabetize the individual characters within strings in your tables. Consider the following result set:
ENAME ---------- ADAMS ALLEN BLAKE CLARK FORD JAMES JONES KING MARTIN MILLER SCOTT SMITH TURNER WARD
You would like the result to be:
OLD_NAME NEW_NAME ---------- -------- ADAMS AADMS ALLEN AELLN BLAKE ABEKL CLARK ACKLR FORD DFOR JAMES AEJMS JONES EJNOS KING GIKN MARTIN AIMNRT MILLER EILLMR SCOTT COSTT SMITH HIMST TURNER ENRRTU WARD ADRW
This problem is a good example of the way increased standardisation allows for more similar, and therefore portable solutions.
To alphabetize rows of strings it is necessary to walk each string then order its characters:
1 select ename, 2 listagg(c,'') WITHIN GROUP( ORDER BY c) 3 from ( 4 select a.ename, 5 substr(a.ename,iter.pos,1 6 ) as c 7 from emp a, 8 (select id as pos from t10) iter 9 where iter.pos <= length(a.ename) 10 order by 1,2 11 ) x 12 Group By c
The key here is the GROUP_CONCAT function, which allows you to not only concatenate the characters that make up each name but also order them:
1 select ename, group_concat(c order by c separator '') 2 from ( 3 select ename, substr(a.ename,iter.pos,1) c 4 from emp a, 5 ( select id pos from t10 ) iter 6 where iter.pos <= length(a.ename) 7 ) x 8 group by ename
The function SYS_CONNECT_BY_PATH allows you to iteratively build a list:
1 select old_name, new_name 2 from ( 3 select old_name, replace(sys_connect_by_path(c,' '),' ') new_name 4 from ( 5 select e.ename old_name, 6 row_number() over(partition by e.ename 7 order by substr(e.ename,iter.pos,1)) rn, 8 substr(e.ename,iter.pos,1) c 9 from emp e, 10 ( select rownum pos from emp ) iter 11 where iter.pos <= length(e.ename) 12 order by 1 13 ) x 14 start with rn = 1 15 connect by prior rn = rn-1 and prior old_name = old_name 16 ) 17 where length(old_name) = length(new_name)
PostgreSQL has now added string_agg to order characters within a string.
select ename, string_agg(c , '' ORDER BY c) from ( select a.ename, substr(a.ename,iter.pos,1) as c from emp a, (select id as pos from t10) iter where iter.pos <= length(a.ename) order by 1,2 ) x Group By c
If you are using SQL Server 2017 or beyond, the PostgreSQL solution with String_Agg will work. Otherwise, to alphabetize rows of strings it is necessary to walk each string, and then order their characters:
1 select ename, 2 max(case when pos=1 then c else '' end)+ 3 max(case when pos=2 then c else '' end)+ 4 max(case when pos=3 then c else '' end)+ 5 max(case when pos=4 then c else '' end)+ 6 max(case when pos=5 then c else '' end)+ 7 max(case when pos=6 then c else '' end) 8 from ( 9 select e.ename, 10 substring(e.ename,iter.pos,1) as c, 11 row_number() over ( 12 partition by e.ename 13 order by substring(e.ename,iter.pos,1)) as pos 14 from emp e, 15 (select row_number()over(order by ename) as pos 16 from emp) iter 17 where iter.pos <= len(e.ename) 18 ) x 19 group by ename
The inline view X returns each character in each name as a row. The function SUBSTR or SUBSTRING extracts each character from each name, and the function ROW_NUMBER ranks each character alphabetically:
ENAME C POS ----- - --- ADAMS A 1 ADAMS A 2 ADAMS D 3 ADAMS M 4 ADAMS S 5 …
To return each letter of a string as a row, you must walk the string. This is accomplished with inline view ITER.
Now that the letters in each name have been alphabetized, the last step is to put those letters back together, into a string, in the order they are ranked. Each letter’s position is evaluated by the CASE statements (lines 2–7). If a character is found at a particular position it is then concatenated to the result of the next evaluation (the following CASE statement). Because the aggregate function MAX is used as well, only one character per position POS is returned, so that only one row per name is returned. The CASE evaluation goes up to the number 6, which is the maximum number of characters in any name in table EMP.
The inline view X (lines 3–6) returns each character in each name as a row. The function SUBSTR extracts each character from each name:
ENAME C ----- - ADAMS A ADAMS A ADAMS D ADAMS M ADAMS S …
Inline view ITER is used to walk the string. From there, the rest of the work is done by the GROUP_CONCAT function. By specifying an order, the function not only concatenates each letter, it does so alphabetically.
The real work is done by inline view X (lines 5–11), where the characters in each name are extracted and put into alphabetical order. This is accomplished by walking the string, then imposing order on those characters. The rest of the query merely glues the names back together.
The tearing apart of names can be seen by executing only inline view X:
OLD_NAME RN C ---------- --------- - ADAMS 1 A ADAMS 2 A ADAMS 3 D ADAMS 4 M ADAMS 5 S …
The next step is to take the alphabetized characters and rebuild each name. This is done with the function SYS_CONNECT_BY_PATH by appending each character to the ones before it:
OLD_NAME NEW_NAME ---------- --------- ADAMS A ADAMS AA ADAMS AAD ADAMS AADM ADAMS AADMS …
The final step is to keep only the strings that have the same length as the names they were built from.
For readability, view V is used in this solution to walk the string. The function SUBSTR, in the view definition, extracts each character from each name so that the view returns:
ENAME C ----- - ADAMS A ADAMS A ADAMS D ADAMS M ADAMS S …
The view also orders the results by ENAME and by each letter in each name. The inline view X (lines 15–18) returns the names and characters from view V, the number of times each character occurs in each name, and its position (alphabetically):
ename | c | cnt | pos ------+---+-----+----- ADAMS | A | 2 | 1 ADAMS | A | 2 | 1 ADAMS | D | 1 | 3 ADAMS | M | 1 | 4 ADAMS | S | 1 | 5
The extra columns CNT and POS, returned by the inline view X, are crucial to the solution. POS is used to rank each character and CNT is used to determine the number of times the character exists in each name. The final step is to evaluate the position of each character and rebuild the name. You’ll notice that each case statement is actually two case statements. This is to determine whether or not a character occursmore than once in a name; if it does, then rather than return that character, what is returned is that character appended to itself CNT times. The aggregate function, MAX, is used to ensure there is only one row per name.
You have a column that is defined to hold character data. Unfortunately, the rows contain mixed numeric and character data. Consider view V:
create view V as select replace(mixed,' ','') as mixed from ( select substr(ename,1,2)|| cast(deptno as char(4))|| substr(ename,3,2) as mixed from emp where deptno = 10 union all select cast(empno as char(4)) as mixed from emp where deptno = 20 union all select ename as mixed from emp where deptno = 30 ) x select * from v MIXED -------------- CL10AR KI10NG MI10LL 7369 7566 7788 7876 7902 ALLEN WARD MARTIN BLAKE TURNER JAMES
You want to return rows that are numbers only, or that contain at least one number. If the numbers are mixed with character data, you want to remove the characters and return only the numbers. For the sample data above you want the following result set:
MIXED -------- 10 10 10 7369 7566 7788 7876 7902
The functions REPLACE and TRANSLATE are extremely useful for manipulating strings and individual characters. The key is to convert all numbers to a single character, which then makes it easy to isolate and identify any number by referring to a single character.
Use functions TRANSLATE, REPLACE, and POSSTR to isolate the numeric characters in each row. The calls to CAST are necessary in view V; otherwise, the view will fail to be created due to type conversion errors. You’ll need the function REPLACE to remove extraneous white space due to casting to the fixed length CHAR:
1 select mixed old, 2 cast( 3 case 4 when 5 replace( 6 translate(mixed,'9999999999','0123456789'),'9','') = '' 7 then 8 mixed 9 else replace( 10 translate(mixed, 11 repeat('#',length(mixed)), 12 replace( 13 translate(mixed,'9999999999','0123456789'),'9','')), 14 '#','') 15 end as integer ) mixed 16 from V 17 where posstr(translate(mixed,'9999999999','0123456789'),'9') > 0
The syntax for MySQL is slightly different and will define view V as:
create view V as select concat( substr(ename,1,2), replace(cast(deptno as char(4)),' ',''), substr(ename,3,2) ) as mixed from emp where deptno = 10 union all select replace(cast(empno as char(4)), ' ', '') from emp where deptno = 20 union all select ename from emp where deptno = 30
Because MySQL does not support the TRANSLATE function, you must walk each row and evaluate it on a character-by-character basis.
1 select cast(group_concat(c order by pos separator '') as unsigned) 2 as MIXED1 3 from ( 4 select v.mixed, iter.pos, substr(v.mixed,iter.pos,1) as c 5 from V, 6 ( select id pos from t10 ) iter 7 where iter.pos <= length(v.mixed) 8 and ascii(substr(v.mixed,iter.pos,1)) between 48 and 57 9 ) y 10 group by mixed 11 order by 1
Use functions TRANSLATE, REPLACE, and INSTR to isolate the numeric characters in each row. The calls to CAST are not necessary in view V. Use the function REPLACE to remove extraneous white space due to casting to the fixed length CHAR. If you decide you would like to keep the explicit type conversion calls in the view definition, it is suggested you cast to VARCHAR2:
1 select to_number ( 2 case 3 when 4 replace(translate(mixed,'0123456789','9999999999'),'9') 5 is not null 6 then 7 replace( 8 translate(mixed, 9 replace( 10 translate(mixed,'0123456789','9999999999'),'9'), 11 rpad('#',length(mixed),'#')),'#') 12 else 13 mixed 14 end 15 ) mixed 16 from V 17 where instr(translate(mixed,'0123456789','9999999999'),'9') > 0
Use functions TRANSLATE, REPLACE, and STRPOS to isolate the numeric characters in each row. The calls to CAST are not necessary in view V. Use the function REPLACE ito remove extraneous white space due to casting to the fixed length CHAR. If you decide you would like to keep the explicit type conversion calls in the view definition, it is suggested you cast to VARCHAR:
1 select cast( 2 case 3 when 4 replace(translate(mixed,'0123456789','9999999999'),'9','') 5 is not null 6 then 7 replace( 8 translate(mixed, 9 replace( 10 translate(mixed,'0123456789','9999999999'),'9',''), 11 rpad('#',length(mixed),'#')),'#','') 12 else 13 mixed 14 end as integer ) as mixed 15 from V 16 where strpos(translate(mixed,'0123456789','9999999999'),'9') > 0
The built-in function ISNUMERIC along with a wildcard search allows you to easily identify strings that contains numbers, but getting numeric characters out of a string is not particularly efficient because the TRANSLATE function is not supported.
The TRANSLATE function is very useful here as it allows you to easily isolate and identify numbers and characters. The trick is to convert all numbers to a single character; this way, rather than searching for different numbers you only search for one character.
The syntax differs slightly among these DBMSs, but the technique is the same. I’ll use the solution for PostgreSQL for the discussion.
The real work is done by functions TRANSLATE and REPLACE. To get the final result set requires several function calls, each listed below in one query:
select mixed as orig,
translate(mixed,'0123456789','9999999999') as mixed1,
replace(translate(mixed,'0123456789','9999999999'),'9','') as mixed2,
translate(mixed,
replace(
translate(mixed,'0123456789','9999999999'),'9',''),
rpad('#',length(mixed),'#')) as mixed3,
replace(
translate(mixed,
replace(
translate(mixed,'0123456789','9999999999'),'9',''),
rpad('#',length(mixed),'#')),'#','') as mixed4
from V
where strpos(translate(mixed,'0123456789','9999999999'),'9') > 0
ORIG | MIXED1 | MIXED2 | MIXED3 | MIXED4 | MIXED5 --------+--------+--------+--------+--------+-------- CL10AR | CL99AR | CLAR | ##10## | 10 | 10 KI10NG | KI99NG | KING | ##10## | 10 | 10 MI10LL | MI99LL | MILL | ##10## | 10 | 10 7369 | 9999 | | 7369 | 7369 | 7369 7566 | 9999 | | 7566 | 7566 | 7566 7788 | 9999 | | 7788 | 7788 | 7788 7876 | 9999 | | 7876 | 7876 | 7876 7902 | 9999 | | 7902 | 7902 | 7902
First, notice that any rows without at least one number are removed. How this is accomplished will become clear as you examine each of the columns in the above result set. The rows that are kept are the values in the ORIG column and are the rows that will eventually make up the result set. The first step to extracting the numbers is to use the function TRANSLATE to convert any number to a 9 (you can use any digit; 9 is arbitrary), this is represented by the values in MIXED1. Now that all numbers are 9’s, they can be treating as a single unit. The next step is to remove all of the numbers by using the function REPLACE. Because all digits are now 9, REPLACE simply looks for any 9’s and removes them. This is represented by the values in MIXED2. The next step, MIXED3, uses values that are returned by MIXED2. These values are then compared to the values in ORIG. If any characters from MIXED2 are found in ORIG, they are converted to the # character by TRANSLATE. The result set from MIXED3 shows that the letters, not the numbers, have now been singled out and converted to a single character. Now that all non-numeric characters are represented by #’s, they can be treated as a single unit. The next step, MIXED4, uses REPLACE to find and remove any # characters in each row; what’s left are numbers only. The final step is to cast the numeric characters as numbers. Now that you’ve gone through the steps, you can see how the WHERE clause works. The results from MIXED1 are passed to STRPOS, and if a 9 is found (the position in the string where the first 9 is located) the result must be greater than 0. For rows that return a value greater than zero, it means there’s at least one number in that row and it should be kept.
The first step is to walk each string and evaluate each character and determine whether or not it’s a number:
select v.mixed, iter.pos, substr(v.mixed,iter.pos,1) as c
from V,
( select id pos from t10 ) iter
where iter.pos <= length(v.mixed)
order by 1,2
+--------+------+------+ | mixed | pos | c | +--------+------+------+ | 7369 | 1 | 7 | | 7369 | 2 | 3 | | 7369 | 3 | 6 | | 7369 | 4 | 9 | … | ALLEN | 1 | A | | ALLEN | 2 | L | | ALLEN | 3 | L | | ALLEN | 4 | E | | ALLEN | 5 | N | … | CL10AR | 1 | C | | CL10AR | 2 | L | | CL10AR | 3 | 1 | | CL10AR | 4 | 0 | | CL10AR | 5 | A | | CL10AR | 6 | R | +--------+------+------+
Now that each character in each string can be evaluated individually, the next step is to keep only the rows that have a number in the C column:
select v.mixed, iter.pos, substr(v.mixed,iter.pos,1) as c
from V,
( select id pos from t10 ) iter
where iter.pos <= length(v.mixed)
and ascii(substr(v.mixed,iter.pos,1)) between 48 and 57
order by 1,2
+--------+------+------+ | mixed | pos | c | +--------+------+------+ | 7369 | 1 | 7 | | 7369 | 2 | 3 | | 7369 | 3 | 6 | | 7369 | 4 | 9 | … | CL10AR | 3 | 1 | | CL10AR | 4 | 0 | … +--------+------+------+
At this point, all the rows in column C are numbers. The next step is to use GROUP_CONCAT to concatenate the numbers to form their respective whole number in MIXED. The final result is then cast as a number:
select cast(group_concat(c order by pos separator '') as unsigned)
as MIXED1
from (
select v.mixed, iter.pos, substr(v.mixed,iter.pos,1) as c
from V,
( select id pos from t10 ) iter
where iter.pos <= length(v.mixed)
and ascii(substr(x.mixed,iter.pos,1)) between 48 and 57
) y
group by mixed
order by 1
+--------+ | MIXED1 | +--------+ | 10 | | 10 | | 10 | | 7369 | | 7566 | | 7788 | | 7876 | | 7902 | +--------+
As a final note, keep in mind that any digits in each string will be concatenated to form one numeric value. For example, an input value of, say, 99Gennick87 will result in the value 9987 being returned. This is something to keep in mind, particularly when working with serialized data.
You want to extract a specified, delimited substring from a string. Consider the following view V, which generates source data for this problem:
create view V as select 'mo,larry,curly' as name from t1 union all select 'tina,gina,jaunita,regina,leena' as name from t1
Output from the view is as follows:
select * from v
NAME
-------------------
mo,larry,curly
tina,gina,jaunita,regina,leena
You would like to extract the second name in each row, so the final result set would be:
SUB ----- larry gina
The key to solving this problem is to return each name as an individual row while preserving the order in which the name exists in the list. Exactly how you do these things depends on which DBMS you are using.
After walking the NAMEs returned by view V, use the function ROW_NUMBER to keep only the second name from each string:
1 select substr(c,2,locate(',',c,2)-2) 2 from ( 3 select pos, name, substr(name, pos) c, 4 row_number() over( partition by name 5 order by length(substr(name,pos)) desc) rn 6 from ( 7 select ',' ||csv.name|| ',' as name, 8 cast(iter.pos as integer) as pos 9 from V csv, 10 (select row_number() over() pos from t100 ) iter 11 where iter.pos <= length(csv.name)+2 12 ) x 13 where length(substr(name,pos)) > 1 14 and substr(substr(name,pos),1,1) = ',' 15 ) y 16 where rn = 2
After walking the NAMEs returned by view V, use the position of the commas to return only the second name in each string:
1 select name 2 from ( 3 select iter.pos, 4 substring_index( 5 substring_index(src.name,',',iter.pos),',',-1) name 6 from V src, 7 (select id pos from t10) iter, 8 where iter.pos <= 9 length(src.name)-length(replace(src.name,',','')) 10 ) x 11 where pos = 2
After walking the NAMEs returned by view V, retrieve the second name in each list by using SUBSTR and INSTR:
1 select sub 2 from ( 3 select iter.pos, 4 src.name, 5 substr( src.name, 6 instr( src.name,',',1,iter.pos )+1, 7 instr( src.name,',',1,iter.pos+1 ) - 8 instr( src.name,',',1,iter.pos )-1) sub 9 from (select ','||name||',' as name from V) src, 10 (select rownum pos from emp) iter 11 where iter.pos < length(src.name)-length(replace(src.name,',')) 12 ) 13 where pos = 2
Use the function SPLIT_PART to help return each individual name as a row:
1 select name 2 from ( 3 select iter.pos, split_part(src.name,',',iter.pos) as name 4 from (select id as pos from t10) iter, 5 (select cast(name as text) as name from v) src 7 where iter.pos <= 8 length(src.name)-length(replace(src.name,',',''))+1 9 ) x 10 where pos = 2
The SQL Server STRING_SPLIT function will do the whole job, but can only take a single cell. Hence, we use a STRING_AGG within a common table expression to present the data the way STRING_SPLIT requires.
1 with agg_tab(name) 2 as 3 (select STRING_AGG(name,',') from V) 4 select value from 5 STRING_SPLIT( 6 (select name from agg_tab),',')
The syntax is slightly different between these two DBMSs, but the technique is the same. I will use the solution for DB2 for the discussion. The strings are walked and the results are represented by inline view X:
select ','||csv.name|| ',' as name,
iter.pos
from v csv,
(select row_number() over() pos from t100 ) iter
where iter.pos <= length(csv.name)+2
EMPS POS ------------------------------- ---- ,tina,gina,jaunita,regina,leena, 1 ,tina,gina,jaunita,regina,leena, 2 ,tina,gina,jaunita,regina,leena, 3 …
The next step is to then step through each character in each string:
select pos, name, substr(name, pos) c,
row_number() over(partition by name
order by length(substr(name, pos)) desc) rn
from (
select ','||csv.name||',' as name,
cast(iter.pos as integer) as pos
from v csv,
(select row_number() over() pos from t100 ) iter
where iter.pos <= length(csv.name)+2
) x
where length(substr(name,pos)) > 1
POS EMPS C RN --- --------------- ---------------- -- 1 ,mo,larry,curly, ,mo,larry,curly, 1 2 ,mo,larry,curly, mo,larry,curly, 2 3 ,mo,larry,curly, o,larry,curly, 3 4 ,mo,larry,curly, ,larry,curly, 4 …
Now that different portions of the string are available to you, simply identify which rows to keep. The rows you are interested in are the ones that begin with a comma; the rest can be discarded:
select pos, name, substr(name,pos) c,
row_number() over(partition by name
order by length(substr(name, pos)) desc) rn
from (
select ','||csv.name||',' as name,
cast(iter.pos as integer) as pos
from v csv,
(select row_number() over() pos from t100 ) iter
where iter.pos <= length(csv.name)+2
) x
where length(substr(name,pos)) > 1
and substr(substr(name,pos),1,1) = ','
POS EMPS C RN --- -------------- ---------------- -- 1 ,mo,larry,curly, ,mo,larry,curly, 1 4 ,mo,larry,curly, ,larry,curly, 2 10 ,mo,larry,curly, ,curly, 3 1 ,tina,gina,jaunita,regina,leena, ,tina,gina,jaunita,regina,leena, 1 6 ,tina,gina,jaunita,regina,leena, ,gina,jaunita,regina,leena, 2 11 ,tina,gina,jaunita,regina,leena, ,jaunita,regina,leena, 3 19 ,tina,gina,jaunita,regina,leena, ,regina,leena, 4 26 ,tina,gina,jaunita,regina,leena, ,leena, 5
This is an important step as it sets up how you will get the nth substring. Notice that many rows have been eliminated from this query because of the following condition in the WHERE clause:
substr(substr(name,pos),1,1) = ','
You’ll notice that ,larry,curly
, was ranked 4, but now is ranked 2. Remember, the WHERE clause is evaluated before the SELECT, so the rows with leading commas are kept, then ROW_NUMBER performs its ranking. At this point it’s easy to see that, to get the nth substring you want rows where RN equals n. The last step is to keep only the rows you are interested in (in this case where RN equals 2) and use SUBSTR to extract the name from that row. The name to keep is the first name in the row: larry
from ,larry,curly
, and gina
from ,gina,jaunita,regina,leena
,.
The inline view X walks each string. You can determine how many values are in each string by counting the delimiters in the string:
select iter.pos, src.name
from (select id pos from t10) iter,
V src
where iter.pos <=
length(src.name)-length(replace(src.name,',',''))
+------+--------------------------------+ | pos | name | +------+--------------------------------+ | 1 | mo,larry,curly | | 2 | mo,larry,curly | | 1 | tina,gina,jaunita,regina,leena | | 2 | tina,gina,jaunita,regina,leena | | 3 | tina,gina,jaunita,regina,leena | | 4 | tina,gina,jaunita,regina,leena | +------+--------------------------------+
In this case, there is one fewer row than values in each string because that’s all that is needed. The function SUBSTRING_INDEX takes care of parsing the needed values:
select iter.pos,src.name name1,
substring_index(src.name,',',iter.pos) name2,
substring_index(
substring_index(src.name,',',iter.pos),',',-1) name3
from (select id pos from t10) iter,
V src
where iter.pos <=
length(src.name)-length(replace(src.name,',',''))
+------+--------------------------------+--------------------------+---------+ | pos | name1 | name2 | name3 | +------+--------------------------------+--------------------------+---------+ | 1 | mo,larry,curly | mo | mo | | 2 | mo,larry,curly | mo,larry | larry | | 1 | tina,gina,jaunita,regina,leena | tina | tina | | 2 | tina,gina,jaunita,regina,leena | tina,gina | gina | | 3 | tina,gina,jaunita,regina,leena | tina,gina,jaunita | jaunita | | 4 | tina,gina,jaunita,regina,leena | tina,gina,jaunita,regina | regina | +------+--------------------------------+--------------------------+---------+
I’ve shown three name fields, so you can see how the nested SUBSTRING_INDEX calls work. The inner call returns all characters to the left of the nth occurrence of a comma. The outer call returns everything to the right of the first comma it finds (starting from the end of the string). The final step is to keep the value for NAME3 where POS equals n, in this case 2.
STRING_SPLIT is the workhorse here, but needs its data the right way. The common table expression is merely to turn the two rows of the V.names column into a single value, as required by STRING_SPLIT being a table-valued function.
The inline view walks each string. The number of times each string is returned is determined by how many values are in each string. The solution finds the number of values in each string by counting the number of delimiters in it. Because each string is enclosed in commas, the number of values in a string is the number of commas minus one. The strings are then UNIONed and joined to a table with a cardinality that is at least the number of values in the largest string. The functions SUBSTR and INSTR use the value of POS to parse each string:
select iter.pos, src.name,
substr( src.name,
instr( src.name,',',1,iter.pos )+1,
instr( src.name,',',1,iter.pos+1 )
instr( src.name,',',1,iter.pos )-1) sub
from (select ','||name||',' as name from v) src,
(select rownum pos from emp) iter
where iter.pos < length(src.name)-length(replace(src.name,','))
POS NAME SUB --- --------------------------------- ------------- 1 ,mo,larry,curly, mo 1 , tina,gina,jaunita,regina,leena, tina 2 ,mo,larry,curly, larry 2 , tina,gina,jaunita,regina,leena, gina 3 ,mo,larry,curly, curly 3 , tina,gina,jaunita,regina,leena, jaunita 4 , tina,gina,jaunita,regina,leena, regina 5 , tina,gina,jaunita,regina,leena, leena
The first call to INSTR within SUBSTR determines the start position of the substring to extract. The next call to INSTR within SUBSTR finds the position of the nth comma (same as the start position) as well the position of the nth + 1 comma. Subtracting the two values returns the length of the substring to extract. Because every value is parsed into its own row, simply specify WHERE POS = n to keep the nth substring (in this case, where POS = 2, so, the second substring in the list).
The inline view X walks each string. The number of rows returned is determined by how many values are in each string. To find the number of values in each string, find the number of delimiters in each string and add one. The function SPLIT_PART uses the values in POS to find the nth occurrence of the delimiter and parse the string into values:
select iter.pos, src.name as name1,
split_part(src.name,',',iter.pos) as name2
from (select id as pos from t10) iter,
(select cast(name as text) as name from v) src
where iter.pos <=
length(src.name)-length(replace(src.name,',',''))+1
pos | name1 | name2 -----+--------------------------------+--------- 1 | mo,larry,curly | mo 2 | mo,larry,curly | larry 3 | mo,larry,curly | curly 1 | tina,gina,jaunita,regina,leena | tina 2 | tina,gina,jaunita,regina,leena | gina 3 | tina,gina,jaunita,regina,leena | jaunita 4 | tina,gina,jaunita,regina,leena | regina 5 | tina,gina,jaunita,regina,leena | leena
I’ve shown NAME twice so you can see how SPLIT_PART parses each string using POS. Once each string is parsed, the final step is the keep the rows where POS equals the nth substring you are interested in, in this case, 2.
You want to parse an IP address’s fields into columns. Consider the following IP address:
111.22.3.4
You would like the result of your query to be:
A B C D ----- ----- ----- --- 111 22 3 4
The solution depends on the built-in functions provided by your DBMS. Regardless of your DBMS, being able to locate periods and the numbers immediately surrounding them are the keys to the solution.
Use the recursive WITH clause to simulate an iteration through the IP address while using SUBSTR to easily parse it. A leading period is added to the IP address so that every set of numbers has a period in front of it and can be treated the same way.
1 with x (pos,ip) as ( 2 values (1,'.92.111.0.222') 3 union all 4 select pos+1,ip from x where pos+1 <= 20 5 ) 6 select max(case when rn=1 then e end) a, 7 max(case when rn=2 then e end) b, 8 max(case when rn=3 then e end) c, 9 max(case when rn=4 then e end) d 10 from ( 11 select pos,c,d, 12 case when posstr(d,'.') > 0 then substr(d,1,posstr(d,'.')-1) 13 else d 14 end as e, 15 row_number() over( order by pos desc) rn 16 from ( 17 select pos, ip,right(ip,pos) as c, substr(right(ip,pos),2) as d 18 from x 19 where pos <= length(ip) 20 and substr(right(ip,pos),1,1) = '.' 21 ) x 22 ) y
The function SUBSTR_INDEX makes parsing an IP address an easy operation:
1 select substring_index(substring_index(y.ip,'.',1),'.',-1) a, 2 substring_index(substring_index(y.ip,'.',2),'.',-1) b, 3 substring_index(substring_index(y.ip,'.',3),'.',-1) c, 4 substring_index(substring_index(y.ip,'.',4),'.',-1) d 5 from (select '92.111.0.2' as ip from t1) y
Use the built-in function SUBSTR and INSTR to parse and navigate through the IP address:
1 select ip, 2 substr(ip, 1, instr(ip,'.')-1 ) a, 3 substr(ip, instr(ip,'.')+1, 4 instr(ip,'.',1,2)-instr(ip,'.')-1 ) b, 5 substr(ip, instr(ip,'.',1,2)+1, 6 instr(ip,'.',1,3)-instr(ip,'.',1,2)-1 ) c, 7 substr(ip, instr(ip,'.',1,3)+1 ) d 8 from (select '92.111.0.2' as ip from t1)
Use the built-in function SPLIT_PART to parse an IP address:
1 select split_part(y.ip,'.',1) as a, 2 split_part(y.ip,'.',2) as b, 3 split_part(y.ip,'.',3) as c, 4 split_part(y.ip,'.',4) as d 5 from (select cast('92.111.0.2' as text) as ip from t1) as y
Use the recursive WITH clause to simulate an iteration through the IP address while using SUBSTR to easily parse it. A leading period is added to the IP address so that every set of numbers has a period in front of it and can be treated the same way:
1 with x (pos,ip) as ( 2 select 1 as pos,'.92.111.0.222' as ip from t1 3 union all 4 select pos+1,ip from x where pos+1 <= 20 5 ) 6 select max(case when rn=1 then e end) a, 7 max(case when rn=2 then e end) b, 8 max(case when rn=3 then e end) c, 9 max(case when rn=4 then e end) d 10 from ( 11 select pos,c,d, 12 case when charindex('.',d) > 0 13 then substring(d,1,charindex('.',d)-1) 14 else d 15 end as e, 16 row_number() over(order by pos desc) rn 17 from ( 18 select pos, ip,right(ip,pos) as c, 19 substring(right(ip,pos),2,len(ip)) as d 20 from x 21 where pos <= len(ip) 22 and substring(right(ip,pos),1,1) = '.' 23 ) x 24 ) y
By using the built-in functions for your database, you can easily walk through parts of a string. The key is being able to locate each of the periods in the address. Then you can parse the numbers between each.
In a recipe at the end of this chapter we will see how regular expressions can be used with most RDBMS’s - parsing an IP address is also a good area to apply this idea.
Between spelling mistakes and legitimate ways to spell words differently such as British versus American spelling there are many times that two words that you want to match are represented by different strings of characters. Fortunately, SQL provides a way to represent the way words sound allowing to find strings that sound the same even though the underlying characters aren’t identical.
For example, you have a list of authors’ names, including some from an earlier era when spelling wasn’t as fixed as it is now, combined with some extra mis-spelling and typos. For example the following column of names:
a_name ---- 2 Johnson 3 Jonson 4 Jonsen 5 Jensen 6 Johnsen 7 Shakespeare 8 Shakspear 9 Shaekspir 9 Shakespar
Although this is likely part of a longer list, you’d like to identify which of these names are plausible phonetic match for other names on the list. While this is an exercise where there are more than one possible solution, your solution will be something like this (the meaning of the last column will become clearer by the end of the recipe:
a_name1 a_name2 soundex_name ---- ---- ---- Jensen Johnson J525 Jensen Jonson J525 Jensen Jonsen J525 Jensen Johnsen J525 Johnsen Johnson J525 Johnsen Jonson J525 Johnsen Jonsen J525 Johnsen Jensen J525 ... Jonson Jensen J525 Jonson Johnsen J525 Shaekspir Shakspear S216 Shakespar Shakespeare S221 Shakespeare Shakespar S221 Shakspear Shaekspir S216
Use the SOUNDEX function to convert strings of characters into the way they sound when spoken in English. A simple self-join allows you to compare values from the same column.
1 select an1.a_name as name1, an2.a_name as name2, 2 SOUNDEX(an1.a_name) as Soundex_Name 2 from author_names an1 3 join author_names an2 4 on (SOUNDEX(an1.a_name)=SOUNDEX(an2.a_name) 5 and an1.a_name not like an2.a_name)
The thinking behind SOUNDEX pre-dates both databases and computing, as it originated with the US Census, as an attempt to resolve different spellings of proper names for both people and places. There are many algorithms which attempt the same task as SOUNDEX, and, of course, there are also alternative versions for languages other than English. However, we cover SOUNDEX, as it comes with most RDBMS’s.
Soundex keeps the first letter of the name, and then replaces the remaining values with numbers that have the same value if they are phonetically similar. For example, m and n are both replaced with the number 5.
In the example above, the actual Soundex output is shown in the Soundex_Name column. This is just to show what is happening, and not necessary for the solution; some RDMS’s even have a function which hides the Soundex result, such as SQL Server’s Difference function, which compares two strings using Soundex, and returns a similarity scale from 0 to 4 (e.g. 4 is a perfect match between the Soundex outputs, representing 4/4 characters in the Soundex version of the two strings match).
Sometimes Soundex will be sufficient for your needs; other times it won’t be. However, a small amount of research, possibly using texts such as Data Matching (Christen, 2012) will help you find other algorithms which are frequently (but not always) simple to implement as a user-defined function, or in another programming language to suit your taste and needs.
You have a text field that contains some structured text values (e.g., phone numbers), and you wish to find occurrences where those values are structured incorrectly. For example, you have data like the following:
select emp_id, text
from employee_comment
EMP_ID TEXT ---------- ------------------------------------------------------------ 7369 126 Varnum, Edmore MI 48829, 989 313-5351 7499 1105 McConnell Court Cedar Lake MI 48812 Home: 989-387-4321 Cell: (237) 438-3333
and you wish to list rows having invalidly formatted phone numbers. For example, you wish to list the following row because its phone number uses two different separator characters:
7369 126 Varnum, Edmore MI 48829, 989 313-5351
You wish to consider valid only those phone numbers that use the same character for both delimiters.
This problem has a multi-part solution:
Find a way to describe the universe of apparent phone numbers that you wish to consider.
Remove any validly formatted phone numbers from consideration.
See whether you still have any apparent phone numbers left. If you do, you know those are invalidly formatted.
select emp_id, text
from employee_comment
where regexp_like(text, '[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}')
and regexp_like(
regexp_replace(text,
'[0-9]{3}([-. ])[0-9]{3}1[0-9]{4}','***'),
'[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}')
EMP_ID TEXT ---------- ------------------------------------------------------------ 7369 126 Varnum, Edmore MI 48829, 989 313-5351 7844 989-387.5359 9999 906-387-1698, 313-535.8886
Each of these rows contains at least one apparent phone number that is not correctly formatted.
The key to this solution lies in the detection of an “apparent phone number.” Given that the phone numbers are stored in a comment field, any text at all in the field could be construed to be an invalid phone number. You need a way to narrow the field to a more reasonable set of values to consider. You don’t, for example, want to see the following row in your output:
EMP_ID TEXT ---------- ---------------------------------------------------------- 7900 Cares for 100-year-old aunt during the day. Schedule only for evening and night shifts.
Clearly there’s no phone number at all in this row, much less one that is invalid. You and I can see that. The question is, how do you get the RDBMS to “see” it. I think you’ll enjoy the answer. Please read on.
This recipe comes (with permission) from an article by Jonathan Gennick called “Regular Expression Anti-Patterns,” which you can read at: http://gennick.com/antiregex.htm.
The solution uses Pattern A to define the set of “apparent” phone numbers to consider:
Pattern A: [0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}
Pattern A checks for two groups of three digits followed by one group of four digits. Any one of a dash (-), a period (.), or a space are accepted as delimiters between groups. You could come up with a more complex pattern. For example, you could decide that you also wish to consider seven-digit phone numbers. But don’t get side-tracked. The point now is that somehow you do need to define the universe of possible phone number strings to consider, and for this problem that universe is defined by Pattern A. You can define a different Pattern A, and the general solution still applies.
The solution uses Pattern A in the WHERE clause to ensure that only rows having potential phone numbers (as defined by the pattern!) are considered:
select emp_id, text from employee_comment where regexp_like(text, '[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}')
Next, you need to define what a “good” phone number looks like. The solution does this using Pattern B:
Pattern B: [0-9]{3}([-. ])[0-9]{3}1[0-9]{4}
This time, the pattern uses 1 to reference the first subexpression. Whichever character is matched by ([-. ]) must also be matched by 1. Pattern B describes good phone numbers, which must be eliminated from consideration (as they are not bad). The solution eliminates the well-formatted phone numbers through a call to REGEXP_ REPLACE:
regexp_replace(text, '[0-9]{3}([-. ])[0-9]{3}1[0-9]{4}','***'),
This call to REGEXP_REPLACE occurs in the WHERE clause. Any well-formatted phone numbers are replaced by a string of three asterisks. Again, Pattern B can be any pattern that you desire. The point is that Pattern B describes the acceptable pattern that you are after.
Having replaced well-formatted phone numbers with strings of three asterisks (*), any “apparent” phone numbers that remain must, by definition, be poorly formatted. The solution applies REGEXP_LIKE to the output from REGEXP_LIKE to see whether any poorly formatted phone numbers remain:
and regexp_like( regexp_replace(text, '[0-9]{3}([-. ])[0-9]{3}1[0-9]{4}','***'), '[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}')
Regular Expressions are a big topic in their own right, requiring practice to master. Once you do master them, you will find they match a great variety of string patterns with ease. We recommend studying a book such as Mastering Regular Expressions by Jeffrey Friedl to get your regular expression skills to the required level.
Matching on strings can be a painful task. SQL has added a range of tools to reduce the pain, and mastering them will keep you out of trouble. Although a lot can be done with the native SQL string functions, using the regular expression functions that are increasingly available takes it to another level altogether.
3.145.186.6