Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 11

Character Data Types in SQL

Abstract

SQL is not a string processing language. Character data in SQL is supported by a simple function library.

Keywords

CHARACTER(n)

CHAR(n)

NATIONAL CHARACTER(n)

NATIONAL VARYING CHARACTER(n)

NCHAR(n)

NVARCHAR(n)

UNICODE

String Equality

Soundex

Metaphone

New York State Identification and Intelligence System

Cutter tables

Sql-89 defined a CHARACTER(n) or CHAR(n) data type, which represents a fixed-length string of (n) printable characters, where (n) is always greater than zero. Some implementations allow the string to contain control characters, but this is not the usual case. The allowable characters are usually drawn from ASCII or EBCDIC character sets and most often use those collation sequences for sorting.

SQL-92 added the VARYING CHARACTER(n) or VARCHAR(n), which was already present in many implementations. A VARCHAR(n) represents a string that varies in length from 1 to (n) printable characters. This is important; SQL does not allow a string column of zero length, but you may find vendors who do so that you can store an empty string.

SQL-92 also added NATIONAL CHARACTER(n) and NATIONAL VARYING CHARACTER(n) data types (or NCHAR(n) and NVARCHAR(n), respectively), which are made up of printable characters drawn from ISO-defined UNICODE character sets. The literal values use the syntax N‘<string>’ in these data types.

SQL-92 also allows the database administrator to define collation sequences and do other things with the character sets. A Consortium (http://www.unicode.org/) maintains the Unicode standards and makes them available in book form (UNICODE STANDARD, VERSION 5.0; ISBN-13: 978-0321480910) or on the Website.

When the Standards got to SQL:2006, we had added a lot of things to handle Unicode and XML data but kept the basic string manipulations pretty simple compared to what vendors have. I am not going to deal with the Unicode and XML data in any detail because most working SQL programmers are using ASCII or a national character set exclusively in their databases.

11.1 Problems with SQL Strings

Different programming languages handle strings differently. You simply have to do some unlearning with you to get SQL. Here are the major problem areas for programmers.

In SQL, character strings are printable characters enclosed in single quotation marks. Many older SQL implementations and several programming languages use double quotation marks or make it an option so that the single quotation mark can be used as an apostrophe. SQL uses two apostrophes together to represent a single apostrophe in a string literal. SQL Server uses the square brackets for double quotes.

Double quotation marks are reserved for column names that have embedded spaces or that are also SQL reserved words.

Character sets fall into three categories: those defined by national or international standards, those provided by implementations, and those defined by applications. All character sets, however defined, always contain the < space > character. Character sets defined by applications can be defined to “reside” in any schema chosen by the application. Character sets defined by standards or by implementations reside in the Information Schema (named INFORMATION_SCHEMA) in each catalog, as do collations defined by standards and collations and form-of-use conversions defined by implementations. There is a default collating sequence for each character repertoire, but additional collating sequences can be defined for any character repertoire. This can be important in languages that have more than one collating sequence in use. For example, in German dictionaries, “öf” would come before “of,” but in German telephone, it is the opposite ordering. It is a good idea to look at http://userguide.icu-project.org/collation for a guide to the current Unicode rules.

11.1.1 Problems of String Equality

No two languages agree on how to compare character strings as equal unless they are identical in length and match position for position, exactly character for character.

The first problem is whether uppercase and lowercase versions of a letter compare as equal to each other. Only Latin, Greek, Cyrillic, and Arabic have cases; the first three have upper and lower cases, whereas Arabic is a connected script that has initial, middle, terminal, and stand-alone forms of its letters. Most programming languages, including SQL, ignore case in the program text, but not always in the data. Some SQL implementations allow the DBA to set uppercase and lowercase matching as a system configuration parameter.

The Standard SQL has two folding functions (yes, that is the name) that change the case of a string:

LOWER(< string expression >) shifts all letters in the parameter string to corresponding lowercase letters;

UPPER(< string expression >) shifts all letters in the string to uppercase. Most implementations have had these functions (perhaps with different names) as vendor library functions.

Equality between strings of unequal length is calculated by first padding out the shorter string with blanks on the right-hand side until the strings are of the same length. Then they are matched, position for position, for identical values. If one position fails to match, then the equality fails.

In contrast, the Xbase languages (FoxPro, dBase, and so on) truncate the longer string to the length of the shorter string and then match them position for position. Other programming languages ignore upper- and lowercase differences.

11.1.2 Problems of String Ordering

SQL-89 was silent on the collating sequence to be used. In practice, almost all SQL implementations used either ASCII or EBCDIC, which are both Latin I character sets in ISO terminology. A few implementations have a Dictionary or Library order option (uppercase and lowercase letters mixed together in alphabetic order: {A, a, B, b, C, c, etc.}, and many vendors offer a national-language option that is based on the appropriate ISO Standard.

National language options can be very complicated. The Nordic languages all share a common ISO character set, but they do not sort the same letters in the same position. German was sorted differently in Germany and Austria. Spain decided to quit sorting ‘ch’ and ‘ll’ as if they were single characters. You really need to look at the ISO Unicode implementation for your particular product.

The Standard SQL allows the DBA to define a collating sequence that is used for comparisons. The feature is becoming more common as we come more globalized, but you have to see what the vendor of your SQL product actually supports.

11.1.3 Problems of String Grouping

Because the SQL equality test has to pad out the shorter of the two strings with spaces, you may find doing a GROUP BY on a VARCHAR(n) that has unpredictable results:

 CREATE TABLE Foobar (x VARCHAR(5) NOT NULL);
 INSERT INTO Foobar VALUES ('a'), ('a '), ('a '), ('a '),

Now, execute the query:

SELECT x, CHAR_LENGTH(x)
FROM Foobar
 GROUP BY x;

The value for CHAR_LENGTH(x) will vary for different products. The most common answers are 1 and 4, 5 in this example. A length of 1 is returned because it is the length of the shortest string or because it is the length of the first string physically in the table. A length of 4 because it is the length of the longest string in the table. A length of 5 because it is the greatest possible length of a string in the table.

SQL has two equivalence class operators; if you do not know what that means, go back to your Set Theory course. They partition an entire set into disjoint subsets. The first one is simple, vanilla scalar equality (=). The second is grouping, as in GROUP BY. This second operator treats all NULLs as part of the same class and follows the padding rule for strings. Yo will see this later in SQL.

You might want to add a constraint that makes sure to trim the trailing blanks to avoid problems.

11.2 Standard String Functions

SQL-92 defines a set of string functions that appear in most products, but with vendor-specific syntax. You will probably find that products will continue to support their own syntax but will also add the Standard SQL syntax in new releases. Let’s look at the basic operations.

String concatenation is shown with the || operator, taken from PL/I. However, you can also find the plus sign being overloaded in the Sybase/SQL Server family and some products using a function call like CONCAT(s1, s2) instead.

The SUBSTRING(< string > FROM < start > FOR < length >) function uses three arguments: the source string, the starting position of the substring, and the length of the substring to be extracted. Truncation occurs when the implied starting and ending positions are not both within the given string.

DB2 and other products have a LEFT and a RIGHT function. The LEFT function returns a string consisting of the specified number of left-most characters of the string expression, and RIGHT, well, that is kind of obvious.

The fold functions are a pair of functions for converting all the lowercase characters in a given string to uppercase, UPPER(< string >), or all the uppercase ones to lowercase LOWER(< string >). We already mentioned them.

The TRIM([[< trim specification >] [< trim character >] FROM] < trim source >) produces a result string that is the source string with an unwanted character removed. The < trim source > is the original character value expression. The < trim specification > is LEADING, TRAILING, or BOTH and the < trim character > is the single character that is to be removed. If you don’t give a < trim character >, then space is assumed. Most products still do not have the < trim character > option and work with only space.

The TRIM() function removes the leading and/or trailing occurrences of a character from a string. The default character if one is not given is a space. The SQL-92 version is a very general function, but you will find that most SQL implementations have a version that works only with spaces. Many early SQLs had two functions: LTRIM for left-most (leading) blanks and RTRIM for right-most (trailing) blanks.

A character translation is a function for changing each character of a given string according to some many-to-one or one-to-one mapping between two not necessarily distinct character sets.

The syntax TRANSLATE(< string expression > USING < translation >) assumes that a special schema object, called a translation, has already been created to hold the rules for doing all of this.

CHAR_LENGTH(< string >), also written CHARACTER_LENGTH(< string >), determines the length of a given character string, as an integer, in characters. In most current products, this function is usually expressed as LENGTH() and the next two functions do not exist at all; they assume that the database will only hold ASCII or EBCDIC characters.

BIT_LENGTH(< string >) determines the length of a given character string, as an integer, in bits.

OCTET_LENGTH(< string >) determines the length of a given character string, as an integer, in octets. Octets are units of 8 bits that are used by the one and two (Unicode) octet characters sets. This is the same as TRUNCATE (BIT_LENGTH (< string >)/8).

The POSITION(< search string > IN < source string >) determines the first position, if any, at which the < search string > occurs within the < source string >. If the < search string > is of length zero, then it occurs at position 1 for any value of the < source string >. If the < search string > does not occur in the < source string >, zero is returned. You will also see LOCATE() in DB2 and CHAR_INDEX() in SQL Server.

11.3 Common Vendor Extensions

The original SQL-89 standard did not define any functions for CHAR(n) data types. The Standard SQL added the basic functions that have been common to implementations for years. However, there are other common or useful functions, and it is worth knowing how to implement them outside of SQL.

Many vendors also have functions that will format dates for display by converting the internal format to a text string. A vendor whose SQL is tied to a 4GL is much more likely to have these extensions simply because the 4GL can use them.

These functions generally use either a COBOL-style picture parameter or a globally set default format. Some of this conversion work is done with the CAST() function in Standard SQL, but since SQL does not have any output statements, such things will be vendor extensions for some time to come.

Vendor extensions are varied, but there are some that are worth mentioning. The names will be different in different products, but the functionality will be the same.

SPACE(n) produces a string of (n) spaces for (n > 0).

REPLICATE (< string expression >, n) produces a string of (n) repetitions of the < string expression >. DB2 calls this one REPEAT() and you will see other local names for it.

REPLACE (< target string >, < old string >, < new string >) replaces the occurrences of the < old string > with the < new string > in the < target string >.

As an aside, a nice trick to reduce several contiguous spaces in a string to a single space to format text:

UPDATE Foobar

SET sentence

= REPLACE(

REPLACE(

REPLACE(sentence, SPACE(1), ‘<>’)

‘><’, SPACE(0))

‘<>’, SPACE(1));

REVERSE(< string expression >) reverses the order of the characters in a string to make it easier to search.

11.3.1 Phonetic Matching

People’s names are a problem for designers of databases. Names are variable-length, can have strange spellings, and are not unique. American names have a diversity of ethnic origins, which give us names pronounced the same way but spelled differently and vice versa.

Ignoring this diversity of names, errors in reading or hearing a name lead to mutations. Anyone who gets junk mail is aware of this; I get mail addressed to “Selco,” “Selko,” “Celco,” as well as “Celko,” which are phonetic errors, and also some that result from typing errors, such as “Cellro,” “Chelco,” and “Chelko” in my mail stack. Such errors result in the mailing of multiple copies of the same item to the same address. To solve this problem, we need phonetic algorithms that can find similar sounding names.

11.3.1.1 Soundex Functions

The Soundex family of algorithms is named after the original algorithm. A Soundex algorithm takes a person’s name as input and produces a character string that identifies a set of names that are (roughly) phonetically alike.

SQL products often have a Soundex algorithm in their library functions. It is also possible to compute a Soundex in SQL, using string functions and the CASE expression in the Standard SQL. Names that sound alike do not always have the same Soundex code. For example, “Lee” and “Leigh” are pronounced alike but have different Soundex codes because the silent ‘g’ in “Leigh” is given a code.

Names that sound alike but start with a different first letter will always have a different Soundex, such as “Carr” and “Karr” will be separate codes.

Finally, Soundex is based on English pronunciation, so European and Asian names may not encode correctly. Just looking at French surnames like “Beaux” with a silent ‘x’ and “Beau” without it, we will create two different Soundex codes.

Sometimes names that don’t sound alike have the same Soundex code. Consider the relatively common names “Powers,” “Pierce,” “Price,” “Perez,” and “Park” which all have the same Soundex code. Yet “Power,” a common way to spell Powers 100 years ago, has a different Soundex code.

11.3.1.2 The Original Soundex

Margaret O’Dell and Robert C. Russell patented the original Soundex algorithm in 1918. The method is based on the phonetic classification of sounds by how they are made.

In case you wanted to know, the six groups are bilabial, labiodental, dental, alveolar, velar, and glottal. The algorithm is fairly straightforward to code and requires no backtracking or multiple passes over the input word. This should not be too surprising, since it was in use before computers and had to be done by hand by clerks. Here is the algorithm:

1.0 Capitalize all letters in the word. Pad the word with right-most blanks as needed during each procedure step.

2.0 Retain the first letter of the word.

3.0 Drop all occurrences of the following letters after the first position: A, E, H, I, O, U, W, Y.

4.0 Change letters from the following sets into the corresponding digits given:

1 = B, F, P, V

2 = C, G, J, K, Q, S, X, Z

3 = D, T

4 = L

5 = M, N

6 = R

5.0 Retain only one occurrence of consecutive duplicate digits from the string that resulted after step 4.0.

6.0 Pad the string that resulted from step 5.0 with trailing zeros and return only the first four positions, which will be of the form < uppercase letter > < digit > < digit > < digit > .

An alternative version of the algorithm, due to Russell, changes the letters in step 3.0 (A, E, H, I, O, U, W, Y) to ‘9’s, retaining them without dropping them. Then step 5.0 is replaced by two steps:

5.1: Remove redundant duplicates ‘22992345’ → 29245

5.2: Remove all ‘9’s and close the spaces. 29245 → 2245

This allows pairs of duplicate digits to appear in the result string. This version has more granularity and will work better for a larger sample of names.

The problem with the Soundex is that it was a manual operation used by the Census Bureau long before computers. The algorithm used was not always applied uniformly from place to place. Surname prefixes, such as “La,” “De,” “von,” or “van,” are generally dropped from the last name for Soundex, but not always.

If you are searching for surnames such as “DiCaprio” or “LaBianca,” you should try the Soundex for both with and without the prefix. Likewise leading syllables like “Mc,” “Mac,” and “O’” were also dropped.

Then there was a question about dropping ‘H’ and ‘W’ along with the vowels. The U.S. Census Soundex did it both ways, so a name like “Ashcraft” could be converted to “Ascrft” in the first pass, and finally Soundexed to “A261,” as it is in the 1920 New York Census. The Soundex code for the 1880, 1900, and 1910 censuses followed both rules. In this case, Ashcraft would be “A226” in some places. The reliability of Soundex is 95.99% with selectivity factor of 0.213% for a name inquiry.

This version is easy to translate into various dialects. The WHILE loop would be better done with a REPEAT loop, but not all products have that construct. The TRANSLATEs could be one statement, but this is easier to read. Likewise, the REPLACE functions could be nested.

CREATE FUNCTION Soundex(IN in_name VARCHAR(50))
RETURNS CHAR(4)
DETERMINISTIC
LANGUAGE SQL
BEGIN ATOMIC
DECLARE header_char CHAR(1);
DECLARE prior_name_size INTEGER;
-- split the name into a head and a tail
SET header_char = UPPER (SUBSTRING (in_name FROM 1 FOR 1));
SET in_name = UPPER (SUBSTRING (in_name FROM 2 FOR CHAR_LENGTH(in_name)));
-- clean out vowels
SET in_name = TRANSLATE (in_name, ' ', 'AEHIOUWY'),
-- clean out spaces and add zeros
SET in_name = REPLACE (in_name, ' ', '') || '0000';
-- consonant changes
SET in_name = TRANSLATE(in_name, '1111', 'BFPV'),
SET in_name = TRANSLATE(in_name, '22222222', 'CGJKQSXZ'),
SET in_name = TRANSLATE(in_name, '33', 'DT'),
SET in_name = TRANSLATE(in_name, '4', 'L'),
SET in_name = TRANSLATE(in_name, '55', 'MN'),
SET in_name = TRANSLATE(in_name, '6', 'R'),
-- loop to clean out duplicate digits
WHILE 1 = 1
DO
SET prior_name_size = CHAR_LENGTH (in_name);
SET in_name = REPLACE(in_name, '11', '1'),
SET in_name = REPLACE(in_name, '22', '2'),
  SET in_name = REPLACE(in_name, '33', '3'),
  SET in_name = REPLACE(in_name, '44', '4'),
  SET in_name = REPLACE(in_name, '55', '5'),
  SET in_name = REPLACE(in_name, '66', '6'),
-- no size change means no more duplicate digits, time to output the answer
  IF prior_name_size = CHAR_LENGTH (in_name)
  THEN RETURN header_char || SUBSTRING (in_name FROM 1 FOR 3);
  END IF;
END WHILE;
END;

11.3.1.3 Metaphone

Metaphone is another improved Soundex that first appeared in Computer Language magazine (Philips 1990). A Pascal version written by Terry Smithwick (Smithwick 1991), based on the original C version by Lawrence Philips, is reproduced with permission here:

FUNCTION Metaphone (p : STRING) : STRING;
CONST
VowelSet = ['A', 'E', 'I', 'O', 'U'];
FrontVSet = ['E', 'I', 'Y'];
VarSonSet = ['C', 'S', 'T', 'G'];
  { variable sound - modified by following 'h' }
FUNCTION SubStr (A : STRING;
 Start, Len : INTEGER) : STRING;
BEGIN
SubStr := Copy (A, Start, Len);
END;
FUNCTION Metaphone (p : STRING) : STRING;
VAR
  i, l, n: BYTE;
  silent, new: BOOLEAN;
  last, this, next, nnext : CHAR;
  m, d: STRING;
BEGIN { Metaphone }
IF (p = '')
THEN BEGIN
  Metaphone := '';
  EXIT;
  END;
{ Remove leading spaces }
FOR i := 1 TO Length (p)
DO p[i] := UpCase (p[i]);
{ Assume all alphas }
{ initial preparation of string }
d := SubStr (p, 1, 2);
IF d IN ('KN', 'GN', 'PN', 'AE', 'WR')
THEN p := SubStr (p, 2, Length (p) - 1);
IF (p[1] = 'X')
THEN p := 'S' + SubStr (p, 2, Length (p) - 1);
IF (d = 'WH')
THEN p := 'W' + SubStr (p, 2, Length (p) - 1);
{ Set up for Case statement }
l := Length (p);
m := '';
{ Initialize the main variable }
new := TRUE;
{ this variable only used next 10 lines!!! }
n := 1;
{ Position counter }
WHILE ((Length (m) < 6) AND (n <> l))
DO BEGIN { Set up the 'pointers' for this loop-around }
  IF (n > 1)
  THEN last := p[n-1]
  ELSE last := #0;
  { use a nul terminated string }
  this := p[n];
  IF (n < l)
  THEN next := p[n+1]
  ELSE next := #0;
  IF ((n+1) < l)
  THEN nnext := p[n+2]
  ELSE nnext := #0;
  new := (this = 'C') AND (n > 1) AND (last = 'C'),
  { 'CC' inside word }
  IF (new)
  THEN BEGIN
 IF ((this IN VowelSet) AND (n = 1))
 THEN m := this;
  CASE this OF
  'B' : IF NOT ((n = l) AND (last = 'M'))
 THEN m := m + 'B';
  { -mb is silent }
'C' : BEGIN { -sce, i, y = silent }
  IF NOT ((last = 'S') AND (next IN FrontVSet))
  THEN BEGIN
 IF (next = 'i') AND (nnext = 'A')
 THEN m := m + 'X'{ -cia- }
 ELSE IF (next IN FrontVSet)
   THEN m := m + 'S' { -ce, i, y = 'S' }
   ELSE IF (next = 'H') AND (last = 'S')
  THEN m := m + 'K' { -sch- = 'K' }
  ELSE IF (next = 'H')
    THEN IF (n = 1) AND ((n+2) < = l)
   AND NOT (nnext IN VowelSet)
   THEN m := m + 'K'
   ELSE m := m + 'X';
END { Else silent }
  END;
 { Case C }
'D' : IF (next = 'G') AND (nnext IN FrontVSet)
 THEN m := m + 'J'
 ELSE m := m + 'T';
'G' : BEGIN
  silent := (next = 'H') AND (nnext IN VowelSet);
  IF (n > 1) AND (((n+1) = l) OR ((next = 'n') AND
 (nnext = 'E') AND (p[n+3] = 'D') AND ((n+3) = l))
{ Terminal -gned }
  AND (last = 'i') AND (next = 'n'))
  THEN silent := TRUE;
 { if not start and near -end or -gned.) }
  IF (n > 1) AND (last = 'D'gnuw) AND (next IN FrontVSet)
  THEN { -dge, i, y }
  silent := TRUE;
  IF NOT silent
  THEN IF (next IN FrontVSet)
 THEN m := m + 'J'
 ELSE m := m + 'K';
  END;
'H' : IF NOT ((n = l) OR (last IN VarSonSet)) AND (next IN
VowelSet)
 THEN m := m + 'H';
  { else silent (vowel follows) }
'F', 'J', 'L', 'M', 'N', 'R' : m := m + this;
'K' : IF (last <> 'C')
 THEN m := m + 'K';
'P' : IF (next = 'H')
 THEN BEGIN
   m := m + 'F';
   INC (n);
   END { Skip the 'H' }
 ELSE m := m + 'P';
'Q' : m := m + 'K';
'S' : IF (next = 'H')
 OR ((n > 1) AND (next = 'i') AND (nnext IN ['O', 'A']))
  THEN m := m + 'X'
  ELSE m := m + 'S';
'T' : IF (n = 1) AND (next = 'H') AND (nnext = 'O')
 THEN m := m + 'T' { Initial Tho- }
 ELSE IF (n > 1) AND (next = 'i') AND (nnext IN ['O', 'A'])
 THEN m := m + 'X'
 ELSE IF (next = 'H')
   THEN m := m + '0'
   ELSE IF NOT ((next = 'C') AND (nnext = 'H'))
  THEN m := m + 'T';
 { -tch = silent }
'V' : m := m + 'F';
'W', 'Y' : IF (next IN VowelSet)
THEN m := m + this;
  { else silent }
'X' : m := m + 'KS';
'Z' : m := m + 'S';
END;
 { Case }
INC (n);
END; { While }
END; { Metaphone }
Metaphone := m
END;

11.3.1.4 NYSIIS Algorithm

The New York State Identification and Intelligence System, or NYSIIS algorithm is more reliable and selective than Soundex, especially for grouped phonetic sounds. It does not perform well with ‘Y’ groups because ‘Y’ is not translated. NYSIIS yields an alphabetic string key that is filled or rounded to 10 characters.

(1) Translate first characters of name:

MAC = > MCC

KN = > NN

K = > C

PH = > FF

PF = > FF

SCH = > SSS

(2) Translate last characters of name:

EE = > Y

IE = > Y

DT,RT,RD,NT,ND = > D

(3) The first character of key = first character of name.

(4) Translate remaining characters by following rules, scanning one character at a time

a. EV = > AF else A,E,I,O,U = > A

b. Q = > G Z = > S M = > N

c. KN = > N else K = > C

d. SCH = > SSS PH = > FF

e. H = > If previous or next character is a consonant use the previous character.

f. W = > If previous character is a vowel, use the previous character.

Add the current character to result if the current character is to equal to the last key character.

(5) If last character is S, remove it

(6) If last characters are AY, replace them with Y

(7) If last character is A, remove it

The stated reliability of NYSIIS is 98.72% with a selectivity factor of .164% for a name inquiry. This was taken from Robert L. Taft, “Name Search Techniques”, New York State Identification and Intelligence System.

11.4 Cutter Tables

Another encoding scheme for names has been used for libraries for over 100 years. The catalog number of a book often needs to reduce an author’s name to a simple fixed-length code. While the results of a Cutter table look much like those of a Soundex, their goal is different. They attempt to preserve the original alphabetical order of the names in the encodings.

But the librarian cannot just attach the author’s name to the classification code. Names are not the same length, nor are they unique within their first letters. For example, “Smith, John A.” and “Smith, John B.” are not unique until the last letter.

What librarians have done about this problem is to use Cutter tables. These tables map authors’ full names into letter-and-digit codes. There are several versions of the Cutter tables. The older tables tended to use a mix of letters (both upper- and lowercase) followed by digits. The three-figure single letter followed by three digits. For example, using that table.

“Adams, J” becomes “A214”

“Adams, M” becomes “A215”

“Arnold” becomes “A752”

“Dana” becomes “D168”

“Sherman” becomes “S553”

“Scanlon” becomes “S283”

The distribution of these numbers is based on the actual distribution of names of authors in English-speaking countries. You simply scan down the table until you find the place where your name would fall and use that code.

Cutter tables have two important properties. They preserve the alphabetical ordering of the original name list, which means that you can do a rough sort on them. The second property is that each grouping tends to be of approximately the same size as the set of names gets larger. These properties can be handy for building indexes in a database.

If you would like copies of the Cutter tables, you can find some of them on the Internet. Princeton University Library has posted their rules for names, locations, regions, and other things (http://infoshare1.princeton.edu/katmandu/class/cutter.html).

You can also get hardcopies from this publisher.

Hargrave House

7312 Firethorn

Littleton, CO 80125

Website = http://www.cuttertables.com

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 11: Character Data Types in SQL

Create new playlist

Sign In

Sign Up