Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 11. Canonical Representation Issues

If I had the luxury of writing just one sentence for this chapter, it would simply be, "Do not make any security decision based on the name of a resource, especially a filename." Why? If you don’t know, I suggest you reread the previous chapter. As Gertrude Stein once said, "A rose is a rose is a rose." Or is it? What if the word rose was determined by an untrusted user? Is a ROSE the same as a roze or a ro$e or a r0se or even a r%6fse? Are they all the same thing? The answer is both yes and no. Yes, they are all references to a rose, but syntactically they are different, which can lead to security issues in your applications. By the way, %6f is the hexadecimal equivalent of the ASCII value for the letter o.

Why can these different "roses" cause security problems? In short, if your application makes security decisions based on the name of a resource, such as a file provided by an untrusted source, chances are good that the application will make a poor decision because often more than one valid way to represent the object name exists. All canonicalization bugs lead to spoofing threats, and in some instances the spoofing threats lead to information disclosure and elevation of privilege threats.

In this chapter, I’ll discuss the meaning of canonical, and in the interest of learning from the industry’s past collective mistakes, I’ll discuss some filename canonicalization bugs and Web-specific issues. Finally, I’ll show examples of mitigating canonicalization bugs.

What Does Canonical Mean, and Why Is It a Problem?

I had no idea what canonical meant the first time I heard the term. The only canon I had heard was Johann Pachelbel’s (1653–1706) glorious Canon in D Major. The entry for canonical in Random House Webster’s College Dictionary (Random House, 2000) reads, "Canonical: in its simplest or standard form." Hence, the canonical representation of something is the standard, most direct, and least ambiguous way to represent it. Canonicalization is the process by which various equivalent forms of a name are resolved to a single, standard name—the canonical name. For example, on a given machine, the names c:dir est.dat, test.dat, and .... est.dat might all refer to the same file. And canonicalization might lead to the canonical representation of these names being c:dir est.dat. Security bugs related to canonicalization occur when an application makes wrong decisions based on a noncanonical representation of a name.

Canonical Filename Issues

I know you know this, but let me make sure we’re on the same page. Many applications make security decisions based on filenames, but the problem is that a file can have multiple names. Let’s look at some past mistakes to see what I mean.

Bypassing Napster Name Filtering

Bypassing the Napster filters is my favorite canonicalization bug because it’s so nontechnical. Unless you were living under a rock in early 2001, you’ll know that Napster was a music-swapping service that was taken to court by the Recording Industry Association of America (RIAA), which viewed the service as piracy. A U.S. federal judge ordered Napster to block access to certain songs, which Napster did. However, this song-blocking was based on the name of the song, and it wasn’t long before people realized how to bypass the Napster filters: simply by giving the song a name that resembles the song title but that is not picked up by the filters. For example, using the music of Siouxsie and the Banshees as an example, I might rename "Candyman" as "AndymanCay" (the pig latin version), "92 degrees" as "92 degree$," and "Deepest Chill" as "Deepest Chi11." This is a disclosure vulnerability because it gives access to files to users who should not have access. In this case, Napster’s lack of a secure canonicalization method for filenames made it difficult to enforce a court-mandated security policy.

You can read more about this issue at http://news.cnet.com/news/0-1005-200-5042145.html.

Vulnerability in Apple Mac OS X and Apache

The version of the Apache Web server that shipped with the first release of Apple’s Mac OS X operating system contains a security flaw when Apple’s Hierarchical File System Plus (HFS+) is used. HFS+ is a case-insensitive file system, and this foils Apache’s directory-protection mechanisms, which use text-based configuration files to determine which data to protect and how to protect it.

For example, the administrator might decide to protect a directory named scripts with the following configuration file to prevent scripts from being accessed by anyone:

<Location /scripts>
    order deny, allow
    deny from all
</Location>

A normal user attempting to access http://www.northwindtraders.com/scripts/index.html will be disallowed access. However, an attacker can enter http://www.northwindtraders.com/SCRIPTS/index.html, and access to Index.html will be allowed.

The vulnerability exists because HFS+ is case-insensitive, but the version of Apache shipped with Mac OS X is case-sensitive. So, to Apache, SCRIPTS is not the same as scripts, and the configuration script has no effect. But to HFS+, SCRIPTS is the same as scripts, so the "protected" index.html file is fetched and sent to the attacker.

You can read more about this security flaw at http://www.securityfocus.com/archive/1/190036.

DOS Device Names Vulnerability

As you might know, some filenames in MS-DOS spilled over into Windows for backward-compatibility reasons. These items are not really files; rather, they are devices. Examples include the default serial port (aux) and printer (lpt1 and prn). In this vulnerability, the attacker forces Windows 95 and Windows 98 to access the device. When Windows attempts to interpret the device name as a file resource, it performs an illegal resource access that usually results in a crash.

You can learn more about this vulnerability at http://www.microsoft.com/technet/security/bulletin/MS00-017.asp.

Sun Microsystems StarOffice /tmp Directory Symbolic-Link Vulnerability

I added this vulnerability because symbolic-link vulnerabilities are extremely common in UNIX and Linux. A symbolic link (symlink) is a file that only points to another file; therefore, it can be considered another name for a file. UNIX also has the hard-link file type, which is a file that is semantically equivalent to the one it points to. Hard links share access rights with the file they point to, whereas symlinks do not share those rights.

Note

You can create hard links in Windows 2000 by using the CreateHardLink function.

For example, /tmp/frodo, a symlink in the temporary directory, might point to the UNIX password file /etc/passwd or to some other sensitive file.

On startup, Sun’s StarOffice creates an object named /tmp/soffice.tmp. This object can be used by anyone for nearly any purpose. In UNIX parlance, the access mask is 0777, which is just as bad as Everyone (Full Control). An attacker can create a symlink from /tmp/soffice.tmp to a user’s file. When that user then runs StarOffice, StarOffice blindly changes the permission settings on that file (because setting permissions on a symlink sets the permissions of the target, if the process has permission to make that change). Once this is done, the attacker can read the file.

If the attacker linked /tmp/soffice.tmp to /etc/passwd and someone ran StarOffice as the UNIX administrator, the permissions on /etc/passwd would get changed. Learn more about this bug at http://www.securityfocus.com/bid/1922.

Almost all of the canonicalization bugs I’ve discussed occur when user input is passed between multiple components in a system. If the first component to receive user input does not fully canonicalize the input before passing the data to the second component, the system is at risk.

Important

All canonicalization issues exist because an application, having determining that a request for a resource did not match a known pattern, defaulted to an insecure mode.

Important

If you make security decisions based on the name of a file, you will get it wrong!

Common Windows Canonical Filename Mistakes

Windows can represent filenames in many ways, due in part to extensibility capabilities and backward compatibility. If you accept a filename and use it for any security decision, it is crucial that you read this section.

8.3 Representation of Long Filenames

As you are no doubt aware, the legacy FAT file system, which first appeared in MS-DOS, requires that files have names of eight characters and a three-character extension. File systems such as FAT32 and NTFS allow for long filenames—for example, an NTFS file can be 255 Unicode characters in length. For backward-compatibility purposes, NTFS and FAT32 by default autogenerate an 8.3 format filename that allows an application based on MS-DOS or 16-bit Windows to access the same file.

Note

The format of the auto-generated 8.3 filename is the first six characters of the long filename, followed by a tilde (~) and an incrementing digit, followed by the first three characters of the extension. For example, My Secret File.2001.Aug.doc might become MYSECR~1.DOC. All illegal characters and spaces are removed from the filename first.

An attacker might slip through your code if your code makes checks against the long filename and the attacker uses the short filename instead. For example, your application might deny access to Fiscal02Budget.xls to users on the 172.30.x.x subnet, but a user on the subnet using the file’s short filename would circumvent your checks because the file system accesses the same file, just through its 8.3 filename. Hence, Fiscal02Budget.xls might be the same file as Fiscal~1.xls.

The following pseudocode highlights the vulnerability:

String SensitiveFiles[] = {"Fiscal02Budget.xls", "ProductPlans.Doc"};
IPAddress RestrictedIP[] = {172.30.0.0, 192.168.200.0};

BOOL AllowAccessToFile(FileName, IPAddress) {
    If (FileName In SensitiveFiles[] && IPAddress In RestrictedIP[])
        Return FALSE;
    Else
        Return TRUE;
}

BOOL fAllow = FALSE;
//This will deny access.
fAllow = AllowAccessToFile("Fiscal02Budget.xls", "172.30.43.12");

//This will allow access. Ouch!
fAllow = AllowAccessToFile("FISCAL~1.XLS", "172.30.43.1 2");

Note

Conventional wisdom would dictate that secure systems do not include MS-DOS or 16-bit Windows applications, and hence 8.3 filename support should be disabled. More on this later.

An interesting potential side effect of the 8.3 filename generation is that some processes can be attacked if and only if the requested file has no spaces in its name. Guess what? 8.3 filenames do not have spaces! I’ll leave the last part of the attack equation to you!

NTFS Alternate Data Streams

I will discuss this canonicalization mistake in detail later in this chapter, but for the moment all you need to know is this: be wary if your code makes decisions based on the filename extension. For example, IIS looked for an .asp extension and routed the request for the file to Asp.dll. When the attacker requested a file with the .asp::$DATA extension, IIS failed to see that the request was a request for the default NTFS data stream and the ASP source code was returned to the user.

Note

You can detect streams in your files by using tools such as Streams.exe from Sysinternals (http://www.sysinternals.com), Crucial ADS from Crucial Security (http://www.crucialsecurity.com), or Security Expressions from Pedestal Software (http://www.pedestalsoftware.com).

In addition, if your application uses alternate data streams, you need to make sure that the code correctly parses the filename to read or write to the correct stream. More on this later. As an aside, streams do not have a separate access control list (ACL)— they use the same ACL as the file in question.

Trailing Characters

I’ve seen a couple of vulnerabilities in which a trailing dot (.) or backslash () appended to a filename caused the application parsing the filename to get the name wrong. Adding a dot is very much a Win32 issue because the file system determines that the trailing dot should not be there and strips it from the filename before accessing the file. The trailing backslash is usually a Web issue, which I’ll discuss in Chapter 17. Take a look at the following code to see what I mean by the trailing dot:

#include <strsafe.h>
char b[20];
StringcbCopy(b, sizeof(b), "Hello!");
HANDLE h = CreateFile("c:\somefile.txt",
                      GENERIC_WRITE,
                      0, NULL,
                      CREATE_ALWAYS,
                      FILE_ATTRIBUTE_NORMAL,
                      NULL);
if (h != INVALID_HANDLE_VALUE) {
    DWORD dwNum = 0;
    WriteFile(h, b, lstrlen(b), &dwNum, NULL);
    CloseHandle(h);
}

h = CreateFile("c:\somefile.txt.", //Trailing dot
               GENERIC_READ,
               0, NULL,
               OPEN_EXISTING,
               FILE_ATTRIBUTE_NORMAL,
               NULL);
if (h != INVALID_HANDLE_VALUE) {
    char b[20];
    DWORD dwNum =0;
    ReadFile(h, b, sizeof b, &dwNum, NULL);
    CloseHandle(h);
}

You can also find this example code in the companion content in the folder Secureco2Chapter11TrailingDot. See the difference in the filenames? The second call to access somefile.txt has a trailing dot, yet somefile.txt is opened and read correctly when you run this code. This is because the file system removes the invalid character for you! As you can see, somefile.txt. is the same as somefile.txt, regardless of the trailing dot.

\? Format

Normally, a filename is limited to MAX_PATH (260) ANSI characters. The Unicode versions of numerous file-manipulation functions allow you to extend this to 32,000 Unicode characters by prepending \? to the filename. The \? tells the function to turn off path parsing. However, each component in the path cannot be more than MAX_PATH characters long. So, in summary, \?c: empmyfile.txt is the same as c: empmyfile.txt.

Note

No known exploit for the \? filename format exists; I’ve included the format for completeness.

Directory Traversal and Using Parent Paths (..)

The vulnerabilities in this section are extremely common in Web and FTP servers, but they’re potential problems in any system. The first vulnerability lies in allowing attackers to walk out of your tightly controlled directory structure and wander around the entire hard disk. The second issue relates to two or more names for a file.

Walking out of the current directory

Let’s say your application contains data files in c:datafiles. In theory, users should not be able to access any other files from anywhere else in the system. The fun starts when attackers attempt to access ..oot.ini to access the boot configuration file in the root of the boot drive or, better yet, ..winnt epairsam to get a copy of the local Security Account Manager (SAM) database file, which contains the usernames and password hashes for all the local user accounts. (In Windows 2000 and later, domain accounts are stored in Active Directory, not in the SAM.) Now the attacker can run a password-cracking tool such as L0phtCrack (available at http://www.atstake.com) to determine the passwords by brute-force means. This is why strong passwords are crucial!

Note

In Windows 2000 and later, the SAM file is encrypted using SysKey by default, which makes this attack somewhat more complex to achieve. Read Knowledge Base article Q143475, "Windows NT System Key Permits Strong Encryption of the SAM" at http://support.microsoft.com/support/kb/articles/Q143/4/75.asp for more information regarding SysKey.

Will the real filename please stand up?

If we assume a directory structure of c:dirfoofilessecret, the file c:dirfoomyfile.txt is the same as c:dirfoofilessecret....myfile.txt, as is c:dirfoofiles..myfile.txt, as is c:dir..dirfoofiles..myfile.txt! Oh my!

Absolute vs. Relative Filenames

If the user gives you a filename to open with no directory name, where do you look for the file? In the current directory? In a folder specified in the PATH environment variable? Your application might not know and might load the wrong file. For example, if a user requests that your application open File.exe, does your application load File.exe from the current directory or from a folder specified in PATH?

Case-Insensitive Filenames

There have been no vulnerabilities that I know of in Windows concerning the case of a filename. The NTFS file system is case-preserving but case-insensitive. Opening MyFile.txt is the same as opening myfile.txt. The only time this is not the case is when your application is running in the Portable Operating System Interface for UNIX (POSIX) subsystem. However, if your application does perform case-sensitive filename comparisons, you might be vulnerable in the same way as the Apple Mac OS X and Apache Web server, as described earlier in this chapter.

UNC Shares

Files can be accessed through Universal Naming Convention (UNC) shares. A UNC share is used to access file and printer resources in Windows and is treated as a file system by the operating system. Using UNC, you can map a new disk drive letter that points to a local server or a remote server. For example, let’s assume you have a computer named BlakeLaptop, which has a share named Files that shares documents held in the c:My DocumentsFiles directory. You can map z: onto this share by using net use z: \BlakeLaptopFiles, and then z:myfile.txt and c:My DocumentsFilesmyfile.txt will point to the same file.

You can access a file directly by using its UNC name rather than by mapping to a drive first. For example, \BlakeLaptopFilesmyfile.txt is the same as z:myfile.txt. Also, you can combine UNC with a variation of the \? format—for example, \?UNCBlakeLaptopFiles is the same as \BlakeLaptopFiles.

Be aware that Windows XP includes a Web-based Distributed Authoring and Versioning (WebDAV) redirector, which allows the user to map a Web-based virtual directory to a local drive by using the Add Network Place Wizard. This means that redirected network drives can reside on a Web server, not just on a file server.

When Is a File Not a File? Mailslots and Named Pipes

APIs such as CreateFile can open not just files but named pipes and mailslots too. A named pipe is a named, one- or two-way communication channel for communication between the pipe server and one or more pipe clients. A mailslot is a "fire-and-forget" one-way interprocess communication protocol. Once a client connects to a pipe or mailslot server, assuming the access checks succeed, the handle returned by the operating system is treated like a file handle. The syntax for a pipe is \servernamepipepipename, and a mailslot name is of the form \servernamemailslotmailslotname, where servername could be a dot representing the local machine.

When Is a File Not a File? Device Names and Reserved Names

Many operating systems, including Windows, have support for naming devices and access to the devices from the console. For example, COM1 is the first serial port, AUX is the default serial port, LPT2 is the second printer port, and so on. The following reserved words cannot be used as the name of a file: CON, PRN, AUX, CLOCK$, NUL, COM1–COM9, and LPT1–LPT9. Also, reserved words followed by an extension—for example, NUL.txt—are valid device names. But wait, there’s more: each of these devices "exists" in every directory. For example, c:Program FilesCOM1 is the first serial port, as is d:NorthWindTradersCOM1.

If a user passes a filename to you and you blindly open the file, you will have problems if the file is a device and not a real file. For example, imagine you have one worker thread that accepts a user request containing a filename. Now an attacker requests documentscom1, and your application opens the "file" for read access. The thread is blocked until the serial port times out! Luckily, there’s a way to determine what the file type is, and I’ll cover that shortly.

Device Name Issues on Other Operating Systems

Canonicalization issues are not, of course, unique to Windows. For example, on Linux it is possible to lock certain applications by attempting to open devices rather than files. Examples include /dev/mouse, /dev/console, /dev/tty0, /dev/zero, and many others.

A test using Mandrake Linux 7.1 and Netscape 4.73 showed that attempting to open file:///dev/mouse locked the mouse and necessitated a reboot of the computer to get control of the mouse. Opening file:///dev/zero freezed the browser. These vulnerabilities are quite serious because an attacker can create a Web site that has image tags such as <IMG SRC=file:///dev/mouse>, which would lock the user’s mouse.

You should become familiar with device names if you plan to build applications on many operating systems.

As you can see, there are many ways to name files, and if your code makes security decisions based on the name of a file, the chances are slim you’ll get it right. Now let’s move on to the other main realm of naming things—the Web.

Canonical Web-Based Issues

Unfortunately, many applications make security decisions based on the name of a URL, or a component of a URL. Just as with file-based security decisions, making URL-based security decisions raises several concerns. Let’s look at a few.

Bypassing AOL Parental Controls

America Online (AOL) 5.0 added controls so that parents could prevent their children from accessing certain Web sites. When a user typed a URL in the browser, the software checked the Web site name against a list of restricted sites, and if it found the site on the list, access to that site was blocked. Here’s the flaw: if the user added a period to the end of the host name, the software allowed the user to access the site. My guess is that the vulnerability existed because the software did not take into consideration the trailing dot when performing a string compare against the list of disallowed Web sites, and the software stripped out invalid characters from the URL after the check had been made.

The bug is now rectified. More information on this vulnerability can be found at http://www.slashdot.org/features/00/07/15/0327239.shtml.

Bypassing eEye’s Security Checks

The irony of this example is that the vulnerabilities were found in a security product, SecureIIS, designed to protect Microsoft Internet Information Services (IIS) from attack. Marketing material from eEye (http://www.eeye.com) describes SecureIIS like so:

SecureIIS protects Microsoft Internet Information Services Web servers from known and unknown attacks. SecureIIS wraps around IIS and works within it, verifying and analyzing incoming and outgoing Web server data for any possible security breaches.

Two canonicalization bugs were found in the product. The first related to how SecureIIS handled specific keywords. For example, say you decided that a user (or attacker) should not have access to a specific area of the Web site if he entered a URL query string containing action=delete. An attacker could escape any character in the query string to bypass the SecureIIS settings. Rather than entering action=delete, the attacker could enter action=%64elete and obtain the desired access. %64 is the hexadecimal representation of the letter d.

The other bug related to how SecureIIS checked for characters that were used to traverse out of a Web directory to other directories. For example, as a Web site developer or administrator, you wouldn’t want users accessing a URL like http://www.northwindtraders.com/scripts/process.asp?file=../../../winnt/repair/sam, which returns the backup SAM database to the user. The traversal characters are the two dots (..) and the slash (/), which SecureIIS looks for. However, an attacker can bypass the check by typing http://www.northwindtraders.com/scripts/process.asp?file=%2e%2e/%2e%2e/%2e%2e/winnt/repair/sam. As you’ve probably worked out, %2e is the escaped representation of the dot in hexadecimal!

You can read more about this vulnerability at http://www.securityfocus.com/bid/2742.

Zones and the Internet Explorer 4 "Dotless-IP Address" Bug

Security zones, introduced in Internet Explorer 4 (exported by UrlMon.dll), are an easy way to administer security because they allow you to gather security settings into easy-to-manage groups. These settings are enforced as the user browses Web sites. Each Web page is handled according to specific security restrictions depending on the page’s host Web site, thereby tying security restrictions to Web page origin.

Internet Explorer 4 uses a simple heuristic to determine whether a Web site is located in the more trusted Local Intranet Zone or in the less trusted Internet Zone. If a Web site name contains one or more dots, such as http://www.microsoft.com, the site must be in the Internet Zone unless the user has explicitly placed the Web site in some other zone. If the site has no dots in its name, such as http://northwindtraders, it must be in the Local Intranet Zone because only a NetBIOS name, which has no dots, can be accessed from within the local intranet. Makes sense, right? Not quite!

This mechanism has a wrinkle: if the user enters the IP address of a remote computer, Internet Explorer will apply the security settings of the more restrictive Internet Zone, even if the site is on the local intranet. This is good because the browser will use more stringent security checks. However, an IP address can be represented as a dotless-IP address, which can be calculated by taking a dotted-IP address—that is, an address in the form a.b.c.d—and applying the following formula:

Dotless-IP = (a × 16777216) + (b × 65536) + (c × 256) + d

For example, 192.168.197.100 is the same as 3232286052. If you enter http:/ /192.168.197.100 in Internet Explorer 4, the browser will invoke security policies for the Internet Zone, which is correct. And if you enter http://3232286052 in the unpatched Internet Explorer 4, the browser will notice no dots in the name, place the site in the Local Intranet Zone, and apply the less restrictive security policy. This might lead to a malicious Internet-based Web site executing code in the less secure environment.

More information is available at http://www.microsoft.com/technet/security/bulletin/MS98-016.asp.

Internet Information Server 4.0 ::$DATA Vulnerability

I remember the IIS ::$DATA vulnerability well because I was on the IIS team at the time the bug was found. Allow me to go over a little background material. The NTFS file system built into Microsoft Windows NT and later is designed to be a superset of many other file systems, including the Apple Macintosh HFS file system, which supports two sets of data, or forks, in a disk-based file. These forks are called the data fork and the resource fork. (You can read more about this at http://support.microsoft.com/default.aspx?scid=kb;en-us;Q147438) To help support these files, NTFS provides multiple-named data streams. For example, you could create a new stream named test in a file named Bar.txt—that is, bar.txt:test—by using the following code:

char *szFilename = "c:\temp\bar.txt:test";
HANDLE h = CreateFile(szFilename,
                      GENERIC_WRITE,
                      0, NULL,
                      CREATE_ALWAYS,
                      FILE_ATTRIBUTE_NORMAL,
                      NULL);
if (h == INVALID_HANDLE_VALUE) {
    printf("Error CreateFile() %d", GetLastError());
    return;
}

char *bBuff = "Hello, stream world!";
DWORD dwWritten = 0;
if (WriteFile(h, bBuff, lstrlen(bBuff), &dwWritten, NUL L)) {
    printf("Cool!");
} else {
    printf("Error WriteFile() %d", GetLastError());
}

This example code is available in the companion content in the folder Secureco2Chapter11NTFSStream. You can view the contents of the file from the command line by using the following syntax:

more < bar.txt:test

You can also use the echo command to insert a stream into a file and then view the contents of the file:

echo Hello, Stream World! > bar.txt:test
more < bar.txt:test

Doing so displays the contents of the stream on the console. The "normal" data in a file is held in a stream that has no name, and it has an internal NTFS data type of $DATA. With this in mind, you can also access the default data stream in an NTFS file by using the following command-line syntax:

more < boot.ini::$DATA

Figure 11-1 outlines what this file syntax means.

Figure 11-1. The NTFS file system stream syntax.

An NTFS stream name follows the same naming rules as an NTFS filename, including all alphanumeric characters and a limited set of punctuation characters. For example, two files, john3 and readme, with streams named 16 and now, respectively, would become john3:16 and readme:now. Any combination of valid filename characters is allowed.

Back to the vulnerability. When IIS receives a request from a user, the server looks at the file extension and determines what it should do with the request. For example, if the file ends in .asp, the request must be for an Active Server Pages (ASP) file, so the server routes the request to Asp.dll for processing. If IIS does not recognize the extension, the request is sent directly to Windows for processing so that the contents of the file can be shown to the user. This functionality is handled by the static-file handler. Think of this as a big default switch in a switch statement. So if the user requests Data.txt and no special extension handler, called a script map, associated with the .txt file extension is found, the source code of the text file is sent to the user.

The vulnerability lies in the attacker requesting a file such as Default.asp::$DATA. When IIS evaluates the extension, it does not recognize .asp::$DATA as a file extension and passes the file to the operating system for processing. NTFS determines that the user requested the default data stream in the file and returns the contents of Default.asp, not the processed result, to the attacker.

You can find out more about this bug at http://www.microsoft.com/technet/security/bulletin/MS98-003.asp.

When is a Line Really Two Lines?

A recent vulnerability is processing lines that include carriage return or carriage return/line feed characters. Imagine your application logs client requests, and as an example, a client requests file.txt. Your server application logs the IP address of the client, his name, the date and time, and the requested resource in the following format:

172.23.11.19 Mike 2002-09-03 13:02:43 file.txt

Imagine that an attacker decides to access a file named file.txt 127.0.0.1 Cheryl 2002-09-03 13:03:00 secretfile.txt, which results in this log entry:

172.23.11.19   Mike          2002-09-03    13:02:43    file.txt
127.0.0.1      Cheryl        2002-09-03    13:03:00    secretfile.txt

Does this mean that Cheryl accessed a sensitive file by logging on the server (127.0.0.1)? No, it does not. The attacker forced a new entry in the log file by using a carriage return and line feed character in the requested resource! You can read more about this vulnerability at http://online.securityfocus.com/archive/82/271498/2002-05-09/2002-05-15/2.

Yet Another Web Issue—Escaping

What makes Web-based canonicalization issues so prevalent and hard to defend against is the number of ways you can represent any character. For example, any character can be represented in a URL or a Web page by using one or more of the following mechanisms:

The "normal" 7-bit or 8-bit character representation, also called US-ASCII
Hexadecimal escape codes
UTF-8 variable-width encoding
UCS-2 Unicode encoding
Double encoding
HTML escape codes (Web pages, not URLs)

7-Bit and 8-Bit ASCII

I trust you understand the 7-bit and 8-bit ASCII representations, which have been used in computer systems for many years, so I won’t cover them here.

Hexadecimal Escape Codes

Hex escapes are a way to represent a possibly nonprintable character by using its hexadecimal equivalent. For example, the space character is %20, and the pounds sterling character (£) is %A3. You can use this mapping in a URL such as http:// www.northwindtraders.com/my%20document.doc, which will open my document.doc on the Northwind Traders Web site; http://www.northwindtraders.com/my%20document%2Edoc will do likewise.

I have already mentioned a canonicalization bug in eEye’s SecureIIS tool. The tool looked for certain words in the client request and rejected the request if any of the words were found. However, an attacker could hex escape any of the characters in the request and the tool would fail to reject the requests, essentially bypassing the security mechanisms.

UTF-8 Variable-Width Encoding

Eight-bit Unicode Transformation Format, UTF-8, as defined in RFC 2279 (http://www.ietf.org/rfc/rfc2279.txt), is a way to encode characters by using one or more bytes. The variable-byte sizes allow UTF-8 to encode many different byte-size character sets, such as 2-byte Unicode (UCS-2), 4-byte Unicode (UCS-4), and ASCII, to name but a few. However, the fact that one character can potentially map to multiple-byte representations is problematic.

How UTF-8 Encodes Data

UTF-8 can encode n-byte characters into different byte sequences, depending on the value of the original characters. For example, a character in the 7-bit ASCII range 0x00–0x7F encodes to 0 7654321, where 0 is the leading bit, set to 0, and 7654321 represents the 7 bits that make up the 7-bit ASCII character. For instance, the letter H, which is 0x48 in hex or 1001000 in binary, becomes the UTF-8 character 0 1001000, or 0x48. As you can see, 7-bit ASCII characters are unchanged by UTF-8.

Things become a little more complex as you start mapping characters beyond the 7-bit ASCII range, all the way up to the top of the Unicode range, 0x7FFFFFFF. For example, any character in the range 0x80–0x7FF encodes to 110 xxxxx 10 xxxxxx, where 110 and 10 are predefined bits and each x represents one bit from the character. For example, pounds sterling is 0xA3, which is 10100011 in binary. The UTF-8 representation is 110 00010 10 100011, or 0xC2 0xA3. However, it doesn’t stop there. UTF-8 can encode larger byte-size characters. Table 11-1 outlines the mappings.

Table 11-1. UTF-8 Character Mappings

Character Range	Encoded Bytes
0x00000000–0x0000007F	0 xxxxxxx
0x00000080–0x000007FF	110 xxxxx 10 xxxxxx
0x00000800–0x0000FFFF	1110 xxxx 10 xxxxxx 10 xxxxxx
0x00010000–0x001FFFFF	11110 xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
0x00200000–0x03FFFFFF	111110 xx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
0x04000000–0x7FFFFFFF	1111110 x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx, 10 xxxxxx

And this is where the fun starts; it is possible to represent a character by using any of these mappings, even though the UTF-8 specification warns against doing so. All UTF-8 characters should be represented in the shortest possible format. For example, the only valid UTF-8 representation of the ? character is 0x3F, or 00111111 in binary. On the other hand, an attacker might try using illegal nonshortest formats, such as these:

0xC0 0xBF
0xE0 0x80 0xBF
0xF0 0x80 0x80 0xBF
0xF8 0x80 0x80 0x80 0xBF
0xFC 0x80 0x80 0x80 0x80 0xBF

A bad UTF-8 parser might determine that all of these formats are the same, when, in fact, only 0x3F is valid.

Perhaps the most famous UTF-8 attack was against unpatched Microsoft Internet Information Server (IIS) 4 and IIS 5 servers. If an attacker made a request that looked like this—http://servername/scripts/..%c0%af../winnt/system32/ cmd.exe—the server didn’t correctly handle %c0%af in the URL. What do you think %c0%af means? It’s 11000000 10101111 in binary; and if it’s broken up using the UTF-8 mapping rules in Table 11-1, we get this: 110 00000 10 101111. Therefore, the character is 00000101111, or 0x2F, the slash (/) character! The %c0%af is an invalid UTF-8 representation of the / character. Such an invalid UTF-8 escape is often referred to as an overlong sequence.

So when the attacker requested the tainted URL, he accessed http://servername/scripts/../../winnt/system32/cmd.exe. In other words, he walked out of the script’s virtual directory, which is marked to allow program execution, up to the root and down into the system32 directory, where he could pass commands to the command shell, Cmd.exe.

More Information

You can read more about the "File Permission Canonicalization" vulnerability at http://www.microsoft.com/technet/security/bulletin/MS00-057.asp.

UCS-2 Unicode Encoding

UCS-2 issues are a variation of hex encoding and, to some extent, UTF-8 encoding. Two-byte Universal Character Set, UCS-2, can be hex-encoded in a similar manner as ASCII characters but with the %uNNNN format, where NNNN is the hexadecimal value of the Unicode character. For example, %5C is the ASCII and UTF-8 hex escape for the backslash () character, and %u005C is the same character in 2-byte Unicode.

To really confuse things, %u005C can also be represented by a wide Unicode equivalent called a fullwidth version. The fullwidth encodings are provided by Unicode to support conversions between some legacy Asian double-byte encoding systems. The characters in the range %uFF00 to %uFFEF are reserved as the fullwidth equivalents of %20 to %7E. For example, the character is %u005C and %uFF3C.

Double Encoding

Just when you thought you understood the various encoding schemes—and we’ve looked at only the most common—along comes double encoding, which involves reencoding the encoded data. For example, the UTF-8 escape for the backslash character is %5c, which is made up of three characters—%, 5, and c—all of which can be re-encoded using their UTF-8 escapes, %25, %35, and %63. Table 11-2 outlines some double-encoding variations of the character.

Table 11-2. Sample Double Escaping Representations of

Escape	Comments
%5c	Normal UTF-8 escape of the backslash character
%255c	%25, the escape for % followed by 5c
%%35%63	The % character followed by %35, the escape for 5, and %63, the escape for c
%25%35%63	The individual escapes for %, 5, and c

The vulnerability lies in the mistaken belief that a simple unescape operation will yield clean, raw data. The application then makes a security decision based on the data, but the data might not be fully unescaped.

HTML Escape Codes

HTML pages can also escape characters by using special characters. For example, angle brackets (< and >) can be represented as < and > and the pound sterling symbol can be represented as £. But wait, there’s more! These escape sequences can also be represented using the decimal or hexadecimal character values, not just easy-to-remember mnemonics. For example, < is the same as &#3C; (hexadecimal value of the < character) and is also the same as < (decimal value of the < character). A complete list of these entities is available at http://www.w3.org/TR/REC-html40/sgml/entities.html.

As you can see, there are many ways to encode data on the Web, which means that making decisions based on the name of a resource is a dangerous programming practice. Let’s now focus on remedies for these issues.

Visual Equivalence Attacks and the Homograph Attack

In early 2002, two researchers, Evgeniy Gabrilovich and Alex Gontmakher, released an interesting paper entitled "The Homograph Attack," available at http://www.cs.technion.ac.il/~gabr/pubs.html. The crux of their paper is that some characters look the same as others, but they are in fact different. Take a look at Figure 11-2.

Figure 11-2. Looks like localhost, doesn’t it? However, it’s not. The word localhost has a special Cyrillic character "o" that looks like an ASCII "o".

The problem is that the last letter "o" in localhost is not a Latin letter "o," it’s a Cyrillic character "o" (U+043E), and while the two are visually equivalent they are semantically different. Even though the user thinks she is accessing her machine, she is not; she is accessing a remote server on the network. Other Cyrillic examples include a, c, e, p, y, x, H, T, and M—they all look like Latin characters, but in fact, they are not.

Another example, is the fraction slash, "", U+2044, and the slash character, "/", U+002F. Once again, they look the same. There are many others in the Unicode repertoire; I’ve outlined some in Chapter 14.

The oldest mixup is the number zero and the uppercase letter "O".

The problem with visual equivalence is that users may see a URL that looks like it will perform a given action, when in fact it would perform another action. Who would have thought a link to localhost would have accessed a remote computer named localhost?

Preventing Canonicalization Mistakes

Now that I’ve paraded numerous issues and you’ve read the bad news, let’s look at solutions for canonicalization mistakes. The solutions include avoiding making decisions based on names, restricting what is allowed in a name, and attempting to canonicalize the name. Let’s look at each in detail.

Don’t Make Decisions Based on Names

The simplest, and by far the most effective way of avoiding canonicalization bugs is to avoid making decisions based on the filename. Let the file system and operating system do the work for you, and use ACLs or other operating system–based authorization technologies. Of course, it’s not quite as simple as that! Some security semantics cannot currently be represented in the file system. For example, IIS supports scripting. In other words, a script file, such as an ASP page containing Visual Basic Scripting Edition (VBScript) or Microsoft JScript, is read and processed by a script engine, and the results of the script are sent to the user. This is not the same as read access or execute access; it’s somewhere in the middle. IIS, not the operating system, has to determine how to process the file. All it takes is a mistake in IIS’s canonicalization, such as that in the ::$DATA exploit, and IIS sends the script file source code to the user rather than processing the file correctly.

As mentioned, you can limit access to resources based on the user’s IP address. However, this security semantics currently cannot be represented as an ACL, and applications supporting restrictions based on IP address, Domain Name System (DNS) name, or subnet must use their own access code.

Important

Refrain from making security decisions based on the name of a file. The wrong choice might have dire security consequences.

Use a Regular Expression to Restrict What’s Allowed in a Name

I covered this in detail in Chapter 10, but it’s worth repeating. If you must make name-based security decisions, restrict what you consider a valid name and deny all other formats. For example, you might require that all filenames be absolute paths containing a restricted pool of characters. Or you might decide that the following must be true for a file to be determined as valid:

The file must reside on drive c: or d:.
The path is a series of backslashes and alphanumeric characters.
The filename follows the path; the filename is also alphanumeric, is not longer than 32 characters, is followed by a dot, and ends with the txt, jpg, or gif extension.

The easiest way to do this is to use regular expressions. Learning to define and use good regular expressions is critical to the security of your application. A regular expression is a series of characters that define a pattern which is then compared with target data, such as a string, to see whether the target includes any matches of the pattern. For example, the following regular expression will represent the example absolute path just described:

^[cd]:(?:\w+)+\w{1,32}.(txt|jpg|gif)$

Refer to Chapter 10 for details about what this expression means.

This expression is strict—the following are valid:

c:mydirmyotherdirmyfile.txt
d:mydirmyotherdirsomeotherdirpicture.jpg

The following are invalid:

e:mydirmyotherdirmyfile.txt (invalid drive letter)
c:fred.txt (must have a directory before the filename)
c:mydirmyotherdir..mydirmyfile.txt (can’t have anything but A-Za-z0-9 and an underscore in a directory name)
c:mydirmyotherdirfdisk.exe (invalid file extension)
c:mydirmyothe~1myfile.txt (the tilde [~] is invalid)
c:mydirmyfile.txt::$DATA (the colon [:] is invalid other than after the drive letter; $ is also invalid)
c:mydirmyfile.txt. (the trailing dot is invalid)
\myservermysharemyfile.txt (no drive letter)
\?c:mydirmyfile.txt (no drive letter)

As you can see, using this simple expression can drastically reduce the possibility of using a noncanonical name. However, it does not detect whether a filename represents a device; we’ll look at that shortly.

Important

Regular expressions teach an important lesson. A regular expression determines what is valid, and everything else is therefore invalid. Determining whether or not an expression is valid is the correct way to parse any kind of input. You should never look for and block invalid data and then allow everything else through; you will likely miss a rare edge case. This is incredibly important. I repeat: look for that which is provably valid, and disallow everything else.

Stopping 8.3 Filename Generation

You should also consider preventing the file system from generating short filenames. This is not a programmatic option—it’s an administrative setting. You can stop Windows from creating 8.3 filenames by adding the following setting to the HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlFileSystem registry key:

NtfsDisable8dot3NameCreation : REG_DWORD : 1

This option does not remove previously generated 8.3 filenames.

Don’t Trust the PATH—Use Full Path Names

Never depend on the PATH environment variable to find files. You should be explicit about where your files reside. For all you know, an attacker might have changed the PATH to read c:myhacktools;%systemroot% and so on! When was the last time you checked the PATH on your systems? The lesson here is to use full path names to your data and executable files, rather than relying on an untrusted variable to determine which files to access.

More Information

A new registry setting in Windows XP allows you to search some of the folders specified in the PATH environment variable before searching the current directory. Normally, the current directory is searched first, which can make it easy for attackers to place Trojan horses on the computer. The registry key is HKEY_LOCAL_MACHINESystemCurrentControlSetControl Session ManagerSafeDllSearchMode. You need to add this registry key. The value is a DWORD type and is 0 by default. If the value is set to 1, the current directory is searched after system32.

Restricting what is valid in a filename and rejecting all else is reasonably safe, as long as you use a good regular expression. However, if you want more flexibility, you might need to attempt to canonicalize the filename for yourself, and that’s the next topic.

Attempt to Canonicalize the Name

Canonicalizing a filename is not as hard as it seems; you just need to be aware of some Win32 functions to help you. The goal of canonicalization is to get as close as possible to the file system’s representation of the file in your code and then to make decisions based on the result. In my opinion, you should get as close as possible to the canonical representation and reject the name if it still does not look valid. For example, the CleanCanon application I’ve written performs robust canonicalization functions as described in the following steps:

It takes an untrusted filename request from a user—for example, mysecretfile.txt.
It determines whether the filename is well formed. For example, mysecretfile.txt is valid; mysecr~1.txt, mysecretfile.txt::$DATA, and mysecretfile.txt. (trailing dot) are all invalid.
The code determines whether the combined length of the filename and the directory is greater than MAX_PATH in length. If so, the request is rejected. This is to help mitigate denial of service attacks and buffer overruns.
It prepends an application-configurable directory to the filename—for example, c:myfiles, to yield c:myfilesmysecretfile.txt. It also adds \? to the start of the filename, this instructs the operating system to handle the filename literally, and not perform any extra canonicalization steps.
It determines the correct directory structure that allows for two dots (..)—this is achieved by calling GetFullPathName.
It evaluates the long filename of the file in case the user uses the short filename version. For example, mysecr~1.txt becomes mysecretfile.txt, achieved by calling GetLongPathName. This is technically moot because of the filename validation in step 2. However, it’s a defense-in-depth measure!
It determines whether the filename represents a file or a device. This is something a regular expression cannot achieve. If the GetFileType function determines the file to be of type FILE_TYPE_DISK, it’s a real file and not a device of some kind.

Note

Earlier I mentioned that device name issues exist in Linux and UNIX also. C or C++ programs running on these operating systems can determine whether a file is a file or a device by calling the stat function and checking the value of the stat.st_mode variable. If its value is S_IFREG (0x0100000), the file is indeed a real file and not a device or a link.

Let’s look at this Win32 C++ code, written using Visual C++ .NET, that performs these steps:

/*
    CleanCanon.cpp
*/
#include "stdafx.h"
#include "atlrx.h"
#include "strsafe.h"
#include <new>

enum errCanon {
    ERR_CANON_NO_ERROR = 0,
    ERR_CANON_INVALID_FILENAME,
    ERR_CANON_INVALID_PATH,
    ERR_CANON_NOT_A_FILE,
    ERR_CANON_NO_FILE,
    ERR_CANON_NO_PATH,
    ERR_CANON_TOO_BIG,
    ERR_CANON_NO_MEM};

errCanon GetCanonicalFileName(LPCTSTR szFilename, 
                              LPCTSTR szDir,
                              LPTSTR  *pszNewFilename) {

    //STEP 1
    //Must provide a path and must be smaller than MAX_PATH
    if (szDir == NULL)
      return ERR_CANON_NO_PATH;

   size_t cchDirLen = 0;
   if (StringCchLength(szDir,MAX_PATH,&cchDirLen) != S_OK ||
            cchDirLen > MAX_PATH)
      return ERR_CANON_TOO_BIG;

   *pszNewFilename = NULL;
   LPTSTR szTempFullDir = NULL;
   HANDLE hFile = NULL;

   errCanon err = ERR_CANON_NO_ERROR;

   try {
      //STEP 2 
      //Check filename is valid (alphanum ’.’ 1-4 alphanums)
      //Check path is valid (alphanum and ’’ only)
      //Case insensitive
      CAtlRegExp<> reFilename, reDirname;
      CAtlREMatchContext<> mc;
      reFilename.Parse(_T("^\a+\.\a\a?\a?\a?$"),FALSE);
      if (!reFilename.Match(szFilename,&mc))
         throw ERR_CANON_INVALID_FILENAME;

      reDirname.Parse(_T("^\c:\\[a-z0-9\\]+$"),FALSE);
      if (!reDirname.Match(szDir,&mc))
         throw ERR_CANON_INVALID_FILENAME;

      size_t cFilename = lstrlen(szFilename);
      size_t cDir = lstrlen(szDir);

      //Temp new buffer size, allow for added ’’
      size_t cNewFilename = cFilename + cDir + 1;

      //STEP 3
      //Make sure filesize is small enough
      if (cNewFilename > MAX_PATH)
         throw ERR_CANON_TOO_BIG;

      //Allocate memory for the new filename
      //Accommodate for prefix \? and for trailing ’’
      LPCTSTR szPrefix = _T("\\?\");
      size_t cchPrefix = lstrlen(szPrefix);
      size_t cchTempFullDir = cNewFilename + 1 + cchPrefix;
      szTempFullDir = new TCHAR[cchTempFullDir];
      if (szTempFullDir == NULL)
         throw ERR_CANON_NO_MEM;

      //STEP 4 
      //Join the dir and filename together. 
      //Prepending \? forces the OS to treat many characters 
      //literally by not performing extra interpretation/canon steps
      if (StringCchPrintf(szTempFullDir,
                          cchTempFullDir,
                         _T("%s%s\%s"),
                         szPrefix,
                         szDir,
                         szFilename) != S_OK)
         throw ERR_CANON_INVALID_FILENAME;

      // STEP 5 
      // Get the full path, 
      // Accommodates for .. and trailing ’.’ and spaces
      TCHAR szFullPathName [MAX_PATH + 1];
      LPTSTR szFilenamePortion = NULL;
      DWORD dwFullPathLen = 
         GetFullPathName(szTempFullDir,
                         MAX_PATH,
                         szFullPathName,
                         &szFilenamePortion);
      if (dwFullPathLen > MAX_PATH)
         throw ERR_CANON_NO_MEM;

      // STEP 6 
      // Get the long filename
      if (GetLongPathName(szFullPathName,
                          szFullPathName,
                          MAX_PATH) == 0) {
         errCanon errName = ERR_CANON_TOO_BIG;
         switch (GetLastError()) {
            case ERROR_FILE_NOT_FOUND : 
                     errName = ERR_CANON_NO_FILE;
                     break;

            case ERROR_NOT_READY :
            case ERROR_PATH_NOT_FOUND :
                     errName = ERR_CANON_NO_PATH;
                     break;

            default : break;
         }

         throw errName; 
      }

      // STEP 7
      // Is this a file or a device?
      hFile = CreateFile(szFullPathName,
                         0,0,NULL,
                         OPEN_EXISTING,
                     SECURITY_SQOS_PRESENT | SECURITY_IDENTIFICATION,
                         NULL);
      if (hFile == INVALID_HANDLE_VALUE)
         throw ERR_CANON_NO_FILE;

      if (GetFileType(hFile) != FILE_TYPE_DISK)
         throw ERR_CANON_NOT_A_FILE;

      //Looks good!
      //Caller must delete [] pszNewFilename
      const size_t cNewFilenane = lstrlen(szFullPathName)+1;
      *pszNewFilename =  new TCHAR[cNewFilenane];
      if (*pszNewFilename != NULL)
         StringCchCopy(*pszNewFilename,cNewFilenane,szFullPathName);
      else
         err = ERR_CANON_NO_MEM;

   } catch(errCanon e) {
      err = e;
   } catch (std::bad_alloc a) {
      err = ERR_CANON_NO_MEM;
   }

   delete [] szTempFullDir;
   if (hFile) CloseHandle(hFile);

   return err;
}

The complete code listing is available in the companion content, in the folder Secureco2Chapter11CleanCanon. CreateFile has a side effect when it’s determining whether the file is a drive-based file. The function will fail if the file does not exist, saving your application from performing the check.

Calling CreateFile Safely

You may have noticed that dwFlagsAndAttributes flags is nonzero in the CreateFile call in the previous code. There’s a good reason for this. This code does nothing more than verify that a filename is valid, and is not a device or an interprocess communication mechanism, such as a mailslot or a named pipe. That’s it. If it were a named pipe, the process owning the pipe could impersonate the process identity of the code making the request. However, in the interests of security, I don’t want any code I don’t trust impersonating me. So setting this flag prevents the code at the "other end" impersonating you.

Note that there is a small issue with setting this flag, although it doesn’t affect this code, because the code is not attempting to manipulate the file. The problem is that the constant SECURITY_SQOS_PRESENT | SECURITY_IDENTIFICATION is the same as FILE_FLAG_OPEN_NO_RECALL, which indicates the file is not to be pulled from remote storage if the file exists. This flag is intended for use by the Hierarchical Storage Management system or remote storage systems.

Now let’s move our focus to fixing Web-based canonical representation issues.

Web-Based Canonicalization Remedies

Like all potential canonicalization vulnerabilities, the first defense is to not make decisions based on the name of a resource if it’s possible to represent the resource name in more than one way.

Restrict What Is Valid Input

The next best remedy is to restrict what is considered a valid user request. You created the resources being protected, and so you can define the valid ways to access that data and reject all other requests. Once again, validity is tested using regular expressions. I’ll say it just one more time: always determine what is valid input and reject all other input. It’s safer to have a client complain that something doesn’t work because of an overzealous regular expression than have the service not work because it’s been hacked!

Be Careful When Dealing with UTF-8

If you must manipulate UTF-8 characters, you need to reduce the data to its canonical form by using the MultiByteToWideChar function in Windows. The following sample code shows how you can call this function with various valid and invalid UTF-8 characters. You can find the complete code listing in the companion content in the folder Secureco2Chapter11UTF8. Also note that if you want to create UTF-8 characters, you can use WideCharToMultiByte by setting the code page to CP_UTF8.

void FromUTF8(LPBYTE pUTF8, DWORD cbUTF8) {
    WCHAR wszResult[MAX_CHAR+1];
    DWORD dwResult = MAX_CHAR;

    int iRes = MultiByteToWideChar(CP_UTF8,
                  0,
                  (LPCSTR)pUTF8,
                  cbUTF8,
                  wszResult,
                  dwResult);

    if (iRes == 0) {
        DWORD dwErr = GetLastError();
        printf("MultiByteToWideChar() failed - > %d
", dwErr);
    } else {
        printf("MultiByteToWideChar() returned "
               "%S (%d) wide characters
",
               wszResult,
               iRes);
    }
}


void main() {
    //Get Unicode for 0x5c; should be ’’.
    BYTE pUTF8_1[] = {0x5C};
    DWORD cbUTF8_1 = sizeof pUTF8_1;
    FromUTF8(pUTF8_1, cbUTF8_1);

    //Get Unicode for 0xC0 0xAF. 
    //Should fail because this is 
    //an overlong ’/’.
    BYTE pUTF8_2[] = {0xC0, 0xAF};
    DWORD cbUTF8_2 = sizeof pUTF8_2;
    FromUTF8(pUTF8_2, cbUTF8_2);

    //Get Unicode for 0xC2 0xA9; should be 
    //a ’©’ symbol.
    BYTE pUTF8_3[] = {0xC2, 0xA9};
    DWORD cbUTF8_3 = sizeof pUTF8_3;
    FromUTF8(pUTF8_3, cbUTF8_3);
}

ISAPIs—Between a Rock and a Hard Place

ISAPI applications and ISAPI filters are probably the most vulnerable technologies, because they are often written in relatively low-level C or C++, they handle Web requests and response, and they manipulate files. If you are writing ISAPI applications for IIS6 you should use the SCRIPT_TRANSLATED server variable, as it will return a correctly canonicalized filename based on the URL to your code, rather than you performing the work and probably getting it wrong.

A Final Thought: Non-File-Based Canonicalization Issues

The core of this chapter relates to canonical file representation, and certainly the vast majority of canonicalization security vulnerabilities relate to files. However, some vulnerabilities exist in the cases in which a resource can be represented by more than one name. The two that spring to mind relate to server names and usernames.

Server Names

Servers, be they Web servers, file and print servers, or e-mail servers, can be named in a number of ways. The most common way to name a computer is to use a DNS name—for example, northwindtraders.com. Another common way is to use an IP address, such as 192.168.197.100. Either name will access the same server from the client code. Also, a local computer can be known as localhost and can have an IP address in the 127.n.n.n subnet. And if the server is on an internal Windows network, the computer can also be accessed by its NetBIOS same, such as \northwindtraders.

So, what if your code makes a security decision based on the name of the server? It’s up to you to determine what an appropriate canonical representation is and to compare names against that, failing all names that do not match. The following code can be used to gather various names of a local computer:

/*
    CanonServer.cpp
*/
for (int i = ComputerNameNetBIOS; 
    i <= ComputerNamePhysicalDnsFullyQualified; 
    i++) {

    TCHAR szName[256];
    DWORD dwLen = sizeof szName / sizeof TCHAR;

    TCHAR *cnf;
    switch(i) {
        case 0 : cnf = "ComputerNameNetBIOS"; break;
        case 1 : cnf = "ComputerNameDnsHostname"; break ;
        case 2 : cnf = "ComputerNameDnsDomain"; break;
        case 3 : cnf = "ComputerNameDnsFullyQualified";  break;
        case 4 : cnf = "ComputerNamePhysicalNetBIOS"; break;
        case 5 : cnf = "ComputerNamePhysicalDnsHostname "; break;
        case 6 : cnf = "ComputerNamePhysicalDnsDomain";  break;
        case 7 : cnf = "ComputerNamePhysicalDnsFullyQualified"; break;
        default : cnf = "Unknown"; break;
    }

    BOOL fRet =
        GetComputerNameEx((COMPUTER_NAME_FORMAT)i,
                          szName,
                          &dwLen);

    if (fRet) {
        printf("%s in ’%s’ format.
", szName, cnf);
    } else {
        printf("Failed %d", GetLastError());
    }
}

The complete code listing is available in the companion content in the folder Secureco2Chapter11CanonServer. You can get the IP address or addresses of the computer by calling the Windows Sockets (Winsock) getaddrinfo function or by using Perl. You can use the following code:

my ($name, $aliases, $addrtype, $length, @addrs) 
    = gethostbyname "mymachinename";
foreach (@addrs) {
    my @addr = unpack(‘C4’, $_);
    print "IP: @addr
";
}

Usernames

Finally, we come to usernames. Historically, Windows supported one form of username: DOMAINUserName, where DOMAIN is the name of the user’s domain and UserName is, obviously, the user’s name. This is also referred to as the SAM (Security Account Manager) name. For example, if Blake is in the DEVELOPMENT domain, his account would be DEVELOPMENTBlake. However, with the advent of Windows 2000, the user principal name (UPN) was introduced, which follows the now-classic and well-understood e-mail address format of user@domain—for example, [email protected].

Take a look at the following code:

bool AllowAccess(char *szUsername) {
    char *szRestrictedDomains[]={"MARKETING", "SALES"};
    for (i = 0; 
         i < sizeof szRestrcitedDomains /
             sizeof szRestrcitedDomains[0]; 
         i++)
        if (_strncmpi(szRestrictedDomains[i],
                      szUsername,
                      strlen(szRestrictedDomains[i]) ==  0)
            return false;
    return true;
}

This code will return false for anyone in the MARKETING or SALES domain. For example, MARKETINGBrian will return false because Brian is in the MARKETING domain. However, if Brian had the valid UPN name [email protected], this function would return true because the name format is different, which causes the case-insensitive string comparison function to always return a nonzero (nonmatch) value.

Windows 2000 and later have a canonical name—it’s the SAM name. All user accounts must have a unique SAM name to be valid on a domain, regardless of whether the domain is Windows NT 4, Windows 2000, Windows 2000 running Active Directory, or Windows XP.

You can use the GetUserNameEx function to determine the canonical user name, like so:

/*
    CanonUser.cpp
*/
#define SECURITY_WIN32
#include <windows.h>
#include <security.h>

for (int i = NameUnknown ; 
     i <= NameServicePrincipal; 
     i++) {

    TCHAR szName[256];
    DWORD dwLen = sizeof szName / sizeof TCHAR;

    TCHAR *enf = NULL;
    switch(i) {
        case 0 : enf = "NameUnknown"; break;
        case 1 : enf = "NameFullyQualifiedDN"; break;
        case 2 : enf = "NameSamCompatible"; break;
        case 3 : enf = "NameDisplay"; break;
        case 4 : enf = "NameUniqueId"; break;
        case 5 : enf = "NameCanonical"; break;
        case 6 : enf = "NameUserPrincipal"; break;
        case 7 : enf = "NameUserPrincipal"; break;
        case 8 : enf = "NameServicePrincipal"; break;
        default : enf = "Unknown"; break;
    }

    BOOL fRet =
        GetUserNameEx((EXTENDED_NAME_FORMAT)i,
                      szName,
                      &dwLen);

    if (fRet) {
        printf("%s in ’%s’ format.
", szName, enf);
    } else {
        printf("%s failed %d
", enf, GetLastError());
    }
}

You can also find this example code in the companion content in the folder Secureco2Chapter11CanonUser. Don’t be surprised if you see some errors; some of the extended name formats don’t apply directly to users.

Finally, you should refrain from making access control decisions based on the username. If possible, use ACLs.

Summary

I can summarize this chapter in one sentence—do not make a security decision based on the name of something. If you decide to make such decisions, you will make mistakes and create security vulnerabilities. If you must make a decision based on a name, be conservative—determine what is a valid request, look for requests that match that pattern, and reject everything else.

You can never determine all invalid requests, so don’t go looking for them!

You have been warned!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 11. Canonical Representation Issues

Create new playlist

Sign In

Sign Up

Chapter 11. Canonical Representation Issues

What Does Canonical Mean, and Why Is It a Problem?

Canonical Filename Issues

Bypassing Napster Name Filtering

Vulnerability in Apple Mac OS X and Apache

DOS Device Names Vulnerability

Sun Microsystems StarOffice /tmp Directory Symbolic-Link Vulnerability

Note

Important

Important

Common Windows Canonical Filename Mistakes

8.3 Representation of Long Filenames

Note

Note

NTFS Alternate Data Streams

Note

Trailing Characters

\? Format

Note

Directory Traversal and Using Parent Paths (..)

Walking out of the current directory

Note

Will the real filename please stand up?

Absolute vs. Relative Filenames

Case-Insensitive Filenames

UNC Shares

When Is a File Not a File? Mailslots and Named Pipes

When Is a File Not a File? Device Names and Reserved Names

Canonical Web-Based Issues

Bypassing AOL Parental Controls

Bypassing eEye’s Security Checks

Zones and the Internet Explorer 4 "Dotless-IP Address" Bug

Internet Information Server 4.0 ::$DATA Vulnerability

When is a Line Really Two Lines?

Yet Another Web Issue—Escaping

7-Bit and 8-Bit ASCII

Hexadecimal Escape Codes

UTF-8 Variable-Width Encoding

How UTF-8 Encodes Data

More Information

UCS-2 Unicode Encoding

Double Encoding

HTML Escape Codes

Visual Equivalence Attacks and the Homograph Attack

Preventing Canonicalization Mistakes

Don’t Make Decisions Based on Names

Important

Use a Regular Expression to Restrict What’s Allowed in a Name

Important

Stopping 8.3 Filename Generation

Don’t Trust the PATH—Use Full Path Names

More Information

Attempt to Canonicalize the Name

Note

Calling CreateFile Safely

Web-Based Canonicalization Remedies

Restrict What Is Valid Input

Be Careful When Dealing with UTF-8

ISAPIs—Between a Rock and a Hard Place

A Final Thought: Non-File-Based Canonicalization Issues

Server Names

Usernames

Summary

Table of Contents for
11. Canonical Representation Issues