Chapter 14. Internationalization Issues

No doubt you know that the world is a very small place and the need for software that recognizes languages other than United States English is important. Here’s the problem: if you think you know what a character is in a language other than English, you are probably mistaken. Most character set encodings, including Unicode, are evolving. This inherent fuzziness can threaten software security. The rest of this short chapter, based on information learned during Microsoft’s Windows Security Push, describes some of the threats related to internationalization, suggests ways to avoid them, and touches on some other general security best practices.

Note

You’ll often see the term "I18N" when working with foreign language software. I18N means "internationalization" (in which the letter I is followed by 18 characters and then the letter N).

This chapter does not cover general globalization best practices except as they affect security. It’s also assumed that you have read Chapter 10 and Chapter 11. Once you’ve read this chapter, I hope you’ll quickly realize that someone in your group should own the security implications of I18N issues in your applications. Now I’ll explain why.

The Golden I18N Security Rules

You should follow two security rules when building applications designed for international audiences:

  • Use Unicode.

  • Don’t convert between Unicode and other code pages/character sets.

If you follow these two rules, you’ll run into few I18N-related security issues; in fact, you can jump to the next chapter if these two rules hold true for your application! For the rest of you, you need to know a few things.

Use Unicode in Your Application

A character set encoding maps some set of characters (A, ß, Æ, and so on) to a set of binary values (usually from one to four bytes) called code values or code points. Hundreds of such encodings are in use today, and Microsoft Windows supports several dozen. Every character set encoding, including Unicode, has security issues, mainly due to character conversion. However, Unicode is the only worldwide standard and security experts have given it the most thorough examination. The bulk of Windows and Microsoft Office data is stored in Unicode, and your code will have fewer conversion issues—and potentially fewer security issues—if you also use Unicode. The Microsoft .NET common language runtime and the .NET Framework use only Unicode.

Note

There are three primary binary representations of the Unicode encoding: UTF-8, UTF-16, and UTF-32. Although all three forms represent exactly the same character repertoire, UTF-16 is the primary form supported by Windows and .NET. You will avoid one class of security issue if you use UTF-16. UTF-8 is popular for internet protocols and on other platforms. Windows National Language Support (NLS) provides an API for converting between UTF-8 and UTF-16, MultiByteToWideChar and WideCharToMultiByte. There is little reason to use UTF-32.

Prevent I18N Buffer Overruns

To avoid buffer overruns, always allocate sufficient buffer space for conversion and always check the function result. The following code shows how to do this correctly.

//Determine the size of the buffer required for the converted string.
//The length includes the terminating .
int nLen = MultiByteToWideChar(CP_OEMCP, 
    MB_ERR_INVALID_CHARS, 
    lpszOld, -1, NULL, 0);
//If the function failed, don’t convert!
if (nLen == 0) { 
    //oops!
}

//Allocate the buffer for the converted string.
LPWSTR lpszNew = (LPWSTR) GlobalAlloc(0, sizeof(WCHAR) * nLen);

//If the allocation failed, don’t convert!
if (lpszNew == NULL) {
    //oops!
}

//Convert the string.
nLen = MultiByteToWideChar(CP_OEMCP, 
    MB_ERR_INVALID_CHARS, 
    lpszOld, -1, lpszNew, nLen);
//The conversion failed, the result is unreliable.
if (nLen == 0) {
    //oops!
}

In general, do not rely on a precalculated maximum buffer size. For example, the new Chinese standard GB18030 (which can be up to 4 bytes for a single character) has invalidated many such calculations.

LCMapString is especially tricky: the output buffer length is words unless called with the LCMAP_SORTKEY option, in which case the output buffer length is bytes.

More Information

If you think Unicode buffer overruns are hard to exploit, you should read "Creating Arbitrary Shellcode in Unicode Expanded Strings" at http://www.nextgenss.com/papers/unicodebo.pdf.

Words and Bytes

Despite their names and descriptions, most Win32 functions do not process characters. Most Win32 A functions, such as CreateProcessA, process bytes, so a two-byte character, such as a Unicode character, would count as two bytes instead of one. Most Win32 W functions, such as CreateProcessW, process 16-bit words, so a pair of surrogates will count as two words instead of one character. More about surrogates in a moment. Confusion here can easily lead to buffer overruns or over allocation.

Many people don’t realize there are A and W functions in Windows. The following code snippet from winbase.h should help you understand their relationship.

#ifdef UNICODE
#define CreateProcess  CreateProcessW
#else
#define CreateProcess  CreateProcessA
#endif // !UNICODE

Validate I18N

Strings, including Unicode, can be invalid in several ways. For example, a string might contain binary values that do not map to any character or the string might contain characters with semantics outside the domain of the application, such as control characters within a URL. Such invalid strings can pose security threats if your code does not handle them properly.

Starting with Microsoft Windows .NET Server 2003, a new function, IsNLSDefinedString, helps verify that a string contains only valid Unicode characters. If IsNLSDefinedString returns true, you know that it contains no code points that CompareString will ignore (such as undefined characters or ill-matched surrogate pairs). Your code will still need to check for application-specific exceptions.

Visual Validation

Even with normalization, many characters in Unicode will appear identical to the user. For example, is actually two Unicode characters ( plus ), not five ASCII range characters. There is no way the user can reliably determine this from the visual display. Therefore, do not rely on the user to recognize that a string contains invalid characters. Either eliminate visual normalization or assist the user (for example, by allowing the user to view the binary values).

Do Not Validate Strings with LCMapString

You can use LCMapString to generate the sorting weights for a string. An application can store these weights (a series of integers) to improve performance when comparing the string with other strings. However, using the LCMapString-generated weights is not a reliable way to validate a string. Even though LCMapString returns identical weights for two strings, either string might contain invalid characters. In particular, LCMapString completely ignores undefined characters. Either use the new function, IsNLSDefinedString, or perform your own conservative validation.

Use CreateFile to Validate Filenames

Just because CompareString says two strings are equal (or unequal) does not mean that every part of the system will agree. In particular, CompareString might determine that two strings NTFS considers distinct are equal and vice versa. Always validate the string with the relevant component. For example, to verify that a string matches an existing filename, use CreateFile and check the error status.

Character Set Conversion Issues

In general, every character set encoding assigns slightly different semantics to its code points. Thus, even well-defined mappings between encodings can lose information. For example, a control character meaningful in ISO 8859-8-E (Bidirectional Hebrew) will lose all meaning in UTF-16, and a private use character in codepage 950 (Traditional Chinese Big5) might be a completely different character in UTF-16.

Your code must recognize that these losses can occur. In particular, if your code converts between encodings, do not assume that if the converted string is safe, the original string was also safe.

Use MultiByteToWideChar and WideCharToMultiByte for UTF-8 conversions on Windows XP and later. Conversion between UTF-8 and UTF-16 can be lossless and secure but only if you are careful. If you must convert between the two forms, be sure to use a converter that is up-to-date with the latest security advisories. Several products and Windows components have cloned the early, insecure version—do not use these. Microsoft has tuned the MultiByteToWideChar and WideCharToMultiByte tables over the years for security and application compatibility. Do not roll your own converter, even if this appears to yield a better mapping.

Use MultiByteToWideChar with MB_PRECOMPOSED and MB_ERR_INVALID_CHARS

When calling MultiByteToWideChar, always use the MB_PRECOMPOSED flag. This reduces, but does not eliminate, the occurrence of combining characters and speeds normalization. This is the default. Except for code pages greater than 50000, use MB_ERR_INVALID_CHARS with MultiByteToWideChar. This will catch undefined characters in the source string. The function converts code pages greater than 50000 by using algorithms rather than tables. Depending on the algorithm, invalid characters might be handled by the algorithm and the MB_ERR_INVALID_CHARS option might not be accepted. Check the MSDN documentation for code pages greater than 50000.

Note

Starting with Windows XP, MB_ERR_INVALID_CHARS is supported for UTF8 conversion as well (code page 65001 or CP_UTF8).

Use WideCharToMultiByte with WC_NO_BEST_FIT_CHARS

For strings that require validation—such as filenames, resource names, and usernames—always use the WC_NO_BEST_FIT_CHARS flag with WideCharToMultiByte. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme. For example, "8" (infinity) maps to "8" (eight) in some code pages!

WC_NO_BEST_FIT_CHARS is available only on Microsoft Windows 2000, Microsoft Windows XP, and Microsoft Windows .NET Server 2003. If your code must run on earlier platforms, you can achieve the same effect by converting the resulting string back to the source encoding—that is, by calling WideCharToMultibyte to get the UTF-16 string and then MultiByteToWideChar with the UTF-16 string to recover the original string. Any code point that differs between the original and the recovered string is said to not round-trip. Any code point that does not round-trip is a best-fit character. The following sample outlines how to perform a round-trip:

/*
 RoundTrip.cpp : Defines the entry point for the console application.
*/

#include "stdafx.h"

/*
  CheckRoundTrip
  Returns TRUE if the given string round trips between Unicode 
  and the given code page.  Otherwise, it returns FALSE.
*/

BOOL CheckRoundTrip(
                    DWORD uiCodePage,
                    LPWSTR wszString) 
{

    BOOL fStatus = TRUE;
    BYTE *pbTemp = NULL;
    WCHAR *pwcTemp = NULL;

    try {
        //Determine if string length is < MAX_STRING_LEN
        //Handles null strings gracefully
        const size_t MAX_STRING_LEN = 200;
        size_t cchCount = 0;
        if (!SUCCEEDED(StringCchLength(wszString, 
                       MAX_STRING_LEN, &cchCount))) 
            throw FALSE;

        pbTemp = new BYTE[MAX_STRING_LEN];
        pwcTemp = new WCHAR[MAX_STRING_LEN];
        if (!pbTemp || !pwcTemp) {
            printf("ERROR: No Memory!
");
            throw FALSE;
        }

        ZeroMemory(pbTemp,MAX_STRING_LEN * sizeof(BYTE));
        ZeroMemory(pwcTemp,MAX_STRING_LEN * sizeof(WCHAR));

        //Convert from Unicode to the given code page.
        int rc =  WideCharToMultiByte( uiCodePage,
            0,
            wszString,
            -1,
            (LPSTR)pbTemp,
            MAX_STRING_LEN,
            NULL,
            NULL );
        if (!rc) {
            printf("ERROR: WC2MB Error = %d, CodePage = %d, 
                   String = %ws
",
                GetLastError(), uiCodePage, wszString);
            throw FALSE;
        }

        //Convert from the given code page back to Unicode.
        rc = MultiByteToWideChar(uiCodePage,
                    0,
                    (LPSTR)pbTemp,
                    -1,
                    pwcTemp,
                    MAX_STRING_LEN / sizeof(WCHAR) );
        if (!rc) {
            printf("ERROR: MB2WC Error = %d, 
                CodePage = %d, String = %ws
",
                GetLastError(), uiCodePage, wszString);
            throw FALSE;
        }

        //Get length of original Unicode string, 
        //check it’s equal to the conversion length.
        size_t Length = 0;
        StringCchLength(wszString, MAX_STRING_LEN,&Length);
        if (Length+1 != rc) {
            printf("Length %d != rc %d
", Length, rc);
            throw FALSE;
        }

        //Compare the original Unicode string to the converted string 
        //and make sure they are identical.
        for (size_t ctr = 0; ctr < Length; ctr++) {
            if (pwcTemp[ctr] != wszString[ctr])
                throw FALSE;
        }
    } catch (BOOL iErr) {
        fStatus = iErr;
    }

    if (pbTemp)  delete [] pbTemp;
    if (pwcTemp) delete [] pwcTemp;

    return (fStatus);
}

int _cdecl main(
                int argc,
                char* argv[])
{
    LPWSTR s1 = L"x00a9MicrosoftCorp";          // Copyright
    LPWSTR s2 = L"Tox221e&Beyond";              // Infinity

    printf("1252 Copyright = %d
", CheckRoundTrip(1252, s1));
    printf("437  Copyright = %d
", CheckRoundTrip(437, s1));
    printf("1252 Infinity  = %d
", CheckRoundTrip(1252, s2));
    printf("437  Infinity  = %d
", CheckRoundTrip(437, s2));

    return (1);
}

The sample demonstrates that some characters cannot round-trip in some code pages. For example, the copyright symbol and the infinity sign in code pages 1252 (Windows codepage Latin I, used for Western European languages) and 437 (the original MS-DOS codepage)—the copyright symbol exists in 1252, but not in 437, and the infinity symbol exists in 437, but not in 1252.

Comparison and Sorting

If the result of the compare is not visible to the user—for example, if you’re generating an internal hash table from the string—consider using binary order. It’s safe, fast, and stable. If the result of the compare is not visible to the user but binary order is unacceptable (the most common reason being case folding, which is outlined at http://www.unicode.org/unicode/reports/tr21), use the Invariant locale, LOCALE_INVARIANT, on Windows XP or the invariant culture in a managed code application.

int nResult = CompareString( 
    LOCALE_INVARIANT,
    NORM_IGNORECASE | NORM_IGNOREKANATYPE | NORM_IGNOREWIDTH,
    lpStr1, -1, lpStr2, -1 );

If your code must run on platforms older than Windows XP, use the US English Locale. On Windows XP, CompareString results will then be identical to those with LOCALE_INVARIANT although Microsoft does not guarantee this to be true with future operating system releases.

int nResult = CompareString( 
    MAKELCID(MAKELANGID(LANG_ENGLISH, SUBLANG_DEFAULT), SORT_DEFAULT),
    NORM_IGNORECASE | NORM_IGNOREKANATYPE | NORM_IGNOREWIDTH,
    lpStr1, -1, lpStr2, -1 );

You should also assume a locale-sensitive compare is random. A frequent cause of errors, some of which pose security threats, is code that makes invalid assumptions about comparisons. In particular, for existing Windows locales:

  • "A" to "Z" might not always sort as in English.

  • When ignoring case, "I" might not always compare equal with "i."

  • "A" might not always come after "a."

  • Latin characters might not always precede other scripts.

Windows will support locales in the future that will include even more differences (or exceptions). If your code uses the user’s locale to compare, assume the result will be random. If this is unacceptable, seriously consider using the Invariant locale.

Unicode Character Properties

Because Unicode contains so many characters, it can be dangerous to assume that a limited range holds a particular property. For example, do not assume that the only digits are U+0030 ("0") through U+0039 ("9"). Unicode 3.1 has many digit ranges. Depending on subsequent processing of the string, characters with undetected properties can cause security problems. The best way to handle this problem is to check to the Unicode category. The .NET Framework method GetUnicodeCategory provides this information for managed code. Unfortunately, no interface to this data is included in NLS yet. The latest approved version of the Unicode character properties is always available at http://www.unicode.org/unicode/reports/tr23.

Use GetStringTypeEx for the same purpose, with caution. The GetStringTypeEx properties predate Unicode by several years, and some of the properties assigned to characters are surprising. Nevertheless, many components of Windows use these properties, and it’s reasonable to use GetStringTypeEx if you will be interacting with such components.

Table 14-1 shows the GetStringTypeEx property and the corresponding Unicode properties for code points greater than U+0080. Code point properties less than U+0080 do not correspond with Unicode.

Table 14-1. Unicode Properties

GetStringTypeEx

Unicode Property

C1_ALPHA

Alphabetic or Ideographic

C1_UPPER

Upper or Title case

C1_LOWER

Lower or title case

C1_DIGIT

Decimal digit

C1_SPACE

White space

C1_PUNCT

Punctuation

C1_CNTRL

ISO control, bidirectional control, join control, format control or ignorable control

C1_XDIGIT

Hex digit

C3_NONSPACING

Nonspacing

C3_SYMBOL

Symbol

C3_KATAKANA

The character name contains the word KATAKANA

C3_HIRAGANA

The character name contains the word HIRAGANA

C3_HALFWIDTH

Half width or narrow

C3_IDEOGRAPH

Ideographic

Normalization

Many character set encodings, but especially Unicode, have multiple binary representations for the "same" string. For example, there are dozens of distinct strings that might render as "Å". This multiplicity complicates operations such as indexing and validation. The complexity increases the risk of coding errors that will compromise security. To reduce complexity in your code, normalize strings to a single form.

Many normalization forms exist already:

  • The Unicode Consortium has defined four standard normalization forms. Normalization Form C is especially popular. Consider adopting Normalization Form C for new designs. It is the most frequently adopted and the easiest to optimize. Most of the Internet normalization forms are modifications of Normalization Form C. You can find more information at http://www.unicode.org/unicode/reports/tr15/.

  • Normalization of URIs is a hot topic within the Internet Engineering Task Force (IETF) and W3C. Details are available at http://www.i-d-n.net/draft/draft-duerst-i18n-norm-04.txt and at http://www.w3.org/TR/charmod.

  • Each file system has a unique form. NTFS, FAT32, NFS, High Sierra, and MacOS are all quite distinct.

  • Several normalization standards specific to Internet protocols. Consult the RFC for your application domain.

The Win32 FoldString function provides several useful options for normalizing strings. Unfortunately, it doesn’t cover the full range of Unicode characters, and the mappings do not always match any of the Unicode normalization forms. If you do use FoldString, be sure to test your code with the full Unicode repertoire. For example, if you use FoldString with the MAP_FOLDDIGITS option, it will normalize many but not all of the characters with the numeric Unicode property.

Summary

To many people, I18N is a mystery, mainly because so many of us build software for the English-speaking world. We don’t take into consideration non-English writing systems and the fact that it often takes more than one byte to represent a character. This can lead to processing errors that can in turn create security errors such as canonicalization mistakes and buffer overruns. Someone in your group should own the security implications of I18N issues in your applications.

Although I18N security issues can be complex, making globalized software trustworthy does not require that you speak 12 languages and memorize the Unicode code chart. A few principles, some of which were described in this chapter, and a little consultation with specialists are often sufficient.

To remove some of the mystery, look at the http://www.microsoft.com/globaldev Web site, which has plenty of information about I18N, as does the Unicode site, http://www.unicode.org. Also, Unicode has an active mailing list you can join; read http://www.unicode.org/unicode/consortium/distlist.html. Finally, news://comp.std.internat is a newsgroup devoted to international standards issues.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.136.233.153