9. Machine Translation

WHEN IBM PC DOS WAS FIRST TRANSLATED into French, the DOS error message “Out of environment space” was translated as the French equivalent of “There is no room in the garden.” I used to imagine French people looking forlornly out of the window thinking to themselves, “That may be true, but what’s wrong with my computer?” The translation industry and the tools that it spawns have made great advances in the last 20 years, but the pitfalls are still the same. In this chapter, we look at how we can use machine translation (MT)—translations performed by computers. We build a translation engine that we later use in Chapter 10, “Resource Administration,” to automatically translate resources and to add whole new languages to an application at the click of a button, and in Chapter 12, “Custom Resource Managers,” to perform automatic translation on the fly for applications with highly dynamic content.

The journey starts with a critique of the kind of success you can expect from machine translation. We create a foundation for the translation engine upon which all translators will be based. We create a range of translators that I have arranged into groups: pseudo translators, Web service translators, HTML translators, and an Office 2003 Research Service translator. Finally, we conclude with a simple utility, Translator Evaluator, that enables you to test and compare the success, performance, and accuracy of automatic translators.

How Good Is It ?

The short answer to this question is that it’s not perfect, and you wouldn’t want to ship your application with a wholly machine-based translation. You should look at machine translation as a means of providing a first-cut translation. This first attempt would be handed over to a human translator (along with the original language upon which the translation was performed), who would fix the translation mistakes.

Machine translation offers a number of benefits. It significantly reduces the amount of work that the translator has to do and, therefore, reduces the cost of translation. One statistic states that the maximum number of words a translator can translate per day is 2,000. A crude estimation of the time involved in translation can be gained from adding the number of words in your resources and dividing by 2,000. The resulting time will certainly be longer than it would take a machine to translate the same text.

In addition, machine translation produces a first-cut translation that can be used for testing and demonstration purposes. The development team commonly doesn’t speak the target language of the application, so, to them, a first-cut translation to the target language is indistinguishable from the final translation. The benefit, though, is that the development team and the user acceptance team can clearly see that a translation has been performed and can detect problems in the internationalization process early in the development process.

The longer answer to the question “How good is it?” requires a recognition of the problems of machine translation. Most languages are ambiguous, and English is one of the most ambiguous. If you look up a word in a dictionary, you will probably find several meanings for the same word. For example, the English word close can mean “to shut,” “to be near,” “to finish,” or “to be fond of,” and each meaning often has a different word. The correct meaning can be inferred only from either the surrounding words or the context in which the phrase is used. To help with the problem of context, some machine translators support a “Subject” setting that enables you to specify a genre or industry in which the words are used (e.g., Science, Sport, Accounting, Aviation, Medical). However, this doesn’t always help. Often when a machine translator cannot determine the context of a word, it makes its best guess. The English phrase “Log off” seems obvious enough to an English speaker, but translated into German, it can get translated as “Plank of wood away” (Log is “plank of wood” and off is “away”).

In addition, there is no one “true” translation for most words or phrases. If you translate the simple English word Enter into German, various translators will return “Kommen Sie herein,” “Kommen Sie,” “Tragen Sie ein,” and “Hereingehen.” If you are looking for a gauge of how accurate your translation is and you don’t speak the target language yourself, try translating from your own language to the target language and back again. If you get the original text back again, the translation is more likely to be accurate (but this isn’t a guarantee because it could be translated incorrectly in both directions). The reverse, however, isn’t necessarily true, so if you don’t get your original text back, it doesn’t necessarily follow that the translation is wrong. (If you are looking for some translation-related fun, you could write a kind of translation Chinese whispers quote of the day program in which you translate a phrase through several different languages and finally back to the original language, and see what you get.) This is one of the reasons why “universal translators” that translate all languages into an intermediate language such as Esperanto and then from the intermediate language to the target language don’t fare so well in terms of accuracy.

Finally, it should be pointed out that correct spelling and grammar are a prerequisite for a successful machine translation. If words are spelled incorrectly, a machine translator cannot possibly translate the word correctly (and this can affect the success of the translation of the surrounding words). See the “Resource strings should be spelled correctly” rule in Chapter 13, “Testing Internationalization Using FxCop,” for details on checking the spelling of your resources.

Translation Engine

The translation engine is a library of machine translators that support one or more translation language pairs (e.g., English to French). In all but one of the translators in this chapter, the language pairs are culture neutral. The translators shown in this chapter represent examples of free translation sources, but you can add your own translator classes. Figure 9.1 shows the translators included with the code for this book.

Figure 9.1. Translator Class Hierarchy

image

Translators implement the ITranslator interface (shown later), which has a method called Translate. In this example, we are creating a new WebServiceXTranslator object (which we cover in the section on Web service translators) and are calling the Translate method to translate “Eddies in the space time continuum” from English to German:


ITranslator translator = new WebServiceXTranslator();
string text = translator.Translate(
    "en", "de", "Eddies in the space time continuum");

Translators support a method called IsSupported so that they can be interrogated to see if they support a given language pair:


ITranslator translator = new WebServiceXTranslator();
if (translator.IsSupported("en", "de"))
    string text = translator.Translate(
        "en", "de", "Eddies in the space time continuum");

Typically, you won’t use just one translator; you’ll use a collection of translators. Not all translators support all possible language pairs, so a collection of translators enables you to support the sum of language pairs for many translators. In addition, not all translators are always online, so having a language pair redundancy enables you to fall back to other translators when one becomes unavailable. For this purpose, there is the TranslatorCollection class:


TranslatorCollection translators = new TranslatorCollection();
translators.Add(new PseudoTranslator());
translators.Add(new WebServiceXTranslator());
translators.Add(new CloserFarTranslator());
translators.Add(new AltaVistaTranslator());
translators.Add(new FreeTranslationTranslator());
translators.Add(new Office2003ResearchServicesTranslator());

In this example, we create a new TranslatorCollection and add a number of translators that support overlapping and unique language pairs. You can then get a translator from the collection for a given language pair:


ITranslator translator = translators.GetTranslator("en", "de")
string text = translator.Translate(
    "en", "de", "Eddies in the space time continuum");

The algorithm used to get the translator is a simple sequential search, but you could change the algorithm to be more sophisticated, to favor certain translators for certain language pairs.

The ITranslator Interface

Translators are defined by the ITranslator interface:


public interface ITranslator
{
    bool IsSupported(string inputLanguage, string outputLanguage);
    string Translate(string inputLanguage, string outputLanguage,
        string text);
    string Name {get;}
    bool Enabled {get;set;}
    string[,] LanguagePairs {get;}
}

You have already seen the IsSupported and Translate methods. The Name property is the name of the translator. The Enabled property specifies whether the translator is enabled. This can be used to turn off a translator. For example, the Resource Administrator (in Chapter 10) catches exceptions thrown by the Translate method and sets Enabled to false. This isn’t the same as removing a translator from the collection because, in a long-running application, you might want to “resurrect” translators to give them a second chance.

The LanguagePairs property is an array of language pairs supported by the translator. The array is intended for informational purposes only. To determine whether a language pair is supported, use the IsSupported method instead of scanning the LanguagePairs array. The IsSupported method is more accurate than walking through the LanguagePairs array, mainly because the LanguagePairs array can contain wildcards (“*”) to denote “any language,” but also because some translators could support language pairs that can be tested only dynamically.

The Translator Class

The Translator class implements the ITranslator interface and acts as a base class for all the translators in this chapter. Of course, if you write your own translator classes, you do not need to inherit from Translator—you only need to support the ITranslator interface. The Translator class implements the basic functionality for the properties and the IsSupported method. The Translate method is left for subclasses to implement.


public abstract class Translator: ITranslator
{
    private bool enabled = true;
    private string name;
    private string[,] languagePairs;

    public Translator(string name)
    {
        this.name = name;
    }
    public Translator(string name, string[,] languagePairs)
    {
        this.name = name;
        this.languagePairs = languagePairs;

    }
    public string Name
    {
        get {return name;}
    }
    public bool Enabled
    {
        get {return enabled;}
        set {enabled = value;}
    }
    public virtual string[,] LanguagePairs
    {
        get {return languagePairs;}
    }
    public virtual bool IsSupported(
        string inputLanguage, string outputLanguage)
    {
        if (languagePairs == null)
            return false;

        if (inputLanguage.Length < 2 || outputLanguage.Length < 2)
            return false;

        for(int pairNumber = 0;
            pairNumber < languagePairs.GetLength(0); pairNumber++)
        {
            if ((String.Compare(inputLanguage,
                languagePairs[pairNumber, 0], true,
                CultureInfo.InvariantCulture) == 0
                || languagePairs[pairNumber, 0] == "*")
                && String.Compare(outputLanguage,
                languagePairs[pairNumber, 1], true,
                CultureInfo.InvariantCulture) == 0)
                return true;
        }
        return false;
    }
    public abstract string Translate(
        string inputLanguage, string outputLanguage, string text);
}

In addition, the Translator class supports three conversion helper methods that can be used as needed.

The TranslatorCollection Class

The TranslatorCollection class is a collection of objects that implement the ITranslator interface:


public class TranslatorCollection : List<ITranslator>

(In .NET Framework 1.1, the base class is CollectionBase instead of List<ITranslator>).

TranslatorCollection implements the following methods:


public int IndexOf(string name)
{
    for(int index = 0; index < Count; index++)
    {
        if (this[index].Name == name)
            return index;
    }
    return -1;
}
public ITranslator GetTranslator(
    string inputLanguage, string outputLanguage)
{
    int index = IndexOf(inputLanguage, outputLanguage);
    if (index == -1)
        return null;
    return this[index];
}
public ITranslator GetEnabledTranslator(
    string inputLanguage, string outputLanguage)
{
    int index = EnabledIndexOf(inputLanguage, outputLanguage);
    if (index == -1)
        return null;
    return this[index];
}
public int IndexOf(string inputLanguage, string outputLanguage)
{
    for(int index = 0; index < Count; index++)
    {
        if (this[index].IsSupported(inputLanguage, outputLanguage))
            return index;
    }
    return -1;
}
public int EnabledIndexOf(
    string inputLanguage, string outputLanguage)

{
    for(int index = 0; index < Count; index++)
    {
        ITranslator translator = this[index];
        if (translator.Enabled &&
            translator.IsSupported(inputLanguage, outputLanguage))
            return index;
    }
    return -1;
}

You have already seen GetTranslator, which gets a translator from the list when given a language pair. GetTranslator is based upon an overloaded IndexOf method, which accepts the same parameters. GetEnabledTranslator (based upon EnabledIndexOf) performs the same search as GetTranslator, except that it looks for only enabled translators. This method is useful if your application catches translator exceptions and “turns off” translators, or, alternatively, if you offer the user the capability to turn translators on and off.

Pseudo Translation

As part of the internationalization process, you will certainly want to test how local-izable your application is. No matter how sophisticated your toolkit is, some part of the testing process must involve looking at the end result of the translation. The form in Figure 9.2 has been translated into Greek.

Figure 9.2. A Windows Form Localized into Greek

image

Although this scores 10 out of 10 for showing that the application is localizable, the user interface is very difficult to navigate if you can’t read Greek. Which of the buttons in the screenshot enables the user to add payments? The user acceptance team will be unable to use the Greek version of the application to test that the application has been globalized, but if they use the English version of the application, they cannot accurately test globalization, either. You need a solution that enables the user acceptance team to read the application so that they can operate it properly, but the language and culture must be different from the development team’s language and culture. Enter pseudo translation and pseudo culture.

Pseudo translation is translation that modifies the original text so that the original text can still be inferred from the translated result. You can pseudo translate your application in many different ways:

• You could capitalize the text (e.g., “Exit” becomes “EXIT”). The downside is that there is no difference when the original text is already in capitals (e.g., “OK”).

• You could prefix the text with a letter (e.g., “T” for “translated”—”Exit becomes “T Exit”).

• You could prefix the text with language and culture (e.g., “Exit” becomes “fr-FR Exit”).

• You could prefix the text with “xxx” and suffix with “xxx” (e.g., “Exit” becomes “xxx Exit xxx”).

• You could convert to Pig Latin (e.g., “Exit” becomes “Exit-hay”).

• You could convert each character to an accented version of the same character (e.g., “Exit” becomes “image”).

• You could translate to a language for which there is no country. Typically, this will be because the language simply isn’t used. Such languages include Latin, Esperanto, Middle Earth (from Lord of the Rings), and Klingon. The Middle Earth and Klingon languages are especially difficult to translate to, partly because there are no machine translators for these languages, but also because Klingon grammar is so very different from Roman grammar.

The option that I have chosen is to convert to accented characters. In this approach, each character is converted to an equivalent character that has an accent. You can find equivalent accented characters using the Character Map (Start, All Programs, Accessories, System Tools, Character Map).

Figure 9.3. Using the Character Map to Find Accented Characters

image

In addition, the result needs to be padded. English is not the longest language in the world, and often only after the translation has occurred will you discover that your carefully designed forms don’t allow enough room for the translated language. German and Welsh are typically significantly longer than English. If you are translating your application for the European market, target German first, in order to identify form design problems in which the translated string needs to be padded. A general rule is that, for text less than 10 characters, the text needs to be padded by an additional 400 percent; for text greater than or equal to 10 characters, the text needs to be padded by an additional 30 percent. You also need to decide whether to pad on the left, on the right, or both. The benefit of padding on both sides is that you can check for screen design problems when your forms have been mirrored (for right-to-left languages).

The form in Figure 9.4 has been translated using the PseudoTranslator class.

Figure 9.4. An Example of a Pseudo Translated Form

image

You should be able to read all the text on the form and, therefore, know which button to press to add payments.

Choosing a Culture for Pseudo Translation

To use pseudo translation, you must identify a culture that represents the pseudo language. In Chapter 11, “Custom Cultures,” I show how you can create your own cultures for various reasons, including supporting a pseudo translation. Creating a custom culture is a good approach in the following circumstances:

  1. You are using Visual Studio 2005 or
  2. You are writing an ASP.NET application or
  3. You are not using resx files for resources.

However, if you are using Visual Studio 2003 and are writing a Windows Forms application based on resx files, any custom culture that you create will not be recognized by Visual Studio 2003. See Chapter 11 for more information.

If you are unable to create a custom culture for pseudo translation, the alternative is to hijack an existing culture. In this approach, you identify a language/country that you do not expect to ever use for its real purpose and hijack it for your own purpose. This requires a certain amount of thought. First, hijacking someone else’s culture could potentially offend the people of that culture (assuming that they find out). Second, you want to choose a culture that is as different from your own as possible. It should have a different date/time format, number format, currency, time zone, and as many other globalization differences as possible. So it is with great apologies to the people of Albania that I have chosen Albanian as the culture to hijack. From the Regional and Language Options page (see Figure 9.5), you can see that the Albanian culture is significantly different than that of English.

Figure 9.5. Regional and Language Options Dialog Showing Albanian Culture

image

The number format is different, the currency symbol is suffixed instead of prefixed, the time suffix is PD instead of AM, and the date separator is “-” instead of “/”.

Regardless of whether you create a custom culture or hijack an existing culture, you need a means by which you can communicate your chosen culture to code that needs to know the pseudo culture. The following PseudoTranslation class has a static property that represents a global placeholder for your chosen pseudo translation culture (“sq” is the language code for Albanian):


public class PseudoTranslation
{
    private static CultureInfo cultureInfo = new CultureInfo("sq");
    public static CultureInfo CultureInfo

    {
        get {return cultureInfo;}
        set {cultureInfo = value;}
    }
}

Now a translator class can indicate that it supports translation to the pseudo culture without hard-coding what the pseudo culture is.

The PseudoTranslator Class

The PseudoTranslator class is the least technically sophisticated translator:


public enum PseudoTranslationPadding {None, Left, Right, Both};

public class PseudoTranslator: Translator
{
    private static PseudoTranslationPadding padding =
        PseudoTranslationPadding.Both;

    public PseudoTranslator(): base("Pseudo Translator",
        new string[,] {{"*", Internationalization.
        Common.PseudoTranslation.CultureInfo.Name}})
    {
    }
    public static PseudoTranslationPadding Padding
    {
        get {return padding;}
        set {padding = value;}
    }
    public override string Translate(
        string inputLanguage, string outputLanguage, string text)
    {
        if (! IsSupported(inputLanguage, outputLanguage))
            throw new LanguageCombinationNotSupportedException(
              "Language combination is not supported",
              inputLanguage, outputLanguage);

        return PseudoTranslate(text);
    }
}

The constructor simply calls the Translator base class constructor and supplies its name and the array of language pairs that it supports. It supports just one language pair, which translates from “*” (meaning any language) to the pseudo translation language. The PseudoTranslator class has a static Padding property, which indicates whether padding is added on the left, the right, both sides, or not at all. The Translate method checks to see that the language translation is supported; if it isn’t, it throws an exception. Otherwise, it performs the translation by calling the PseudoTranslate method:


protected virtual string PseudoTranslate(string text)
{
    StringBuilder stringBuilder = new StringBuilder("[");
    if (padding == PseudoTranslationPadding.Left ||
        padding == PseudoTranslationPadding.Both)
    {
        // add padding on the left
        for(int padCount = 0;
            padCount < GetPaddingOnOneSideCount(text); padCount++)
        {
            stringBuilder.Append("!!! ");
        }
    }

    bool previousCharacterIsSlash = false;
    foreach(char chr in text)
    {
        if (previousCharacterIsSlash)
            stringBuilder.Append(chr);
        else
            stringBuilder.Append(ConvertCharacter(chr));

        previousCharacterIsSlash = (chr == @""[0]);
    }

    if (padding == PseudoTranslationPadding.Right ||
        padding == PseudoTranslationPadding.Both)
    {
        // add padding on the right
        for(int padCount = 0;
            padCount < GetPaddingOnOneSideCount(text); padCount++)
        {
            stringBuilder.Append(" !!!");
        }
    }
    stringBuilder.Append("]");
    return stringBuilder.ToString();
}

It builds the translated text by adding the padding prefix (if any) and converting each character using the ConvertCharacter method and then adding the padding suffix (if any). Notice the references to the previousCharacterIsSlash variable, which is an attempt to preserve control characters embedded in the string. Refer to the section entitled “Embedded Control Characters” in Chapter 8, “Best Practices,” for a discussion on the pros and cons of embedding control characters in strings. The PseudoTranslator class can cope with embedded control characters because it performs a translation character by character and can put embedded control characters back into the same relative position that they occupied in the original string. Other translator classes do not have the same luxury and cope with embedded control characters less successfully. (Chapter 13 includes an FxCop rule to catch resource strings with embedded control characters.) The ConvertCharacter method is a straight lookup, replacing character for character:

image

Our first translator is complete.

Static Lookup Translator

An obvious approach to translation is to keep all the phrases used in an application in a huge lookup table of some kind. This is the approach adopted by the Microsoft Application Translator (see Appendix B, “Information Resources”). Of course, this solution doesn’t help you with the actual translation process, but having found a lookup source—or, more likely, having created one yourself (or with the help of a translator)—you can save some effort on subsequent translations by keeping such words or phrases in a static lookup. Such glossaries exist. Microsoft’s glossaries are at ftp://ftp.microsoft.com/developr/msdn/newup/Glossary/ (you can read these manually or use the inexpensive WinLexic (http://www.winlexic.com) to download, read, and search them in a more human interface). The Microsoft Application Translator (see Appendix B) also includes various lookup tables. However you should be aware of the following:

• The license agreement might prevent you from using the glossary in this way. (Microsoft’s glossary license explicitly states this.)

• The lookups are good for only known phrases. The majority of your application text is unlikely to match existing glossary text.

• Words used in one context do not necessarily have the same translations in another context. (Welsh, for example, has many forms of “yes” and “no,” which are dependent upon the question being asked.)

For these reasons, I have left this kind of translator as an exercise for the reader.

Web Service Translators

Many free and commercial machine translators are implemented as Web services. You can find Web service translators from Web service directories such as these:

• XMethods (http://www.xmethods.com)

• SalCentral (http://www.salcentral.com)

In this section, I show an example of a Web Service translator using the WebServiceX translator at http://www.webservicex.net/TranslateService.asmx. This Web service provides machine-translation services for European, Chinese, Japanese, and Korean languages. The code for this book also includes a translator for the Closer-Far service, at http://www.closerfar.com/engtoarabic.asmx, which translates English to Arabic.

Start by adding a reference to the Web service (in Solution Explorer, right-click the project, select Add Web Reference..., enter the address in the URL text box, click Go, and click the Add Reference button). This is the WebServiceXTranslator class:


public class WebServiceXTranslator: Translator
{
    private net.webservicex.www.TranslateService translateService;

    public WebServiceXTranslator():
        base("WebServiceX Translator", new string[,]
        {
            {"en", "zh-CHS"},
            {"en", "fr"},
            {"en", "de"},
            {"en", "it"},
            {"en", "ja"},
            {"en", "ko"},
            {"en", "pt"},
            {"en", "es"},
            {"fr", "en"},
            {"fr", "de"},
            {"de", "en"},
            {"de", "fr"},
            {"it", "en"},
            {"es", "en"}
        })
    {
    }
    public override string Translate(
        string inputLanguage, string outputLanguage, string text)
    {
        if (! IsSupported(inputLanguage, outputLanguage))
            throw new LanguageCombinationNotSupportedException(
                "Language combination is not supported",
                inputLanguage, outputLanguage);

        if (translateService == null)
            translateService =
                new net.webservicex.www.TranslateService();

        string translatedText = translateService.Translate(
            TranslationServiceLanguage(
            inputLanguage, outputLanguage), text);

        return ConvertFromEscapedNumerics(translatedText);
    }
}

The constructor calls the base class, passing in the name and the list of supported language pairs. As mentioned in previous chapters, the Chinese (Simplified) language is not a simple two-letter code like most other languages. Although it is uncommon, this is not the only language that is not identified by two letters.

The Translate method calls the Web service’s Translate method, passing in an enumeration that identifies the language pair and the text to translate. The TranslationServiceLanguage method is simply a conversion from a language pair to the required enum.

HTML Translators

Another source of translators is translation Web sites. AltaVista is one such example. Go to AltaVista and click the “Translate” link (see Figure 9.6).

Figure 9.6. AltaVista’s Translation Facility

image

From here you can type a string into the “Translate a block of text:” box, select a language pair using the combo box, and click the Translate button; the translated text is shown on the next page. The job of an HTML translator is to automate this process and to collect the result. The result is returned somewhere in the returned HTML page and must be extracted. This process is often referred to as screen scraping and is not an ideal solution. The biggest problem is that when the HTML changes, the algorithm for extracting the string is usually broken, so the process fails. As such, it is a fragile solution; the more explicit method employed by Web services is preferable.

In the code for this book, you will find HTML translators for the following:

• AltaVista

• Free Translation

• Online Translator

• Socrates

See Appendix B for a complete list of online translators.

All the HTML translators inherit from the HtmlTranslator class, which uses the .NET Framework 2.0 WebBrowser control (in the All Windows Forms section of the toolbox). If you are using Visual Studio 2003, see the section “Visual Studio 2003 WebBrowser Control” for an equivalent control. Both the Visual Studio 2005 and Visual Studio 2003 projects are included in the book’s source code.

The WebBrowser control is part of a form called WebBrowserForm, which is never shown. The WebBrowser control simply represents a way to post information to a page and to get the resulting HTML. The form has one method to wait for the completion of the page before getting the result:


public void WaitForBrowser()
{
    while(WebBrowser.ReadyState != WebBrowserReadyState.Complete)
    {
        Application.DoEvents();
    }
}

The HtmlTranslator class makes life easy for its subclasses. With the majority of the work performed in the HtmlTranslator class, the subclasses need to specify only the following:

• The translator’s name

• The URL for the Web site

• The language pairs supported

• A method to format the data posted to the URL

• A method to decode the result from the Web page

This is the HtmlTranslator class:


public abstract class HtmlTranslator: Translator
{
    private string url;

    private WebBrowserForm webBrowserForm;
    private WebBrowser webBrowser;

    public HtmlTranslator(string name, string url): base(name)
    {
        this.url = url;
    }
    public HtmlTranslator(string name, string url,
        string[,] languagePairs): base(name, languagePairs)
    {
        this.url = url;
    }
    public string Url
    {
        get {return url;}
        set {url = value;}
    }
    protected virtual void InitializeWebBrowser()
    {
        if (webBrowser == null)
        {
            webBrowserForm = new WebBrowserForm();
            webBrowser = webBrowserForm.WebBrowser;
        }
    }
    public abstract string GetPostData(
        string inputLanguage, string outputLanguage, string text);

    protected string Encode(string text)
    {
        return HttpUtility.UrlEncode(text);
    }
    protected virtual string GetTranslation(string inputLanguage,
        string outputLanguage, string innerText)
    {
        return innerText;
    }
}

As you would expect, the action happens in the Translate method:


public override string Translate(
    string inputLanguage, string outputLanguage, string text)
{
    if (! IsSupported(inputLanguage, outputLanguage))
        throw new LanguageCombinationNotSupportedException(
            "Language combination is not supported",
            inputLanguage, outputLanguage);

    InitializeWebBrowser();

    string innerText = GetInnerText(GetPostData(
        inputLanguage, outputLanguage, text));
    return GetTranslation(inputLanguage, outputLanguage, innerText);
}

Translate initializes the Web browser and calls the subclass’s GetPostData to get the data to post to the URL. GetInnerText navigates to the URL, posts the data, waits for the browser to complete the display of the page, extracts the HTML from the page, and then extracts just the text part of the HTML:


protected virtual string GetInnerText(string postData)
{
    string headers =
        "Content-Type: application/x-www-form-urlencoded" +
        (char) 10 + (char) 13;

    byte[] bytePostData =
        System.Text.Encoding.ASCII.GetBytes(postData);

    webBrowser.Navigate(
        new Uri(url), String.Empty, bytePostData, headers);

    webBrowserForm.WaitForBrowser();

    return webBrowser.Document.Body.InnerText;
}

The subclass’s GetTranslation method is passed the language pair and the Web page’s text, and is responsible for extracting the translated text from the page.

Visual Studio 2003 WebBrowser Control

The .NET Framework 1.1 does not have a WebBrowser control, but an equivalent ActiveX control can be used instead. This control is not installed in the Visual Studio toolbox by default. To install it, right-click the toolbox, select Add/Remove Items..., select the COM Components tab, click the Browse... button, enter shdocvw.dll from your system32 folder, and click Open. Figure 9.7 shows the result. Click OK.

Figure 9.7. Adding the Microsoft Web Browser Control to the Visual Studio 2003 Toolbox

image

The ActiveX Web Browser wrapper control is similar to the .NET Framework 2.0 WebBrowser control, but you should be aware of the differences listed in Table 9.1.

Table 9.1. Relevant Differences Between the .NET Framework 2.0 WebBrowser Control and the ActiveX Web Browser Control Used in .NET Framework 1.1

image

The AltaVistaTranslator Class

The AltaVistaTranslator class uses AltaVista’s translation Web page to perform translations:


public class AltaVistaTranslator: HtmlTranslator
{
    public AltaVistaTranslator(): base("AltaVista Translator",
        @"http://babelfish.altavista.com/tr", new string[,]
        {
            {"en", "zh-CHS"},
            {"en", "zh-CHT"},
            {"en", "nl"},
            {"en", "fr"},
            {"en", "de"},
            {"en", "el"},
            {"en", "it"},
            {"en", "ja"},
            {"en", "ko"},
            {"en", "pt"},
            {"en", "ru"},
            {"en", "es"},
            {"zh-CHS", "en"},
            {"zh-CHT", "en"},
            {"nl", "en"},
            {"nl", "fr"},
            {"fr", "nl"},
            {"fr", "en"},
            {"fr", "de"},

            {"fr", "el"},
            {"fr", "it"},
            {"fr", "pt"},
            {"fr", "es"},
            {"de", "en"},
            {"de", "fr"},
            {"el", "en"},
            {"el", "fr"},
            {"it", "en"},
            {"it", "fr"},
            {"ja", "en"},
            {"ko", "en"},
            {"pt", "en"},
            {"pt", "fr"},
            {"ru", "en"},
            {"es", "en"},
            {"es", "fr"}
        })
    {
    }
    protected virtual string DotNetLanguageCodeToLanguageCode(
        string language)
    {
        // check for a couple of adjustments to the language codes
        if (language == "zh-CHS")
            // Chinese (Simplified)
            return "zh";
        else if (language == "zh-CHT")
            // Chinese (Traditional)
            return "zt";
        else
            return language;
    }
    public override string GetPostData(string inputLanguage,
        string outputLanguage, string text)
    {
        string languagePair =
            DotNetLanguageCodeToLanguageCode(inputLanguage) + "_" +
            DotNetLanguageCodeToLanguageCode(outputLanguage);

        return "doit=done&intl=1&tt=urltext&trtext=" +
            Encode(text) + "&lp=" + languagePair;
    }
    protected override string GetTranslation(string inputLanguage,
        string outputLanguage, string innerText)
    {
        int index = innerText.IndexOf("Babel Fish Translation");
        if (index == -1)


            return String.Empty;

        innerText = innerText.Substring(index + 2);

        index = innerText.IndexOf(":");
        if (index == -1)
            return String.Empty;

        innerText = innerText.Substring(index + 3);

        index = innerText.IndexOf("Translate again");
        if (index == -1)
            return String.Empty;

        return innerText.Substring(0, index - 2).TrimEnd(
            new char[] {' ', (char) 10, (char) 13});
    }
}

The GetPostData method builds a post data string containing the language pair and the text to translate. The GetTranslation method looks for textual markers that are known to be immediately before the translated text and immediately after the translated text, and gets the text in between. This represents the most fragile part of this process. If the textual content of the resulting Web page changes, this code will need to be rewritten.

Office 2003 Research Services

Microsoft Office 2003 offers yet another translation opportunity. Office 2003 includes a feature called Research Services that enables you to perform various kinds of research from within an Office application. This enables you to perform the research without having to leave your Office application, but it also allows the result of the research to be pasted into your application in context. You can write your own Research Services in .NET and have Office use them just like one of its own.

But this isn’t what is of most interest to us with regard to machine translation. One of the built-in Research Services is the Translation service. To see it in action, start an Office 2003 application, such as Word, and select Tools, Research.... A new pane called “Research” is added on the right side of the window. Drop down the combo box (which initially says “All Reference Books”) and select Translation. In the “Search for:” text box, enter a phrase to translate. In the two combo boxes below, enter the “From” language and the “To” language. Click the green arrow next to the “Search for:” text box; the text then is translated (see Figure 9.8).

Figure 9.8. Using the Microsoft Office 2003 Research Pane to Perform Translations (Translation Services Provided by WorldLingo.)

image

It is this translation facility that you can harness for your own automatic translation.

If you search on MSDN (http://msdn.microsoft.com) or the Microsoft Office Web site (http://office.microsoft.com), you will find a fair amount of information on creating your own research services and integrating them into Office. However, you won’t find any information on how to consume research services. The reason behind this is that Microsoft expects that the only consumer of Office 2003 Research Services is Office 2003. However, all that you need to know is in this section. You might find it useful to download the Microsoft Office 2003 Research Services SDK, which contains some background information on the subject.

Most Office 2003 Research Services are simply Web services. As such, given the Web service’s URL, its WSDL, and the format of its messages, we can use the Web service just like any other Web service. Figure 9.9 shows the list of Research Services that are contained within the Registry at HKEY_CURRENT_USERSoftware MicrosoftOffice11.0CommonResearchSources.

Figure 9.9. Microsoft Office 2003 Research Services Registry Entries

image

Research Services is an “install on demand” option, so you will need to use the translation facility once before the Registry is populated. A “source” is a provider of research information. Microsoft Office Online Services (shown in Figure 9.9) is an example of one such provider. From this entry, you can see the URL of the Web service (http://office.microsoft.com/Research/query.asmx). Providers provide services. If you expand the source’s keys, you can see the list of services (see Figure 9.10).

Figure 9.10. Microsoft Office 2003 Services Registry Entries

image

This service is a translation service that translates from “English (U.S.)” to “French (France)”. The kind of service is specified in the CategoryID, which is 0x36120000 (907149312) for translation services (this is the REFERENCE_TRANSLATION constant in the Office 2003 Research Service SDK). Of particular interest here is the SourceData entry, which is in the following format:


<FromLCID>/<ToLCID>/<ResultType>

In the entry in Figure 9.10, the FromLCID is 1033 (which is the locale ID for “English (U.S.)”), the ToLCID is 1036 (which is the locale ID for “French (France)”), and the ResultType is 4. The result type is “1” for keyword translators and “2” for whole-document translators; “4” is not documented but appears to be for keyword/sentence translators. For our purposes, we are interested in “1” and “4”.

From this information, you could read through the list of providers collecting a list of services that have a CategoryID of 0x36120000 and a SourceData that has a result type of either 1 or 4.

By default, three providers of translation services are included with Office 2003:

• internal:LocalTranslation

• Microsoft Office Online Services

• WorldLingo

The “internal:LocalTranslation” provider is a set of Win32 DLLs and is not a Web service. You can find the DLLs in “%CommonProgramFiles%Microsoft SharedTRANSLAT”. They are installed on demand, so they won’t be present until you have translated English to/from French and/or English to/from Spanish. Because this provider is not a web service and the functions are undocumented, I have chosen to ignore this provider.

At first sight, the Microsoft Office Online Services looks like a good source of machine translation. The URL in the Registry can be used as is in Visual Studio’s ASP.NET Web Service Wizard to generate a Web service reference because the Web service returns the WSDL that describes the Web service. Unfortunately, the Web service itself suffers from two problems. First, the Web service is more of a translation dictionary than a keyword translator. For example, if you translate Stop into German, the result (after all the HTML formatting has been removed) is this:

1. (-pp-) intransitives Verb (an)halten, stehen bleiben (auch Uhr und so weiter), stoppen; aufhören; besonders Brt. bleiben; stop dead plötzlich oder abrupt stehen bleiben; stop at nothing vor nichts zurückschrecken; stop short of doing, stop short at something zurückschrecken vor (Dativ); transitives Verb anhalten, stoppen; aufhören mit; ein Ende machen oder setzen (Dativ); Blutung stillen; Arbeiten, Verkehr und so weiter zum Erliegen bringen; etwas verhindern; jemanden abhalten (from von), hindern (from an Dativ); Rohr und so weiter verstopfen (auch stop up); Zahn füllen, plombieren; Scheck sperren (lassen); stop by vorbeischauen; stop in vorbeischauen (at bei); stop off umgangssprachlich: kurz Halt machen; stop over kurz Halt machen; Zwischenstation machen;

Clearly, this is the kind of definition that you would expect to find in a dictionary, but it is virtually useless for machine translation.

Second, it translates just single words; it cannot translate a sentence or a phrase. It is almost completely meaningless to translate words one by one and string them together, so these services have no use to us.

WorldLingo Translation Services

The third provider, WorldLingo, is the only viable option that is installed by default. The complete source code to use with this provider is included with this book. Because it is long, I focus only on the most important parts.

The first problem in using the WorldLingo services is that the WorldLingo server doesn’t expose the WSDL for the Web service. You can’t simply put http://www.worldlingo.com/wl/msoffice11 into Visual Studio’s ASP.NET Web Service wizard; the process needs to be a little lower level. Instead, you can use an HttpWebRequest object to send an HTTP request to the server and read the Web Response object that is returned. SendRequest sends a SOAP request to a URL:


protected string SendRequest(string url, string soapPacket)
{
    HttpWebRequest httpWebRequest =
        (HttpWebRequest) WebRequest.Create(url);
    httpWebRequest.ContentType = "text/xml; charset=utf-8";
    httpWebRequest.Headers.Add(
        "SOAPAction: urn:Microsoft.Search/Query");
    httpWebRequest.Method = "POST";
    httpWebRequest.ProtocolVersion = HttpVersion.Version10;

    Stream stream = httpWebRequest.GetRequestStream();
    StreamWriter streamWriter = new StreamWriter(stream);
    streamWriter.Write(soapPacket);
    streamWriter.Close();

    WebResponse webResponse = httpWebRequest.GetResponse();
    Stream responseStream = webResponse.GetResponseStream();
    StreamReader responseStreamReader =
        new StreamReader(responseStream);
    return responseStreamReader.ReadToEnd();
}
This would be used something like this:-
string responsePacket = SendRequest(
    "http://www.worldlingo.com/wl/msoffice11", queryPacket);

The Web service has a method called Query that accepts a single parameter that is a string of XML. The XML contains the translation request, including the “from” language, the “to” language, and the text to be translated. The aforementioned Microsoft Office 2003 Research Services SDK has the structure of this XML packet. At first sight, the Research Services Class Library (RCSL, also available from http://msdn.microsoft.com) includes QueryRequest and QueryResponse classes that might help. These classes are wrappers to build and read the XML used with the Query method. Unfortunately, they are designed for use by developers, not consumers, of Research Services; consequently, they enable you to read the query XML and to create the response XML. This doesn’t help because we want to create the query XML and read the response XML.

To create the query XML, I wrote a GetQueryXml method, which can be called something like this:


GetQueryXml("The monkey is in the tree", service.Id, "(11.0.6360)")

We pass the string to translate, the GUID of the service that performs the translation, and a build number. The GUID of the service identifies the from/to language pair. GetQueryXml then builds the necessary XML using XmlTextWriter according to the schema defined in the SDK.

The return result of the SendRequest method is the response from the Web service. Again, this is an XML string using the QueryResponse schema defined in the SDK. The Response element of this XML contains the translated text. Unfortunately, this translated text is formatted for display in an Office application, so it contains HTML formatting that must be removed first. With this done, we have our translated text.

Translator Evaluator

Now that we have completed all our translators, we can take stock and review our list. Included with the code for this book is a simple utility called Translator Evaluator. Its purpose is to enable you to experiment with different translators and compare their support, performance, and accuracy. Start TranslatorEvaluator. exe, enter a “from” language and a “to” language, and click the “Show Support For Languages” button (see Figure 9.11).

Figure 9.11. Translator Evaluator—“Show Support For Languages”

image

This tells you which translators support this language pair. Enter some text in the “Translation text” text box and click the Translate button. Figure 9.12 shows the result.

Figure 9.12. Translator Evaluator Comparing Translation Performance and Accuracy

image

The Duration shown is the duration for a translation from the “from” language to the “to” language only; it does not include the translation back to the original “from” language. From the duration, you can infer the relative performance of the different translators. You should run your tests several times and take a mean duration because many factors affect performance. Based on your results, you should be able to determine which translators to list first in your TranslatorCollection and which to list later, to be used only if the first translators fail.

The “Translation” column shows the translated text. Even from this example, you can see that there are differences between how the different translators translate text.

The “Translation Back” column shows the translation of the translated text back to its original language. You can use this to gauge the accuracy of the translation process. Remember, it is only an indication—it doesn’t necessarily follow that if there are differences, the translation is wrong.

Where Are We?

In this chapter, we looked at machine translation and how we can automate it. The result of a machine translation serves as a first-cut translation (to save the translators’ time and, therefore, your money), but it is also a means of helping with user acceptance testing for your application. It can also be used as a temporary fix for maintenance releases when a timely turnaround from the translator is not possible. We created a translation engine with many translators based on Web services, websites, and Office 2003 Research Services. We also looked at performing pseudo translation to help with the user acceptance process. Finally, we looked at the Translator Evaluator, which compares translator support, performance, and accuracy. With the translation engine in place, we can use it in Chapter 10 to automatically translate resources, and in Chapter 12 to perform on-the-fly translation for applications with highly dynamic content.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.196.59