Chapter 9. LaTeX in a Multilingual Environment

This chapter starts with a short introduction to the technical problems that must be solved if you want to use (La)TeX with a non-English language. Most of the remaining part of the chapter discusses the babel system, which provides a convenient way of generating documents in different languages. We look in particular how we can typeset documents in French, German, Russian, Greek, and Hebrew, as the typesetting of those languages illustrates various aspects of the things one has to deal with in a non-English environment. Section 9.5 explains the structure of babel’s language definition files for the various language options. Finally, we say a few words about how to handle other languages, such as Arabic and Chinese, that are not supported by babel.

9.1. TeX and non-English languages

Due to its popularity in the academic world, TeX spread rapidly throughout the world and is now used not only with the languages based on the Latin alphabet, but also with languages using non-Latin alphabetic scripts, such as Russian, Greek, Arabic, Persian, Hebrew, Thai, Vietnamese, and several Indian languages. Implementations also exist for Chinese, Japanese, and Korean, which use Kanji-based ideographic scripts.

With the introduction of 8-bit TeX and , which were officially released by Donald Knuth in March 1990, problems of multilingual support could be more easily addressed for the first time. Nevertheless, by themselves, these versions do not solve all the problems associated with providing a convenient environment for using LaTeX with multiple and/or non-English languages.

To achieve this goal, TeX and its companion programs should be made truly international, and the following points should be addressed:

1. Adjust all programs to the particular language(s):

• Support typesetting in different directions, this ability is offered by several programs (e.g., eTeX, Omega) [27, 97],

• Create proper fonts containing national symbols [137],

• Define standard character set encodings, and

• Generate patterns for the hyphenation algorithm.

2. Provide a translation for the language-dependent strings, create national layouts for the standard documents, and provide TeX code to treat the language-dependent typesetting rules automatically [120].

3. Support processing of multilingual documents (more than one language in the same document) and work in international environments (one language per document, but a choice between several possibilities). For instance, the sorting of indexes and bibliographic references should be performed in accordance with a given language’s alphabet and collating sequence; see the discussion on xindy in Section 11.3.

At the same time, you should be able to conveniently edit, view, and print your documents using any given character set, and LaTeX should be able to successfully process files created in this way. There exist, however, almost as many different character encoding schemes as there are languages (for example, IBM PC personal computers have dozens of code pages). In addition, several national and international standards exist, such as the series ISO 8859-x [67]. Therefore, some thought should be given to the question of compatibility and portability. If a document is to be reproducible in multiple environments, issues of standardization become important. In particular, sending 8-bit encoded documents via electronic mail generated problems at one time, because some mail gateways dropped the higher-order bit, rendering the document unprocessable. The e-mail problem is more or less solved now that almost all mailers adhere to the Multipart Internet Mail Extensions (MIME) standard, in which the use of a particular encoding standard (e.g., ISO 8859-x) is explicitly declared in the e-mail’s header. The fact remains, however, that it is necessary to know the encoding in which a document was produced. For this purpose LaTeX offers the inputenc package, described in Section 7.11.3 on page 443.

Document encoding problems will ultimately be solved when new standards that can encode not only the alphabetic languages, but also ideographic scripts like Chinese, Japanese, and Korean are introduced. Clearly, 8 bits are not sufficient to represent even a fraction of the “characters” in those scripts. Multi-byte electronic coding standards have been developed to serve this need—in particular, “16-bit” Unicode [165], which is a subset of the multi-byte ISO 10646 [69, 70]. Unicode will likely become the base encoding of most operating systems in the near future. Moreover, Unicode lies at the very heart of the XML [26] meta-language, on which all recently developed markup languages of the Internet are based. Thus, the integrity of electronic documents and data—structural as well as content-wise—can be fully guaranteed. LaTeX supports a restricted version of Unicode’s UTF-8 representation through the inputenc option utf8 discussed in Section 7.5.2.

At its Portland, Oregon, meeting in July 1992, TUG’s Technical Council set up the Technical Working Group on Multiple Language Coordination (TWGMLC), chaired by Yannis Haralambous. This group was charged with promoting and coordinating the standardization and development of TeX-related software adapted to different languages. Its aim was to produce for each language or group of languages a package that would facilitate typesetting. Such a package should contain details about fonts, input conventions, hyphenation patterns, a LaTeX option file compatible with the babel concept (see Section 9.1.3), possibly a preprocessor, and, of course, documentation in English and the target language.

9.1.1. Language-related aspects of typesetting

When thinking about supporting typesetting documents in languages other than English, a number of aspects that need to be dealt with come to mind.

First and foremost is the fact that other languages have different rules for hyphenation, something that TeX accommodates through its support for multiple hyphenation patterns. In some languages, however, certain letter combinations change when they appear at a hyphenation point. TeX does not support this capability “out of the box”.

Some languages need different sets of characters to be properly typeset. This issue can vary from the need for additional “accented letters” (as is the case with many European languages) to the need for a completely different alphabet (as is the case with languages using the Cyrillic or Greek alphabet). When non-European languages need to be supported, the typesetting direction might be different as well (such as right to left for Arabic and Hebrew texts) or so many characters might be needed (as is the case with the Kanji script, for instance) that TeX’s standard mechanisms cannot deal with them.

A more “subtle” problem turns up when we look at the standard document classes that each LaTeX distribution supplies. They were designed for the Anglo-American situation. A specific example where this preference interferes with supporting other languages is the start of a chapter. For some languages it is not enough to just translate the word “Chapter”; the order of the word and the denomination of the chapter needs to be changed as well, solely on the basis of grammatical rules. Where the English reader expects to see “Chapter 1”, the French reader expects to see “1er Chapitre”.

9.1.2. Culture-related aspects of typesetting

An even more thorny problem when faced with the need to support typesetting of many languages is the fact that typesetting rules differ, even between countries that use the same language. For instance, hyphenation rules differ between British English and American English. Translations of English words might vary between countries, just as they do for the German spoken in Germany and the German spoken (and written) in Austria.

Typographic rules may differ between countries, too. No worldwide standard tells us how nested lists should be typeset; on the contrary, their appearance may differ for different languages, or countries, or even printing houses. With these aspects we enter the somewhat fuzzy area comprising the boundary between language aspects of typesetting and cultural aspects of typesetting. It is not clear where that boundary lies. When implementing support for typesetting documents written in a specific language, this difference needs to be taken into account. The language-related aspects can be supported on a general level, but the cultural aspects are more often than not better (or more easily) handled by creating specific document classes.

9.1.3. Babel—LaTeX speaks multiple languages

The LaTeX distribution contains a few standard document classes that are used by most users. These classes (article, report, book, and letter) have a certain American look and feel, which not everyone likes. Moreover, the language-dependent strings, such as “Chapter” and “Table of Contents” (see Table 9.2 on page 547 for a list of commands holding language-dependent strings), come out in English by default.

The babel package developed by Johannes Braams [25] provides a set of options that allow the user to choose the language(s) in which the document will be typeset. It has the following characteristics:

• Multiple languages can be used simultaneously.

• The hyphenation patterns, which are loaded when INITeX is run to produce the LaTeX format, can be defined dynamically via an external file.

• Translations for the language-dependent strings and commands for facilitating text input are provided for more than 20 languages (see Table 9.1 on the facing page).

Image

Table 9.1. Language options supported by the babel system

In the next section we describe the user interface of the babel system. We then discuss the additional commands for various languages and describe the support for typesetting languages using non-Latin alphabets. Finally, we discuss ways to tailor babel to your needs and go into some detail about the structure of the language definition files (.ldf) that implement the language-specific commands in babel. Throughout the sections, examples illustrate the use of various languages supported by babel.

9.2. The babel user interface

Any language that you use in your document should be declared as an option when loading the babel package. Alternatively, because the language(s) in which a document is written constitute a global characteristic of the document, the languages can be indicated as global options on the documentclass command. This strategy makes them available to any package that changes behavior depending on the language settings of the document. Currently supported options are enumerated in Table 9.1. For example, the following declaration prepares for typesetting in the languages German (option ngerman for new hyphenation rules) and Italian (option italian):

Image

The last language appearing on the usepackage command line will be the default language used at the beginning of the document. In the above example, the language-dependent strings, the hyphenation patterns (if they were loaded for the given language when the LaTeX format was generated with INITeX; see the discussion on page 580), and possibly the interpretation of certain language-dependent commands (such as the date) will be for Italian from the beginning of the document up to the point where you choose a different language.

If one decides to make ngerman and italian global options, then other packages can also detect their presence. For example, the following code lets the package varioref (described in Section 2.4.2 on page 68) detect and use the options specified on the documentclass command:

Image

If you use more than one language in your document and you want to define your own language-dependent strings for the varioref commands, you should use the methods described in Section 9.5 on page 579 and not those discussed in Section 2.4.2.

9.2.1. Setting or getting the current language

Within a document it is possible to change the current language in several ways. For example, you can change all language-related settings including translations for strings like “Chapter”, the typesetting conventions, and the set-up for shorthand commands. Alternatively, you can keep the translations unchanged but modify everything else (e.g., when typesetting short texts in a foreign language within the main text). Finally, you can change only the hyphenation rules.

Image

A change to all language-related settings is implemented via the command selectlanguage. For instance, if you want to switch to German, you would use the command selectlanguage{german}. The process is similar for switching to other languages. Each language must have been declared previously as a language option in the preamble as explained earlier. The selectlanguage command calls the macros defined in the language definition file (see Section 9.5) and activates the special definitions for the language in question. It also updates the setting of TeX’s language primitive used for hyphenation.

The environment otherlanguage provides the same functionality as the selectlanguage declaration, except that the language change is local to the environment. For mixing left-to-right typesetting with right-to-left typesetting, the use of this environment is a prerequisite. The argument language is the language one wants to switch to.

Image

The command foreignlanguage typesets phrase according to the rules of language. It switches only the extra definitions and the hyphenation rules for the language, not the names and dates. Its environment equivalent is otherlanguage*.

9-2-1
Image
Image

For the contents of the environment hyphenrules, only the hyphenation rules of language to be used are changed; languagename and all other settings remain unchanged. When no hyphenation rules for language are loaded into the format, the environment has no effect.

As a special application, this environment can be used to prevent hyphenation altogether, provided that in language.dat the “language” nohyphenation is defined (by loading zerohyph.tex, as explained in Section 9.5.1 on page 580).

9-2-2
Image

Note that this approach works even if the “language” nohyphenation is not specified as an option to the babel package.

If more than one language is used, it might be necessary to know which language is active at a specific point in the document. This can be checked by a call to iflanguage:

Image

The first argument in this syntax, language, is the name of a language, which is first checked to see whether it corresponds to a language declared to babel. If the language is known, the command compares it with the current language. If they are the same, the commands specified in the true-clause are executed; otherwise, the commands specified in the third argument, false-clause, are executed.

This step is actually carried out by comparing the l@language commands that point to the hyphenation patterns used for the two languages (see Section 9.5.1 on page 580). Thus, two “languages” are considered identical if they share the same patterns (e.g., dialects1 of a language such as austrian), especially with languages for which no patterns are loaded.

1 Only in the implementation in babel! Some languages are implemented as “dialects” of the others for TeXnical reasons; no discrimination is intended.

9-2-3
Image
Image

The control sequence languagename contains the name of the current language.

9-2-4
Image

Most document classes available in a LaTeX installation define a number of commands that are used to store the various language-dependent strings. Table 9.2 on the facing page presents an overview of these commands, together with their default text strings.

Image

Table 9.2. Language-dependent strings in babel (English defaults)

9.2.2. Handling shorthands

For authors who write in languages other than English, it is sometimes awkward to type the input needed to produce the letters of their languages in the final document. More often than not, they need letters with accents above or below—sometimes even more than one accent. When you need to produce such glyphs and do not have the ability to use 8-bit input, but rather have to rely on 7-bit input encodings, an easier way to type those instructions would be welcome. For this reason (among others, as will be discussed later), babel supports the concept of “shorthands”. A “shorthand” is a one- or two-character sequence, the first character of which introduces the shorthand and is called the “shorthand character”. For a two-character shorthand, the second character specifies the behavior of the shorthand.

Babel knows about three kinds of shorthands—those defined by “the system”, “the language”, and “the user”. A system-defined shorthand sequence can be overridden by a shorthand sequence defined as part of the support for a specific language; a language-defined shorthand sequence can be overridden by a user-defined one.

Document-level commands for shorthands

This section describes the shorthand commands that can be used in the document and various aspects of the shorthand concept. Language-level or system-level shorthands are declared in language definition files; see Section 9.5 on page 579.

Image

The command useshorthands initiates the definition of user-defined shorthand sequences. The argument char is the character that starts these shorthands.

Image

The command defineshorthand defines a shorthand. Its first argument, charseq, is a one- or two-character sequence; the second argument, expansion, is the code to which the shorthand should expand.

Image

The command aliasshorthand lets you use another character, char2, to perform the same functions as the default shorthand character, char1. For instance, if you prefer to use the character | instead of ", you can enter aliasshorthand{"}{|}.

9-2-5
Image
Image

The command languageshorthands is used to switch between shorthands for the language specified as an argument. The language must have been declared to babel for the current document. When switching languages, the language definition files usually issue this command for the language in question. For example, the file frenchb.ldf contains the following command:

Image

Sometimes it is necessary to temporarily switch off the shorthand action of a given character because it needs to be used in a different way.

Image

The command shorthandoff sets the catcode for each of the characters in its argument chars to “other” (12). Conversely, the command shorthandon sets the catcode to “active” (13) for its argument chars. Both commands only act on “known” shorthand characters. If a character is not known to be a shorthand character, its category code will be left unchanged.

For instance, the language definition file german.ldf defines two commands, mdqoff and mdqon, that turn the shorthand action of the character " off and on, respectively. They are defined as follows:

Image

The language definition file for French (frenchb.ldf) makes the “double” punctuation characters “?”, “!”, “:”, and “;” active. One can eliminate this behavior by specifying each as an argument to a shorthandoff command. This step is necessary with certain packages, where the same characters have a special meaning. Below is an example with the xy package, where the use of “;” and “?” as shorthand characters is turned off inside xy’s xy environment [57, Chapter 5], because these characters have a functional meaning there.

9-2-6
Image

9.2.3. Language attributes

Sometimes the support for language-dependent typesetting needs to be tailored for different situations. In such a case it is possible to define attributes for the particular language. Two examples of the use of attributes can be found in the support for typesetting of Latin texts. When the attribute medieval is selected, certain document element names are spelled differently; also, the letters “u” and “V” are defined to be a lowercase and uppercase pair. The attribute withprosodicmarks can be used when typesetting grammars, dictionaries, teaching texts, and the like, where prosodic marks are important for providing complete information on the words or the verses. This attribute makes special shorthands available for breve and macron accents that may interfere with other packages.

Image

The command languageattribute declares which attributes are to be used for a given language. It must be used in the preamble of the document following the command usepackage[...]{babel} that loads the babel package. The command takes two arguments: language is the name of a language, and langattrs is a comma-separated list of attributes to be used for that language. The command checks whether the language is known in the current document and whether the attribute(s) are known for this language.

For instance, babel has two variants for the Greek language: monotoniko (one-accent), the default, and polutoniko (multi-accent). To select the polutoniko variant, one must specify it in the document preamble, using the command languageattribute. The following two examples illustrate the difference.

9-2-7
Image

With the polutoniko attribute we get a different result:

9-2-8
Image

9.3. User commands provided by language options

This section gives a general overview of the features typically offered by the various language options. It includes translations of language-dependent strings and a survey of typical shorthands intended to ease language-specific document content or to solve language-specific typesetting requirements. Some language options define additional commands to produce special date formats or numbers in a certain style. Also discussed are layout modifications as undertaken for French and Hebrew as well as the interfaces for dealing with different scripts (e.g., Latin and Cyrillic) in the same document.

9.3.1. Translations

As discussed earlier, babel provides translations for document element names that LaTeX uses in its document classes. The English versions of these strings are shown in Table 9.2 on page 547. Table 9.3 on page 551 shows the translations for a number of languages, some of them not using the normal Latin script.

9-3-1
Image

Table 9.3. Language-dependent strings in babel (French, Greek, Polish, and Russian)

Apart from the translated strings in Table 9.3, the language definition files supply alternative versions of the command oday, as shown in the following example.

9-3-2
Image

9.3.2. Available shorthands

Many of the language definition files provide shorthands. Some are meant to ease typing, wheras others provide quite extensive trickery to achieve special effects. You might not be aware of it, but LaTeX itself defines a shorthand (although it is not called by that name) that you probably use quite often: the character tilde (~), which is used to enter a “nonbreakable” space.

A number of shorthand definitions deal with “accented characters”. They were invented in the days when TeX did not yet support 8-bit input or 8-bit hyphenation patterns. When proper 8-bit hyphenation patterns are available, it is normally better to apply those and to use the inputenc package to select the proper input encoding (see Section 7.1.2 on page 329). However, if special processing needs to take place when an accented character appears next to a hyphenation point (as is the case for the Dutch hyphenation rules), the use of shorthands cannot be circumvented.1

1 This statement is true only if the underlying formatter is TeX. Omega, for example, provides additional functionality so that such cases can be handled automatically.

The double quote

The most popular character to be used as a shorthand character is the double quote character ("). This character is used in this way for Basque, Bulgarian, Catalan, Danish, Dutch, Estonian, Finnish, Galician, German, Icelandic, Italian, Latin, Norwegian, Polish, Portuguese, Russian, Serbian, Slovenian, Spanish, Swedish, Ukrainian, and Upper Sorbian. To describe all uses of the double quote character as a shorthand character would go too far. Instead, it is recommended that you check the documentation that comes with the babel package for each language if you want to know the details. What can be said here is that its uses fall into a number of categories, each of which deserves a description and a few examples.

Insert accented letters For a number of languages shorthands have been created to facilitate typing accented characters. With the availability of 8-bit input and output encodings this usage might seem to have become obsolete, but this is not true for all cases. For the Dutch language, for instance, an accent needs to be removed when the hyphenation point is next to the accented letter.

9-3-3
Image

Insert special characters In the Catalan language a special glyph, the “geminated l”, is needed for proper typesetting [167].

9-3-4
Image

This character can also be typeset by using the commands lgem and Lgem or through the combinations “l.” and “L.” once catalan is selected.

Insert special quoting characters By default, LaTeX supports single and double quotes: ‘quoted text’ and “quoted text”. This support is not desirable in European languages. Many have their own conventions and more often than not require different characters for this purpose. For example, in Dutch traditional typesetting the opening quote should be placed on the baseline, in German typesetting the closing quote is reversed, and French typesetting requires guillemets. For Icelandic typesetting the guillemets are used as well, but the other way around—that is, pointing “inward” instead of “outward” (a convention also sometimes used in German typography).

9-3-5
Image

The T1 font encoding provides the guillemets (see Table 7.32 on page 449), but its support for French typesetting relies on the commands og and fg. These commands not only produce the guillemets, but also provide proper spacing between them and the text they surround.

Insert special hyphenation rules A number of languages have specific rules about what happens to characters at a line break. For instance, in older German spelling ..ck.. is hyphenated as ..k-k.. and a triple f in a compound word is normally typeset as ff—except when hyphenated, in which case the third f reappears as shown in the example.

9-3-6
Image

Insert special hyphenation indications A number of shorthands are used to inform LaTeX about special situations with regard to hyphenation. For instance, in a number of languages it is sometimes necessary to prevent LaTeX from typesetting a ligature—for example, in a compound word. This goal can be achieved by inserting a small kern between the two letters that would normally form a ligature. The shorthand "| is available for this purpose in many language definitions.

9-3-7
Image

Another popular shorthand is "-, which indicates a hyphenation point (like -), but without supressing hyphenation in the remainder of the word:

9-3-8
Image

There is also "" (similar to "-, but does not print the -), "= (inserts an explicit hyphen with a breakpoint, allowing hyphenation in the combined words separately), and "~ (inserts an explicit hyphen without a breakpoint). The following example shows the effects of these shorthands, using the same word.

9-3-9
Image
The tilde

For the languages Basque, Estonian, Galician, Greek, and Spanish, the tilde character is used for a different purpose than inserting an unbreakable space.

• For Estonian typography, the tilde-accent needs to be set somewhat lower than LaTeX’s normal positioning.

• For Greek multi-accented typesetting, LaTeX needs to see the tilde as if it were a normal letter. This behavior is needed to make the ligatures in the Greek fonts work correctly.

• For Basque, Galician, and Spanish, the tilde is used in the shorthands ~n (ñ), ~N (Ñ), and ~- (special dash). The construction ~- (as well as ~-- and ~---) produces a dash that disallows a linebreak after it. When the tilde is followed by any other character, it retains its original function as an “unbreakable space” (producing the overfull first line in the example). If such a space is needed before an “n”, this can be achieved by inserting an empty group (the second line in the example).

9-3-10
Image
The colon, semicolon, exclamation mark, and question mark

For the languages Breton, French, Russian, and Ukrainian, these four characters are used as shorthands to facilitate the use of correct typographic conventions. For Turkish typography, this ability is needed only for the colon and semicolon. The convention is that a little white space should precede these characters.

9-3-11
Image

This white space is added automatically by default, but this setting can be changed in a configuration file. The use of the colon as a shorthand character can lead to problems with other packages or when including PostScript files in a document. In such cases it may be necessary to disable this shorthand (temporarily) by using shorthandoff, as explained in Example 9-2-6 on page 549.

The grave accent

The support for the languages Catalan and Hungarian makes it possible to use the grave accent (') as a shorthand character.

• For Catalan this use of the grave accent character is not supported by default; one has to specify the option activegrave when loading babel. The purpose of this shorthand is to facilitate the entering of accented characters while retaining hyphenation. The shorthand can be used together with the letters a, e, o and A, E, O.

9-3-12
Image

• For Hungarian this shorthand can be used with both uppercase and lowercase version of the characters c, d, g, l, n, s, t, and z. Its purpose is to insert discretionaries to invoke the correct behavior at hyphenation points.

9-3-13
Image
The acute accent

The support for the languages Catalan, Galician, and Spanish makes it possible to use the acute accent (') as a shorthand character.

• For the support of Catalan typesetting, this shorthand can be used together with the vowels (a, e, i, o, u), both uppercase and lowercase. Its effect is to add the accent and to retain hyphenation.

• For the support of Galician typesetting, this shorthand offers the same functionality as for Catalan with the addition that entering 'n will produce ñ.

9-3-14
Image

• For the support of Spanish typesetting, this shorthand offers similar functionality as for Catalan and Galician.

The described functionality is made available when the activeacute option is used. This support is made optional because the acute accent has other uses in LaTeX, which will fail when this character is turned into a shorthand.

The caret

The support for the languages Esperanto and Latin makes it possible to use the caret accent (^) as a shorthand character.

• For typesetting the Esperanto language, two accents are needed: the caret and the breve accent. The caret appears on the letters c, g, h, j, and s; the breve appears on the character u. Both accents can appear on lowercase and uppercase letters. The caret is defined as a shorthand that retains hyphenation and sets the caret accent somewhat lower on the character “h” (Image). Used together with the letter u, this shorthand typesets the breve accent (^u results in ŭ); used together with the vertical bar, it inserts an explicit hyphen sign, allowing hyphenation in the rest of the word.

9-3-15
Image

• When a Latin text is being typeset and the attribute withprosodicmarks has been selected, the caret is defined to be a shorthand for adding a breve accent to the lowercase vowels (except the medieval ligatures æ and œ). This is done while retaining hyphenation points.

9-3-16
Image
The equals sign

The support for the languages Latin (with the attribute withprosodicmarks selected) and Turkish makes it possible to use the equals sign (=) as a shorthand character.

• When a Latin text is being typeset and the attribute withprosodicmarks has been selected, the equals sign is defined to be a shorthand for adding a macron accent to the lowercase vowels (except the medieval ligatures æ and œ). This is done while retaining hyphenation points.

9-3-17
Image

• When Turkish typesetting rules are to be followed, the equals sign needs to be preceded by a little white space. This is achieved automatically by turning the equals sign into a shorthand that replaces a preceding space character with a tiny amount of white space.

9-3-18
Image

The disadvantage of having the equals sign turn into a space character is that it may cause many other packages to fail, including the usage of PostScript files for graphics inclusions. Make sure that the shorthand is turned off with shorthandoff.

The greater than and less than signs

The support for the Spanish language makes it possible to use the greater than and less than signs (< and >) as shorthand characters for inserting a special quoting environment. This environment inserts different quoting characters when it is nested within itself. It supports a maximum of three levels of nested quotations. It also automatically inserts the closing quote signs when a new paragraph is started within a quote.

9-3-19
Image

Note that when characters are turned into shorthands, the ligature mechanism in the fonts no longer works for them. In the T1 font encoding, for instance, a ligature is defined for two consecutive “less than” signs that normally results in typesetting guillemets. In the example above, the nested quote shows clearly that this does not happen.

The period

The support for the Spanish language also allows the use of the period (.) as a shorthand character in math mode. Its purpose is to control whether decimal numbers are written with the comma (decimalcomma) or the period (decimalpoint) as the decimal character.

9-3-20
Image

9.3.3. Language-specific commands

Apart from the translations and shorthands discussed above, some language definition files provide extra commands. Some of these are meant to facilitate the production of documents that conform to the appropriate typesetting rules. Others provide extra functionality not available by default in LaTeX. A number of these commands are described in this section.

Formatting dates

For some languages more than one format is used for representing dates. In these cases extra commands are provided to produce a date in different formats. In the Bulgarian tradition months are indicated using uppercase Roman numerals; for such dates the command odayRoman is available.

9-3-21
Image

When writing in the Esperanto language two slightly different ways of representing the date are provided by the commands hodiau and hodiaun.

9-3-22
Image

When producing a document in the Greek language the date can also be represented with Greek numerals instead of Arabic numerals. For this purpose the command Grtoday is made available.

9-3-23
Image

The support for typesetting Hebrew texts offers the command hebdate to translate any Gregorian date, given as “day, month, year”, into a Gregorian date in Hebrew. The command hebday replaces LaTeX’s normal oday. When you want to produce “normal” Hebrew dates, you need to use the package hebcal, which provides the command Hebrewtoday. When it is used outside the Hebrew environment it produces the Hebrew date in English.

9-3-24
Image

The support for the Hungarian language provides the command ontoday to produce a date format used in expressions such as “on February 10th”.

For the Upper and Lower Sorbian languages two different sets of month names are employed. By default, the support for these languages produces “new-style” dates, but “old-style” dates can be produced as well. The “old-style” date format for the Lower Sorbian language can be selected with the command olddatelsorbian; ewdatelsorbian switches (back) to the modern form. For Upper Sorbian similar commands are available, as shown in the example.

9-3-25
Image

In Swedish documents it is customary to represent dates with just numbers. Such dates can occur in two forms: YYYY-MM-DD and DD/MM YYYY. The command datesymd changes the definition of the command oday to produce dates in the first numerical form; the command datesdmy changes the definition of the command oday to produce dates in the second numerical format.

9-3-26
Image
Numbering

The support for certain languages provides additional commands for representing numbers by letters. LaTeX provides the commands alph and Alph for this purpose. For the Esperanto language the commands esper and Esper are provided. The support for the Greek language changes the definition of alph and Alph to produce Greek letters while the support for the Bulgarian language changes them to produce Cyrillic letters. The support for the Russian and Ukrainian languages provides the commands asbuk and Asbuk as alternatives to the LaTeX commands.

For Hebrew typesetting the alph command is changed to produce Hebrew letter sequences using the “Gimatria” scheme. As there are no uppercase letters Alph produces the same letter sequences but adds apostrophes. In addition, an extra command, Alphfinal, generates Hebrew letters with apostrophes and final letter forms, a variant needed for Hebrew year designators. Table 9.4 compares the various numbering schemes.

9-3-27
Image

Table 9.4. Different methods for representing numbers by letters

In French typesetting, numbers should be typeset following different rules than those employed in English typesetting. Namely, instead of separating thousands with a comma, a space should be used. The command ombre is provided for this purpose. It can also be used outside the French language environment, where it will typeset numbers according to the English rules. The command ombre takes an optional argument, which can be used to replace the default decimal separator (stored in decimalsep). This feature can be useful in combination with the package dcolumn (see Section 5.7.2), in which you have to use the optional argument to achieve correct alignment.

9-3-28
Image

In Greece an alternative way of writing numbers exists. It is based on using letters to denote number ranges. This system was used in official publications at the end of the 19th century and the beginning of the 20th century. At present most Greeks use it for small numbers. The knowledge of how to write numbers larger than 20 or 30 is not very widespread, being primarily used by the Eastern Orthodox Church and scholars. They employ this approach to denote numbers up to 999999. This system works as follows:

• Only numbers greater than 0 can be expressed.

• For the units 1 through 9 (inclusive), the letters alpha, beta, gamma, delta, epsilon, stigma, zeta, eta, and theta are used, followed by a mark similar to the mathematical symbol “prime”, called the “numeric mark”. Because the letter stigma is not always part of the available font, it is often replaced by the first two letters of its name as an alternative. In the babel implementation the letter stigma is produced, rather than the digraph sigma tau.

• For the tens 10 through 90 (inclusive), the letters iota, kappa, lambda, mu, nu, xi, omikron, pi, and qoppa are used, again followed by the numeric mark. The qoppa that appears in Greek numerals has a distinct zig-zag form that is quite different from the normal qoppa, which resembles the Latin “q”.

• For the hundreds 100 through 900 (inclusive), the letters rho, sigma, tau, upsilon, phi, chi, psi, omega, and sampi are used, also followed by the numeric mark.

• Using these rules any number between 1 and 999 can be expressed by a group of letters denoting the hundreds, tens, and units, followed by one numeric mark.

• For the number range 1000 through 999000 (inclusive), the digits denoting multiples of a thousand are expressed by the same letters as above, this time with a numeric mark in front of this letter group. This mark is rotated 180 degrees and placed under the baseline. As can be seen in the example below, when two letter-groups are combined, both numeric marks are used.

9-3-29
Image

In ancient Greece yet another numbering system was used, which closely resembles the Roman one in that it employs letters to denote important numbers. Multiple occurrences of a letter denote a multiple of the “important” number; for example, the letter I denotes 1, so III denotes 3. Here are the basic digits used in the Athenian numbering system:

• I denotes the number one (1).

• ∏ denotes the number five (5).

• Δ denotes the number ten (10).

• H denotes the number one hundred (100).

• X denotes the number one thousand (1000).

• M denotes the number ten thousand (10000).

Moreover, the letters Δ, H, X, and M, when placed under the letter Π, denote five times their original value; for example, the symbol Image denotes the number 5000, and the symbol Image denotes the number 50. Note that the numbering system does not provide negative numerals or a symbol for zero.

The Athenian numbering system, among others, is described in an article in Encyclopedia Image, Volume 2, seventh edition, page 280, Athens, October 2, 1975. This numbering system is supported by the package athnum, which comes with the babel system. It implements the command athnum.

9-3-30
Image

In Icelandic documents, numbers need to be typeset according to Icelandic rules. For this purpose the command ala is provided. Like ombre it takes an optional argument, which can be used to replace the decimal separator used, such as for use with the dcolumn package.

9-3-31
Image
Miscellaneous extras

... for French

In French typesetting it is customary to print family names in small capitals, without hyphenating a name. For this purpose the command sc (boxed small caps) is provided. Abbreviations of the French word “numéro” should be typeset according to specific rules; these have been implemented in the commands o and No. Finally, for certain enumerated lists the commands primo, secundo, ertio, and quarto are available when typesetting in French.

9-3-32
Image

... for Catalan, French, and Italian

In some languages, e.g., Italian, it is customary to write together the article and the following noun—for example, “nell’altezza”. To carry out the hyphenation of such constructs the character ' is made to behave as a normal letter.

... for Hungarian

In the Hungarian language the definite article can be either “a” or “az”, depending on the context. Especially with references and citations, it is not always known beforehand which form should be used. The support for the Hungarian language contains commands that know the rules dictating when a “z” should be added to the article. These commands all take an argument that determines which form of the definite article should be typeset together with that argument.

Image

These commands produce the article and the argument. The argument can be a star (as in az*), in which case just the article will be typeset. The form Az is intended for the start of a sentence.

Image

The first two commands should be used instead of a(z)~ ef{label}. When an equation is being referenced, the argument may be enclosed in parentheses instead of braces. For page references use apageref (or Apageref) to allow LaTeX to automatically produce the correct definite article.

Image

For citations the command acite should be used. Its argument may be a list of citations, in which case the first element of the list determines which form of the article should be typeset.

... specials for math

In Eastern Europe a number of mathematical operators have a different appearance in equations than they do in “the Western world”. Table 9.5 shows the relevant commands for different languages. The Russian commands are also valid for Bulgarian and Ukrainian language support. The package grmath, which comes as part of the babel distribution, changes the definitions of these operators to produce abbreviations of their Greek names. The package can only be used in conjunction with the greek option of babel.

9-3-33
Image

Table 9.5. Alternative mathematical operators for Eastern European languages

9.3.4. Layout considerations

Some of the language support files in the babel package provide commands for automatically changing the layout of the document. Some simply change the way LaTeX handles spaces after punctuation characters or ensure that the first paragraph that follows a section heading is indented. Others go much further.

Spaces after punctuation characters

In The TeXbook [82, pp.72–74], the concept of extra white space after punctuation characters is discussed. Good typesetting practice mandates that inter-sentence spaces behave a little differently than interword spaces with respect to shrinkage and expansion (during justification). However, this practice is not considered helpful in all cases, so for a number of languages (Breton, Bulgarian, Czech, Danish, Estonian, Finnish, French, German, Norwegian, Russian, Spanish, Turkish, and Ukrainian) this feature is switched off by calling the command frenchspacing.

Paragraph indention after heading

Another layout concept that is built into most LaTeX classes is the suppression of the paragraph indentation for the first paragraph that follows a section heading. Again, for some languages this behavior is wrong; the support for French, Serbo-Croatian, and Spanish changes it to have all paragraphs indented. In fact, you can request this behavior for any document by loading the package indentfirst.

Layout of lists

The support for French (and Breton, for which support is derived from the support for the French language) takes this somewhat further to accomodate the typesetting rules used in France. It changes the general way lists are typeset by LaTeX by reducing the amount of vertical white space in them. For the itemize environment, it removes all vertical white space between the items and changes the appearance of the items by replacing “•” with “–”.

9-3-34
Image
Image

For documents that are typeset in more than one language, the support for French provides a way to ensure that lists have a uniform layout throughout the document, either the “French layout” or the “LaTeX layout”. This result can be achieved by using the command FrenchLayout or StandardLayout in the preamble of the document. Unfortunately, when your document is being typeset with something other than one of the document classes provided by standard LaTeX, or when you use extension packages such as paralist, such layout changes may have surprising and unwanted effects. In such cases it might be safest to use StandardLayout.

Image

Layout of footnotes

In the French typesetting tradition, footnotes are handled differently than they are in the Anglo-American tradition. In the running text, a little white space should be added before the number or symbol that calls the footnote. This behavior is optional and can be selected by using the AddThinSpaceBeforeFootnotes command in the preamble of your document. The text of the footnote can also be typeset according to French typesetting rules; this result is achieved by using the command FrenchFootnotes.

9-3-35
Image

Layout of captions

The final layout change performed by the babel support for the French language is that the colon in captions for tables and figures is replaced with an en dash when one of the document classes of standard LaTeX is used.

Image Internal commands redefined for magyar

The support for typesetting Hungarian documents goes even further: it redefines a number of internal LaTeX commands to produce correct captions for figures and tables. Using the same means, it changes the layout of section headings. The definition of the theorem environment is changed as well. As explained above, such changes may lead to unexpected and even unwanted behavior, so be careful.

Right to left typesetting

To support typesetting Hebrew documents, even more drastic changes are needed because the Hebrew language has to be typeset from right to left. This requires the usage of a TeX extension (i.e., eTeX with a LaTeX format) to correctly typeset a Hebrew document.

9.3.5. Languages and font encoding

As shown in some of the earlier examples, some languages cannot be supported by, for instance, simply translating some texts and providing extra support for special hyphenation needs. Many languages require characters that are not present in LaTeX’s T1 encoding. For some, just a few characters are missing and can be constructed from the available glyphs; other languages are not normally written using the Latin script. Some of these are supported by the babel system.

Extensions to the OT1 and T1 encodings

For some languages just a few characters are missing in the OT1 encoding and sometimes even in the T1 encoding. When the missing characters can be constructed from the available glyphs, it is relatively easy to rectify this situation. Such is the case for the Old Icelandic language. It needs a number of characters that can be represented by adding the “ogonek” to available glyphs. To access these you should use the shorthands in the next example. Note that each of these shorthands is composed of " and an 8-bit character, so use of the inputenc package is required.

9-3-36
Image

Old Icelandic may not be a language in daily use, but the Polish language certainly is. For this language the OT1 encoding is missing a few characters (note that they are all included in T1). Again the missing characters can be constructed, and their entry is supported with shorthands. The support for entering the letters “pointed z” and “accented z” comes in two forms, as illustrated below. The reason for this duality is historical.

9-3-37
Image

All such shorthands were devised when 7-bit font encodings were the norm and producing a glyph such as “Image” required some internal macro processing (if it was possible at all). With today’s 8-bit fonts there is no requirement to use the shorthands. For example, with T1-encoded fonts, standard input methods may be used instead.

9-3-38
Image
Basic support for switching font encodings

In the situation where simply constructing a few extra characters to support the correct typesetting of a language does not offer a sufficient solution, switching from one font encoding to another becomes necessary. This section describes the commands provided by babel and its language support files for this task. Note that these commands are normally “hidden” by babel’s user interface.

Image

The babel package uses latinencoding to record the Latin encoding (OT1 or T1) used in the document. To determine which encoding is used, babel tests whether the encoding current at egin{document} is T1; if it is not, it (perhaps wrongly) assumes OT1.

The languages that are typeset using the Cyrillic alphabet define the command cyrillicencoding to store the name for the Cyrillic encoding. The command hebrewencoding serves the same purpose for the Hebrew font encoding. At the time of writing no greekencoding command was available, because babel supported only a single encoding (LGR) for Greek.

Image

This command typesets its argument in a font with the Latin encoding, independent of the encoding of the surrounding text.

Image

This command is (only) defined when one of the options bulgarian, russian, or ukrainian is used. It typesets its argument using a font in the Cyrillic encoding stored in cyrillicencoding.

Image

These commands are defined by the greek language option. Both typeset their arguments in a font with the Greek encoding; the command extol uses an outline font.

Declarative forms for these ext... commands are also available; they are called latintext, greektext, outlfamily, and cyrillictext.

Basic support for switching typesetting directions

To support the typesetting of Hebrew texts, the direction of typesetting also needs to be changed. Several commands with different names have been defined for this purpose.

Image

The command sethebrew switches the typesetting direction to “right to left”, switches the font encoding to a Hebrew encoding, and shifts the “point of typesetting” to start from the right margin. The command unsethebrew switches the typesetting direction to “left to right”, switches the font encoding to the one in use when sethebrew was called, and shifts the “point of typesetting” to start from the left margin.

Image

The commands R and L should be used when a small piece of Hebrew text needs to appear in the same location relative to the surrounding text. The use of these commands is illustrated in the following example. Note the location of the second text typeset with Hebrew characters.

9-3-39
Image

9.4. Support for non-Latin alphabets

The babel distribution contains support for three non-Latin alphabets: the Cyrillic alphabet, the Greek alphabet, and the Hebrew alphabet. They are discussed in the following sections.

9.4.1. The Cyrillic alphabet

The Cyrillic alphabet is used by several of the Slavic languages in Eastern Europe, as well as for writing tens of languages used in the territory encompassed by the former Soviet Union. Vladimir Volovich and Werner Lemberg, together with the LaTeX team, have integrated basic support for the Cyrillic language into LaTeX. This section addresses the issues of Cyrillic fonts, the encoding interface, and their integration with babel.

Historically, support for Russian in TeX has been available from the American Mathematical Society [14]. The AMS system uses the wncyr fonts and is based on a transliteration table originally designed for Russian journal names and article titles in the journal Mathematical Reviews. In this journal the AMS prefers that the same character sequence in the electronic files produce either the Russian text with Russian characters or its transliteration with English characters, without any ambiguities.

However, with the spread of TeX in Russia, proper support for typesetting Russian (and later other languages written in the Cyrillic alphabet) became necessary. Over the years several 7- and 8-bit input encodings were developed, as well as many font encodings. The Cyrillic system is designed to work for any 8-bit input encoding and is able to map all of them onto a few Cyrillic font encodings, each supporting a number of languages.

Fonts and font encodings

For compatibility reasons, only the upper 128 characters in an 8-bit TeX font are available for new glyphs. As the number of glyphs in use in Cyrillic-based languages during the 20th century far exceeds 128, four “Cyrillic font encodings” have been defined [17]. Three of them—T2A, T2B, and T2C—satisfy the basic structural requirements of LaTeX’s T* encodings and, therefore, can be used in multilingual documents with other languages being based on standard font encodings.1

1 The fourth Cyrillic encoding, X2, contains Cyrillic glyphs spread over the 256 character positions, and is thus suitable only for specific, Cyrillic-only applications. It is not discussed here.

The work on the T2* encodings was performed by Alexander Berdnikov in collaboration with Mikhail Kolodin and Andrew Janishevsky. Vladimir Volovich provided the integration with LaTeX.

Two other LaTeX Cyrillic font encodings exist: the 7-bit OT2 encoding developed by the American Mathematical Society, which is useful for short texts in Cyrillic, and the 8-bit LCY encoding, which is incompatible with the LaTeX’s T* encodings and, therefore, unsuitable for typesetting multilingual documents. The OT2 encoding was designed in such a way that the same source could be used to produce text either in the Cyrillic alphabet or in a transliteration.

Cyrillic Computer Modern fonts

The default font family with LaTeX is Knuth’s Computer Modern, in its 7-bit (OT1-encoded CM fonts) or 8-bit (T1-encoded EC fonts) incarnation. Olga Lapko and Andrey Khodulev developed the LH fonts, which provide glyph designs compatible with the Computer Modern font family and covering all Cyrillic font encodings. They provide the same font shapes and sizes as those available for its Latin equivalent, the EC family. These fonts are found on CTAN in the directory fonts/cyrillic/lh. Installation instructions appear in the file INSTALL in that distribution.1

1 Other fonts, including Type 1 fonts, can also be used, provided that their TeX font encoding is compatible with the T2* encodings. In particular, the CM-Super fonts cover the whole range of Cyrillic encodings; see Section 7.5.1 on page 353 for details.

A collection of hyphenation patterns for the Russian language that support the T2* encodings, as well as other popular font encodings used for Russian typesetting (including the Omega internal encoding), are available in the ruhyphen distribution on CTAN (language/hyphenation/ruhyphen). The patterns for other Cyrillic languages should be adapted to work with the T2* encodings.

Using Cyrillic in your documents

Support for Cyrillic in LaTeX is based on the standard fontenc and inputenc packages, as well as on the babel package. For instance, one can write the following in the preamble of the document:

Image

The input encoding koi8-r (KOI8 optimized for Russian) can be replaced by any of the following Cyrillic input encodings:

cp855 Standard MS-DOS Cyrillic code page.

cp866 Standard MS-DOS Russian code page. Several variants, distinguished by differences in the code positions 242–254, exist: cp866av (Cyrillic Alternative), cp866mav (Modified Alternative Variant), cp866nav (New Alternative Variant), and cp866tat (for Tatar).

cp1251 Standard MS Windows Cyrillic code page.

koi8-r Standard Cyrillic code page that is widely used on UN*X-like systems for Russian language support. Variants for Ukrainian are koi8-u and koi8-ru. An ECMA variant (ISO-IR 111 ECMA) is isoir111.

iso88595 ISO standard ISO 8859-5 (also called ISO-IR 144).

maccyr Apple Macintosh Cyrillic code page (also known as Microsoft cp10007) and macukr, the Apple Macintosh Ukrainian code page.

ctt, dbk, mnk, mos, ncc Mongolian code pages.

Not all of these code pages are part of the standard inputenc distribution, so some may have to be obtained separately.

When more than one input encoding is used within a document, you can use the inputencoding command to switch between them. To define the case of text, two standard LaTeX commands, MakeUppercase and MakeLowercase, can produce uppercase or lowercase, respectively. The low-level TeX uppercase and lowercase should never be used in LaTeX and will not work for Cyrillic.

In the previous example of a preamble, the font encoding to be used was explicitly declared. For multilingual documents all encodings needed should be enumerated via the usepackage[...]{fontenc} command. Changing from one font encoding to another can be accomplished by using the fontencoding command, but it is advisable that such changes be performed by a higher-level interface such as the selectlanguage command. In particular, when using babel, you can write

Image

where babel will automatically choose the default font encoding for Russian, which is T2A, when it is available. Table 9.6 on the following page shows the layout of the T2A encoding.

Image
Image

Table 9.6. Glyph chart for a T2A-encoded font (larm1000)

Font encodings for Cyrillic languages

The Cyrillic font encodings support the languages listed below. Note that some languages, such as Bulgarian and Russian, can be properly typeset with more than one encoding.

T2A: Abaza, Avar, Agul, Adyghei, Azerbaijani, Altai, Balkar, Bashkir, Bulgarian, Buryat, Byelorussian, Gagauz, Dargin, Dungan, Ingush, Kabardino-Cherkess, Kazakh, Kalmyk, Karakalpak, Karachaevskii, Karelian, Kirghiz, Komi-Zyrian, Komi-Permyak, Kumyk, Lak, Lezghin, Macedonian, Mari-Mountain, Mari-Valley, Moldavian, Mongolian, Mordvin-Moksha, Mordvin-Erzya, Nogai, Oroch, Osetin, Russian, Rutul, Serbian, Tabasaran, Tadzhik, Tatar, Tati, Teleut, Tofalar, Tuva, Turkmen, Udmurt, Uzbek, Ukrainian, Hanty-Obskii, Hanty-Surgut, Gipsi, Chechen, Chuvash, Crimean-Tatar

T2B: Abaza, Avar, Agul, Adyghei, Aleut, Altai, Balkar, Byelorussian, Bulgarian, Buryat, Gagauz, Dargin, Dolgan, Dungan, Ingush, Itelmen, Kabardino-Cherkess, Kalmyk, Karakalpak, Karachaevskii, Karelian, Ketskii, Kirghiz, Komi-Zyrian, Komi-Permyak, Koryak, Kumyk, Kurdian, Lak, Lezghin, Mansi, Mari-Valley, Moldavian, Mongolian, Mordvin-Moksha, Mordvin-Erzya, Nanai, Nganasan, Negidal, Nenets, Nivh, Nogai, Oroch, Russian, Rutul, Selkup, Tabasaran, Tadzhik, Tatar, Tati, Teleut, Tofalar, Tuva, Turkmen, Udyghei, Uigur, Ulch, Khakass, Hanty-Vahovskii, Hanty-Kazymskii, Hanty-Obskii, Hanty-Surgut, Hanty-Shurysharskii, Gipsi, Chechen, Chukcha, Shor, Evenk, Even, Enets, Eskimo, Yukagir, Crimean-Tatar, Yakut

T2C: Abkhazian, Bulgarian, Gagauz, Karelian, Komi-Zyrian, Komi-Permyak, Kumyk, Mansi, Moldavian, Mordvin-Moksha, Mordvin-Erzya, Nanai, Orok (Uilta), Negidal, Nogai, Oroch, Russian, Saam, Old-Bulgarian, Old-Russian, Tati, Teleut, Hanty-Obskii, Hanty-Surgut, Evenk, Crimean-Tatar

The basic LaTeX distribution comes with all the encoding and font definition files for handling Cyrillic. The babel package includes support for Bulgarian, Russian, and Ukrainian. Together with the font files (to be installed separately), LaTeX can use this package to provide complete support for typesetting languages based on the Cyrillic alphabet.

Running MakeIndex and BibTeX

Recognizing that standard MakeIndex and BibTeX programs cannot handle 8-bit input encodings natively, the T2 bundle comes with utilities to allow Cyrillic 8-bit input to be handled correctly by those programs.

For indexes, rumakeindex is a wrapper for MakeIndex that creates a properly sorted index when Cyrillic letters are used in the entries. Use of the rumakeindex utility also requires the sed program.1 The utility should be run instead of standard MakeIndex when you are creating an index containing Cyrillic characters. Note that the rumakeindex script on UN*X uses the koi8-r encoding, whereas the corresponding batch file on MS-DOS, rumkidxd.bat, uses the cp866 encoding, and the batch file on MS Windows, rumkidxw.bat, uses the cp1251 encoding. If a different encoding is needed, changes have to be introduced in the relevant files. Alternatively, you might consider using xindy, a newer index preparation program, which is described in Section 11.3.

1 Available on any UN*X and for Microsoft operating systems on PC distributed by GNU (e.g., at http://www.simtel.net).

For bibliographic references, rubibtex is a wrapper for BibTeX that produces Cyrillic letters in item names, which correspond to the reference keys when a BibTeX bibliographic database is used. You should also install the citehack package from the T2 bundle in that case. Moreover, the installed version of the BibTeX program should be able to handle 8-bit input (e.g., the BibTeX8 program described in Section 13.1.1). As in the case of MakeIndex described above, the rubibtex script and batch files also require the sed program.

Note that the rubibtex script on UN*X uses the koi8-r encoding, whereas the corresponding batch file on MS-DOS, rubibtex.bat, uses the cp866 encoding. When another encoding is needed, changes should be introduced in the relevant files.

9.4.2. The Greek alphabet

Greek support in babel comes in two variants: the one-accent monotoniko (the default), which is used in most cases in everyday communications in Greece today, and the multi-accent polutoniko, which has to be specified as an attribute, as explained in Section 9.2.3.

The first family of Greek fonts for TeX was created during the mid-1980s by Silvio Levy [114]. Other developers improved or extended these fonts, or developed their own Greek fonts.

In babel the Greek language support is based on the work of Claudio Beccari in collaboration with Apostolos Syropoulos, who developed the Greek cb font family [12]. In their paper these authors discuss in some detail previous efforts to support the Greek language with TeX. The sources of the cb fonts are available on CTAN in the directory languages/greek/cb or on the TeX Live CD in the directory texmf/fonts/source/public/cbgreek. Hyphenation patterns corresponding to this font family are found in the file grhyph.tex or grphyph.tex in the same directory on CTAN and in texmf/tex/generic/hyphen on TeX Live.

The cb fonts use the LGR font encoding. At the time of this book’s writing, work was under way to design a font encoding that is compatible with LaTeX’s standards. When it is ready, it will become the T7 encoding. Table 9.7 on the next page shows the layout of the complete LGR encoding.

Image
Image

Table 9.7. Glyph chart for an LGR-encoded font (grmn1000)

It is possible to use Latin alphabetic characters for inputting Greek according to the transliteration scheme shown in Table 9.8 on page 576. This table shows that the Latin “v” character has no direct equivalent in the Greek transcription. In fact, it is used to indicate that one does not want a final sigma. For example, “sv” generates a median form sigma although it occurs in a final position.

9-4-1
Image

Table 9.8. Greek transliteration with Latin letters for the LGR encoding

By default, the greek option of babel will use monotoniko Greek. Multi-accented mode is requested by specifying the language attribute polutoniko for the greek option:

Image

For both modes, some seldom-used characters have been defined to behave like letters (catcode 11). For monotoniko Greek, this is the case for the characters ' and ". In the polutoniko variant, the characters <, >, ~, ', and | also behave like letters. The reason for this behavior is that the LGR encoding contains many ligatures with these characters to produce the right glyphs; see Table 9.9 on page 576. Table 9.10 shows the available composite accent and spiritus combinations.

9-4-2
Image

Table 9.9. LGR ligatures producing single-accented glyphs

9-4-3
Image

Table 9.10. Available composite spiritus and accent combinations

9.4.3. The Hebrew alphabet

The first support for Hebrew that became part of the babel distribution was developed by Boris Lavva and Alon Ziv, based on earlier work that offered support for typesetting Hebrew texts with LaTeX 2.09 and TeX--XeT. This support was developed further by these two authors and Rama Porrat. At the time of writing Tzafrir Cohen has started a sourceforge project called “ivritex” (http://ivritex.sf.net) to extend the work even more.

The current support for typesetting Hebrew is based on fonts from the Hebrew University of Jerusalem. These fonts have a particular 7-bit encoding for which the Local Hebrew encoding (LHE) has been developed. Figure 9.1 used the Jerusalem font; in Table 9.11 on the following page the encoding of these fonts is shown. The support in babel uses the Jerusalem font as the regular font, Old Jaffa for a font with an italic shape, and the Dead Sea font for typesetting bold letters. When a sans serif font is needed, the Tel Aviv font is used; it is also deployed as a replacement for a typewriter font.

9-4-4
Image

Figure 9.1. A Hebrew document

Image

Table 9.11. Glyph chart for an LHE-encoded font (shold10)

As an alternative to these fonts, two other (copyrighted, but freely available on CTAN) fonts are supported: Hclassic is a “modernized Classical Hebrew” font; Hcaption is a slanted version of it. Furthermore, three shalom fonts are available: ShalomScript10 contains handwritten Hebrew letters; ShalomStick10 contains sans serif letters; and ShalomOldStyle10 contains old-style letters. Yet another available family of fonts are the Frank Ruehl fonts, which come in regular, bold extended, and slanted shapes. The Carmel font family offers regular and slanted shapes and was designed for headers and emphasized text. The Redis family comes with regular, slanted, and bold extended shapes. For all supported font families, the package hebfont defines commands to select them. These commands are shown in Table 9.12 on the next page.

9-4-5
Image

Table 9.12. Hebrew font-changing commands

A few input encodings are available as part of the support for Hebrew. They are not automatically provided with the inputenc distribution.

si960 This 7-bit Hebrew encoding uses ASCII character positions 32–127. Also known as “oldcode”, it is defined by Israeli standard SI-960.

8859-8 This 8-bit mixed Hebrew and Latin encoding is also known as “newcode”. It is defined by the standard ISO 8859-8.

cp862 This IBM code page is commonly used by MS-DOS on IBM-compatible personal computers. It is also known as “pccode”.

cp1255 The MS Windows 1255 (Hebrew) code page resembles ISO 8859-8. In addition to Hebrew letters, this encoding contains vowels and dots (nikud).

9.5. Tailoring babel

This section explains some of the commands that are made available by the core babel package to construct language definition files (which are usually loaded when a language option is requested). Section 9.5.3 then looks in some detail at the template file language.skeleton, which can be used as a basis to provide support for additional languages.

Language definition files (file extension .ldf) have to conform to a number of conventions, since they complement the common shared code of babel provided in the file babel.def for producing language-dependent text strings. Similarly, to allow for language switching like the capability built into babel, certain rules apply. The basic working assumptions follow.

• Each language definition file lang .ldf must define five macros, which are subsequently used to activate and deactivate the language-specific definitions. These macros are language hyphenmins, captions language, date language, extras language, and oextras language, where language is either the name of the language definition file or the name of a babel package option. These macros and their functions are discussed below.

• When a language definition file is loaded, it can define l@ language to be a variant (dialect) of language0 when l@ language is undefined.

• The language definition files must be written in a way that they can be read not just in the preamble of the document, but also in the middle of document processing.

9.5.1. Hyphenating in several languages

Since TeX version 3.0, hyphenation patterns for multiple languages can be used together. These patterns have to be administered somehow. In particular, the plainTeX user has to know for which languages patterns have been loaded, and to what values of the command sequence language they correspond. The babel package abstracts from this low-level interface and manages this information by using an external file, language.dat, in which one records which languages have hyphenation patterns and in which files these patterns are stored. This configuration file is then processed1 when INITeX is run to generate a new LaTeX format. An example of this file is shown here:

1 Make sure that you do not have several such files in your TeX installation, because it is not always clear which of them will be examined during the format generation. The authors nearly got bitten during the book production when INITeX picked up the system configuration file and not the specially prepared one containing all the patterns for the examples.

Image

This configuration file language.dat can contain empty lines and comments, as well as lines that start with an equals (=) sign. Such a line will instruct LaTeX that the hyphenation patterns just processed will be known under an alternative name. The first element on each line specifies the name of the language; it is followed by the name of the file containing the hyphenation patterns. An optional third entry can specify a hyphenation exception file in case the exceptions are stored in a separate file (e.g., frhyphx.tex in the previous example).

For each language in language.dat, the command l@ language is defined in the LaTeX format (i.e., l@english and so on). When the document is processed with such a format, babel checks for each language whether the command l@ language is defined and, if so, it loads the corresponding hyphenation patterns; otherwise, it loads the patterns for the default language 0 (the one loaded first by INITeX); for compatibility reasons this language should contain US-English hyphenation patterns.

Image

Seven “languages” are loaded into the format, as defined in the language.dat file: english (0), russian (1), french (2), UKenglish (3), german (4), dumylang (5), and nohyphenation (6; implicitly defined with no hyphenation tries). Babel uses these text strings (or their equivalents, specified preceeded by an = sign in language.dat) to identify a language.

If language.dat cannot be opened for reading during the INITeX run, babel will attempt to use the default hyphenation file hyphen.tex instead. It informs the user in this event.

9.5.2. The package file

To help make use of the features of LaTeX, the babel package contains a package file called babel.sty. This file is loaded by the usepackage command and defines all the language options supported by babel (see Table 9.1 on page 543). It also takes care of a number of compatibility issues with other packages. Local customization for babel can be entered in the configuration file bblopts.cfg, which is read at the end of babel.sty.

Apart from the language options listed in Table 9.1 on page 543, babel pre-declares a few options that can influence the behavior of language definition files. For instance, activeacute and activegrave by default do nothing, but they are used with, for instance, Catalan (catalan.ldf) to activate the acute and grave accents when the relevant options are specified.

A third option, KeepShorthandsActive, instructs babel to keep shorthand characters active when processing of the package file ends. Note that this is not the default as it can cause problems with other packages. Nevertheless, in some cases, such as when you need to use shorthand characters in the preamble of a document, this option can be useful.

9.5.3. The structure of the babel language definition file

The babel distribution comes with the file language.skeleton, which provides a convenient skeleton for developing one’s own language file to support a new language. It serves as a convenient model to understand how the babel core commands are used. The file is shown here, and the commands used in it are described as they occur.

Throughout language.skeleton, you will find the string “language”; it should be replaced by the name of the language for which you are providing support. If this language is known to have a dialect that needs a slightly different support, you can arrange for this support as well. In such a case, the strings “dialect” should be replaced by the name of the dialect. If your language does not need support for a dialect, you should remove the corresponding lines of code.

Copyright and introduction

The file starts with copyright and license information.

Image

Identification of the language

This is followed by information identifying the file and language.

Image
Image

The command ProvidesLanguage (line 34) identifies the language definition file. It uses the same syntax as LaTeX’s ProvidesPackage. For instance, the file welsh.ldf contains the following declaration:

Image

The release-information can be used to indicate that at least this version of babel is required.

A documentation driver

The next section then sets up a documentation driver to allow for typesetting the file itself using the doc package. See Chapter 14 for details.

Image

Documentation and initialization

The following part starts with the documentation of the features provided by the language definition file. Use the methods described in Chapter14 for documenting code and providing a short user manual.

Image
Image

The macro LdfInit (line 83) performs a couple of standard checks that have to be made at the beginning of a language definition file, such as checking the category code of the @ sign and preventing the .ldf file from being processed twice.

Defining language and dialects

Image
Image

The command adddialect adds the name of a variant (dialect) language l@variant, for which already defined hyphenation patterns can be used (the ones for language lang).1 If a language has more than one variant, you can repeat this section as often as necessary.

1 When loading hyphenation patterns with INITeX babel uses the addlanguage command to declare the various languages specified in language.dat; see Section 9.5.1.

“Dialect” is somewhat of a historical misnomer, as lang and variant are at the same level as far as babel is concerned, without co-notation indicating whether one or the other is the main language. The “dialect” paradigm comes in handy if you want to share hyphenation patterns between various languages. Moreover, if no hyphenation patterns are preloaded in the format for the language lang, babel’s default behavior is to define this language as a “dialect” of the default language (language0).

For instance, the first line below indicates that for Austrian one can use the hyphenation patterns for German (defined in german.ldf). The second line tells us that Nynorsk shares the hyphenation patterns of Norsk (in norsk.ldf).

Image

The following example shows how language variants can be obtained using the dialect mechanism, where there can be differences in the names of sectioning elements or for the date.

9-5-1
Image

Defining language attributes

The next part deals with the set-up for language attributes, if necessary.

Image
Image

This command (used on line 109) declares that for the attribute attr in the language lang, the code exec should be executed. For instance, the file greek.ldf defines an attribute polutoniko for the Greek language:

Image

When you load the Greek language with the polutonikogreek option (which is equivalent to setting the attribute polutoniko), Greek will then be typeset with multiple accents (according to the code specified in the third argument).

If you want to define more than one attribute for the current language, repeat this section as often as necessary.

Adjusting hyphenation patterns

Now we deal with the minimum number of characters required to the left and right of hyphenation points.

Image
Image

The command providehyphenmins (line 124) provides a default setting for the hyphenation parameters lefthyphenmin (minimum number of characters on the left before the first hyphen point) and ighthyphenmin (minimum numbers on the right) for the language lang, by defining language hyphenmins unless it is already defined for some reason. The babel package detects whether the hyphenation file explicitly sets lefthyphenmin and ighthyphenmin and automatically defines language hyphenmins, in which case the providehyphenmins declaration has no effect.

The syntax inside babel is storage optimized, dating back to the days when every token counted. Thus, the argument hyphenmins contains the values for both parameters simply as two digits, making the assumption that you will never want a minimum larger than 9. If this assumption is wrong, you must surround the values with braces within hyphenmins. For example,

Image

would request to leave at least 10 characters before a hyphen and at least 5 characters after it (thus essentially never hyphenate).

If you want to explicitly overwrite the settings regardless of any existing spec-ification, you can do so by providing a value for language hyphenmins yourself. For instance,

Image

never considers hyphenation points with less than four letters before and three letters after the hyphen. Thus, it will never hyphenate a word with less than seven characters.

Hyphenation patterns are built with a certain setting of these parameters in mind. Setting their values lower than the values used in the pattern generation will merely result in incorrect hyphenation. It is possible, however, to use higher values in which case the potential hyphenation points are simply reduced.

Translations for language-dependent strings

The translations for language-dependent strings are set up next.

Image
Image

The macro captions language (line 132) defines the macros that hold the translations for the language-dependent strings used in LaTeX for the language language. It must also be provided for each dialect being set up. If the dialect uses the same translation, let can be used (as shown in line 138). Otherwise, you have to provide a full definition.

Image
Image

The macro date language (line 146) defines the text string for the oday command for the language language being defined in a .ldf file.

Providing extra features

For some languages (or dialects), extra definitions have to be provided. This is done in the next section.

Image
Image

The macro extras language (line 170) contains all extra definitions needed for the language language being defined in a .ldf file. Such extras can be commands to turn shorthands on or off, to make certain characters active, to initiate French spacing, to position umlauts, and so on.

Image

To allow switching between any two languages, it is necessary to return to a known state for the TeX engine—in particular, with respect to the definitions initiated by the command extras language. The macro oextras language (line 171) must contain code to revert all such definitions so as to bring TeX back to a known state.

Clean up and finish

The file finishes with the following lines of code.

Image
Image

The macro ldf@finish (line 192) performs a couple of tasks that are necessary at the end of each .ldf file. The argument lang is the name of the language as it is defined in the language definition file. The macro starts by verifying whether the system contains a file lang.cfg—that is, a file with the same name as the language definition file, but with the extension .cfg. This file can be used to add site-specific actions to a language definition file, such as adding strings to captions language to support local document classes, or activating or deactivating shorthands for acute or grave accents. In particular, the babel distribution for French written by Daniel Flipo comes with a file frenchb.cfg that contains a few (commented-out) supplementary definitions for typesetting French that can be activated (uncommented) by the user if they appear to be useful. Other tasks performed by the macro include resetting the category code of the @ sign, and preparing the language to be activated at the beginning of the document.

Adding definitions to babel’s data structures

On various lines (114, 170, 171), the command addto was used to extend one of the babel data structures holding translations or code for a certain language.

Image

This command extends the definition of the control sequence csname with the TeX code specified in code. The control sequence csname does not have to have been defined previously. As an example, the following lines are taken from the file russianb.ldf, where code is added to the commands captionsrussian, extrasrussian, and oextrasrussian.

Image
Language-level commands for shorthands

Shorthands on the language or system level are set up in the language definition files. An incomplete example of this process was given in the previous section. In this section we describe all commands and declarations that can be used for this purpose.

Image

This macro can be used in language definition files to turn the character char into a “shorthand character”. When the character is already defined to be a shorthand character, this macro does nothing. Otherwise, it defines the control sequence ormal@char char to expand to the character char in its “normal state” and it defines the active character to expand to ormal@char char by default. Subsequently, its definition can be changed to expand to active@char char by calling bl@activate char. When a character has been made active, it will remain active until deactivated or until the end of the document is reached. Its definition can be changed at any time during the typesetting stage of the document.

For example, several language definition files make the double quote character active with the following statement:

Image

For French the configuration file frenchb.cfg defines two-character shorthands:

Image
Image

The command bl@activate “switches on” the active behavior of the character char by changing its definition to expand to active@char char (instead of ormal@char char). Conversely, the command bl@deactivate lets the active character char expand to ormal@char char. This command does not change the catcode of the character, which stays active.

Image

Recognizing that some shorthands declared in the language definition files have to be usable in both text and math modes, this macro allows you to specify the code to execute when in text mode (text-code) or when in math mode (math-code). As explained on page 446, providing commands for use in text and math can have unwanted side effects, so this macro should be used with great care.

Image

When LaTeX cannot hyphenate a word properly by itself—for instance, because it is a compound word or because the word contains accented letters constructed using the accent primitive—it needs a little help. This help involves making LaTeX think it is dealing with two words, which appear as one word on the page. For this purpose babel provides the command allowhyphens, which inserts an invisible horizontal skip, unless the current font encoding is T1.1 In some cases one wants to insert this “help” unconditionally; for these cases bl@allowhyphens is available. This invisible skip has the effect of making LaTeX think it is dealing with two words that can be hyphenated separately.

1 In contrast to the OT1 encoding, the T1 encoding contains most accented characters as real glyphs so that the accent primitive is almost never used.

Image

The macro declare@shorthand defines shorthands to facilitate entering text in the given language. The first argument, name, specifies the name of the collection of shorthands to which the definition belongs. The second argument, charseq, consists of one or more characters that correspond to the shorthand being defined. The third argument, exec, contains the code to be executed when the shorthand is encountered in the document. A few examples from various language definition files follow.

Image

The latter two instructions are found in the file frenchb.ldf, where the first handles the case where the ; character is active and the third argument provides code for ensuring that a thin space is inserted before “high” punctuation (;, :, !, and ?). The last command deals with the case where these French punctuation rules are inactivated (note that these four punctuation characters are made active in frenchb.ldf).

9.6. Other approaches

In general, the babel package does a good job of translating document element names and making text input somewhat more convenient. However, for several languages, individuals or local user groups have developed packages and versions of TeX that cope with a given language on a deeper level—in particular, by better integrating the typographic traditions of the target language.

An example of such a package is french [51, 66], which was developed by Bernard Gaulle. Special customized versions of (La)TeX exist (e.g., Polish and Czech, distributed by the TeX user groups GUST and CSTUG, respectively).

9.6.1. More complex languages

In the world of non-Latin alphabets, one more level of complexity is added when one wants to treat the Arabic or Hebrew [140] languages. Not only are they typeset from right to left, but, in the case of Arabic, the letter shapes change according to their positions in a word.

Several systems to handle Hebrew are available on CTAN (language/hebrew). In particular, babel offers an interface for Hebrew written by Boris Lavva. For Arabic there is the ArabTeX system [102], developed by Klaus Lagally. This package extends the capabilities of (La)TeX to generate Arabic writing using an ASCII transliteration (CTAN nonfree/language/arabtex).

Serguei Dachian, Arnak Dalalyan, and Vardan Hakobian provide Armenian support (CTAN language/armtex).

For the languages of the Indian subcontinent, most of the support is based on the work of Frans Velthuis. In particular, recently Anshuman Pandey developed packages for Bengali (bengali package and associated fonts on CTAN language/bengali/pandey), Sanskrit (Anshuman Pandey’s devnag package on CTAN language/devanagari/velthuis), and Gurmukhi (CTAN language/gurmukhi/pandey).

Oliver Corff and Dorjpalam Dorj’s manjutex package can be used for typesetting languages using the Manju (Mongolian) scripts (CTAN language/manju/manjutex).

Ethiopian language support, compatible with babel, is available through Berhanu Beyene, Manfred Kudlek, Olaf Kummer, and Jochen Metzinger’s ethiop package and fonts (CTAN language/ethiopia/ethiop).

For Chinese, Japanese, and Korean (the so-called CJK scripts), one can use Werner Lemberg’s cjk package [113], which contains fonts and utilities (CTAN language/chinese/CJK).

9.6.2. Omega

No discussion of multilingual typesetting would be complete without mentioning Omega [137], an extension of TeX developed by Yannis Haralambous and John Plaice. Omega’s declared aim is to improve on TeX’s multilingual typesetting abilities by making significant changes to the executable TeX, the Program. It potentially provides far simpler solutions in many of the areas addressed by babel by offering the following features:

• Omega can be used to read text files in any encoding (8-bit, 16-bit, or more).

• Omega handles shorthands internally by applying specified transformations to recognized sequences of input characters.

• Omega has an internal structure that is far more flexible than that of TeX for handling large sets of characters and large fonts.

• Omega supports many different types of script and all writing directions used for present-day scripts.

These enhancements to the TeX typesetting paradigm will make it easier to typeset a range of languages: Arabic, Bantu, Basque, Georgian, Hindi, Khmer, Chinese, Cree, or Mongolian—and all within the same document! It is also hoped (at end 2003) that enhancements to LaTeX will soon appear to support these new facilities, thus providing a fully multilingual LaTeX system.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.204.186