Program Encoding

At the lowest level, a Ruby program is simply a sequence of characters. Ruby’s lexical rules are defined using characters of the ASCII character set. Comments begin with the # character (ASCII code 35), for example, and allowed whitespace characters are horizontal tab (ASCII 9), newline (10), vertical tab (11), form feed (12), carriage return (13), and space (32). All Ruby keywords are written using ASCII characters, and all operators and other punctuation are drawn from the ASCII character set.

By default, the Ruby interpreter assumes that Ruby source code is encoded in ASCII. This is not required, however; the interpreter can also process files that use other encodings, as long as those encodings can represent the full set of ASCII characters. In order for the Ruby interpreter to be able to interpret the bytes of a source file as characters, it must know what encoding to use. Ruby files can identify their own encodings or you can tell the interpreter how they are encoded. Doing so is explained shortly.

The Ruby interpreter is actually quite flexible about the characters that appear in a Ruby program. Certain ASCII characters have specific meanings, and certain ASCII characters are not allowed in identifiers, but beyond that, a Ruby program may contain any characters allowed by the encoding. We explained earlier that identifiers may contain characters outside of the ASCII character set. The same is true for comments and string and regular expression literals: they may contain any characters other than the delimiter character that marks the end of the comment or literal. In ASCII-encoded files, strings may include arbitrary bytes, including those that represent nonprinting control characters. (Using raw bytes like this is not recommended, however; Ruby string literals support escape sequences so that arbitrary characters can be included by numeric code instead.) If the file is written using the UTF-8 encoding, then comments, strings, and regular expressions may include arbitrary Unicode characters. If the file is encoded using the Japanese SJIS or EUC encodings, then strings may include Kanji characters.

Specifying Program Encoding

By default, the Ruby interpreter assumes that programs are encoded in ASCII. In Ruby 1.8, you can specify a different encoding with the -K command-line option. To run a Ruby program that includes Unicode characters encoded in UTF-8, invoke the interpreter with the -Ku option. Programs that include Japanese characters in EUC-JP or SJIS encodings can be run with the -Ke and -Ks options.

Ruby 1.9 also supports the -K option, but it is no longer the preferred way to specify the encoding of a program file. Rather than have the user of a script specify the encoding when they invoke Ruby, the author of the script can specify the encoding of the script by placing a special “coding comment” at the start of the file.[2] For example:

# coding: utf-8

The comment must be written entirely in ASCII, and must include the string coding followed by a colon or equals sign and the name of the desired encoding (which cannot include spaces or punctuation other than hyphen and underscore). Whitespace is allowed on either side of the colon or equals sign, and the string coding may have any prefix, such as en to spell encoding. The entire comment, including coding and the encoding name, is case-insensitive and can be written with upper- or lowercase letters.

Encoding comments are usually written so that they also inform a text editor of the file encoding. Emacs users might write:

# -*- coding: utf-8 -*-

And vi users can write:

# vi: set fileencoding=utf-8 :

An encoding comment like this one is usually only valid on the first line of the file. It may appear on the second line, however, if the first line is a shebang comment (which makes a script executable on Unix-like operating systems):

#!/usr/bin/ruby -w
# coding: utf-8

Encoding names are not case-sensitive and may be written in uppercase, lowercase, or a mix. Ruby 1.9 supports at least the following source encodings: ASCII-8BIT (also known as BINARY), US-ASCII (7-bit ASCII), the European encodings ISO-8859-1 through ISO-8859-15, the Unicode encoding UTF-8, and the Japanese encodings SHIFT_JIS (also known as SJIS) and EUC-JP. Your build or distribution of Ruby may support additional encodings as well.

As a special case, UTF-8-encoded files identify their encoding if the first three bytes of the file are 0xEF 0xBB 0xBF. These bytes are known as the BOM or “Byte Order Mark” and are optional in UTF-8-encoded files. (Certain Windows programs add these bytes when saving Unicode files.)

In Ruby 1.9, the language keyword __ENCODING__ (there are two underscores at the beginning and at the end) evaluates to the source encoding of the currently executing code. The resulting value is an Encoding object. (See The Encoding class for more on the Encoding class.)

Source, External, and Internal Encodings

In Ruby 1.9, it is important to understand the difference between the source encoding of a single Ruby file and the default external and default internal encodings of the entire Ruby process. The source encoding is what we described earlier: it tells the Ruby interpreter how to read characters in a script. Source encodings are typically set with coding comments. A Ruby program may consist of more than one file, and different files may have different source encodings. The source encoding of a file affects the encoding of the string literals in that file. For more about the encoding of strings, see String Encodings and Multibyte Characters.

The default external encoding is something different: this is the encoding that Ruby uses by default when reading from files and streams. The default external encoding is global to the Ruby process and does not change from file to file. Normally, the default external encoding is set based on the locale that your computer is configured to. But you can also explicitly specify the default external encoding with command-line options, as we’ll describe shortly. The default external encoding does not affect the encoding of string literals, but it is quite important for I/O, as we’ll see in Streams and Encodings.

When a Ruby program reads text from a file or network socket, it normally leaves the text in its native encoding. If you prefer to have all text automatically transcoded to a single common encoding, you can specify a default internal encoding using the command-line options described below. See Streams and Encodings for more details.

We described the -K interpreter option earlier as a way to set the source encoding. In fact, what this option really does is set the default external encoding of the process and then uses that encoding as the default source encoding.

In Ruby 1.9, the -K option exists for compatibility with Ruby 1.8 but is not the preferred way to set the default external encoding. Two new options, -E and --encoding, allow you to set both the default external and the default internal encoding and to specify an encoding by its full name rather than by a one-character abbreviation. For example:

ruby -E utf-8            # Default external encoding name follows -E
ruby -Eutf-8             # The space is optional
ruby -E utf-8:binary     # Specify external and internal encodings
ruby -E :sjis            # Specify default internal encoding only
ruby --encoding utf-8    # --encoding is just like -E
ruby --encoding=utf-8    # Or use an equals sign with --encoding

The -U (for Unicode) option specifies a default internal encoding of UTF-8. It is a shortcut for -E:utf-8. See Invoking the Ruby Interpreter for complete details on these interpreter command-line options.

You can query the default external and default internal encodings with Encoding.default_external and Encoding.default_internal. These class methods return an Encoding object. Use Encoding.locale_charmap to obtain the name (as a string) of the character encoding derived from the locale. This method is always based on the locale setting and ignores command-line options that override the default external encoding.



[2] Ruby follows Python’s conventions in this; see http://www.python.org/dev/peps/pep-0263/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.196.146