Text

Text is represented in Ruby by objects of the String class. Strings are mutable objects, and the String class defines a powerful set of operators and methods for extracting substrings, inserting and deleting text, searching, replacing, and so on. Ruby provides a number of ways to express string literals in your programs, and some of them support a powerful string interpolation syntax by which the values of arbitrary Ruby expressions can be substituted into string literals. The sections that follow explain string and character literals and string operators. The full string API is covered in Strings.

Textual patterns are represented in Ruby as Regexp objects, and Ruby defines a syntax for including regular expressions literally in your programs. The code /[a-z]d+/, for example, represents a single lowercase letter followed by one or more digits. Regular expressions are a commonly used feature of Ruby, but regexps are not a fundamental datatype in the way that numbers, strings, and arrays are. See Regular Expressions for documentation of regular expression syntax and the Regexp API.

String Literals

Ruby provides quite a few ways to embed strings literally into your programs.

Single-quoted string literals

The simplest string literals are enclosed in single quotes (the apostrophe character). The text within the quote marks is the value of the string:

'This is a simple Ruby string literal'

If you need to place an apostrophe within a single-quoted string literal, precede it with a backslash so that the Ruby interpreter does not think that it terminates the string:

'Won't you read O'Reilly's book?'

The backslash also works to escape another backslash, so that the second backslash is not itself interpreted as an escape character. Here are some situations in which you need to use a double backslash:

'This string literal ends with a single backslash: '
'This is a backslash-quote: ''
'Two backslashes: \'

In single-quoted strings, a backslash is not special if the character that follows it is anything other than a quote or a backslash. Most of the time, therefore, backslashes need not be doubled (although they can be) in string literals. For example, the following two string literals are equal:

'a' == 'a\b'

Single-quoted strings may extend over multiple lines, and the resulting string literal includes the newline characters. It is not possible to escape the newlines with a backslash:

'This is a long string literal 
that includes a backslash and a newline'

If you want to break a long single-quoted string literal across multiple lines without embedding newlines in it, simply break it into multiple adjacent string literals; the Ruby interpreter will concatenate them during the parsing process. Remember, though, that you must escape the newlines (see Chapter 2) between the literals so that Ruby does not interpret the newline as a statement terminator:

message = 
'These three literals are '
'concatenated into one by the interpreter. '
'The resulting string contains no newlines.'

Double-quoted string literals

String literals delimited by double quotation marks are much more flexible than single-quoted literals. Double-quoted literals support quite a few backslash escape sequences, such as for newline, for tab, and " for a quotation mark that does not terminate the string:

"	"This quote begins with a tab and ends with a newline"
"
"\"  # A single backslash

In Ruby 1.9, the u escape embeds arbitrary Unicode characters, specified by their codepoint, into a double-quoted string. This escape sequence is complex enough that we’ll describe it in its own section (see Unicode escapes). Many of the other backslash escape sequences are obscure and are used for encoding binary data into strings. The complete list of escape sequences is shown in Table 3-1.

More powerfully, double-quoted string literals may also include arbitrary Ruby expressions. When the string is created, the expression is evaluated, converted to a string, and inserted into the string in place of the expression text itself. This substitution of an expression with its value is known in Ruby as “string interpolation.” Expressions within double-quoted strings begin with the # character and are enclosed within curly braces:

"360 degrees=#{2*Math::PI} radians" # "360 degrees=6.28318530717959 radians"

When the expression to be interpolated into the string literal is simply a reference to a global, instance, or class variable, then the curly braces may be omitted:

$salutation = 'hello'     # Define a global variable
"#$salutation world"      # Use it in a double-quoted string

Use a backslash to escape the # character if you do not want it to be treated specially. Note that this only needs to be done if the character after # is {, $, or @:

"My phone #: 555-1234"                # No escape needed
"Use #{ to interpolate expressions"  # Escape #{ with backslash

Double-quoted string literals may span multiple lines, and line terminators become part of the string literal, unless escaped with a backslash:

"This string literal
has two lines 
but is written on three"

You may prefer to explicitly encode the line terminators in your strings—in order to enforce network CRLF (Carriage Return Line Feed) line terminators, as used in the HTTP protocol, for example. To do this, write all your string literals on a single line and explicitly include the line endings with the and escape sequences. Remember that adjacent string literals are automatically concatenated, but if they are written on separate lines, the newline between them must be escaped:

"This string has three lines.
" 
"It is written as three adjacent literals
" 
"separated by escaped newlines
"

Table 3-1. Backslash escapes in double-quoted strings

Escape sequenceMeaning
x

A backslash before any character x is equivalent to the character x by itself, unless x is a line terminator or one of the special characters abcefnrstuvxCM01234567. This syntax is useful to escape the special meaning of the , #, and " characters.

a

The BEL character (ASCII code 7). Rings the console bell. Equivalent to C-g or 07.



The Backspace character (ASCII code 8). Equivalent to C-h or 10.

e

The ESC character (ASCII code 27). Equivalent to 33.

f

The Form Feed character (ASCII code 12). Equivalent to C-l and 14.

The Newline character (ASCII code 10). Equivalent to C-j and 12.

The Carriage Return character (ASCII code 13). Equivalent to C-m and 15.

s

The Space character (ASCII code 32).

The TAB character (ASCII code 9). Equivalent to C-i and 11.

u nnnn

The Unicode codepoint nnnn, where each n is one hexadecimal digit. Leading zeros may not be dropped; all four digits are required in this form of the u escape. Supported in Ruby 1.9 and later.

u{ hexdigits }

The Unicode codepoint(s) specified by hexdigits. See the description of this escape in the main text. Ruby 1.9 and later.

v

The vertical tab character (ASCII code 11). Equivalent to C-k and 13.

nnn

The byte nnn, where nnn is three octal digits between 000 and 377.

nn

Same as nn, where nn is two octal digits between 00 and 77.

n

Same as 0n, where n is an octal digit between 0 and 7.

x nn

The byte nn, where nn is two hexadecimal digits between 00 and FF. (Both lowercase and uppercase letters are allowed as hexadecimal digits.)

x n

Same as x0n, where n is a hexadecimal digit between 0 and F (or f).

c x

Shorthand for C-x.

C- x

The character whose character code is formed by zeroing the sixth and seventh bits of x, retaining the high-order bit and the five low bits. x can be any character, but this sequence is usually used to represent control characters Control-A through Control-Z (ASCII codes 1 through 26). Because of the layout of the ASCII table, you can use either lowercase or uppercase letters for x. Note that cx is shorthand. x can be any single character or an escape other than C u, x, or nnn.

M- x

The character whose character code is formed by setting the high bit of the code of x. This is used to represent “meta” characters, which are not technically part of the ASCII character set. x can be any single character or an escape other than M u, x, or nnn. M can be combined with C as in M-C-A.

eol

A backslash before a line terminator escapes the terminator. Neither the backslash nor the terminator appear in the string.

Unicode escapes

In Ruby 1.9, double-quoted strings can include arbitrary Unicode characters with u escapes. In its simplest form, u is followed by exactly four hexadecimal digits (letters can be upper- or lowercase), which represent a Unicode codepoint between 0000 and FFFF. For example:

"u00D7"       # => "×": leading zeros cannot be dropped
"u20ac"       # => "€": lowercase letters are okay

A second form of the u escape is followed by an open curly brace, one to six hexadecimal digits, and a close curly brace. The digits between the braces can represent any Unicode codepoint between 0 and 10FFFF, and leading zeros can be dropped in this form:

"u{A5}"      # => "¥": same as "u00A5"
"u{3C0}"     # Greek lowercase pi: same as "u03C0"
"u{10ffff}"  # The largest Unicode codepoint

Finally, the u{} form of this escape allows multiple codepoints to be embedded within a single escape. Simply place multiple runs of one to six hexadecimal digits, separated by a single space or tab character, within the curly braces. Spaces are not allowed after the opening curly brace or before the closing brace:

money = "u{20AC A3 A5}"  # => "€£¥"

Note that spaces within the curly braces do not encode spaces in the string itself. You can, however, encode the ASCII space character with Unicode codepoint 20:

money = "u{20AC 20 A3 20 A5}"  # => "€ £ ¥"

Strings that use the u escape are encoded using the Unicode UTF-8 encoding. (See String Encodings and Multibyte Characters for more on the encoding of strings.)

u escapes are usually, but not always, legal in strings. If the source file uses an encoding other than UTF-8, and a string contains multibyte characters in that encoding (literal characters, not characters created with escapes), then it is not legal to use u in that string—it is just not possible for one string to encode characters in two different encodings. You can always use u if the source encoding (see Specifying Program Encoding) is UTF-8. And you can always use u in a string that only contains ASCII characters.

u escapes may appear in double-quoted strings, and also in other forms of quoted text (described shortly) such as regular expressions, characters literals, %- and %Q-delimited strings, %W-delimited arrays, here documents, and backquote-delimited command strings. Java programmers should note that Ruby’s u escape can only appear in quoted text, not in program identifiers.

Arbitrary delimiters for string literals

When working with text that contains apostrophes and quotation marks, it is awkward to use it as single- and double-quoted string literals. Ruby supports a generalized quoting syntax for string literals (and, as we’ll see later, for regular expression and array literals as well). The sequence %q begins a string literal that follows single-quoted string rules, and the sequence %Q (or just %) introduces a literal that follows double-quoted string rules. The first character following q or Q is the delimiter character, and the string literal continues until a matching (unescaped) delimiter is found. If the opening delimiter is (, [, {, or <, then the matching delimiter is ), ], }, or >. (Note that the backtick ` and apostrophe ' are not a matched pair.) Otherwise, the closing delimiter is the same as the opening delimiter. Here are some examples:

%q(Don't worry about escaping ' characters!)
%Q|"How are you?", he said|
%-This string literal ends with a newline
-  # Q omitted in this one

If you find that you need to escape the delimiter character, you can use a backslash (even in the stricter %q form) or just choose a different delimiter:

%q_This string literal contains \_underscores\__
%Q!Just use a _different_ delimiter!!

If you use paired delimiters, you don’t need to escape those delimiters in your literals, as long as they appear in properly nested pairs:

# XML uses paired angle brackets:
%<<book><title>Ruby in a Nutshell</title></book>>  # This works
# Expressions use paired, nested parens:
%((1+(2*3)) = #{(1+(2*3))})                        # This works, too
%(A mismatched paren ( must be escaped)           # Escape needed here

Here documents

For long string literals, there may be no single character delimiter that can be used without worrying about remembering to escape characters within the literal. Ruby’s solution to this problem is to allow you to specify an arbitrary sequence of characters to serve as the delimiter for the string. This kind of literal is borrowed from Unix shell syntax and is historically known as a here document. (Because the document is right here in the source code rather than in an external file.)

Here documents begin with << or <<-. These are followed immediately (no space is allowed, to prevent ambiguity with the left-shift operator) by an identifier or string that specifies the ending delimiter. The text of the string literal begins on the next line and continues until the text of the delimiter appears on a line by itself. For example:

document = <<HERE        # This is how we begin a here document
This is a string literal.
It has two lines and abruptly ends...
HERE

The Ruby interpreter gets the contents of a string literal by reading a line at a time from its input. This does not mean, however, that the << must be the last thing on its own line. In fact, after reading the content of a here document, the Ruby interpreter goes back to the line it was on and continues parsing it. The following Ruby code, for example, creates a string by concatenating two here documents (and the newlines that terminate them) and a regular single-quoted string:

greeting = <<HERE + <<THERE + "World"
Hello
HERE
There
THERE

The <<HERE on line 1 causes the interpreter to read lines 2 and 3. And the <<THERE causes the interpreter to read lines 4 and 5. After these lines have been read, the three string literals are concatenated into one.

The ending delimiter of a here document really must appear on a line by itself: no comment may follow the delimiter. If the here document begins with <<, then the delimiter must start at the beginning of the line. If the literal begins with <<- instead, then the delimiter may have whitespace in front of it. The newline at the beginning of a here document is not part of the literal, but the newline at the end of the document is. Therefore, every here document ends with a line terminator, except for an empty here document, which is the same as "":

empty = <<END
END

If you use an unquoted identifier as the terminator, as in the previous examples, then the here document behaves like a double-quoted string for the purposes of interpreting backslash escapes and the # character. If you want to be very, very literal, allowing no escape characters whatsoever, place the delimiter in single quotes. Doing this also allows you to use spaces in your delimiter:

document = <<'THIS IS THE END, MY ONLY FRIEND, THE END'
    .
    . lots and lots of text goes here
    . with no escaping at all.
    .
THIS IS THE END, MY ONLY FRIEND, THE END

The single quotes around the delimiter hint that this string literal is like a single-quoted string. In fact, this kind of here document is even stricter. Because the single quote is not a delimiter, there is never a need to escape a single quote with a backslash. And because the backslash is never needed as an escape character, there is never a need to escape the backslash itself. In this kind of here document, therefore, backslashes are simply part of the string literal.

You may also use a double-quoted string literal as the delimiter for a here document. This is the same as using a single identifier, except that it allows spaces within the delimiter:

document = <<-"# # #"    # This is the only place we can put a comment
<html><head><title>#{title}</title></head>
<body>
<h1>#{title}</h1>
#{body}
</body>
</html>
               # # #

Note that there is no way to include a comment within a here document except on the first line after the << token and before the start of the literal. Of all the # characters in this code, one introduces a comment, three interpolate expressions into the literal, and the rest are the delimiter.

Backtick command execution

Ruby supports another syntax involving quotes and strings. When text is enclosed in backquotes (the ` character, also known as backticks), that text is treated as a double-quoted string literal. The value of that literal is passed to the specially named Kernel.` method. This method executes the text as an operating system shell command and returns the command’s output as a string.

Consider the following Ruby code:

`ls`

On a Unix system, these four characters yield a string that lists the names of the files in the current directory. This is highly platform-dependent, of course. A rough equivalent in Windows might be `dir`.

Ruby supports a generalized quote syntax you can use in place of backticks. This is like the %Q syntax introduced earlier, but uses %x (for execute) instead:

%x[ls]

Note that the text within the backticks (or following %x) is processed like a double-quoted literal, which means that arbitrary Ruby expressions can be interpolated into the string. For example:

if windows
  listcmd = 'dir'
else
  listcmd = 'ls'
end
listing = `#{listcmd}`

In a case like this, however, it is simpler just to invoke the backtick method directly:

listing = Kernel.`(listcmd)  # irb doesn't support this legal syntax

String literals and mutability

Strings are mutable in Ruby. Therefore, the Ruby interpreter cannot use the same object to represent two identical string literals. (If you are a Java programmer, you may find this surprising.) Each time Ruby encounters a string literal, it creates a new object. If you include a literal within the body of a loop, Ruby will create a new object for each iteration. You can demonstrate this for yourself as follows:

10.times { puts "test".object_id }

For efficiency, you should avoid using literals within loops.

The String.new method

In addition to all the string literal options described earlier, you can also create new strings with the String.new method. With no arguments, this method returns a newly created string with no characters. With a single string argument, it creates and returns a new String object that represents the same text as the argument object.

Character Literals

Single characters can be included literally in a Ruby program by preceding the character with a question mark. No quotation marks of any kind are used:

?A   # Character literal for the ASCII character A
?"   # Character literal for the double-quote character
??   # Character literal for the question mark character

Although Ruby has a character literal syntax, it does not have a special class to represent single characters. Also, the interpretation of character literals has changed between Ruby 1.8 and Ruby 1.9. In Ruby 1.8, character literals evaluate to the integer encoding of the specified character. ?A, for example, is the same as 65 because the ASCII encoding for the capital letter A is the integer 65. In Ruby 1.8, the character literal syntax only works with ASCII and single-byte characters.

In Ruby 1.9 and later, characters are simply strings of length 1. That is, the literal ?A is the same as the literal 'A', and there is really no need for this character literal syntax in new code. In Ruby 1.9, the character literal syntax works with multibyte characters and can also be used with the u Unicode escape (though not with the multicodepoint form u{a b c}):

?u20AC == ?€    # => true: Ruby 1.9 only
?€ == "u20AC"   # => true

The character literal syntax can actually be used with any of the character escapes listed earlier in Table 3-1:

?	      # Character literal for the TAB character
?C-x    # Character literal for Ctrl-X
?111    # Literal for character whose encoding is 0111 (octal)

String Operators

The String class defines several useful operators for manipulating strings of text. The + operator concatenates two strings and returns the result as a new String object:

planet = "Earth"
"Hello" + " " + planet    # Produces "Hello Earth"

Java programmers should note that the + operator does not convert its righthand operand to a string; you must do that yourself:

"Hello planet #" + planet_number.to_s  # to_s converts to a string

Of course, in Ruby, string interpolation is usually simpler than string concatenation with +. With string interpolation, the call to to_s is done automatically:

"Hello planet ##{planet_number}"

The << operator appends its second operand to its first, and should be familiar to C++ programmers. This operator is very different from +; it alters the lefthand operand rather than creating and returning a new object:

greeting = "Hello"
greeting << " " << "World"
puts greeting   # Outputs "Hello World"

Like +, the << operator does no type conversion on the righthand operand. If the righthand operand is an integer, however, it is taken to be a character code, and the corresponding character is appended. In Ruby 1.8, only integers between 0 and 255 are allowed. In Ruby 1.9, any integer that represents a valid codepoint in the string’s encoding can be used:

alphabet = "A"
alphabet << ?B   # Alphabet is now "AB"
alphabet << 67   # And now it is "ABC"
alphabet << 256  # Error in Ruby 1.8: codes must be >=0 and < 256

The * operator expects an integer as its righthand operand. It returns a String that repeats the text specified on the lefthand side the number of times specified by the righthand side:

ellipsis = '.'*3    # Evaluates to '...'

If the lefthand side is a string literal, any interpolation is performed just once before the repetition is done. This means that the following too-clever code does not do what you might want it to:

a = 0;
"#{a=a+1} " * 3   # Returns "1 1 1 ", not "1 2 3 "

String defines all the standard comparison operators. == and != compare strings for equality and inequality. Two strings are equal if—and only if—they have the same length and all characters are equal. <, <=, >, and >= compare the relative order of strings by comparing the character codes of the characters that make up a string. If one string is a prefix of another, the shorter string is less than the longer string. Comparison is based strictly on character codes. No normalization is done, and natural language collation order (if it differs from the numeric sequence of character codes) is ignored.

String comparison is case-sensitive.[*] Remember that in ASCII, the uppercase letters all have lower codes than the lowercase letters. This means, for example, that "Z" < "a". For case-insensitive comparison of ASCII characters, use the casecmp method (see Strings) or convert your strings to the same case with downcase or upcase methods before comparing them. (Keep in mind that Ruby’s knowledge of upper- and lowercase letters is limited to the ASCII character set.)

Accessing Characters and Substrings

Perhaps the most important operator supported by String is the square-bracket array-index operator [], which is used for extracting or altering portions of a string. This operator is quite flexible and can be used with a number of different operand types. It can also be used on the lefthand side of an assignment, as a way of altering string content.

In Ruby 1.8, a string is like an array of bytes or 8-bit character codes. The length of this array is given by the length or size method, and you get or set elements of the array simply by specifying the character number within square brackets:

s = 'hello';   # Ruby 1.8
s[0]           # 104: the ASCII character code for the first character 'h'
s[s.length-1]  # 111: the character code of the last character 'o'
s[-1]          # 111: another way of accessing the last character
s[-2]          # 108: the second-to-last character
s[-s.length]   # 104: another way of accessing the first character
s[s.length]    # nil: there is no character at that index

Notice that negative array indexes specify a 1-based position from the end of the string. Also notice that Ruby does not throw an exception if you try to access a character beyond the end of the string; it simply returns nil instead.

Ruby 1.9 returns single-character strings rather than character codes when you index a single character. Keep in mind that when working with multibyte strings, with characters encoded using variable numbers of bytes, random access to characters is less efficient than access to the underlying bytes:

s = 'hello';   # Ruby 1.9
s[0]           # 'h': the first character of the string, as a string
s[s.length-1]  # 'o': the last character 'o'
s[-1]          # 'o': another way of accessing the last character
s[-2]          # 'l': the second-to-last character
s[-s.length]   # 'h': another way of accessing the first character
s[s.length]    # nil: there is no character at that index

To alter individual characters of a string, simply use brackets on the lefthand side of an assignment expression. In Ruby 1.8, the righthand side may be an ASCII character code or a string. In Ruby 1.9, the righthand side must be a string. You can use character literals in either version of the language:

s[0] = ?H        # Replace first character with a capital H
s[-1] = ?O       # Replace last character with a capital O
s[s.length] = ?! # ERROR! Can't assign beyond the end of the string

The righthand side of an assignment statement like this need not be a character code: it may be any string, including a multicharacter string or the empty string. Again, this works in both Ruby 1.8 and Ruby 1.9:

s = "hello"      # Begin with a greeting
s[-1] = ""       # Delete the last character; s is now "hell"
s[-1] = "p!"     # Change new last character and add one; s is now "help!"

More often than not, you want to retrieve substrings from a string rather than individual character codes. To do this, use two comma-separated operands between the square brackets. The first operand specifies an index (which may be negative), and the second specifies a length (which must be nonnegative). The result is the substring that begins at the specified index and continues for the specified number of characters:

s = "hello"
s[0,2]          # "he"
s[-1,1]         # "o": returns a string, not the character code ?o
s[0,0]          # "": a zero-length substring is always empty
s[0,10]         # "hello": returns all the characters that are available
s[s.length,1]   # "": there is an empty string immediately beyond the end
s[s.length+1,1] # nil: it is an error to read past that
s[0,-1]         # nil: negative lengths don't make any sense

If you assign a string to a string indexed like this, you replace the specified substring with the new string. If the righthand side is the empty string, this is a deletion, and if the lefthand side has zero-length, this is an insertion:

s = "hello"
s[0,1] = "H"              # Replace first letter with a capital letter
s[s.length,0] = " world"  # Append by assigning beyond the end of the string
s[5,0] = ","              # Insert a comma, without deleting anything
s[5,6] = ""               # Delete with no insertion; s == "Hellod"

Another way to extract, insert, delete, or replace a substring is by indexing a string with a Range object. We’ll explain ranges in detail in Ranges later. For our purposes here, a Range is two integers separated by dots. When a Range is used to index a string, the return value is the substring whose characters fall within the Range:

s = "hello"
s[2..3]           # "ll": characters 2 and 3
s[-3..-1]         # "llo": negative indexes work, too
s[0..0]           # "h": this Range includes one character index
s[0...0]          # "": this Range is empty
s[2..1]           # "": this Range is also empty
s[7..10]          # nil: this Range is outside the string bounds
s[-2..-1] = "p!"     # Replacement: s becomes "help!"
s[0...0] = "Please " # Insertion: s becomes "Please help!"
s[6..10] = ""        # Deletion: s becomes "Please!"

Don’t confuse string indexing with two comma-separated integers with this form that uses a single Range object. Although both involve two integers, there is an important difference: the form with the comma specifies an index and a length; the form that uses a Range object specifies two indexes.

It is also possible to index a string with a string. When you do this, the return value is the first substring of the target string that matches the index string, or nil, if no match is found. This form of string indexing is really only useful on the lefthand side of an assignment statement when you want to replace the matched string with some other string:

s = "hello"       # Start with the word "hello"
while(s["l"])     # While the string contains the substring "l"
  s["l"] = "L";   # Replace first occurrence of "l" with "L"
end               # Now we have "heLLo"

Finally, you can index a string using a regular expression. (Regular expression objects are covered in Regular Expressions.) The result is the first substring of the string that matches the pattern, and again, this form of string indexing is most useful when used on the lefthand side of an assignment:

s[/[aeiou]/] = '*'      # Replace first vowel with an asterisk

Iterating Strings

In Ruby 1.8, the String class defines an each method that iterates a string line-by-line. The String class includes the methods of the Enumerable module, and they can be used to process the lines of a string. You can use the each_byte iterator in Ruby 1.8 to iterate through the bytes of a string, but there is little advantage to using each_byte over the [] operator because random access to bytes is as quick as sequential access in 1.8.

The situation is quite different in Ruby 1.9, which removes the each method, and in which the String class is no longer Enumerable. In place of each, Ruby 1.9 defines three clearly named string iterators: each_byte iterates sequentially through the individual bytes that comprise a string; each_char iterates the characters; and each_line iterates the lines. If you want to process a string character-by-character, it may be more efficient to use each_char than to use the [] operator and character indexes:

s = "¥1000"
s.each_char {|x| print "#{x} " }         # Prints "¥ 1 0 0 0". Ruby 1.9 
0.upto(s.size-1) {|i| print "#{s[i]} "}  # Inefficient with multibyte chars

String Encodings and Multibyte Characters

Strings are fundamentally different in Ruby 1.8 and Ruby 1.9:

  • In Ruby 1.8, strings are a sequence of bytes. When strings are used to represent text (instead of binary data), each byte of the string is assumed to represent a single ASCII character. In 1.8, the individual elements of a string are not characters, but numbers—the actual byte value or character encoding.

  • In Ruby 1.9, on the other hand, strings are true sequences of characters, and those characters need not be confined to the ASCII character set. In 1.9, the individual elements of a string are characters—represented as strings of length 1—rather than integer character codes. Every string has an encoding that specifies the correspondence between the bytes in the string and the characters those bytes represent. Encodings such as the UTF-8 encoding of Unicode characters use variable numbers of bytes for each character, and there is no longer a 1-to-1 (nor even a 2-to-1) correspondence between bytes and characters.

The subsections that follow explain the encoding-related features of strings in Ruby 1.9, and also demonstrate rudimentary support for multibyte characters in Ruby 1.8 using the jcode library.

Multibyte characters in Ruby 1.9

The String class has been rewritten in Ruby 1.9 to be aware of and properly handle multibyte characters. Although multibyte support is the biggest change in Ruby 1.9, it is not a highly visible change: code that uses multibyte strings just works. It is worth understanding why it works, however, and this section explains the details.

If a string contains multibyte characters, then the number of bytes does not correspond to the number of characters. In Ruby 1.9, the length and size methods return the number of characters in a string, and the new bytesize method returns the number of bytes. The [] and []= operators allow you to query and set the characters of a string, and the new methods getbyte and setbyte allow you to query and set individual bytes (though you should not often need to do this):

# -*- coding: utf-8 -*-   # Specify Unicode UTF-8 characters

# This is a string literal containing a multibyte multiplication character
s = "2×2=4"

# The string contains 6 bytes which encode 5 characters
s.bytesize                                     # => 6
s.bytesize.times {|i| print s.getbyte(i), " "} # Prints "50 195 151 50 61 52"
s.length                                       # => 5
s.length.times { |i| print s[i], " "}          # Prints "2 × 2 = 4"
s.setbyte(5, s.getbyte(5)+1);                  # s is now "2×2=5"

Note that the first line in this code is a coding comment that sets the source encoding (see Specifying Program Encoding) to UTF-8. Without this comment, the Ruby interpreter would not know how to decode the sequence of bytes in the string literal into a sequence of characters.

When a string contains characters encoded with varying numbers of bytes, it is no longer possible to map directly from character index to byte offset in the string. In the string above, for example, the second character begins at the second byte. But the third character begins at the fourth byte. This means that you cannot assume that random access to arbitrary characters within a string is a fast operation. When you use the [] operator, as we did in the code above, to access a character or substring within a multibyte string, the Ruby implementation must internally iterate sequentially through the string to find the desired character index. In general, therefore, you should try to do your string processing using sequential algorithms when possible. That is: use the each_char iterator when possible instead of repeated calls to the [] operator. On the other hand, it is usually not necessary to worry too much about this. Ruby implementations optimize the cases that can be optimized, and if a string consists entirely of 1-byte characters, random access to those characters will be efficient. If you want to attempt your own optimizations, you can use the instance method ascii_only? to determine whether a string consists entirely of 7-bit ASCII characters.

The Ruby 1.9 String class defines an encoding method that returns the encoding of a string (the return value is an Encoding object, which is described below):

# -*- coding: utf-8 -*-
s = "2×2=4"     # Note multibyte multiplication character
s.encoding      # => <Encoding: UTF-8>

The encoding of string literals is always the same as the source encoding of the file, except that literals that contain u escapes are always encoded in UTF-8, regardless of the source encoding.

Certain string operations, such as concatenation and pattern matching, require that two strings (or a string and a regular expression) have compatible encodings. If you concatenate an ASCII string with a UTF-8 string, for example, you obtain a UTF-8 string. It is not possible, however, to concatenate a UTF-8 string and an SJIS string: the encodings are not compatible, and an exception will be raised. You can test whether two strings (or a string and a regular expression) have compatible encodings by using the class method Encoding.compatible?. If the encodings of the two arguments are compatible, it returns the one that is the superset of the other. If the encodings are incompatible, it returns nil.

You can explicitly set the encoding of a string with force_encoding. This is useful if you have a string of bytes (read from an I/O stream, perhaps) and want to tell Ruby how they should be interpreted as characters. Or, if you have a string of multibyte characters, but you want to index individual bytes with []:

text = stream.readline.force_encoding("utf-8")
bytes = text.dup.force_encoding("binary")

force_encoding does not make a copy of its receiver; it modifies the encoding of the string and returns the string. This method does not do any character conversion—the underlying bytes of the string are not changed, only Ruby’s interpretation of them is changed. The argument to force_encoding can be the name of an encoding or an Encoding object.

force_encoding does no validation; it does not check that the underlying bytes of the string represent a valid sequence of characters in the specified encoding. Use valid_encoding? to perform validation. This instance method takes no arguments and checks whether the bytes of a string can be interpreted as a valid sequence of characters using the string’s encoding:

s = "xa4".force_encoding("utf-8")  # This is not a valid UTF-8 string
s.valid_encoding?                   # => false

The encode method (and the mutating encode! variant) of a string is quite different from force_encoding. It returns a string that represents the same sequence of characters as its receiver, but using a different encoding. In order to change the encoding of—or transcode—a string like this, the encode method must alter the underlying bytes that make up the string. Here is an example:

# -*- coding: utf-8 -*-
euro1 = "u20AC"                     # Start with the Unicode Euro character
puts euro1                           # Prints "€"
euro1.encoding                       # => <Encoding:UTF-8>
euro1.bytesize                       # => 3

euro2 = euro1.encode("iso-8859-15")  # Transcode to Latin-15
puts euro2.inspect                   # Prints "xA4"
euro2.encoding                       # => <Encoding:iso-8859-15>
euro2.bytesize                       # => 1

euro3 = euro2.encode("utf-8")        # Transcode back to UTF-8
euro1 == euro3                       # => true

Note that you should not often need to use the encode method. The most common time to transcode strings is before writing them to a file or sending them across a network connection. And, as we’ll see in Streams and Encodings, Ruby’s I/O stream classes support the automatic transcoding of text when it is written out.

If the string that you are calling encode on consists of unencoded bytes, you need to specify the encoding by which to interpret those bytes before transcoding them to another encoding. Do this by passing two arguments to encode. The first argument is the desired encoding, and the second argument is the current encoding of the string. For example:

# Interpret a byte as an iso-8859-15 codepoint, and transcode to UTF-8
byte = "xA4"
char = byte.encode("utf-8", "iso-8859-15")

That is, the following two lines of code have the same effect:

text = bytes.encode(to, from)
text = bytes.dup.force_encoding(from).encode(to)

If you call encode with no arguments, it transcodes its receiver to the default internal encoding, if one has been set with the -E or -U interpreter options (see Encoding Options). This allows library modules (for example) to transcode their public string constants to a common encoding for interoperability.

Character encodings differ not only in their mapping from bytes to characters, but in the set of characters that they can represent. Unicode (also known as UCS—the Universal Character Set) tries to allow all characters, but character encodings not based on Unicode can only represent a subset of characters. It is not possible, therefore, to transcode all UTF-8 strings to EUC-JP (for example); Unicode characters that are neither Latin nor Japanese cannot be translated.

If the encode or encode! method encounters a character that it cannot transcode, it raises an exception:

"u20AC".encode("iso-8859-1") # No euro sign in Latin-1, so raise exception

encode and encode! accept a hash of transcoding options as their final argument. At the time of this writing, the only defined option name is :invalid, and the only defined value for that key is :ignore. “ri String.encode” will give details when more options are implemented.

The Encoding class

The Encoding class of Ruby 1.9 represents a character encoding. Encoding objects act as opaque identifiers for an encoding and do not have many methods of their own. The name method returns the name of an encoding. to_s is a synonym for name, and inspect converts an Encoding object to a string in a more verbose way than name does.

Ruby defines a constant for each of the built-in encodings it supports, and these are the easiest way to specify a hardcoded encoding in your program. The predefined constants include at least the following:

Encoding::ASCII_8BIT     # Also ::BINARY
Encoding::UTF_8          # UTF-8-encoded Unicode characters
Encoding::EUC_JP         # EUC-encoded Japanese
Encoding::SHIFT_JIS      # Japanese: also ::SJIS, ::WINDOWS_31J, ::CP932

Note that because these are constants, they must be written in uppercase, and hyphens in the encoding names must be converted to underscores. Ruby 1.9 also supports the US-ASCII encoding, the European encodings ISO-8859-1 through ISO-8859-15, and the Unicode UTF-16 and UTF-32 encodings in big-endian and little-endian variants.

If you have an encoding name as a string and want to obtain the corresponding Encoding object, use the Encoding.find factory method:

encoding = Encoding.find("utf-8")

Using Encoding.find causes the named encoding to be dynamically loaded, if necessary. Encoding.find accepts encoding names that are in either upper- or lowercase. Call the name method of an Encoding to obtain the name of the encoding as a string.

Encoding.list returns an array of all available encoding objects. Encoding.name_list returns an array of the names (as strings) of all available encodings. Many encodings have more than one name in common use, and Encoding.aliases returns a hash that maps encoding aliases to the official encoding names for which they are synonyms. The array returned by Encoding.name_list includes the aliases in the Encoding.aliases hash.

Use Encoding.default_external and Encoding.default_internal to obtain the Encoding objects that represent the default external and default internal encodings (see Source, External, and Internal Encodings). To obtain the encoding for the current locale, call Encoding.locale_charmap and pass the resulting string to Encoding.find.

Most methods that expect an Encoding object will also accept a case-insensitive encoding name (such as ascii, binary, utf-8, euc-jp, or sjis) in place of an Encoding object.

Multibyte characters in Ruby 1.8

Normally, Ruby 1.8 treats all strings as sequences of 8-bit bytes. There is rudimentary support for multibyte characters (using the UTF-8, EUC, or SJIS encodings) in the jcode module of the standard library.

To use this library, require the jcode module, and set the global $KCODE variable to the encoding that your multibyte characters use. (Alternatively, use the -K command-line option when you start the Ruby interpreter.) The jcode library defines a new jlength method for String objects: it returns the length of the string in characters rather than in bytes. The existing 1.8 length and size methods are unchanged—they return the string length in bytes.

The jcode library does not modify the array indexing operator on strings, and does not allow random access to the characters that comprise a multibyte string. But it does define a new iterator named each_char, which works like the standard each_byte but passes each character of the string (as a string instead of as a character code) to the block of code you supply:

$KCODE = "u"        # Specify Unicode UTF-8, or start Ruby with -Ku option
require "jcode"     # Load multibyte character support

mb = "23032272=4" # This is "2×2=4" with a Unicode multiplication sign
mb.length           # => 6: there are 6 bytes in this string
mb.jlength          # => 5: but only 5 characters
mb.mbchar?          # => 1: position of the first multibyte char, or nil
mb.each_byte do |c| # Iterate through the bytes of the string.
  print c, " "      # c is Fixnum
end                 # Outputs "50 195 151 50 61 52 "
mb.each_char do |c| # Iterate through the characters of the string
  print c, " "      # c is a String with jlength 1 and variable length
end                 # Outputs "2 × 2 = 4 "

The jcode library also modifies several existing String methods, such as chop, delete, and tr, to work with multibyte strings.



[3] Use ri to learn more: ri Kernel.sprintf

[*] In Ruby 1.8, setting the deprecated global variable $= to true makes the ==, <, and related comparison operators perform case-insensitive comparisons. You should not do this, however; setting this variable produces a warning message, even if the Ruby interpreter is invoked without the -w flag. And in Ruby 1.9, $= is no longer supported.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.28.70