Lexical Structure

The Ruby interpreter parses a program as a sequence of tokens. Tokens include comments, literals, punctuation, identifiers, and keywords. This section introduces these types of tokens and also includes important information about the characters that comprise the tokens and the whitespace that separates the tokens.

Comments

Comments in Ruby begin with a # character and continue to the end of the line. The Ruby interpreter ignores the # character and any text that follows it (but does not ignore the newline character, which is meaningful whitespace and may serve as a statement terminator). If a # character appears within a string or regular expression literal (see Chapter 3), then it is simply part of the string or regular expression and does not introduce a comment:

# This entire line is a comment
x = "#This is a string"               # And this is a comment
y = /#This is a regular expression/   # Here's another comment

Multiline comments are usually written simply by beginning each line with a separate # character:

#
# This class represents a Complex number
# Despite its name, it is not complex at all.
#

Note that Ruby has no equivalent of the C-style /*...*/ comment. There is no way to embed a comment in the middle of a line of code.

Embedded documents

Ruby supports another style of multiline comment known as an embedded document. These start on a line that begins =begin and continue until (and include) a line that begins =end. Any text that appears after =begin or =end is part of the comment and is also ignored, but that extra text must be separated from the =begin and =end by at least one space.

Embedded documents are a convenient way to comment out long blocks of code without prefixing each line with a # character:

=begin Someone needs to fix the broken code below!
    Any code here is commented out
=end

Note that embedded documents only work if the = signs are the first characters of each line:

# =begin This used to begin a comment. Now it is itself commented out!
    The code that goes here is no longer commented out
# =end

As their name implies, embedded documents can be used to include long blocks of documentation within a program, or to embed source code of another language (such as HTML or SQL) within a Ruby program. Embedded documents are usually intended to be used by some kind of postprocessing tool that is run over the Ruby source code, and it is typical to follow =begin with an identifier that indicates which tool the comment is intended for.

Documentation comments

Ruby programs can include embedded API documentation as specially formatted comments that precede method, class, and module definitions. You can browse this documentation using the ri tool described earlier in Viewing Ruby Documentation with ri. The rdoc tool extracts documentation comments from Ruby source and formats them as HTML or prepares them for display by ri. Documentation of the rdoc tool is beyond the scope of this book; see the file lib/rdoc/README in the Ruby source code for details.

Documentation comments must come immediately before the module, class, or method whose API they document. They are usually written as multiline comments where each line begins with #, but they can also be written as embedded documents that start =begin rdoc. (The rdoc tool will not process these comments if you leave out the “rdoc”.)

The following example comment demonstrates the most important formatting elements of the markup grammar used in Ruby’s documentation comments; a detailed description of the grammar is available in the README file mentioned previously:

#
# Rdoc comments use a simple markup grammar like those used in wikis.
# 
# Separate paragraphs with a blank line.
# 
# = Headings
# 
# Headings begin with an equals sign
# 
# == Sub-Headings
# The line above produces a subheading.
# === Sub-Sub-Heading
# And so on.
# 
# = Examples
# 
#   Indented lines are displayed verbatim in code font.
#     Be careful not to indent your headings and lists, though.
# 
# = Lists and Fonts
# 
# List items begin with * or -. Indicate fonts with punctuation or HTML:
# * _italic_ or <i>multi-word italic</i>
# * *bold* or <b>multi-word bold</b>
# * +code+ or <tt>multi-word code</tt>
# 
# 1. Numbered lists begin with numbers.
# 99. Any number will do; they don't have to be sequential.
# 1. There is no way to do nested lists.
# 
# The terms of a description list are bracketed:
# [item 1]  This is a description of item 1
# [item 2]  This is a description of item 2
# 

Literals

Literals are values that appear directly in Ruby source code. They include numbers, strings of text, and regular expressions. (Other literals, such as array and hash values, are not individual tokens but are more complex expressions.) Ruby number and string literal syntax is actually quite complicated, and is covered in detail in Chapter 3. For now, an example suffices to illustrate what Ruby literals look like:

1                      # An integer literal
1.0                    # A floating-point literal
'one'                  # A string literal
"two"                  # Another string literal
/three/                # A regular expression literal

Punctuation

Ruby uses punctuation characters for a number of purposes. Most Ruby operators are written using punctuation characters, such as + for addition, * for multiplication, and || for the Boolean OR operation. See Operators for a complete list of Ruby operators. Punctuation characters also serve to delimit string, regular expression, array, and hash literals, and to group and separate expressions, method arguments, and array indexes. We’ll see miscellaneous other uses of punctuation scattered throughout Ruby syntax.

Identifiers

An identifier is simply a name. Ruby uses identifiers to name variables, methods, classes, and so forth. Ruby identifiers consist of letters, numbers, and underscore characters, but they may not begin with a number. Identifiers may not include whitespace or nonprinting characters, and they may not include punctuation characters except as described here.

Identifiers that begin with a capital letter A–Z are constants, and the Ruby interpreter will issue a warning (but not an error) if you alter the value of such an identifier. Class and module names must begin with initial capital letters. The following are identifiers:

i
x2
old_value
_internal    # Identifiers may begin with underscores
PI           # Constant

By convention, multiword identifiers that are not constants are written with underscores like_this, whereas multiword constants are written LikeThis or LIKE_THIS.

Case sensitivity

Ruby is a case-sensitive language. Lowercase letters and uppercase letters are distinct. The keyword end, for example, is completely different from the keyword END.

Unicode characters in identifiers

Ruby’s rules for forming identifiers are defined in terms of ASCII characters that are not allowed. In general, all characters outside of the ASCII character set are valid in identifiers, including characters that appear to be punctuation. In a UTF-8 encoded file, for example, the following Ruby code is valid:

def ×(x,y)  # The name of this method is the Unicode multiplication sign
  x*y       # The body of this method multiplies its arguments
end         

Similarly, a Japanese programmer writing a program encoded in SJIS or EUC can include Kanji characters in her identifiers. See Specifying Program Encoding for more about writing Ruby programs using encodings other than ASCII.

The special rules about forming identifiers are based on ASCII characters and are not enforced for characters outside of that set. An identifier may not begin with an ASCII digit, for example, but it may begin with a digit from a non-Latin alphabet. Similarly, an identifier must begin with an ASCII capital letter in order to be considered a constant. The identifier Å, for example, is not a constant.

Two identifiers are the same only if they are represented by the same sequence of bytes. Some character sets, such as Unicode, have more than one codepoint that represents the same character. No Unicode normalization is performed in Ruby, and two distinct codepoints are treated as distinct characters, even if they have the same meaning or are represented by the same font glyph.

Punctuation in identifiers

Punctuation characters may appear at the start and end of Ruby identifiers. They have the following meanings:

$Global variables are prefixed with a dollar sign. Following Perl’s example, Ruby defines a number of global variables that include other punctuation characters, such as $_ and $-K. See Chapter 10 for a list of these special globals.
@Instance variables are prefixed with a single at sign, and class variables are prefixed with two at signs. Instance variables and class variables are explained in Chapter 7.
?As a helpful convention, methods that return Boolean values often have names that end with a question mark.
!Method names may end with an exclamation point to indicate that they should be used cautiously. This naming convention is often to distinguish mutator methods that alter the object on which they are invoked from variants that return a modified copy of the original object.
=Methods whose names end with an equals sign can be invoked by placing the method name, without the equals sign, on the left side of an assignment operator. (You can read more about this in Assigning to Attributes and Array Elements and Accessors and Attributes.)

Here are some example identifiers that contain leading or trailing punctuation characters:

$files          # A global variable
@data           # An instance variable
@@counter       # A class variable
empty?          # A Boolean-valued method or predicate
sort!           # An in-place alternative to the regular sort method
timeout=        # A method invoked by assignment

A number of Ruby’s operators are implemented as methods, so that classes can redefine them for their own purposes. It is therefore possible to use certain operators as method names as well. In this context, the punctuation character or characters of the operator are treated as identifiers rather than operators. See Operators for more about Ruby’s operators.

Keywords

The following keywords have special meaning in Ruby and are treated specially by the Ruby parser:

__LINE__      case         ensure       not          then
__ENCODING__  class        false        or           true
__FILE__      def          for          redo         undef
BEGIN         defined?     if           rescue       unless
END           do           in           retry        until
alias         else         module       return       when
and           elsif        next         self         while
begin         end          nil          super        yield
break

In addition to those keywords, there are three keyword-like tokens that are treated specially by the Ruby parser when they appear at the beginning of a line:

=begin    =end      __END__

As we’ve seen, =begin and =end at the beginning of a line delimit multiline comments. And the token __END__ marks the end of the program (and the beginning of a data section) if it appears on a line by itself with no leading or trailing whitespace.

In most languages, these words would be called “reserved words” and they would be never allowed as identifiers. The Ruby parser is flexible and does not complain if you prefix these keywords with @, @@, or $ prefixes and use them as instance, class, or global variable names. Also, you can use these keywords as method names, with the caveat that the method must always be explicitly invoked through an object. Note, however, that using these keywords in identifiers will result in confusing code. The best practice is to treat these keywords as reserved.

Many important features of the Ruby language are actually implemented as methods of the Kernel, Module, Class, and Object classes. It is good practice, therefore, to treat the following identifiers as reserved words as well:

# These are methods that appear to be statements or keywords
at_exit        catch          private        require
attr           include        proc           throw
attr_accessor  lambda         protected
attr_reader    load           public
attr_writer    loop           raise

# These are commonly used global functions
Array          chomp!         gsub!          select
Float          chop           iterator?      sleep
Integer        chop!          load           split
String         eval           open           sprintf
URI            exec           p              srand
abort          exit           print          sub
autoload       exit!          printf         sub!
autoload?      fail           putc           syscall
binding        fork           puts           system
block_given?   format         rand           test
callcc         getc           readline       trap
caller         gets           readlines      warn
chomp          gsub           scan

# These are commonly used object methods
allocate       freeze         kind_of?       superclass
clone          frozen?        method         taint
display        hash           methods        tainted?
dup            id             new            to_a
enum_for       inherited      nil?           to_enum
eql?           inspect        object_id      to_s
equal?         instance_of?   respond_to?    untaint
extend         is_a?          send           

Whitespace

Spaces, tabs, and newlines are not tokens themselves but are used to separate tokens that would otherwise merge into a single token. Aside from this basic token-separating function, most whitespace is ignored by the Ruby interpreter and is simply used to format programs so that they are easy to read and understand. Not all whitespace is ignored, however. Some is required, and some whitespace is actually forbidden. Ruby’s grammar is expressive but complex, and there are a few cases in which inserting or removing whitespace can change the meaning of a program. Although these cases do not often arise, it is important to know about them.

Newlines as statement terminators

The most common form of whitespace dependency has to do with newlines as statement terminators. In languages like C and Java, every statement must be terminated with a semicolon. You can use semicolons to terminate statements in Ruby, too, but this is only required if you put more than one statement on the same line. Convention dictates that semicolons be omitted elsewhere.

Without explicit semicolons, the Ruby interpreter must figure out on its own where statements end. If the Ruby code on a line is a syntactically complete statement, Ruby uses the newline as the statement terminator. If the statement is not complete, then Ruby continues parsing the statement on the next line. (In Ruby 1.9, there is one exception, which is described later in this section.)

This is no problem if all your statements fit on a single line. When they don’t, however, you must take care that you break the line in such a way that the Ruby interpreter cannot interpret the first line as a statement of its own. This is where the whitespace dependency lies: your program may behave differently depending on where you insert a newline. For example, the following code adds x and y and assigns the sum to total:

total = x +     # Incomplete expression, parsing continues
  y

But this code assigns x to total, and then evaluates y, doing nothing with it:

total = x  # This is a complete expression
  + y      # A useless but complete expression

As another example, consider the return and break statements. These statements may optionally be followed by an expression that provides a return value. A newline between the keyword and the expression will terminate the statement before the expression.

You can safely insert a newline without fear of prematurely terminating your statement after an operator or after a period or comma in a method invocation, array literal, or hash literal.

You can also escape a line break with a backslash, which prevents Ruby from automatically terminating the statement:

var total = first_long_variable_name + second_long_variable_name 
  + third_long_variable_name # Note no statement terminator above

In Ruby 1.9, the statement terminator rules change slightly. If the first nonspace character on a line is a period, then the line is considered a continuation line, and the newline before it is not a statement terminator. Lines that start with periods are useful for the long method chains sometimes used with “fluent APIs,” in which each method invocation returns an object on which additional invocations can be made. For example:

animals = Array.new
  .push("dog")   # Does not work in Ruby 1.8
  .push("cow")
  .push("cat")
  .sort

Spaces and method invocations

Ruby’s grammar allows the parentheses around method invocations to be omitted in certain circumstances. This allows Ruby methods to be used as if they were statements, which is an important part of Ruby’s elegance. Unfortunately, however, it opens up a pernicious whitespace dependency. Consider the following two lines, which differ only by a single space:

f(3+2)+1
f (3+2)+1

The first line passes the value 5 to the function f and then adds 1 to the result. Since the second line has a space after the function name, Ruby assumes that the parentheses around the method call have been omitted. The parentheses that appear after the space are used to group a subexpression, but the entire expression (3+2)+1 is used as the method argument. If warnings are enabled (with -w), Ruby issues a warning whenever it sees ambiguous code like this.

The solution to this whitespace dependency is straightforward:

  • Never put a space between a method name and the opening parenthesis.

  • If the first argument to a method begins with an open parenthesis, always use parentheses in the method invocation. For example, write f((3+2)+1).

  • Always run the Ruby interpreter with the -w option so it will warn you if you forget either of the rules above!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.134.130