This chapter is fairly dense but should provide a comprehensive introduction to Java syntax. It is written primarily for readers who are new to the language but have some previous programming experience. Determined novices with no prior programming experience may also find it useful. If you already know Java, you should find it a useful language reference. The chapter includes some comparisons of Java to C and C++ for the benefit of programmers coming from those languages.
This chapter documents the syntax of Java programs by starting at the very lowest level of Java syntax and building from there, moving on to increasingly higher orders of structure. It covers:
The characters used to write Java programs and the encoding of those characters.
Literal values, identifiers, and other tokens that comprise a Java program.
The data types that Java can manipulate.
The operators used in Java to group individual tokens into larger expressions.
Statements, which group expressions and other statements to form logical chunks of Java code.
Methods, which are named collections of Java statements that can be invoked by other Java code.
Classes, which are collections of methods and fields. Classes are the central program element in Java and form the basis for object-oriented programming. Chapter 3 is devoted entirely to a discussion of classes and objects.
Packages, which are collections of related classes.
Java programs, which consist of one or more interacting classes that may be drawn from one or more packages.
The syntax of most programming languages is complex, and Java is no exception. In general, it is not possible to document all elements of a language without referring to other elements that have not yet been discussed. For example, it is not really possible to explain in a meaningful way the operators and statements supported by Java without referring to objects. But it is also not possible to document objects thoroughly without referring to the operators and statements of the language. The process of learning Java, or any language, is therefore an iterative one.
Before we begin our bottom-up exploration of Java syntax, let’s take a
moment for a top-down overview of a Java program. Java programs consist
of one or more files, or compilation units, of Java source code. Near
the end of the chapter, we describe the structure of a Java file and
explain how to compile and run a Java program. Each compilation unit
begins with an optional package
declaration followed by zero or more
import
declarations. These declarations specify the namespace within
which the compilation unit will define names, and the namespaces from
which the compilation unit imports names. We’ll see package
and
import
again later in this chapter in
“Packages and the Java Namespace”.
The optional package
and import
declarations are followed by zero or
more reference type definitions. We will meet the full variety of
possible reference types in Chapters 3 and
4, but for now, we should note that these are most often either class
or interface
definitions.
Within the definition of a reference type, we will encounter members such as fields, methods, and constructors. Methods are the most important kind of member. Methods are blocks of Java code composed of statements.
With these basic terms defined, let’s start by approaching a Java program from the bottom up by examining the basic units of syntax—often referred to as lexical tokens.
This section explains the lexical structure of a Java program. It starts with a discussion of the Unicode character set in which Java programs are written. It then covers the tokens that comprise a Java program, explaining comments, identifiers, reserved words, literals, and so on.
Java programs are written using Unicode. You can use Unicode characters anywhere in a Java program, including comments and identifiers such as variable names. Unlike the 7-bit ASCII character set, which is useful only for English, and the 8-bit ISO Latin-1 character set, which is useful only for major Western European languages, the Unicode character set can represent virtually every written language in common use on the planet.
If you do not use a Unicode-enabled text editor, or if you do not want
to force other programmers who view or edit your code to use a
Unicode-enabled editor, you can embed Unicode characters into your Java
programs using the special Unicode escape sequence uxxxx
—that is, a backslash and a lowercase u, followed by four hexadecimal
characters. For example, u0020
is the space character, and u03c0
is the character π.
Java has invested a large amount of time and engineering effort in ensuring that its Unicode support is first class. If your business application needs to deal with global users, especially in non-Western markets, then the Java platform is a great choice. Java also has support for multiple encodings and character sets, in case applications need to interact with non-Java applications that do not speak Unicode.
Java is a case-sensitive language. Its keywords are written in
lowercase and must always be used that way. That is, While
and WHILE
are not the same as the while
keyword. Similarly, if you declare a
variable named i
in your program, you may not refer to it as I
.
In general, relying on case sensitivity to distinguish identifiers is a terrible idea. Do not use it in your own code, and in particular never give an identifier the same name as a keyword but differently cased.
Java ignores spaces, tabs, newlines, and other whitespace, except when it appears within quoted characters and string literals. Programmers typically use whitespace to format and indent their code for easy readability, and you will see common indentation conventions in this book’s code examples.
Comments are natural-language text intended for human readers of a
program. They are ignored by the Java compiler. Java supports three
types of comments. The first type is a single-line comment, which
begins with the characters //
and continues until the end of the
current line. For example:
int
i
=
0
;
// Initialize the loop variable
The second kind of comment is a multiline comment. It begins with the
characters /*
and continues, over any number of lines, until the
characters */
. Any text between the /*
and the */
is ignored by
javac
. Although this style of comment is typically used for multiline
comments, it can also be used for single-line comments. This type of
comment cannot be nested (i.e., one /* */
comment cannot appear within
another). When writing multiline comments, programmers often use extra
*
characters to make the comments stand out. Here is a typical
multiline comment:
/*
* First, establish a connection to the server.
* If the connection attempt fails, quit right away.
*/
The third type of comment is a special case of the second. If a comment
begins with /**
, it is regarded as a special doc comment. Like
regular multiline comments, doc comments end with */
and cannot be
nested. When you write a Java class you expect other programmers to use,
provide doc comments to embed documentation about the class and each of its
methods directly into the source code. A program named javadoc
extracts these comments and processes them to create online
documentation for your class. A doc comment can contain HTML tags and
can use additional syntax understood by javadoc
. For example:
/**
* Upload a file to a web server.
*
* @param file The file to upload.
* @return <tt>true</tt> on success,
* <tt>false</tt> on failure.
* @author David Flanagan
*/
See Chapter 7 for more information on the doc comment syntax and Chapter 13 for more information on the javadoc
program.
Comments may appear between any tokens of a Java program, but may not appear within a token. In particular, comments may not appear within double-quoted string literals. A comment within a string literal simply becomes a literal part of that string.
The following words are reserved in Java (they are part of the syntax of the language and may not be used to name variables, classes, and so forth):
abstract const final int public throw assert continue finally interface return throws boolean default float long short transient break do for native static true byte double goto new strictfp try case else if null super void catch enum implements package switch volatile char extends import private synchronized while class false instanceof protected this
Of these, true
, false,
and null
are technically literals.
The sequence var
is not a keyword, but instead indicates that the type of a local variable should be type-inferred.
The character sequence consisting of a single underscore, _
, is also disallowed as an identifier.
There are also 10 restricted keywords which are only considered keywords within the context of declaring a Java platform module.
We’ll meet each of these reserved words again later in this book. Some of them are the names of primitive types and others are the names of Java statements, both of which are discussed later in this chapter. Still others are used to define classes and their members (see Chapter 3).
Note that const
and goto
are reserved but aren’t actually used in
the language, and that interface
has an additional variant
form—@interface
, which is used when defining types known as
annotations. Some of the reserved words (notably final
and default
)
have a variety of meanings depending on context.
An identifier is simply a name given to some part of a Java program, such as a class, a method within a class, or a variable declared within a method. Identifiers may be of any length and may contain letters and digits drawn from the entire Unicode character set. An identifier may not begin with a digit.
In general, identifiers may not contain punctuation characters.
Exceptions include the dollar sign ($
) as well as other Unicode currency symbols such as £
and ¥
.
The ASCII underscore (_
) also deserves special mention.
Originally, the underscore could be freely used as an identifier, or part of one.
However, in recent versions of Java, including Java 11, the underscore may not be used as an identifier.
The underscore character can still appear in a Java identifier, but it is no longer legal as a complete identifier by itself. This is to support an expected forthcoming language feature whereby the underscore will acquire a special new syntactic meaning.
Currency symbols are intended for use in automatically generated source
code, such as code produced by javac
. By avoiding the use of currency
symbols in your own identifiers, you don’t have to worry about
collisions with automatically generated identifiers.
The usual Java convention is to name variables using camel case. This means that the first letter of a variable should be lowerase, but that the first letter of any other words in the identifier should be uppercase.
Formally, the characters allowed at the beginning of and within an
identifier are defined by the methods isJavaIdentifierStart()
and
isJavaIdentifierPart()
of the class java.lang.Character
.
The following are examples of legal identifiers:
i
x1
theCurrentTime
current
獺
Note in particular the example of a UTF-8 identifier, 獺
. This is the
Kanji character for “otter” and is perfectly legal as a Java identifier.
The usage of non-ASCII identifiers is unusual in programs predominantly
written by Westerners, but is sometimes seen.
Literals are sequences of source characters that directly represent constant values that appear as-is in Java source code.
They include integer and floating-point numbers, single characters within
single quotes, strings of characters within double quotes, and the
reserved words true
, false
, and null
. For example, the following are all literals:
1
1.0
'1'
1L
"one"
true
false
null
The syntax for expressing numeric, character, and string literals is detailed in “Primitive Data Types”.
Java also uses a number of punctuation characters as tokens. The Java Language Specification divides these characters (somewhat arbitrarily) into two categories, separators and operators. The 12 separators are:
(
)
{
}
[
]
...
@
::
;
,
.
+
—
*
/
%
&
|
^
<<
>>
>>>
+=
-=
*=
/=
%=
&=
|=
^=
<<=
>>=
>>>=
=
==
!=
<
<=
>
>=
!
~
&&
||
++
--
?
:
->
We’ll see separators throughout the book, and will cover each operator individually in “Expressions and Operators”.
Java supports eight basic data types known as primitive types as described in Table 2-1. The primitive types include a Boolean type, a character type, four integer types, and two floating-point types. The four integer types and the two floating-point types differ in the number of bits that represent them and therefore in the range of numbers they can represent.
Type | Contains | Default | Size | Range |
---|---|---|---|---|
|
|
|
1 bit |
NA |
|
Unicode character |
|
16 bits |
|
|
Signed integer |
0 |
8 bits |
–128 to 127 |
|
Signed integer |
0 |
16 bits |
–32768 to 32767 |
|
Signed integer |
0 |
32 bits |
–2147483648 to 2147483647 |
|
Signed integer |
0 |
64 bits |
–9223372036854775808 to 9223372036854775807 |
|
IEEE 754 floating point |
0.0 |
32 bits |
1.4E–45 to 3.4028235E+38 |
|
IEEE 754 floating point |
0.0 |
64 bits |
4.9E–324 to 1.7976931348623157E+308 |
The next section summarizes these primitive data types. In addition to these primitive types, Java supports nonprimitive data types known as reference types, which are introduced in “Reference Types”.
The boolean
type represents truth values. This type has only two
possible values, representing the two Boolean states: on or off, yes or
no, true or false. Java reserves the words true
and false
to
represent these two Boolean values.
Programmers coming to Java from other languages (especially JavaScript and C) should note that Java is much stricter about its Boolean values than
other languages; in particular, a boolean
is neither an integral nor an
object type, and incompatible values cannot be used in place of a
boolean
. In other words, you cannot take shortcuts such as the
following in Java:
Object
o
=
new
Object
();
int
i
=
1
;
if
(
o
)
{
while
(
i
)
{
//...
}
}
Instead, Java forces you to write cleaner code by explicitly stating the comparisons you want:
if
(
o
!=
null
)
{
while
(
i
!=
0
)
{
// ...
}
}
The char
type represents Unicode characters. Java has a slightly
unique approach to representing characters—javac
accepts identifiers
and literals as UTF-8 (a variable-width encoding) in input.
However, internally, Java represents chars in a fixed-width encoding—either a 16-bit encoding (before Java 9) or as ISO-8859-1 (an 8-bit encoding, used for Western European languages, also called Latin-1) if possible (Java 9 and later).
This distinction between external and internal representation does not normally need to concern the developer. In most cases, all that is required is to remember the rule that to include a character literal in a Java program, simply place it between single quotes (apostrophes):
char
c
=
'A'
;
You can, of course, use any Unicode character as a character literal,
and you can use the u
Unicode escape sequence. In addition, Java
supports a number of other escape sequences that make it easy both to
represent commonly used nonprinting ASCII characters, such as newline,
and to escape certain punctuation characters that have special meaning
in Java. For example:
char
tab
=
' '
,
nul
=
'
000
'
,
aleph
=
'u05D0'
,
slash
=
''
;
Table 2-2 lists the escape characters that can be used in char
literals. These characters can also be used in string literals, which are covered in the next section.
Escape sequence | Character value |
---|---|
Backspace |
|
|
Horizontal tab |
|
Newline |
|
Form feed |
|
Carriage return |
|
Double quote |
|
Single quote |
|
Backslash |
|
The Latin-1 character with the encoding |