HTML is the primary format used for Web documents. As I said earlier, HTML is a simple standard for describing the semantic content of textual data. The idea of describing a text’s semantics rather than its appearance comes from an older standard called the Standard Generalized Markup Language (SGML). Standard HTML is an instance of SGML. SGML was invented beginning in the mid-1970s by Charles Goldfarb at IBM. SGML is now an International Standards Organization (ISO) standard, specifically ISO 8879:1986.
SGML and, by inheritance, HTML are based on the notion of design by
meaning rather than design by appearance. You don’t say that
you want some text printed in 18-point type; you say that it is a
top-level heading (<H1>
in HTML). Likewise,
you don’t say that a word should be placed in italics. Rather
you say it should be emphasized (<EM>
in
HTML). It is left to the browser to determine how to best display
headings or emphasized text.
The tags used to mark up the text are case insensitive. Thus
<STRONG>
is the same as
<strong>
is the same as
<Strong>
is the same as
<StrONg>
. Some tags have a matching closing
tag to define a region of text. A closing tag is the same as the
opening tag except that the opening angle bracket is followed by a
/
. For example: <STRONG>this text is strong</STRONG>
; <EM>this text is
emphasized</EM>
. The entire
text from the beginning of the start tag to the end of the end tag is
called an element. Thus <STRONG>this text is strong</STRONG>
is a STRONG
element.
HTML elements may nest but they should not overlap. The first line following is standard conforming. The second line is not, though many browsers accept it nonetheless:
<STRONG><EM>Jack and Jill went up the hill</EM></STRONG> <STRONG><EM>to fetch a pail of water</STRONG></EM>
Some elements have additional attributes that are encoded as
name-value pairs on the start tag. The <H1>
tag and most other paragraph-level tags may have an
ALIGN
attribute that says whether the header
should be centered, left aligned, or right aligned.
For example:
<H1 ALIGN=CENTER> This is a centered H1 heading </H1>
The value of an attribute may be enclosed in double or single quotes like this:
<H1 ALIGN="CENTER"> This is a centered H1 heading </H1> <H2 ALIGN='LEFT'> This is a left-aligned H2 heading </H2>
Quotes are required only if the value contains embedded spaces. When processing HTML, you need to be prepared for attribute values that do and don’t have quotes.
There have been several versions of HTML over the years. The current
standard is HTML 4.0, most of which is supported by current web
browsers with occasional exceptions. Furthermore, several companies,
notably Netscape, Microsoft, and Sun, have added nonstandard
extensions to HTML. These include blinking text, inline movies,
frames, and, most importantly for this book, applets. Some of these
extensions—for example, the <APPLET>
tag—are allowed but deprecated in HTML 4.0. Others, such as
Netscape’s notorious <BLINK>
, come out
of left field and have no place in a semantically oriented language
like HTML.
HTML 4.0 may be the end of the line, aside from minor fixes. The W3C
has decreed that HTML is getting too bulky to layer more features on
top of. Instead, new development will focus on XML, a semantic
language that allows page authors to create the elements they need
rather than relying on a few fixed elements such as
P
and LI
. For example, if
you’re writing a web page with a price list, you would likely
have an SKU
element, a PRICE
element, a MANUFACTURER
element, a
PRODUCT
element, and so forth. That might look
something like this:
<PRODUCT MANUFACTURER="LOTUS"> <NAME>1-2-3</NAME> <VERSION>5.0</VERSION> <PLATFORM>Windows</PLATFORM> <PRICE CURRENCY="US">299.95</PRICE> <SKU>D05WGML</SKU> </PRODUCT>
This looks a lot like HTML, in much the same way that Java looks like
C. There are elements and attributes. Tags are set off by
<
and >
. Attributes are
enclosed in quotation marks, and so forth. However, instead of being
limited to a finite set of tags, you can create all the new and
different tags you need. Since no browser can know in advance all the
different elements that may appear, a stylesheet is used to describe
how each of the items should be displayed.
XML has another advantage over
HTML that may not be obvious from this simple example. HTML can be
quite sloppy. Elements are opened but not closed. Attribute values
may or may not be enclosed in quotes. The quotes may or may not be
closed. XML tightens all this up. It lays out very strict
requirements for the syntax of a well-formed XML document, and it
requires that browsers reject all malformed documents. Browsers may
not attempt to fix the problem and make a best-faith effort to
display what they think the author meant. They must simply report the
error. Furthermore, an XML document may have a Document Type
Definition (DTD) which can impose additional constraints on valid
documents. For example, a DTD may require that every
PRODUCT
element contain exactly one
NAME
element. This has a number of advantages, but
the key one here is that XML documents are far easier to parse than
HTML documents. As a programmer, you will find it much easier to work
with XML than HTML.
XML can be used both for pure XML pages and for embedding new kinds of content in HTML. For example, the Mathematical Markup Language, MathML, is an XML application for including mathematical equations in web pages. SMIL, the Synchronized Multimedia Integration Language, is an XML application for including timed multimedia such as slide shows and subtitled videos on web pages. For a lot more information about XML, see my own The XML Bible, IDG Books, 1999.
13.58.244.216