Chapter 2. Parsing XML

"Short for eXtensible Markup Language, XML can convert data into an understandable format for disparate systems, in much the same way that Morse code is used by disparate telegraph operators to transmit their messages in an easily understood format."

Patrick T. Coleman, "The Morse Code of Data—XML," SunExpert Magazine, Feb. 1999

IN THIS CHAPTER

This chapter will deliver a detailed tour of and tutorial on Java XML parsers. We will begin with the theory behind parsers and various parsing strategies implemented in the current crop of parsers. After gaining an understanding of these broad concepts and categories, we will examine the key APIs in XML parsing: SAX and SAX2. Once we know the key APIs, we will examine the current Parser Implementations available from the leading commercial vendors and popular open source alternatives.

Parsing is the process of dissecting a body of text into its individual component pieces. For example, if that body of text is a paragraph, parsing would break the paragraph into sentences. It would then break a sentence into subject and predicate. In turn, the subject and predicate would then be broken down into their components like nouns, verbs, and adjectives. Parsing is a highly developed and formalized discipline in computer science. There are numerous examples in computer science where parsing is necessary. Translating (or compiling) high-level languages (Java, C, C++, Smalltalk, Lisp, and so on) to low-level languages (assembly) requires parsing. Command lines are parsed by programs. Batched commands are parsed in scripting languages. Numerous other languages, such as the Standard Query Language (SQL) and the HyperText Markup Language (HTML), all require parsing.

In computer science, parsing is divided into two distinct activities: lexical analysis and grammatical analysis.

Lexical analysis breaks the body of text into tokens. Tokens are the smallest atomic components of the stream of data. In the paragraph example, tokens would be words. In a computer program, tokens are keywords (such as while and for), literals (such as 23 or "mike") and identifiers (such as myInt and employeeName).

Grammatical analysis involves recognizing the syntactical structure of a language—in other words, how words are combined to form larger structures and how those structures form even larger ones. For example, in parsing a computer program, grammatical analysis would determine how tokens form expressions, how expressions and keywords form statements, how statements form blocks, how blocks form modules and, lastly, how modules form programs. Although parsing is often used loosely as an umbrella concept for the activities of lexical analysis and grammatical analysis, it is usually implemented by two separate pieces of software: a scanner and a parser. The scanner extracts tokens from the stream (lexical analysis) and passes them to the parser for grammatical analysis. For XML, an XML processor parses the XML for its using application. An XML processor is just another name for an XML parser.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.171.121