13.7 Chapter Summary
The web can be viewed as a massive collection of documents that include text as well as links to other web documents and to multimedia content such as sound, images, and video. The data can be structured, unstructured, or semistructured. Semistructured data models have evolved as a way of describing documents with structures that differ from the well-defined structure of relational databases.
A semistructured data model contains a collection of nodes, each containing data, possibly with different schemas. Each document can be represented as a tree, having a single root node and a sequence of child nodes. Each child node can itself have child nodes, which in turn can have child nodes, and so on. The nodes can represent either complex objects or atomic values. An edge can represent either the relationship between an object and its subobject, or the relationship between an object and its value. Leaf nodes represent values. JavaScript Object Notation (JSON) and Extensible Markup Language (XML) are different types of models for describing semistructured data. Using JSON or XML, it is possible to define the structure of heterogeneous data and support interchange of data between applications or database systems.
JSON is a text-based format that leverages data structures and basic types found in most programming languages: objects, arrays, strings, and numbers. Objects and arrays can be nested in JSON objects (and arrays), and objects in the same document (or even the same array) can have different formats. Many web browsers (e.g., Chrome, Safari, Edge, and Firefox) contain JSON parsers; JSON can be parsed by many programming languages as well. JSON is commonly used in web applications because it can be used to exchange complex data in a single transfer. Many relational database systems support JSON as a type or can store JSON in text fields; however, processing this data can be inefficient because relational databases are typically not optimized for semistructured data. There is no standard query language associated with JSON; however, it is the data model used in document databases such as MongoDB (discussed in Chapter 14), which supports sophisticated querying.
An XML document contains one or more elements, each of which has a start tag showing its name, some character data, and an end tag. Elements can be subelements of other elements. Subelements must be properly nested. Elements can have attributes whose names and values are shown inside the element’s start tag. Attributes occur only once within each element, whereas elements can occur any number of times. A document can contain entity references to refer to external files, common text, Unicode characters, or some reserved symbols. An XML document is said to be well formed if it obeys the rules of XML.
Users can define their own markup language in XML by writing a Document Type Definition (DTD) or by writing an XML Schema Definition (XSD). An instance document that obeys the rules of its associated DTD is said to be type-valid. An XSD is a more powerful way to describe the structure of documents than a DTD. If an instance document conforms to an XSD, it is called schema-valid. An XSD lists elements and attributes. Elements may be complex (i.e., have subelements) or simple. Attributes or elements can be used to store data values. Attributes can be used for simple values that are not repeated, whereas elements can be complex or simple and can occur multiple times.
XQuery is the current World Wide Web Consortium (W3C) standard query language for XML data. It uses the abstract logical structure of a document as it is encoded in XML, possibly in a DTD or an XSD. Queries make use of the notion of a path expression, which comes from an earlier language, XPath. A path expression usually consists of the name of the document and a specification of the elements to be retrieved, using a path relationship. We can add conditions to any nodes in a path expression. The path expression is normally evaluated by reading forward in the document until a node of the specified type and condition is encountered. The user can specify an axis or direction for searching. XQuery uses a general form called a FLWOR expression, for FOR, LET, WHERE, ORDER BY, and RETURN clauses. The expressions allow for binding of variables to results and also for iterating through the nodes of a document. They also allow joins to be performed and data to be restructured. XQuery provides many predefined functions, including count, avg, max, min, and sum, which can be used in FLWOR expressions.
Many vendors of relational database management systems (RDBMS) have extended their native data types to allow storage of XML and JSON documents. It is also possible to use SQL with XPath expressions to retrieve values from the database. Existing heterogeneous databases can be queried using standard languages such as SQL, and the query results can be placed into an XML instance document. The query language has to have facilities that can tag and structure relational data into XML format.
Oracle databases can store JSON data using a column of VARCHAR2 type for small documents, or BLOB type for large documents. The system can validate input data if the user specifies a check constraint using IS_JSON. Oracle versions after 21c have a JSON datatype that can be used for a column. Data can be entered using standard INSERT statements, or from a file by using the SQL*Loader facility. Normal SQL queries using dot notation can be used to retrieve JSON data. There are also special JSON functions, JSON_VALUE and JSON_QUERY, that can retrieve data and format it. The JSON_TABLE function can be used to convert JSON data to relational data. Updates can be made to a document by using the standard SQL UPDATE function, replacing the entire document. In version 19, the JSON_MERGEPATCH operation can be used to update parts of a document, and in version 21, JSON_TRANSFORM can be used for updates.
Oracle includes an XMLTYPE data type, which can be used as an attribute type or an object table type. Using Oracle XML DB, a standard component of the Oracle database management system (DBMS), users can perform many operations on XML data and convert between XML and relational formats. XML data can be entered from SQL*Plus or from a file. The data can be queried using XQuery in SQL*Plus, PL/SQL or Java. The main functions for queries are XMLQuery, XMLCast, and XMLExists. XMLTable can be used to convert XML into relational data, but the user must identify columns for the mapping. Updates can be performed using the standard SQL UPDATE function, if the entire document is being replaced. Updates of parts of a document, and inserting or deleting nodes, require the use of XMLQuery. If a schema is registered for an XML document, Oracle can shred it and store the contents in a relational table. There are XML Developer Kits (XDKs) for PL/SQL, Java, C, and C++, which allow XML documents to be parsed and then processed in the host language.