My dom.d
module is an HTML and XML parser that can understand much of the tag soup found on the Web. Once it parses a document, it provides a JavaScript-style DOM API for easy inspection and manipulation of the document tree.
Here, we'll use the library to extract some meta-information and text from an HTML page, and then modify it and save a local copy to explore its features and implementation, which uses several of the techniques we've learned in this book.
Download dom.d
and characterencodings.d
from my Github repository. It has no other dependencies, so you do not need to download any additional files or libraries.
Let's execute the following steps to parse and modify an HTML page:
arsd.dom
.Document
class.parseGarbage
method, or if you want strict checks on case and well-formedness, use parseStrict
. It will throw exceptions when it encounters bad syntax.title
property and the getMeta("author")
method.querySelector
or requireSelector
methods and the innerText
property. For example, document.requireSelector("p").innerText;
makes use of the requireSelector
method and the innerText
property.document["a[href]"].setValue("source", "your-site");
.requireElementByID
and innerHTML
.toString
method.dmd yourfile.d dom.d characterencodings.d
.The code is as follows:
import arsd.dom; void main() { auto document = new Document(); // The example document will be defined inline here // We could also load the string from a file with // std.file.readText or the web with std.net.curl.get document.parseGarbage(`<html><head> <meta name="author" content="Adam D. Ruppe"> <title>Test Document</title> </head> <body> <p>This is the first paragraph of our <a href="test.html">test document</a>. <p>This second paragraph also has a <a href="test2.html">link</a>. <p id="custom-paragraph">Old text</p> </body> </html>`); import std.stdio; // retrieve and print some meta information writeln(document.title); writeln(document.getMeta("author")); // show a paragraph's text writeln(document.requireSelector("p").innerText); // modify all links document["a[href]"].setValue("source", "your-site"); // change some HTML document.requireElementById("custom-paragraph").innerHTML = "New <b>HTML</b>!"; // show the new document writeln(document.toString()); }
Running the program will print the following output:
Test Document Adam D. Ruppe This is the first paragraph of our test document. <!DOCTYPE html> <html><head> <meta content="Adam D. Ruppe" name="author" /> <title>Test Document</title> </head> <body> <p>This is the first paragraph of our <a href="test.html?source=your-site">test document</a>. </p><p>This second paragraph also has a <a href="test2.html?source=your-site">link</a>. </p><p id="custom-paragraph">New <b>HTML</b>!</p></body> </html>
The dom.d
module is centered around two primary classes: Element
and Document
. The Document
class includes the HTML parser and methods to set and get meta-information in the formats typically used on websites. The Element
class represents one node in the document including its child nodes and attributes.
The parser's implementation doesn't use any of D's special features except array slicing, but did need some optimization work. The entity decoder's first draft naively built a new array on every call, which ended up being a performance problem. For the second draft, I rewrote it to scan the string ahead of time—before performing any copying or decoding for the &
character. If the &
character wasn't found, it simply returned the original slice. This slice is propagated everywhere it is used, never copied and never reallocated. If one was found, it performed a single copy and modification for that node.
Replacing the original conservative copying implementation with the new slicing and copy-on-write for both the entity encoder and decoder resulted in a major speed improvement in the parser and serialization methods.
In the example, we used the parseGarbage
method. The Document class' parser includes two branches in every error condition: one that simply throws an exception and one that attempts recovery. The parseGarbage
method opts for recovery in all conditions and has been tested against hundreds of real-world websites. It can recover from unclosed tags, mismatched tags, improper paragraph nesting, malformed attributes, and mislabeled character sets.
To handle character sets, the characterencodings.d
implementation uses the Phobos function std.utf.validate
to check UTF-8 correctness. If this fails, dom.d
attempts to find a charset
meta tag or XML prologue and uses that string to ask characterencodings.d
to translate the data from that character set to UTF-8 for further processing in D. It performs the translation with hardcoded translation tables. If it cannot determine the character set, parseGarbage
will assume it is Windows-1252 because that is the most common encoding used on unlabeled websites in my experience.
The Element
class has a number of searching and mutation methods, primarily based on the JavaScript DOM. Search methods include getElementById
, getElementsByTagName
, and querySelector
, all inspired by and substantially similar to—but not exactly identical to—JavaScript. The biggest difference between the getElementsByTagName
function of dom.d
and JavaScript's function with the same name is that dom.d
returns a simple array of elements, whereas JavaScript returns a live node list that is updated as the tree is mutated.
In D, a live node list would be best represented by a range. However, JavaScript's node list violates one of the D range rules: it offers random access that runs in linear time instead of constant time, as required by D's random access ranges. Nevertheless, a live list could be represented by a forward range in D. The dom.d
module doesn't do this simply because I didn't consider that at the time, and now I am stuck with it for backward compatibility.
However, the dom.d
module uses a lazy range called ElementStream
internally for all searching. This range can be retrieved with the tree
property on Element
. The ElementStream
class uses an internal stack to implement an input range over the recursive DOM tree structure, just like we wrote in Chapter 3, Ranges.
As ElementStream
implements the input range protocol, it can be passed to any std.algorithm
functions just like any other range, including filter
, map
, and others.
The querySelector
method and its partner method, requireSelector
, parse a CSS selector and retrieve all nodes that match the pattern. The selector syntax is based on that used in JavaScript's querySelector
function, CSS stylesheets, and the popular jQuery JavaScript library.
The querySelector
method returns null
if no matching elements are found. The requireSelector
method will instead throw an exception, ensuring that it never returns null
.
The dom.d
module also provides querySelectorAll
, which returns an array of elements that match the selector, whereas querySelector
only returns the first match. The Document class' opIndex
, which we used in step 6 of the example, calls querySelectorAll
to populate an ElementCollection
object.
The ElementCollection
object's implementation uses opDispatch
and string mixins to forward subsequent method calls to each element in the collection at once with minimal boilerplate. This allows quick and easy manipulation of a group of elements through their methods. Each wrapped method returns the whole collection, allowing chained calls, as is often seen in jQuery code.
Lastly, the innerHTML
property and the toString
method are used to manipulate and retrieve the content HTML strings. The innerHTML
property always uses the nonstrict parsing options, similar to parseGarbage
. Both implementations use an Appender
argument to minimize intermediate allocations as a performance optimization.
The dom.d
module also makes heavy usage of opDispatch
to access attributes, similar to JavaScript. The element.attrs.attribute_name
, element.style.cssRule
and element.dataset.someValue
all utilize helper objects with opDispatch
to generate properties that give easy access to attributes. All three collections use the technique we learned in Chapter 9, Code Generation, with different string translation rules. The attrs
collection provides direct access to the underlying associative array. The style
collection translates from CamelCase to dash-separated names and parses the existing attribute to provide access to all existing rules and recombines them to provide usable HTML.
The style
collection also uses alias this
and opAssign
to enable implicit conversion to and from an attribute string—something JavaScript cannot do! The dataset
method performs CamelCase to dash-separated conversion and prefixes the attribute with data—for compatibility with HTML5.
dom.d
3.133.134.151