Parsing and modifying an HTML page with dom.d

My dom.d module is an HTML and XML parser that can understand much of the tag soup found on the Web. Once it parses a document, it provides a JavaScript-style DOM API for easy inspection and manipulation of the document tree.

Here, we'll use the library to extract some meta-information and text from an HTML page, and then modify it and save a local copy to explore its features and implementation, which uses several of the techniques we've learned in this book.

Getting ready

Download dom.d and characterencodings.d from my Github repository. It has no other dependencies, so you do not need to download any additional files or libraries.

How to do it…

Let's execute the following steps to parse and modify an HTML page:

  1. Import arsd.dom.
  2. Create an instance of the Document class.
  3. Pass an unvalidated HTML string to the parseGarbage method, or if you want strict checks on case and well-formedness, use parseStrict. It will throw exceptions when it encounters bad syntax.
  4. Get the document title and author meta-information with the title property and the getMeta("author") method.
  5. Extract the first paragraph's text with the querySelector or requireSelector methods and the innerText property. For example, document.requireSelector("p").innerText; makes use of the requireSelector method and the innerText property.
  6. Modify all links to add a source parameter with document["a[href]"].setValue("source", "your-site");.
  7. Replace the inner HTML text of a specific element with requireElementByID and innerHTML.
  8. Get the new HTML as a string with the toString method.
  9. Compile the program with dmd yourfile.d dom.d characterencodings.d.

The code is as follows:

import arsd.dom;

void main() {
  auto document = new Document();

  // The example document will be defined inline here
  // We could also load the string from a file with
  // std.file.readText or the web with std.net.curl.get
  document.parseGarbage(`<html><head>
    <meta name="author" content="Adam D. Ruppe">
    <title>Test Document</title>
  </head>
  <body>
    <p>This is the first paragraph of our <a href="test.html">test document</a>.
    <p>This second paragraph also has a <a href="test2.html">link</a>.
    <p id="custom-paragraph">Old text</p>
  </body>
  </html>`);
  import std.stdio;
  // retrieve and print some meta information
  writeln(document.title);
  writeln(document.getMeta("author"));
  // show a paragraph's text
  writeln(document.requireSelector("p").innerText);
  // modify all links
  document["a[href]"].setValue("source", "your-site");
  // change some HTML
  document.requireElementById("custom-paragraph").innerHTML = "New <b>HTML</b>!";
  // show the new document
  writeln(document.toString());
}

Running the program will print the following output:

Test Document
Adam D. Ruppe
This is the first paragraph of our test document.
<!DOCTYPE html>
<html><head>
                <meta content="Adam D. Ruppe" name="author" />
                <title>Test Document</title>
        </head>
        <body>
          <p>This is the first paragraph of our <a href="test.html?source=your-site">test document</a>.
          </p><p>This second paragraph also has a <a href="test2.html?source=your-site">link</a>.

        </p><p id="custom-paragraph">New <b>HTML</b>!</p></body>
        </html>

Tip

The document's parse functions take an HTML string, and not a filename or URL. If you get exceptions about missing input, make sure that you are sending it the correct input.

How it works…

The dom.d module is centered around two primary classes: Element and Document. The Document class includes the HTML parser and methods to set and get meta-information in the formats typically used on websites. The Element class represents one node in the document including its child nodes and attributes.

The parser's implementation doesn't use any of D's special features except array slicing, but did need some optimization work. The entity decoder's first draft naively built a new array on every call, which ended up being a performance problem. For the second draft, I rewrote it to scan the string ahead of time—before performing any copying or decoding for the & character. If the & character wasn't found, it simply returned the original slice. This slice is propagated everywhere it is used, never copied and never reallocated. If one was found, it performed a single copy and modification for that node.

Replacing the original conservative copying implementation with the new slicing and copy-on-write for both the entity encoder and decoder resulted in a major speed improvement in the parser and serialization methods.

In the example, we used the parseGarbage method. The Document class' parser includes two branches in every error condition: one that simply throws an exception and one that attempts recovery. The parseGarbage method opts for recovery in all conditions and has been tested against hundreds of real-world websites. It can recover from unclosed tags, mismatched tags, improper paragraph nesting, malformed attributes, and mislabeled character sets.

To handle character sets, the characterencodings.d implementation uses the Phobos function std.utf.validate to check UTF-8 correctness. If this fails, dom.d attempts to find a charset meta tag or XML prologue and uses that string to ask characterencodings.d to translate the data from that character set to UTF-8 for further processing in D. It performs the translation with hardcoded translation tables. If it cannot determine the character set, parseGarbage will assume it is Windows-1252 because that is the most common encoding used on unlabeled websites in my experience.

The Element class has a number of searching and mutation methods, primarily based on the JavaScript DOM. Search methods include getElementById, getElementsByTagName, and querySelector, all inspired by and substantially similar to—but not exactly identical to—JavaScript. The biggest difference between the getElementsByTagName function of dom.d and JavaScript's function with the same name is that dom.d returns a simple array of elements, whereas JavaScript returns a live node list that is updated as the tree is mutated.

In D, a live node list would be best represented by a range. However, JavaScript's node list violates one of the D range rules: it offers random access that runs in linear time instead of constant time, as required by D's random access ranges. Nevertheless, a live list could be represented by a forward range in D. The dom.d module doesn't do this simply because I didn't consider that at the time, and now I am stuck with it for backward compatibility.

However, the dom.d module uses a lazy range called ElementStream internally for all searching. This range can be retrieved with the tree property on Element. The ElementStream class uses an internal stack to implement an input range over the recursive DOM tree structure, just like we wrote in Chapter 3, Ranges.

As ElementStream implements the input range protocol, it can be passed to any std.algorithm functions just like any other range, including filter, map, and others.

The querySelector method and its partner method, requireSelector, parse a CSS selector and retrieve all nodes that match the pattern. The selector syntax is based on that used in JavaScript's querySelector function, CSS stylesheets, and the popular jQuery JavaScript library.

The querySelector method returns null if no matching elements are found. The requireSelector method will instead throw an exception, ensuring that it never returns null.

The dom.d module also provides querySelectorAll, which returns an array of elements that match the selector, whereas querySelector only returns the first match. The Document class' opIndex, which we used in step 6 of the example, calls querySelectorAll to populate an ElementCollection object.

The ElementCollection object's implementation uses opDispatch and string mixins to forward subsequent method calls to each element in the collection at once with minimal boilerplate. This allows quick and easy manipulation of a group of elements through their methods. Each wrapped method returns the whole collection, allowing chained calls, as is often seen in jQuery code.

Lastly, the innerHTML property and the toString method are used to manipulate and retrieve the content HTML strings. The innerHTML property always uses the nonstrict parsing options, similar to parseGarbage. Both implementations use an Appender argument to minimize intermediate allocations as a performance optimization.

There's more…

The dom.d module also makes heavy usage of opDispatch to access attributes, similar to JavaScript. The element.attrs.attribute_name, element.style.cssRule and element.dataset.someValue all utilize helper objects with opDispatch to generate properties that give easy access to attributes. All three collections use the technique we learned in Chapter 9, Code Generation, with different string translation rules. The attrs collection provides direct access to the underlying associative array. The style collection translates from CamelCase to dash-separated names and parses the existing attribute to provide access to all existing rules and recombines them to provide usable HTML.

The style collection also uses alias this and opAssign to enable implicit conversion to and from an attribute string—something JavaScript cannot do! The dataset method performs CamelCase to dash-separated conversion and prefixes the attribute with data—for compatibility with HTML5.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.134.151