Functions may be recursive, that is, they may call themselves, either directly or indirectly. Recursion is a powerful technique for many problems, and of course it’s essential for processing recursive data structures. In Section 4.4, we used recursion over a tree to implement a simple insertion sort. In this section, we’ll use it again for processing HTML documents.
The example program below uses a non-standard package,
golang.org/x/net/html
, which provides an HTML parser.
The golang.org/x/...
repositories hold packages designed
and maintained by the Go team for applications such as networking,
internationalized text processing, mobile platforms, image manipulation,
cryptography, and developer tools.
These packages are not in the standard library
because they’re still under development or because they’re rarely
needed by the majority of Go programmers.
The parts of the golang.org/x/net/html
API that we’ll need
are shown below.
The function html.Parse
reads a sequence of bytes, parses them, and
returns the root of the HTML document tree, which is an html.Node
.
HTML has several kinds of nodes—text, comments, and so on—but here we are
concerned only with element nodes of the form <name key='value'>
.
package html type Node struct { Type NodeType Data string Attr []Attribute FirstChild, NextSibling *Node } type NodeType int32 const ( ErrorNode NodeType = iota TextNode DocumentNode ElementNode CommentNode DoctypeNode ) type Attribute struct { Key, Val string } func Parse(r io.Reader) (*Node, error)
The main
function parses the standard input as HTML,
extracts the links using a recursive visit
function, and prints each discovered link:
// Findlinks1 prints the links in an HTML document read from standard input. package main import ( "fmt" "os" "golang.org/x/net/html" ) func main() { doc, err := html.Parse(os.Stdin) if err != nil { fmt.Fprintf(os.Stderr, "findlinks1: %v ", err) os.Exit(1) } for _, link := range visit(nil, doc) { fmt.Println(link) } }
The visit
function traverses an HTML node tree,
extracts the link from the href
attribute of each anchor element
<a href='...'>
, appends the links to a slice of strings, and returns the resulting slice:
// visit appends to links each link found in n and returns the result. func visit(links []string, n *html.Node) []string { if n.Type == html.ElementNode && n.Data == "a" { for _, a := range n.Attr { if a.Key == "href" { links = append(links, a.Val) } } } for c := n.FirstChild; c != nil; c = c.NextSibling { links = visit(links, c) } return links }
To descend the tree for a node n
, visit
recursively calls itself for each of
n
’s children, which are held in the FirstChild
linked list.
Let’s run findlinks
on the Go home page, piping the output of
fetch
(§1.5) to the input of
findlinks
.
We’ve edited the output slightly for brevity.
$ go build gopl.io/ch1/fetch $ go build gopl.io/ch5/findlinks1 $ ./fetch https://golang.org | ./findlinks1 # /doc/ /pkg/ /help/ /blog/ http://play.golang.org/ //tour.golang.org/ https://golang.org/dl/ //blog.golang.org/ /LICENSE /doc/tos.html http://www.google.com/intl/en/policies/privacy/
Notice the variety of forms of links that appear in the page.
Later we’ll see how to resolve them relative to the base
URL, https://golang.org
, to make absolute URLs.
The next program uses recursion over the HTML node tree to print the structure of the tree in outline. As it encounters each element, it pushes the element’s tag onto a stack, then prints the stack.
func main() { doc, err := html.Parse(os.Stdin) if err != nil { fmt.Fprintf(os.Stderr, "outline: %v ", err) os.Exit(1) } outline(nil, doc) } func outline(stack []string, n *html.Node) { if n.Type == html.ElementNode { stack = append(stack, n.Data) // push tag fmt.Println(stack) } for c := n.FirstChild; c != nil; c = c.NextSibling { outline(stack, c) } }
Note one subtlety: although outline
“pushes” an element on
stack
, there is no corresponding pop.
When outline
calls itself recursively, the callee receives a
copy of stack
.
Although the callee may append elements to this slice, modifying its
underlying array and perhaps even allocating a new array, it doesn’t
modify the initial elements that are
visible to the caller, so when the function returns, the caller’s
stack
is as it was before the call.
Here’s the outline of https://golang.org
, again edited for brevity:
$ go build gopl.io/ch5/outline $ ./fetch https://golang.org | ./outline [html] [html head] [html head meta] [html head title] [html head link] [html body] [html body div] [html body div] [html body div div] [html body div div form] [html body div div form div] [html body div div form div a] ...
As you can see by experimenting with outline
, most HTML
documents can be processed with only a few levels of recursion, but
it’s not hard to construct pathological web pages that require
extremely deep recursion.
Many programming language implementations use a fixed-size function call stack; sizes from 64KB to 2MB are typical. Fixed-size stacks impose a limit on the depth of recursion, so one must be careful to avoid a stack overflow when traversing large data structures recursively; fixed-size stacks may even pose a security risk. In contrast, typical Go implementations use variable-size stacks that start small and grow as needed up to a limit on the order of a gigabyte. This lets us use recursion safely and without worrying about overflow.
Exercise 5.1:
Change the findlinks
program to traverse the
n.FirstChild
linked list using recursive calls to visit
instead of a loop.
Exercise 5.2:
Write a function to populate a mapping from element names—p
, div
,
span
, and so on—to the number of elements with that name
in an HTML document tree.
Exercise 5.3:
Write a function to print the contents of all text nodes in an HTML
document tree.
Do not descend into <script>
or <style>
elements,
since their contents are not visible in a web browser.
Exercise 5.4:
Extend the visit
function so that it extracts other kinds of
links from the document, such as images, scripts, and style sheets.
18.188.40.207