Parsing Some HTML

Problem

You want to pull the strings out of some HTML. For example, you’d like to get at the href="urlstringstuff" type strings from the <a> tags within a chunk of HTML.

Solution

For a quick and easy shell parse of HTML, provided it doesn’t have to be foolproof, you might want to try something like this:

cat $1 | sed -e 's/>/>
/g' | grep '<a' | while IFS='"' read a b c ; do echo $b; done

Discussion

Parsing HTML from bash is pretty tricky, mostly because bash tends to be very line oriented whereas HTML was designed to treat newlines like whitespace. So it’s not uncommon to see tags split across two or more lines as in:

<a href="blah...blah...blah
  other stuff >

There are also two ways to write <a> tags, one with a separate ending </a> tag, and one without, where instead the singular <a> tag itself ends with a />. So, with multiple tags on a line and the last tag split across lines, it’s a bit messy to parse, and our simple bash technique for this is often not foolproof.

Here are the steps involved in our solution. First, break the multiple tags on one line into at most one line per tag:

cat file | sed -e 's/>/>
/g'

Yes, that’s a newline right after the backslash so that it substitutes each end-of-tag character (i.e., the >) with that same character and then a newline. That will put tags on separate lines with maybe a few extra blank lines. The trailing g tells sed to do the search and replace globally, i.e., multiple times on a line if need be.

Then you can pipe that output into grep to grab just the <a tag lines or maybe just lines with double quotes:

cat file | sed -e 's/>/>
/g' | grep '<a'

or:

cat file | sed -e 's/>/>
/g' | grep '".*"'

(that’s g r e p ' “. * " '). The single quotes tell the shell to take the inner characters literally and not do any shell expansion on them; the rest is a regular expression to match a double quote followed by any character (.) any number of times (*) followed by another double quote. (This won’t work if the string itself is split across lines.)

To parse out the contents of what’s inside the double quotes, one trick is to use the shell’s Internal Field Separator ($IFS) to tell it to use the double quote (“) as the separator; or you can do a similar thing with awk and its -F option (F for field separator).

For example:

cat $1 | sed -e 's/>/>
/g' | grep '".*"' | awk -F'"' '{ print $2}'

(Or use the grep '<a' if you just want <a tags and not all quoted strings.)

If you want to use the $IFS shell trick, rather than awk, it would be:

cat $1 | sed -e 's/>/>
/g' | grep '<a' | while IFS='"' read PRE URL POST ; do echo $URL; done

where the grep output is piped into a while loop and the while loop will read the input into three fields (PRE, URL, and POST). By preceding the read command with the IFS='"', we set that environment variable just for the read command, not for the entire script. Thus, for the line of input that it reads, it will parse with the quotes as its notion of what separates the words of the input line. It will set PRE to be everything up to the first quote, URL to be everything from there to the next quote, and POST to be everything thereafter. Then the script just echoes the second variable, URL. That’s all the characters between the quotes.

See Also

  • man sed

  • man grep

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.203.68