You want to pull the strings out of some HTML. For example, you’d like to get at the
href="
urlstringstuff"
type strings from the <a>
tags
within a chunk of HTML.
For a quick and easy shell parse of HTML, provided it doesn’t have to be foolproof, you might want to try something like this:
cat $1 | sed -e 's/>/> /g' | grep '<a' | while IFS='"' read a b c ; do echo $b; done
Parsing HTML from bash is pretty tricky, mostly because bash tends to be very line oriented whereas HTML was designed to treat newlines like whitespace. So it’s not uncommon to see tags split across two or more lines as in:
<a href="blah...blah...blah other stuff >
There are also two ways to write <a>
tags, one with a separate ending
</a>
tag, and one without,
where instead the singular <a>
tag itself ends with a />
. So,
with multiple tags on a line and the last tag split across lines, it’s a
bit messy to parse, and our simple bash technique
for this is often not foolproof.
Here are the steps involved in our solution. First, break the multiple tags on one line into at most one line per tag:
cat file | sed -e 's/>/> /g'
Yes, that’s a newline right after the backslash so that it
substitutes each end-of-tag character (i.e., the >
) with that same character and then a
newline. That will put tags on separate lines with maybe a few extra
blank lines. The trailing g
tells
sed
to do the search and replace globally, i.e., multiple
times on a line if need be.
Then you can pipe that output into grep to
grab just the <a
tag lines or
maybe just lines with double quotes:
cat file | sed -e 's/>/> /g' | grep '<a'
or:
cat file | sed -e 's/>/> /g' | grep '".*"'
(that’s g r e p ' “. * " '). The single quotes tell the shell to take the inner characters literally and not do any shell expansion on them; the rest is a regular expression to match a double quote followed by any character (.) any number of times (*) followed by another double quote. (This won’t work if the string itself is split across lines.)
To parse out the contents of what’s inside the double quotes, one
trick is to use the shell’s Internal Field Separator ($IFS)
to tell it to use the double quote (“)
as the separator; or you can do a similar thing with
awk and its -F
option (F for field separator).
For example:
cat $1 | sed -e 's/>/> /g' | grep '".*"' | awk -F'"' '{ print $2}'
(Or use the grep '<a'
if you
just want <a
tags and not all
quoted strings.)
If you want to use the $IFS
shell trick, rather than awk
, it
would be:
cat $1 | sed -e 's/>/> /g' | grep '<a' | while IFS='"' read PRE URL POST ; do echo $URL; done
where the grep output is piped into a while
loop and the while loop will read the input into three fields (PRE, URL,
and POST)
. By preceding the read command with the
IFS='"'
, we set that environment
variable just for the read
command,
not for the entire script. Thus, for the line of input that it reads, it
will parse with the quotes as its notion of what separates the words of
the input line. It will set PRE
to be
everything up to the first quote, URL
to be everything from there to the next quote, and POST
to be everything thereafter. Then the
script just echoes the second variable, URL
. That’s all the characters between the
quotes.
3.17.203.68