Interpreting HTML stream

In this recipe, we will see how an HTML code may be read and interpreted using regular expressions. We will create a program that will read an HTML stream in a string and will display the tag names along with the content of the tags. The FIND and replace statements are used together with a do loop. (This recipe will focus on reading tags beginning with <tag> and ending with < ag>).

How to do it...

For creating a program for interpreting HTML code, follow the steps shown in the following steps:

  1. Declare three strings by the name htmlstream, tagcontents, and tagname.
  2. We then assign a suitable HTML code to the htmlstream variable.
  3. Within a do loop, a FIND REGEX statement is added that finds tag names and their contents. The regex used in this case for matching an HTML tag is '<(uw*)[^>]*>(.*)</1>'.
  4. Once a tag is processed, a replace all occurrences statement is used for replacing the tag with '$$$'.
  5. The tag name and tag contents are printed.
  6. Once all the tags are processed, the exit statement is executed.
    How to do it...

How it works...

We have used ignoring case since the tag names may start with upper or lowercase such as H1 or h1. The regular expression searches for tags starting with a <, then followed by a single alphabet (denoted in regex by u), followed by zero or more alphanumeric characters. After this, an optional substring (comprising of all characters except for a > may be found, followed by a > character. This will match HTML tag names such as H1, H2, HTML, or html. The tag name without the special characters < and > is assigned to a subgroup that is then available in the submatch variable tagname. The start and end of the tag is checked using the back-referencing operator 1. Note that in this case, the forward slash / is part of the HTML code denoting the end of the tag. The content of a particular tag is read into the submatch variable tagcontents.

The find statement finds all the tags. Once a tag is processed, we replace the tag name as $$$ in order to avoid it to be found by the find statement another time. On the next do loop pass, the next tag is matched and contents are read.

Using a WRITE statement, all the tag names and tag contents are printed on screen. The output is shown in the following screenshot:

How it works...

Once all the tags are processed, the sy-subrc condition of being not equal to zero is met and the loop is exited.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.51.36