Manipulating the awk record separator to report on XML data

So far, while we have been working with awk we have limited ourselves to working with individual rows, with each new row representing a new record. Although this is often what we want, where we work with tagged data, such as XML where an individual record may span multiple lines. In this case, we may need to look at setting the RS or record separator internal variable.

Apache Virtual Hosts

In Chapter 9, Automating Apache Virtual Hosts we worked with Apache Virtual Hosts. This uses tagged data that defines the start and end of each Virtual Host. Even though we prefer to store each Virtual Host in their own file, they can be combined into a single file. Consider the following file that stores the possible Virtual Host definitions, this can be stored as the virtualhost.conf file, as shown:

<VirtualHost *:80>
DocumentRoot /www/example
ServerName www.example.org
# Other directives here
</VirtualHost>

<VirtualHost *:80>
DocumentRoot /www/theurbanpenguin
ServerName www.theurbanpenguin.com
# Other directives here
</VirtualHost>

<VirtualHost *:80>
DocumentRoot /www/packt
ServerName www.packtpub.com
# Other directives here
</VirtualHost>

We have the three Virtual Hosts within a single file. Each record is separated by an empty line, meaning that we have two new line characters that logically separate each entry. We will explain this to awk by setting the RS variable as follows: RS=" ". With this in place, we can then print the required Virtual Host record. This will be set in the BEGIN code block of the control file.

We will also need to dynamically search the command line for the desired host configuration. We build this into the control file. The control file should look similar to the following code:

BEGIN { RS="

" ; }
$0 ~ search { print }

The BEGIN block sets the variable and then we move onto the range. The range is set so that the record ($0) matches (~) the search variable. We must set the variable when awk is executed. The following command demonstrates the command line execution where the control file and configuration file are located within our working directory:

$ awk -f vh.awk search=packt virtualhost.conf

We can see this more clearly by looking at the command and the output that is produced in the following screenshot:

Apache Virtual Hosts

XML catalog

We can extend this further into XML files where we may not want to display the complete record, but just certain fields. If we consider the following product catalog:

<product>
<name>drill</name>
<price>99</price>
<stock>5</stock>
</product>

<product>
<name>hammer</name>
<price>10</price>
<stock>50</stock>
</product>

<product>
<name>screwdriver</name>
<price>5</price>
<stock>51</stock>
</product>

<product>
<name>table saw</name>
<price>1099.99</price>
<stock>5</stock>
</product>

Logically, each record is delimited as before with the empty line. Each field though is a little more detailed and we need to use the delimiter as follows: FS="[><]". We define either the opening or closing angle bracket as the field delimiter.

To help analyze this, we can print a single record as follows:

<product><name>top</name><price>9</price><stock>5</stock></product>

Each angle brace is a field separator, which means that we will have some empty fields. We could rewrite this line as a CSV file:

,product,,name,top,/name,,price,9,/price,,stock,5,/stock,,/product,

We just replace each angle bracket with a comma, in this way it is more easily read by us. We can see that the content of field 5 is the top value.

Of course, we will not edit the XML file, we will leave it in the XML format. The conversion here is just to highlight how the field separators can be read.

The control file that we use to extract data from the XML file is illustrated in the following code example:

BEGIN { FS="[><]"; RS="

" ; OFS=""; }
$0 ~ search { print $4 ": " $5, $8 ": " $9, $12 ": " $13 }

Within the BEGIN code block, we set the FS and RS variables as we have discussed. We also set the OFS or Output Field Separator to a space. In this way, when we print the fields we separate the values with a space rather than leaving in the angle brackets. The ranch makes use of the same match as we used before when looking at the Virtual Hosts.

If we need to search for the product drill from within the catalog we can use the command laid out in the following example:

$ awk -f catalog.awk search=drill catalog.xml

The following screenshot shows the output in detail:

XML catalog

We have now been able to take a rather messy XML file and create readable reports from the catalog. The power of awk is highlighted again and for us, the last time within this book. By now, I hope you too can start to make use of this on a regular basis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.67.70