Statistics with a Better Histogram

Here's yet another version of the statistics program we've been working with throughout the last week. Yesterday's version of the statistics program included a horizontal histogram that looked like this:

Frequency of Values:
1  | *****
2  | *************
3  | *******************
4  | ****************
5  | ***********
6  | ****
43 | *
62 | *

For this version, we'll print a vertical histogram that looks like this:

†††††*
†††††*
†††††*
†††††*
†††††*
†††††*†*
†††††*†*†*
†††††*†*†*†††††††*†††††*†††††*
†††††*†*†*†††*†*†*†††††*†††††*
†††*†*†*†*†††*†*†*†††††*†††††*
†††*†*†*†*†††*†*†*†††††*†††††*
†*†*†*†*†*†*†*†*†*†††††*†††††*††*
†*†*†*†*†*†*†*†*†*††*††*††*††*††*††*††*
---------------------------------------
†1†2†3†4†5†6†7†8†9†12†23†25†34†37†39†42

This form of histogram is actually much harder to produce than a horizontal histogram; this version uses two nested for loops and some careful counting for it to come out the right way.

There's one other change to this version of stats.pl: it gets its data from a file, rather than making you enter in all the data at a prompt. As with all the scripts that read data from a file, you need to specify a data file for this script to use on the command line, as follows:

% statsfinal.pl data.txt
					

The data file, here called data.txt, has each of the numbers on individual lines. First, let's look closely at the two parts of this script that are different from the last version: the part that reads in the data file, and the part that prints the histogram.

Listing 7.1 shows the script for our final statistics script. Given how much we've been working with this script up to this point, it should look familiar to you. The parts to concentrate on are the input loop in lines 14 through 21, and the code to generate the new histogram in lines 36 through 49.

Listing 7.1. The statsfinal.pl Script
1:  #!/usr/loczl/bin/perl -w
2:
3:  $input = "";  # temporary input
4:  @nums = ();   # array of numbers;
5:  %freq = ();   # hash of number frequencies
6:  $maxfreq = 0; # maximum frequency
7:  $count = 0;   # count of numbers
8:  $sum = 0;     # sum of numbers
9:  $avg = 0;     # average
10: $med = 0;     # median
11: @keys = ();   # temp keys
12: $totalspace = 0; # total space across histogram
13:
14: while (defined ($input = <>)) {
15:     chomp ($input);
16:     $nums[$count] = $input;
17:     $freq{$input} ++;
18:     if ($maxfreq < $freq{$input} ) { $maxfreq = $freq{$input}  }
19:     $count++;
20:     $sum += $input;
21: }
22: @nums = sort { $a <=> $b }  @nums;
23:
24: $avg = $sum / $count;
25: $med = $nums[$count /2];
26:
27: print "
Total count of numbers: $count
";
28: print "Total sum of numbers: $sum
";
29: print "Minimum number: $nums[0]
";
30: print "Maximum number: $nums[$#nums]
";
31: printf("Average (mean): %.2f
", $avg);
32: print "Median: $med

";
33:
34: @keys = sort { $a <=> $b }  keys %freq;
35:
36: for ($i = $maxfreq; $i > 0; $i--) {
37:     foreach $num (@keys) {
38:         $space = (length $num);
39:         if ($freq{$num}  >= $i) {
40:             print( (" " x $space) . "*");
41:         } else {
42:             print " " x (($space) + 1);
43:         }
44:         if ($i == $maxfreq) { $totalspace += $space + 1; }
45:     }
46:     print "
";
47: }
48: print "-" x $totalspace;
49: print "
 @keys
";
					

Because you've seen the boilerplate code for reading data from files using <>, nothing in lines 14 through 21 should be too much of a surprise. Note that we read each line (that is, each number) into the $input variable, and then use that value throughout the block.

Why not use $_? We could have done that here, but a lot of the statements in this block need an actual variable reference (they don't default to $_). Using $_ for that reference would have made things only very slightly smaller, but would have decreased the readability of the example, and in this case, it was a better idea to err on the side of readability.

Note

A point to remember throughout this book as I explain more and more strange and obscure bits of Perl—just because Perl uses a particular feature doesn't mean you have to use it. Consider the tradeoffs between creating very small code that no one except a Perl wizard can decipher, versus longer, maybe less efficient, but more readable code. Consider it particularly well done if someone else can read your Perl code further down the line.


Anyhow, other than reading the input from a file instead of standard input, much of the while block is the same as it was in yesterday's version of this script. The one other difference is the addition in line 18 to calculate the $maxfreq value. This value is the maximum frequency of any number—that is, the number of times the most frequent number appears in the data set. We'll use this value later to determine the overall height of the histogram. Here, all we do is compare the current maximum frequency to the current frequency, and change $maxfreq if the new one is larger.

Farther down in the script, after we've sorted, summed, and printed, we get to the histogram part of the script, in the daunting set of loops in lines 36 through 49.

Building a horizontal histogram like we did yesterday is much easier than building one vertically. With the horizontal histogram, you can just loop through the keys in the %freq hash and print out the appropriate number of asterisks (plus some minor formatting). For the vertical histogram, we need to keep track of the overall layout much more closely, as each line we draw doesn't have any direct relationship to any specific key or value in the hash. Also, we still must keep track of the spaces for formatting.

We'll keep track of the histogram using two loops. The outer loop, a for loop, controls the number of lines to print, that is, the overall height from top to bottom. The second loop is a foreach loop that moves from left to right within each line, printing either an asterisk or a space. With two nested loops (the for and the foreach), we can go from left to right and line to line, with both the height and the width of the histogram determined by the actual values in the data.

First, we extract a sorted list of keys out of the %freq hash in line 34. This is mostly for convenience and to make the for loops coming up at least a little less complex.

Line 36 starts our outer for loop. The overall height of the histogram is determined by the most frequent value in the data set. Here's where we make use of that $maxfreq variable we calculated when the data is read in. This outer for loop starts at the maximum frequency and works down to 0, printing as many lines as it takes.

The inner loop prints each line, looping over the values in the data set (the keys from the %freq data set). For each line, we print either a space or a *, depending on whether the given value's frequency should start showing up on the current line. We also keep track of formatting, to add more space for those values that have multiple digits (the spacing for a value of 333 will be different from that for 1).

Line by line, starting at line 38 here's what we're doing:

  • In line 38, we calculate the space this column will need, based on the number of digits in the current value.

  • Lines 39 and 40 print a * if the * is warranted. The test here is to see if the current value we're looking at is as frequent as our vertical position in the histogram (frequency greater or equal to the current value of $i). This way, at the start of the histogram we'll get fewer asterisks, and as we progress downward and $i gets lower, more values will need asterisks. Note that the print statement in line 40 prints both the asterisk and enough spaces to space it out to the correct width.

  • If there's no * to be printed in this line, we print the right amount of filler space: space for the column, plus one extra.

  • Line 44 is a puzzler. What's this here for? This line is here to calculate the total width of the histogram, overall, based on the lengths of all the digits in the data set with spaces in between them all. We'll need this in line 48 when we print a divider line, but because we're already in the midst of a loop here, I figured I'd get this calculation now instead of waiting until then. What this loop does is if $i is equal to $maxfreq—that is, if we're on the very first line of the outer for loop—the loop adds the current amount of space to the $totalspace variable to get the maximum width.

  • And, finally, in line 46, when we're done with a line of data, we print a newline to restart the next line at the appropriate spot.

With the columns of the histogram printed, all we've got left are the labels on the bottom. Here we'll print an appropriate number of hyphens to mimic a horizontal line (using the value we calculated for $totalspace), and then print the set of keys, interpolated inside a string, which prints all the elements in @keys with spaces between them.

Complicated nested loops such as this are particularly hard to follow, and sometimes a description like the one I just gave you isn't enough. If you're still really bewildered about how this example worked, consider working through it step by step, loop by loop, making sure you understand the current values of all the variables and how they relate to each other.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.190.160.221