An Example: Frequencies in the Statistics Program

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

An Example: Frequencies in the Statistics Program

Let's modify our statistics script again, this time to add a feature that keeps track of the number of times each number appears in the input data. We'll use this feature to print out a histogram of the frequencies of each bit of data. Here's an example of what that histogram will look like (other than the histogram, the output the script produces is the same as it was before, so I'm not going to duplicate that here):

Frequency of Values:
1  | *****
2  | *************
3  | *******************
4  | ****************
5  | ***********
6  | ****
43 | *
62 | *

To keep track of each number's frequency in our script, we use a hash, with the keys being the actual numbers in the data and the values being the number of times that number occurs in the data. The histogram part then loops over that hash and prints out a graphical representation of the number of times the data occurs. Easy!

Listing 5.1 shows the Perl code for our new script.

Listing 5.1. `stillmorestats.pl`

1:  #!/usr/local/bin/perl -w
2:
3:  $input = '';  # temporary input
4:  @nums = ();   # array of numbers;
5:  %freq = ();   # hash of number frequencies
6:  $count = 0;   # count of numbers
7:  $sum = 0;     # sum of numbers
8:  $avg = 0;     # average
9:  $med = 0;     # median
10: $maxspace = 0;# max space for histogram
11:
12: while () {
13:   print 'Enter a number: ';
14:   chomp ($input = <STDIN>);
15:   if ($input eq '') { last; }
16:
17:   if ($input =~ /D/) {
18:       print "Digits only, please.
";
19:       next;
20:   }
21:
22:   push @nums, $input;
23:   $freq{$input} ++;
24:   $count++;
25:   $sum += $input;
26: }
27:
28: @nums = sort { $a <=> $b }  @nums;
29: $avg = $sum / $count;
30: $med = $nums[$count / 2];
31:
32: print "
Total count of numbers: $count
";
33: print "Total sum of numbers: $sum
";
34: print "Minimum number: $nums[0]
";
35: print "Maximum number: $nums[$#nums]
";
36: printf("Average (mean): %.2f
", $avg);
37: print "Median: $med

";
38: print "Frequency of Values:
";
39:
40: $maxspace = (length $nums[$#nums]) + 1;
41:
42: foreach $key (sort { $a <=> $b }  keys %freq) {
43:   print $key;
44:   print ' ' x ($maxspace - length $key);
45:   print '| ', '*' x $freq{$key} , "
";
46: }

This script hasn't changed much from the previous one; the only changes are in lines 5, 10, line 23, and the section at the end, in lines 38 to 46. You might look over those lines now to see how they fit into the rest of the script that we've already written.

Lines 5 and 10 are easy. These are just new variables that we'll use later on in the script: the %freq hash, which will store the frequency of the data; and $maxspace, which will hold a temporary space variable for formatting the histogram (more about this when we go over how the histogram is built).

Line 23 is much more interesting. This line is inside the loop where we're reading the input; line 22 is where we push the current input onto the array of values. In line 23, what we're doing is looking up the input number as a key in the frequencies hash, and then incrementing the value referred to by that key by 1 (using the ++ operator).

The key is the number itself, whereas the value is the number of times that number appears in the data. If the number that was input doesn't yet appear as a key in the hash, then this line will add it and increment the value to 1. Each time after that, it then just keeps incrementing the frequency as the same number appears in the data.

At the end of the input loop, then, you'll end up with a hash that contains, as keys, all the unique values in the data set, and as values, the number of times each value appears. All that's left now is to print the usual sum, average and median, and a histogram of that data.

Instead of going over lines 38 through 46 line by line as I've done in past examples, I'd like to show you how I built this loop when I wrote the script itself, so you can see my thinking in how this loop came out. This will actually give you a better idea of why I did what I did.

My first pass at this loop was just an attempt to get the values to print in the right order. I started with a foreach loop not unlike the one I described in “Processing All the Values in a Hash” earlier in this lesson:

foreach $key (sort { $a <=> $b }  keys %freq) {
  print "Key: $key Value: $freq{$key} 
";
}

In this loop, I use foreach to loop over each key in the hash. The order in which the elements are presented, however, is controlled by the list in parentheses on the first line. The keys %freq part extracts all the keys from the hash, sort sorts them (remember, sort by default sorts in ASCII order, adding $a <=> $b forces a numeric sort). This results in the hash being processed in order from lowest key to highest.

Inside the loop, then, all I have to do is print the keys and the values. Here's the output of the loop when I add some simple data to %freq:

Key: 2 Value: 4
Key: 3 Value: 5
Key: 4 Value: 3
Key: 5 Value: 1

That's a good printout of the values of the %freq hash, but it's not a histogram. My second pass changes the print statement to use the string repetition operator x (you learned about it on Day 3) to print out the appropriate number of asterisks for the frequency of numbers:

foreach $key (sort { $a <=> $b }  keys %freq) {
  print "$key |", '*' x $freq{$key} , "
";
}

This is closer; it produces output like this:

2 | ****
3 | *****
4 | ***
5 | *

The problem comes when the input data is larger than 9. Depending on the number of characters in the key, the formatting of the histogram can get really screwed up. Here's what the histogram looked like when I input numbers of one, two and three digits:

2 | ****
3 | *****
4 | ***
5 | *
13 | **
24 | *
45 | ***
2345 | *

So the thing to do here is to make sure there are the appropriate number of spaces before the pipe character (|) to make sure everything in the histogram lines up correctly. I did this with the length function, which returns the number of characters (bytes, actually), in a scalar value, and that x operator again.

We start by finding out the maximum amount of space we'll need to allow for. I got that number from the largest value in the data set (because the data set is sorted, the largest value is the last value), and I added 1 to it to include a space at the end:

$maxspace = (length $nums[$#nums]) + 1;

Then, inside the loop, we can add some print statements: The first one prints just the key. The second one will pad out the smaller numbers to the largest number's width by adding an appropriate number of spaces. The third one prints the pipe and the stars for the histogram:

foreach $key (sort { $a <=> $b }  keys %freq) {
  print $key;                             # print the key
  print ' ' x ($maxspace - length $key);  # pad to largest width
  print '| ', '*' x $freq{$key} , "
";    # print the stars
}

This last version of the histogram is the version I ended up with in Listing 5.1.

Note

The way I did the formatting here is kind of a hack, and I don't recommend this method for anything more substantial than the few characters we're dealing with in this example. Perl has a set of procedures specifically for formatting data on ASCII screens (remember, it's the Practical Extraction and Report Language). In this age of HTML and Web-based reports, Perl ASCII formatting isn't as commonly used, but you can get a taste for it from the perlform man page.

Extracting Data into Arrays or Hashes Using the `split` Function

When you read input from the keyboard, often that data is in a convenient form so that you can just test it a little, assign it to a variable and then do whatever else you want to with it. But a lot of the input you'll deal with—particularly from files—is not often in a form that's so easy to process. What if the input you're getting has ten numbers per line? What if it came from Excel or a database and it's in comma-separated text? What if there's one part in the middle of the line you're interested in, but you don't care about the rest of it?

Often, the input you get to a Perl script will be in some sort of raw form, and then it's your job to extract and save the things you're interested in. Fortunately, Perl makes this very easy. One way to extract data out of a string is to split that string into multiple elements, and save those elements as an array or a hash. Then you can manipulate the elements in the array or hash individually. A built-in function, called split, does just this.

Let's take the simplest and most common example: your input data is a single string of elements, separated by spaces:

$stringofnums = '34 23 56 34 78 38 90';

To split this string into an array of elements, you would use split with two arguments:

A string of one space, and the string you want to split. The split function will return a list, so usually you'll want to assign the list to something (like an array):

@nums = split(' ', $stringofnums);

The result of this statement is an array of seven elements, split from the original string:

(34, 23, 56, 34, 78, 38, 90).

Or you could assign it to a set of variables:

($x, $y, $z, undef, @remainder) = split(' ', $stringofnums);

In this case, the first three numbers in the string get assigned to the first three variables, the fourth (34) gets thrown away (undef), and the last three are stored in @remainder.

This form of split, with a single-space argument, is actually a special case. The single space tells split to split the string on any white space characters, including spaces or tabs, to skip over multiple whitespace characters, and to ignore any leading or trailing whitespace as well. It does it all for you, automatically.In fact, this form of split is so common you could ignore the space argument altogether and just call split with the string argument and it would automatically split on whitespace:

@nums = split $stringofnums;

I will use the string argument in all my examples to remind you what split is splitting on.

If you want to split a string on anything other than whitespace; for example, if your data is separated by commas, by pipe characters (|), or by anything else, you must use a pattern. These are the same regular expression patterns you have seen before. Here's an example that splits on commas:

$commasep = "45,32,56,123,hike!";
@stuff = split(/,/, $commasep);

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for An Example: Frequencies in the Statistics Program

Create new playlist

Sign In

Sign Up

An Example: Frequencies in the Statistics Program

Listing 5.1. stillmorestats.pl

Extracting Data into Arrays or Hashes Using the split Function

Table of Contents for
An Example: Frequencies in the Statistics Program

Listing 5.1. `stillmorestats.pl`

Extracting Data into Arrays or Hashes Using the `split` Function