Writing Algorithms That Use the File-Reading Techniques

There are several common ways to organize information in files. The rest of this chapter will show how to apply the various file-reading techniques to these situations and how to develop some algorithms to help with this.

Skipping the Header

Many data files begin with a header. As described in The Readline Technique, TSDL files begin with a one-line description followed by comments in lines beginning with a #, and the Readline technique can be used to skip that header. The technique ends when we read the first real piece of data, which will be the first line after the description that doesn’t start with a #.

In English, we might try this algorithm to process this kind of a file:

 Skip the first line in the file
 Skip over the comment lines in the file
 For each of the remaining lines in the file:
  Process the data on that line

The problem with this approach is that we can’t tell whether a line is a comment line until we’ve read it, but we can read a line from a file only once—there’s no simple way to “back up” in the file. An alternative approach is to read the line, skip it if it’s a comment, and process it if it’s not. Once we’ve processed the first line of data, we process the remaining lines:

 Skip the first line in the file
 Find and process the first line of data in the file
 For each of the remaining lines:
  Process the data on that line

The thing to notice about this algorithm is that it processes lines in two places: once when it finds the first “interesting” line in the file and once when it handles all of the following lines:

 from​ typing ​import​ TextIO
 from​ io ​import​ StringIO
 
 def​ skip_header(reader: TextIO) -> str:
 """Skip the header in reader and return the first real piece of data.
 
  >>> infile = StringIO('Example​​\​​n# Comment​​\​​n# Comment​​\​​nData line​​\​​n')
  >>> skip_header(infile)
  'Data line​​\​​n'
  """
 
 # Read the description line
  line = reader.readline()
 
 # Find the first non-comment line
  line = reader.readline()
 while​ line.startswith(​'#'​):
  line = reader.readline()
 
 # Now line contains the first real piece of data
 return​ line
 
 def​ process_file(reader: TextIO) -> None:
 """Read and print the data from reader, which must start with a single
  description line, then a sequence of lines beginning with '#', then a
  sequence of data.
 
  >>> infile = StringIO('Example​​\​​n# Comment​​\​​nLine 1​​\​​nLine 2​​\​​n')
  >>> process_file(infile)
  Line 1
  Line 2
  """
 
 # Find and print the first piece of data
  line = skip_header(reader).strip()
 print​(line)
 
 # Read the rest of the data
 for​ line ​in​ reader:
  line = line.strip()
 print​(line)
 
 if​ __name__ == ​'__main__'​:
 with​ open(​'hopedale.txt'​, ​'r'​) ​as​ input_file:
  process_file(input_file)

In skip_header, we return the first line of read data, because once we’ve found it, we can’t read it again (we can go forward but not backward). We’ll want to use skip_header in all of the file-processing functions in this section. Rather than copying the code each time we want to use it, we can put the function in a file called time_series.py (for Time Series Data Library) and use it in other programs using import time_series, as shown in the next example. This allows us to reuse the skip_header code, and if it needs to be modified, then there is only one copy of the function to edit.

This program processes the Hopedale data set to find the smallest number of fox pelts produced in any year. As we progress through the file, we keep the smallest value seen so far in a variable called smallest. That variable is initially set to the value on the first line, since it’s the smallest (and only) value seen so far:

 from​ typing ​import​ TextIO
 import​ time_series
 
 def​ smallest_value(reader: TextIO) -> int:
 """Read and process reader and return the smallest value after the
  time_series header.
 
  >>> infile = StringIO('Example​​\​​n1​​\​​n2​​\​​n3​​\​​n')
  >>> smallest_value(infile)
  1
  >>> infile = StringIO('Example​​\​​n3​​\​​n1​​\​​n2​​\​​n')
  >>> smallest_value(infile)
  1
  """
 
  line = time_series.skip_header(reader).strip()
 
 # Now line contains the first data value; this is also the smallest value
 # found so far, because it is the only one we have seen.
  smallest = int(line)
 
 for​ line ​in​ reader:
  value = int(line.strip())
 
 # If we find a smaller value, remember it.
 if​ value < smallest:
  smallest = value
 
 return​ smallest
 
 if​ __name__ == ​'__main__'​:
 with​ open(​'hopedale.txt'​, ​'r'​) ​as​ input_file:
 print​(smallest_value(input_file))

As with any algorithm, there are other ways to write this; for example, we can replace the if statement with this single line:

 smallest = min(smallest, value)

Dealing with Missing Values in Data

We also have data for colored fox fur production in Hebron, Labrador:

 Coloured fox fur production, Hebron, Labrador, 1834-1839
 #Source: C. Elton (1942) "Voles, Mice and Lemmings", Oxford Univ. Press
 #Table 17, p.265--266
 #remark: missing value for 1836
  55
  262
  -
  102
  178
  227

The hyphen indicates that data for the year 1836 is missing. Unfortunately, calling read_smallest on the Hebron data produces this error:

 >>>​​ ​​import​​ ​​read_smallest
 >>>​​ ​​read_smallest.smallest_value(open(​​'hebron.txt'​​,​​ ​​'r'​​))
 Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "./read_smallest.py", line 16, in smallest_value
  value = int(line.strip())
 ValueError: invalid literal for int() with base 10: '-'

The problem is that ’-’ isn’t an integer, so calling int(’-’) fails. This isn’t an isolated problem. In general, we will often need to skip blank lines, comments, or lines containing other “nonvalues” in our data. Real data sets often contain omissions or contradictions; dealing with them is just a fact of scientific life.

For the development of this algorithm, we assume that the first value is an integer, because otherwise the time series would simply start at the second value.

To fix our code, we must add a check inside the loop that processes a line only if it contains a real value. We will assume that the first value is never a hyphen because in the TSDL data sets, missing entries are always marked with hyphens. So we just need to check for that before trying to convert the string we have read to an integer:

 from​ typing ​import​ TextIO
 from​ io ​import​ StringIO
 import​ time_series
 
 def​ smallest_value_skip(reader: TextIO) -> int:
 """Read and process reader, which must start with a time_series header.
  Return the smallest value after the header. Skip missing values, which
  are indicated with a hyphen.
 
  >>> infile = StringIO('Example​​\​​n1​​\​​n-​​\​​n3​​\​​n')
  >>> smallest_value_skip(infile)
  1
  """
 
  line = time_series.skip_header(reader).strip()
 # Now line contains the first data value; this is also the smallest value
 # found so far, because it is the only one we have seen.
  smallest = int(line)
 
 for​ line ​in​ reader:
  line = line.strip()
 if​ line != ​'-'​:
  value = int(line)
  smallest = min(smallest, value)
 
 return​ smallest
 
 if​ __name__ == ​'__main__'​:
 with​ open(​'hebron.txt'​, ​'r'​) ​as​ input_file:
 print​(smallest_value_skip(input_file))

Notice that the update to smallest is nested inside the check for hyphens.

Processing Whitespace-Delimited Data

The file at http://robjhyndman.com/tsdldata/ecology1/lynx.dat (Time Series Data Library [Hyn06]) contains information about lynx pelts in the years 1821–1934. All data values are integers, each line contains many values, the values are separated by whitespace, and for reasons best known to the file’s author, each value ends with a period. (Note that author M. J. Campbell’s name below is misspelled in the original file.)

 Annual Number of Lynx Trapped, MacKenzie River, 1821-1934
 #Original Source: Elton, C. and Nicholson, M. (1942)
 #"The ten year cycle in numbers of Canadian lynx",
 #J. Animal Ecology, Vol. 11, 215--244.
 #This is the famous data set which has been listed before in
 #various publications:
 #Cambell, M.J. and Walker, A.M. (1977) "A survey of statistical work on
 #the MacKenzie River series of annual Canadian lynx trappings for the years
 #1821-1934 with a new analysis", J.Roy.Statistical Soc. A 140, 432--436.
  269. 321. 585. 871. 1475. 2821. 3928. 5943. 4950. 2577. 523. 98.
  184. 279. 409. 2285. 2685. 3409. 1824. 409. 151. 45. 68. 213.
  546. 1033. 2129. 2536. 957. 361. 377. 225. 360. 731. 1638. 2725.
  2871. 2119. 684. 299. 236. 245. 552. 1623. 3311. 6721. 4245. 687.
  255. 473. 358. 784. 1594. 1676. 2251. 1426. 756. 299. 201. 229.
  469. 736. 2042. 2811. 4431. 2511. 389. 73. 39. 49. 59. 188.
  377. 1292. 4031. 3495. 587. 105. 153. 387. 758. 1307. 3465. 6991.
  6313. 3794. 1836. 345. 382. 808. 1388. 2713. 3800. 3091. 2985. 3790.
  674. 81. 80. 108. 229. 399. 1132. 2432. 3574. 2935. 1537. 529.
  485. 662. 1000. 1590. 2657. 3396.

Now we’ll develop a program to find the largest value. To process the file, we will break each line into pieces and strip off the periods. Our algorithm is the same as it was for the fox pelt data: find and process the first line of data in the file, and then process each of the subsequent lines. However, the notion of “processing a line” needs to be examined further because there are many values per line. Our refined algorithm, shown next, uses nested loops to handle the notion of “for each line and for each value on that line”:

 Find the first line containing real data after the header
 For each piece of data in the current line:
  Process that piece
 
 For each of the remaining lines of data:
  For each piece of data in the current line:
  Process that piece

Once again we are processing lines in two different places. That is a strong hint that we should write a helper function to avoid duplicate code. Rewriting our algorithm and making it specific to the problem of finding the largest value makes this clearer:

 Find the first line of real data after the header
 Find the largest value in that line
 
 For each of the remaining lines of data:
  Find the largest value in that line
  If that value is larger than the previous largest, remember it

The helper function required is one that finds the largest value in a line, and it must split up the line. String method split will split around the whitespace, but we still have to remove the periods at the ends of the values.

We can also simplify our code by initializing largest to -1, since that value is guaranteed to be smaller than any of the (positive) values in the file. That way, no matter what the first real value is, it’ll be larger than the “previous” value (our -1) and replace it.

 from​ typing ​import​ TextIO
 from​ io ​import​ StringIO
 import​ time_series
 
 def​ find_largest(line: str) -> int:
 """Return the largest value in line, which is a whitespace-delimited string
  of integers that each end with a '.'.
 
  >>> find_largest('1. 3. 2. 5. 2.')
  5
  """
 # The largest value seen so far.
  largest = -1
 for​ value ​in​ line.split():
 # Remove the trailing period.
  v = int(value[:-1])
 # If we find a larger value, remember it.
 if​ v > largest:
  largest = v
 
 return​ largest

We now face the same choice as with skip_header: we can put find_largest in a module (possibly time_series), or we can include it in the same file as the rest of the code. We choose the latter this time because the code is specific to this particular data set and problem:

 from​ typing ​import​ TextIO
 from​ io ​import​ StringIO
 import​ time_series
 
 def​ find_largest(line: str) -> int:
 """Return the largest value in line, which is a whitespace-delimited string
  of integers that each end with a '.'.
 
  >>> find_largest('1. 3. 2. 5. 2.')
  5
  """
 # The largest value seen so far.
  largest = -1
 for​ value ​in​ line.split():
 # Remove the trailing period.
  v = int(value[:-1])
 # If we find a larger value, remember it.
 if​ v > largest:
  largest = v
 
 return​ largest
 
 def​ process_file(reader: TextIO) -> int:
 """Read and process reader, which must start with a time_series header.
  Return the largest value after the header. There may be multiple pieces
  of data on each line.
 
  >>> infile = StringIO('Example​​\​​n 20. 3.​​\​​n 100. 17. 15.​​\​​n')
  >>> process_file(infile)
  100
  """
 
  line = time_series.skip_header(reader).strip()
 # The largest value so far is the largest on this first line of data.
  largest = find_largest(line)
 
 # Check the rest of the lines for larger values.
 for​ line ​in​ reader:
  large = find_largest(line)
 if​ large > largest:
  largest = large
 return​ largest
 
 if​ __name__ == ​'__main__'​:
 with​ open(​'lynx.txt'​, ​'r'​) ​as​ input_file:
 print​(process_file(input_file))

Notice how simple the code in process_file looks! This happened only because we decided to write helper functions. To show you how much clearer this is, here is the same code without using time_series.skip_header and find_largest as helper methods:

 from​ typing ​import​ TextIO
 from​ io ​import​ StringIO
 
 def​ process_file(reader: TextIO) -> int:
 """Read and process reader, which must start with a time_series header.
  Return the largest value after the header. There may be multiple pieces
  of data on each line.
 
  >>> infile = StringIO('Example​​\​​n 20. 3.​​\​​n')
  >>> process_file(infile)
  20
  >>> infile = StringIO('Example​​\​​n 20. 3.​​\​​n 100. 17. 15.​​\​​n')
  >>> process_file(infile)
  100
  """
 
 # Read the description line
  line = reader.readline()
 
 # Find the first non-comment line
  line = reader.readline()
 while​ line.startswith(​'#'​):
  line = reader.readline()
 
 # Now line contains the first real piece of data
 
 # The largest value seen so far in the current line
  largest = -1
 
 for​ value ​in​ line.split():
 
 # Remove the trailing period
  v = int(value[:-1])
 # If we find a larger value, remember it
 if​ v > largest:
  largest = v
 
 # Check the rest of the lines for larger values
 for​ line ​in​ reader:
 
 # The largest value seen so far in the current line
  largest_in_line = -1
 
 for​ value ​in​ line.split():
 
 # Remove the trailing period
  v = int(value[:-1])
 # If we find a larger value, remember it
 if​ v > largest_in_line:
  largest_in_line = v
 
 if​ largest_in_line > largest:
  largest = largest_in_line
 return​ largest
 
 if​ __name__ == ​'__main__'​:
 with​ open(​'lynx.txt'​, ​'r'​) ​as​ input_file:
 print​(process_file(input_file))
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.237.194