5.2 Using Files for Large Data Sets

All of the examples in the Introducing the Python Collections chapter used small data sets so that we could concentrate on developing the statistical methods needed to analyze the data. We assumed that our data items were stored in lists, and we constructed our functions to process those lists accordingly. Now that we have those functions in place, we can turn our attention to using statistical tools to describe larger data sets.

We have already seen that Python provides some powerful collections in which to store and manipulate data. However, as the amount of data gets larger, it becomes more difficult to store the data in the collections for later processing. It would certainly be possible to have the user enter the data interactively, but this would require a substantial amount of effort. Instead, large data sets are usually found in data files that are prepared ahead of time. We can then read such data from the file and fill our collections for later processing.

5.2.1 Text Files

One popular format is a text file—that is, a file filled with characters. For example, the Python programs that we write are stored as text files. We can create text files in a number of ways. For example, we can use a text editor to type in and save the data. We can also download the data from a website and then save it in a file. Regardless of how the file is created, to read the file, we must know the file’s format—that is, how the data is stored in the file. For example, the data items could be organized as one item per line, or as a series of data items separated by spaces, or in some other format. Once we know the format of the file, we can use Python methods to read and manipulate the data in the file.

In Python, we must open files before we can use them and close files when we are finished with them. As you might expect, once a file has been opened, it becomes a Python object just like all other data. TABLE 5.1 shows the functions that can be used to open and close files.

TABLE 5.1 Opening and Closing Files in Python

Method Name Use Explanation
open open(filename, mode) Built-in function. Opens a file called filename and returns a reference to a file object. If the file does not exist, raises an OSError. The mode can be:
'r': open the file for reading
'w': open the file for writing. If the file exists, it will be overwritten.
'a': open the file for writing. If the file exists, the new data will be appended to the end.
close fileVariable.close() Signals that file use is complete and releases any memory or other resources associated with the file.

As an example, suppose we have a text file named rainfall.txt that contains the data shown in FIGURE 5.1, which represents the total annual rainfall (in inches) for 25 towns in Iowa. The first item on each line is the rain gauge location, usually a town name, and the second is the rainfall amount. The two values are separated by a space.

A figure shows the contents of the file rainfall.txt.

FIGURE 5.1 The contents of the file rainfall.txt.

Although it would be possible to consider entering this data by hand each time it is used, you can imagine that such a task would be time-consuming and error-prone. In addition, it is likely that there could be data from many more towns.

To open this file, we call the open function. The variable fileRef then holds a reference to the file object returned by open. When we are finished with the file, we can close it by using the close method. After the file is closed, any further attempts to use fileRef will result in an error. These two operations are shown in SESSION 5.1 as option 1.

Image

SESSION 5.1 Opening and closing a file

If we do not close the file after finishing processing its data, Python will eventually close it. The benefit of formally closing the file is that we release any resources, such as memory, that have been assigned to the file.

Another option for opening a file is to call the open method using the following syntax, which is also shown as option 2 in Session 5.1:

Image

The file object, fileRef, is returned from the open function and becomes our reference to the file. This syntax has two significant differences: The data processing is performed in a block, and the file is automatically closed when the block finishes. This approach is the preferred method for opening a file, so we will use it when reading and writing files.

5.2.2 Iterating over Lines in a File

We now use this file as input in a program that will do some data processing. In the program, we will read each line of the file and print it with some additional text. Because text files consist of sequences of lines of text, we can use the for loop to iterate through each line in the file.

A line in a file is defined as a sequence of characters up to and including a special character called the newline character. If you evaluate a string that contains a newline character, you will see the character represented as . If you print a string that contains a newline character, you will not see the ; instead, you will see only its effects. When you are typing a Python program and you press the enter or return key on your keyboard, the editor inserts a newline character into your text at that point.

As the for loop iterates through each line of the file, the loop variable will contain the current line of the file as a string of characters. The general pattern for processing each line of a text file is as follows:

Image

To process our rainfall data, we will use a for loop to iterate over the lines of the file. Because the city and rainfall amount on each line is separated by a space, we can use the split method to break each line into a list containing the city code and the rainfall amount. We can then take these values and construct a simple sentence, as shown in SESSION 5.2.

Image

SESSION 5.2 Simple program to read rainfall data from a file

5.2.3 Writing a File

Let’s think about another example of file processing: converting our file from data reporting rainfall in inches to data reporting rainfall in centimeters. To do this, we will read the file contents as before, but instead of printing a message, we will do some computation and then write the results to another file. The new file will need to be opened for writing. To write a line to a file, we use the write method, shown in TABLE 5.2.

TABLE 5.2 The write Method

Method Name Use Explanation
write fileVar.write(string) Write a string to a file that is open for writing using the file object returned from the open function. Return the number of characters written.

LISTING 5.1 shows the Python program we will use to perform the inches-to-centimeters conversion. The new file will be called rainfallInCM.txt. A for loop is used to iterate through the input file. Each line of the file is split, and the rainfall value in inches is converted to the equivalent value in centimeters (lines 7–8).

Image

LISTING 5.1 Writing data into a new file

The write statement on lines 10–11 does all the work to create a new line in the output file. Note that it can add only a single string to the file each time it is used. For this reason, we need to use string concatenation to build up the line piece by piece. The first piece, values[0], is the city code. We then add a space to separate the city code from the rainfall amount. Next, the floating-point value called cm is converted to a string using the str function. Finally, the entire line is completed by adding a newline character. The write method returns the number of characters that were written in the line, but because we do not process that number, we simply assign the return value to a variable nChars, which we ignore. FIGURE 5.2 shows the contents of the newly created file.

A figure shows contents of the new text file, rainfallInCM.txt.

FIGURE 5.2 The contents of the new text file, rainfallInCM.txt.

5.2.4 String Formatting

As you can see on lines 10–11 of Listing 5.1, converting values to strings and concatenating these strings together can be a tedious process. Fortunately, Python provides us with a better alternative: formatted strings. In this template, words or spaces that will remain constant are combined with placeholders for variables to be inserted into the string. For example, the statement

Image

outputs the words "had" and "inches of rain." every time, but the city name and the amount are different for each line printed.

Using a formatted string, we can write the previous statement as

Image

A formatted string uses the string format method shown in TABLE 5.3. The curly braces ({}) serve as placeholders for values to be inserted into the string. The characters within curly braces specify the values and associated formats to substitute for the placeholders. The parameters of the format method specify the actual values to be substituted.

TABLE 5.3 The String format Method

Method Use Explanation
format(replacementField, …) Substitute replacement fields for placeholders in the string using the formats specified within the placeholder.

Note that the number of placeholders in the string corresponds to the number of replacement fields sent as parameters. By default, the replacement fields are substituted for the placeholders in the same order as the replacement field parameters are given to the format method. To avoid confusion, it is recommended that you insert the number of the replacement field in each placeholder, as shown by {0} and {1} in the previous example. The replacement fields can be specified as a tuple or a dictionary. If a dictionary is provided, then the dictionary name should be preceded by ** and the keys will be used to identify the value to be replaced.

Following the replacement field number (if any), the placeholder can contain a colon followed by a conversion specification. The conversion specification consists of an alignment, width, and type. TABLE 5.4 summarizes some common type specifications. In the first placeholder in the preceding example, no conversion specification is given, because string is the default type. The second placeholder, {1:2.2f}, specifies that the second parameter should be formatted as a floating-point number with 2 places to the right of the decimal point.

TABLE 5.4 Common Type Specifications

Type of Replacement Field Type Conversion Character Output Format
String s String. This is the default format.
Integer c Character. Converts an integer to its Unicode equivalent.
d Decimal integer. This is the default for a number.
n Number. Same as decimal integer.
Floating-point e or E Scientific notation with a default precision of 6, as m.ddddde+/-xx or m.dddddE+/-xx
f or F Fixed-point with precision of 6, as m.dddddd.
g General format. Uses scientific or fixed-point depending on the magnitude of the number.
n Same as g.
% Percentage. Multiplies the number by 100, then displays as f format and appends a percent sign.

To these format characters, we can add other modifier characters to specify numeric precision and output width and alignment. TABLE 5.5 lists commonly used modifier characters.

TABLE 5.5 Common Formatting Modifiers

Modifier Type Modifier Character Output Format
Alignment < Left-align value in its space. This is the default for strings.
> Right-align value in its space. This is the default for numbers.
^ Center the value within its space.
Width w Value will use the space of w characters. The default is to use the minimum space required to display the value.
Precision .n Display n numbers after the decimal point.

SESSION 5.3 shows some examples of formatted strings in use.

Image

SESSION 5.3 Demonstrating string formatting

5.2.5 Alternative File-Reading Methods

In addition to the for loop, Python provides three methods to read data from the input file. The readline method reads a specified number of characters or up to one line from the file and returns it as a string. The string returned by readline will contain the newline character at the end. This method returns the empty string when it reaches the end of the file.

The readlines method returns the contents of the entire file as a list of strings, where each item in the list represents one line of the file; the newline character appears at the end of each item in the list.

It is also possible to read the entire file into a single string with read. TABLE 5.6 summarizes these methods, and SESSION 5.4 shows them in action.

TABLE 5.6 Methods for Reading Files in Python

Method Name Use Explanation
read(n) filevar.read() Read and return a string of n characters, or the entire file as a single string if n is not provided.
readline(n) filevar.readline() Return the next line of the file with all text up to and including the newline character. If n is provided as a parameter, then only n characters will be returned if the line is longer than n. Returns the empty string when the end of the file is reached.
readlines(n) filevar.readlines() Return a list of n strings, each representing a single line of the file. If n is not provided, then all lines of the file are returned.
Image

SESSION 5.4 Using more read methods

Note that we need to reopen the file before each read so that we start from the beginning of the file. Each file has a marker that denotes the current read position in the file. Whenever one of the read methods is called, this marker moves to the character immediately following the last character returned. In the case of readline, the marker moves to the first character of the next line in the file. In the case of read or readlines, the marker moves to the end of the file or to the end of the data read.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.187.116