5 Files

Files are an indispensable part of the world of computers, and thus of programming. We read data from files, and write to files. Even when something isn’t really a file--such as a network connection--we try to use an interface similar to files because they’re so familiar.

To normal, everyday users, there are different types of files--Word, Excel, PowerPoint, and PDF, among others. To programmers, things are both simpler and more complicated. They’re simpler in that we see files as data structures to which we can write strings, and from which we can read strings. But files are also more complicated, in that when we read the string into memory, we might need to parse it into a data structure.

Working with files is one of the easiest and most straightforward things you can do in Python. It’s also one of the most common things that we need to do, since programs that don’t interact with the filesystem are rather boring.

In this chapter, we’ll practice working with files--reading from them, writing to them, and manipulating the data that they contain. Along the way, you’ll get used to some of the paradigms that are commonly used when working with Python files, such as iterating over a file’s contents and writing to files in a with block.

In some cases, we’ll work with data formatted as CSV (comma-separated values) or JSON (JavaScript object notation), two common formats that modules in Python’s standard library handle. If you’ve forgotten the basics of CSV or JSON, I have some short reminders in this chapter.

After this chapter, you’ll not only be more comfortable working with files, you’ll also better understand how you can translate from in-memory data structures (e.g., lists and dicts) to on-disk data formats (e.g., CSV and JSON) and back. In this way, files make it possible for you to keep data structures intact--even when the program isn’t running or when the computer is shut down--or even to transfer such data structures to other computers.

Table 5.1 What you need to know

Concept

What is it?

Example

To learn more

Files

Overview of working with files in Python

f = open('/etc/passwd')

http://mng.bz/D22R

with

Puts an object in a context manager ; in the case of a file, ensures it’s flushed and closed by the end of the block

with open ('file.text') as f:

http://mng.bz/6QJy

Context manager

Makes your own objects work in with statements

with MyObject() as m:

http://mng.bz/B221

set.update

Adds elements to a set

s.update([10, 20, 30])

http://mng.bz/MdOn

os.stat

Retrieves information (size, permissions, type) about a file

os.stat('file.txt')

http://mng.bz/dyyo

os.listdir

Returns a list of files in a directory

os.listdir('/etc/')

http://mng.bz/YreB

glob.glob

Returns a list of files matching a pattern

glob.glob('/etc/*.conf')

http://mng.bz/044N

Dict comprehension

Creates a dict based on an iterator

{word : len(word)

for word in 'ab cde'.split()}

http://mng.bz/Vggy

str.split

Breaks strings apart, returning a list

# Returns ['ab', 'cd', 'ef']

'ab cd ef'.split()

http://mng.bz/aR4z

hashlib

Module with cryptographic functions

import hashlib

http://mng.bz/NK2x

csv

Module for working with CSV files

x = csv.reader(f)

http://mng.bz/xWWd

json

Module for working with JSON

json.loads(json_string)

http://mng.bz/AAAo

Exercise 18 Final line

It’s very common for new Python programmers to learn how they can iterate over the lines of a file, printing one line at a time. But what if I’m not interested in each line, or even in most of the lines? What if I’m only interested in a single line--the final line of the file?

Now, retrieving the final line of a file might not seem like a super useful action. But consider the Unix head and tail utilities, which show the first and last lines of a file, respectively--and which I use all the time to examine files, particularly log files and configuration files. Moreover, knowing how to read specific parts of a file, as opposed to the entire thing, is a useful and practical skill to have.

In this exercise, write a function (get_final_line) that takes a filename as an argument. The function should return that file’s final line on the screen.

Working it out

The solution code uses a number of common Python idioms that I’ll explain here. And along the way, you’ll see how using these idioms leads not just to more readable code, but also to more efficient execution.

Depending on which arguments you use when calling it, the built-in open function can return a number of different objects, such as TextIOWrapper or BufferedReader. These objects all implement the same API for working with files and are thus described in the Python world as “file-like objects.” Using such an object allows us to paper over the many different types of filesystems out there and just think in terms of “a file.” Such an object also allows us to take advantage of whatever optimizations, such as buffering, the operating system might be using.

Here’s how open is usually invoked:

f = open(filename)

In this case, filename is a string representing a valid file name. When we invoke open with just one argument, it should be a filename. The second, optional, argument is a string that can include multiple characters, indicating whether we want to read from, write to, or append to the file (using r, w, or a), and whether the file should be read by character (the default) or by bytes (the b option, in which case we’ll use rb, wb, or ab). (See the sidebar about the b option and reading the file in byte, or binary, mode.) I could thus more fully write the previous line of code as

f = open(filename, 'r')

Because we read from files more often than we write to them, r is the default value for the second argument. It’s quite usual for Python programs not to specify r if reading from a file.

As you can see here, we’ve put the resulting object into the variable f. And because file-like objects are all iterable, returning one line per iteration, it’s typical to then say this:

for current_line in f:
    print(current_line)

But if you’re just planning to iterate over f once, then why create it as a variable at all? We can avoid the variable definition and simply iterate over the file object that open returned:

for current_line in open(filename):
    print(current_line)

With each iteration over a file-like object, we get the next line from the file, up to and including the newline character. Thus, in this code, line is always going to be a string that always contains a single character at the end of it. A blank line in a file will contain just the newline character.

In theory, files should end with an , such that you’ll never finish the file in the middle of a line. In practice, I’ve seen many files that don’t end with an . Keep this in mind whenever you’re printing out a file; assuming that a file will always end with a newline character can cause trouble.

What about closing the file? This code will work, printing the length of each line in a file. However, this sort of code is frowned upon in the Python world because it doesn’t explicitly close the file. Now, when it comes to reading from files, it’s not that big of a deal, especially if you’re only opening a small number of them at a time. But if you’re writing to files, or if you’re opening many files at once, you’ll want to close them--both to conserve resources and to ensure that the file has been closed for good.

The way to do that is with the with construct. I could rewrite the previous code as follows:

with open(filename) as f:
    for one_line in f:
        print(len(one_line))

Instead of opening the file and assigning the file object to f directly, we’ve opened it within the context of with, assigned it to f as part of the with statement, and then opened a block.

There’s more detail about this in the sidebar about with and “context managers,” but you should know that this is the standard Pythonic way to open a file--in no small part because it guarantees that the file has been closed by the end of the block.

Binary mode using b

What happens if you open a nontext file, such as a PDF or a JPEG, with open and then try to iterate over it, one line at a time?

First, you’ll likely get an error right away. That’s because Python expects the contents of a file to be valid UTF-8 formatted Unicode strings. Binary files, by definition, don’t use Unicode. When Python tries to read a non-Unicode string, it’ll raise an exception, complaining that it can’t define a string with such content.

To avoid that problem, you can and should open the file in binary or bytes mode, adding a b to r, w, or a in the second argument to open; for example

for current_line in open(filename, 'rb'):   
    print(current_line)                     

Opens the file in “r” (read) and “b” (binary) mode

The type of current_line here is bytes, similar to a string but without Unicode characters.

Now you won’t be constrained by a lack of Unicode characters.

But wait. Remember that with each iteration, Python will return everything up to and including the next character. In a binary file, such a character won’t appear at the end of every line, because there are no lines to speak of. Without such a character, what you get back from each iteration will probably be nonsense.

The bottom line is that if you’re reading from a binary file, you shouldn’t forget to use the b flag. But when you do that, you’ll find that you don’t want to read the file per line anyway. Instead, you should be using the read method to retrieve a fixed number of bytes. When read returns 0 bytes, you’ll know that you’re at the end of the file; for example

with open(filename, 'rb') as f:     
    while True:
        one_chunk = f.read(1000)    
        if not one_chunk:
            break
        print(f'This chunk contains {len(one_chunk)} bytes')

Uses “with”, in a “context manager,” to open the file

Reads up to 1,000 bytes and returns them as a bytes object

In this particular exercise, you were asked to print the final line of a file. One way to do so might look like the following code:

for current_line in open(filename):
    pass
 
print(current_line)

This trick works because we iterate over the lines of the file and assign current_line in each iteration--but we don’t actually do anything in the body of the for loop. Rather, we use pass, which is a way of telling Python to do nothing. (Python requires that we have at least one line in an indented block, such as the body of a for loop.) The reason that we execute this loop is for its side effect--namely, the fact that the final value assigned to current_line remains in place after the loop exits.

However, looping over the rows of a file just to get the final one strikes me as a bit strange, even if it works. My preferred solution, shown in figure 5.1, is to iterate over each line of the file, getting the current line but immediately assigning it to final_line.

Figure 5.1 Immediately before printing the final line

When we exit from the loop, final_line will contain whatever was in the most recent line. We can thus print it out afterwards.

Normally, print adds a newline after printing something to the screen. However, when we iterate over a file, each line already ends with a newline character. This can lead to doubled whitespace between printed output. The solution is to stop print from displaying anything by overriding the default value in the end parameter. By passing end='', we tell print to add '', the empty string, after printing final_line. For further information about the arguments you can pass to print, take a look here: http://mng.bz/RAAZ.

Solution

def get_final_line(filename):
    final_line = ''
    for current_line in open(filename):    
        final_line = current_line
    return final_line
 
print(get_final_line('/etc/passwd'))

Iterates over each line of the file. You don’t need to declare a variable; just iterate directly over the result of open.

You can work through a version of this code in the Python Tutor at http://mng.bz/ D24g.

Simulating files in Python Tutor

Philip Guo’s Python Tutor site (http://mng.bz/2XJX), which I use for diagrams and also to allow you to experiment with the book’s solutions, doesn’t support files. This is understandable--a free server system that lets people run arbitrary code is hard enough to create and support. Allowing people to work with arbitrary files would add plenty of logistical and security problems.

However, there is a solution: StringIO (http://mng.bz/PAOP). StringIO objects are what Python calls “file-like objects.” They implement the same API as file objects, allowing us to read from them and write to them just like files. Unlike files, though, StringIO objects never actually touch the filesystem.

StringIO wasn’t designed for use with the Python Tutor, although it’s a great workaround for the limitations there. More typically, I see (and use) StringIO in automated tests. After all, you don’t really want to have a test touch the filesystem; that would make things run much more slowly. Instead, you can use StringIO to simulate a file.

If you’re doing any software testing, you should take a serious look at StringIO, part of the Python standard library. You can load it with

from io import StringIO

When we’re looking at files, the versions of code that you’ll see in Python Tutor thus will be slightly different from the ones in the book itself. However, they should work the same way, allowing you to explore the code visually. Unfortunately, exercises that involve directory listings can’t be papered over as easily, and thus lack any Python Tutor link.

Screencast solution

Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.

Beyond the exercise

Iterating over files, and understanding how to work with the content as (and after) you iterate over them, is an important skill to have when working with Python. It is also important to understand how to turn the contents of a file into a Python data structure--something we’ll look at several more times in this chapter. Here are a few ideas for things you can do when iterating through files in this way:

  • Iterate over the lines of a text file. Find all of the words (i.e., non-whitespace surrounded by whitespace) that contain only integers, and sum them.

  • Create a text file (using an editor, not necessarily Python) containing two tab-separated columns, with each column containing a number. Then use Python to read through the file you’ve created. For each line, multiply each first number by the second, and then sum the results from all the lines. Ignore any line that doesn’t contain two numeric columns.

  • Read through a text file, line by line. Use a dict to keep track of how many times each vowel (a, e, i, o, and u) appears in the file. Print the resulting tabulation.

Exercise 19 /etc/passwd to dict

It’s both common and useful to think of files as sequences of strings. After all, when you iterate over a file object, you get each of the file’s lines as a string, one at a time. But it often makes more sense to turn a file into a more complex data structure, such as a dict.

In this exercise, write a function, passwd_to_dict, that reads from a Unix-style “password file,” commonly stored as /etc/passwd, and returns a dict based on it. If you don’t have access to such a file, you can download one that I’ve uploaded at http://mng.bz/2XXg.

Here’s a sample of what the file looks like:

nobody:*:-2:-2::0:0:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0::0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1::0:0:System Services:/var/root:/usr/bin/false

Each line is one user record, divided into colon-separated fields. The first field (index 0) is the username, and the third field (index 2) is the user’s unique ID number. (In the system from which I took the /etc/passwd file, nobody has ID -2, root has ID 0, and daemon has ID 1.) For our purposes, you can ignore all but these two fields.

Sometimes, the file will contain lines that fail to adhere to this format. For example, we generally ignore lines containing nothing but whitespace. Some vendors (e.g., Apple) include comments in their /etc/passwd files, in which the line starts with a # character.

The function passwd_to_dict should return a dict based on /etc/passwd in which the dict’s keys are usernames and the values are the users’ IDs.

Some help from string methods

The string methods str.startswith, str.endswith, and str.strip are helpful when doing this kind of analysis and manipulation.

For example, str.startswith returns True or False, depending on whether the string starts with a string:

s = 'abcd'
s.startswith('a')    # returns True
s.startswith('abc')  # returns True
s.startswith('b')    # returns False

Similarly, str.endswith tells us whether a string ends with a particular string:

s = 'abcd'
s.endswith('d')    # returns True
s.endswith('cd')   # returns True
s.endswith('b')    # returns False

str.strip removes the whitespace--the space character, as well as , , , and even v--on either side of the string. The str.lstrip and str.rstrip methods only remove whitespace on the left and right, respectively. All of these methods return strings:

s = '   			a  b  c  		
'
s.strip()    # returns 'a  b  c'
s.lstrip()   # returns 'a  b  c  		
'
s.rstrip()   # returns '   			a  b  c'

Working it out

Once again, we’re opening a text file and iterating over its lines, one at a time. Here, we assume that we know the file’s format, and that we can extract fields from within each record.

In this case, we’re splitting each line across the : character, using the str.split method. str.split always returns a list of strings, although the length of that list depends on the number of times that : occurs in the string. In the case of /etc/passwd, we will assume that any line containing : is a legitimate user record and thus has all of the necessary fields.

However, the file might contain comment lines beginning with #. If we were to invoke str.split (http://mng.bz/aR4z) on those lines, we’d get back a list, but one containing only a single element--leading to an IndexError exception if we tried to retrieve user_info[2].

It’s thus important that we ignore those lines that begin with #. Fortunately, we can use a str.startswith (http://mng.bz/PAAw) method. Specifically, I identify and discard comment and blank lines using this code:

if not line.startswith(('#', '
')):

The invocation of str.startswith passes it a tuple of two strings. str.startswith will return True if either of the strings in that tuple are found at the start of the line. Because every line contains a newline, including blank lines, we could say that a line that starts with is a blank line.

Assuming that it has found a user record, our program then adds a new key-value pair to users. The key is user_info[0], and the value is user_info[2]. Notice how we can use user_info[0] as the name of a key; as long as the value of that variable contains a string, we may use it as a dict key.

I use with (http://mng.bz/lGG2) here to open the file, thus ensuring that it’s closed when the block ends. (See the sidebar about with and context managers.)

Solution

def passwd_to_dict(filename):
    users = {}
    with open(filename) as passwd:
        for line in passwd:
            if not line.startswith(('#', '
')):     
                user_info = line.split(':')          
                users[user_info[0]] = int(user_info[2])
    return users
 
print(passwd_to_dict('/etc/passwd'))

Ignores comment and blank lines

Turns the line into a list of strings

You can work through a version of this code in the Python Tutor at http://mng.bz/ lGWR.

Screencast solution

Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.

Beyond the exercise

At a certain point in your Python career, you’ll stop seeing files as sequences of characters on a disk, and start seeing them as raw material you can transform into Python data structures. Our programs have more semantic power with structured data (e.g., dicts) than strings. We can similarly do more and think in deeper ways if we read a file into a data structure rather than just into a string.

For example, imagine a CSV file in which each line contains the name of a country and its population. Reading this file as a string, it would be possible--but frustrating--to compare the populations of France and Thailand. But reading this file into a dict, it would be trivial to make such a comparison.

Indeed, I’m a particular fan of reading files into dicts, in no small part because many file formats lend themselves to this sort of translation--but you can also use more complex data structures. Here are some additional exercises you can try to help you see that connection and make the transformation in your code:

  • Read through /etc/passwd, creating a dict in which user login shells (the final field on each line) are the keys. Each value will be a list of the users for whom that shell is defined as their login shell.

  • Ask the user to enter integers, separated by spaces. From this input, create a dict whose keys are the factors for each number, and the values are lists containing those of the users’ integers that are multiples of those factors.

  • From /etc/passwd, create a dict in which the keys are the usernames (as in the main exercise) and the values are themselves dicts with keys (and appropriate values) for user ID, home directory, and shell.

with and context managers

As we’ve seen, it’s common to open a file as follows:

with open('myfile.txt', 'w') as f:
    f.write('abc
')
    f.write('def
')

Most people believe, correctly, that using with ensures that the file, f, will be flushed and closed at the end of the block. (You thus don’t have to explicitly call f.close() to ensure

the contents will be flushed.) But because with is overwhelmingly used with files, many developers believe that there’s some inherent connection between with and files. The truth is that with is a much more general Python construct, known as a context manager.

The basic idea is as follows:

  1. You use with, along with an object and a variable to which you want to assign the object.

  2. The object should know how to behave inside of the context manager.

  3. When the block starts, with turns to the object. If a __enter__ method is defined on the object, then it runs. In the case of files, the method is defined but does nothing other than return the file object itself. Whatever this method returns is assigned to the as variable at the end of the with line.

  4. When the block ends, with once again turns to the object, executing its __exit__ method. This method gives the object a chance to change or restore whatever state it was using.

It’s pretty obvious, then, how with works with files. Perhaps the __enter__ method isn’t important and doesn’t do much, but the __exit__ method certainly is important and does a lot--specifically in flushing and closing the file. If you pass two or more objects to with, the __enter__ and __exit__ methods are invoked on each of them, in turn.

Other objects can and do adhere to the context manager protocol. Indeed, if you want, you can write your own classes such that they’ll know how to behave inside of a with statement. (Details of how to do so are in the “What you need to know” table at the start of the chapter.)

Are context managers only used in the case of files? No, but that’s the most common case by far. Two other common cases are (1) when processing database transactions and (2) when locking certain sections in multi-threaded code. In both situations, you want to have a section of code that’s executed within a certain context--and thus, Python’s context management, via with, comes to the rescue.

If you want to learn more about context managers, here’s a good article on the subject: http://mng.bz/B221.

Exercise 20 Word count

Unix systems contain many utility functions. One of the most useful to me is wc (http:// mng.bz/Jyyo), the word count program. If you run wc against a text file, it’ll count the characters, words, and lines that the file contains.

The challenge for this exercise is to write a wordcount function that mimics the wc Unix command. The function will take a filename as input and will print four lines of output:

Number of characters (including whitespace)

  1. Number of words (separated by whitespace)

  2. Number of lines

Number of unique words (case sensitive, so “NO” is different from “no”)

I’ve placed a test file (wcfile.txt) at http://mng.bz/B2ml. You may download and use that file to test your implementation of wc. Any file will do, but if you use this one, your results will match up with mine. That file’s contents look like this:

This is a test file.
It contains 28 words and 20 different words.
It also contains 165 characters.
It also contains 11 lines.
It is also self-referential.
Wow!

This exercise, like many others in this chapter, tries to help you see the connections between text files and Python’s built-in data structures. It’s very common to use Python to work with log files and configuration files, collecting and reporting that data in a human-readable format.

Working it out

This program demonstrates a number of Python’s capabilities that many programmers use on a daily basis. First and foremost, many people who are new to Python believe that if they have to measure four aspects of a file, then they should read through the file four times. That might mean opening the file once and reading through it four times, or even opening it four separate times. But it’s more common in Python to loop over the file once, iterating over each line and accumulating whatever data the program can find from that line.

How will we accumulate this data? We could use separate variables, and there’s nothing wrong with that. But I prefer to use a dict (figure 5.2), since the counts are closely related, and because it also reduces the code I need to produce a report.

So, once we’re iterating over the lines of the file, how can we count the various elements? Counting lines is the easiest part: each iteration goes over one line, so we can simply add 1 to counts['lines'] at the top of the loop.

Next, we want to count the number of characters in the file. Since we’re already iterating over the file, there’s not that much work to do. We get the number of characters in the current line by calculating len(one_line), and then adding that to counts['characters'].

Many people are surprised that this includes whitespace characters, such as spaces and tabs, as well as newlines. Yes, even an “empty” line contains a single newline character. But if we didn’t have newline characters, then it wouldn’t be obvious to the computer when it should start a new line. So such characters are necessary, and they take up some space.

Figure 5.2 Initialized counts in the dict

Next, we want to count the number of words. To get this count, we turn one_line into a list of words, invoking one_line.split. The solution invokes split without any arguments, which causes it to use all whitespace--spaces, tabs, and newlines--as delimiters. The result is then put into counts['words'].

The final item to count is unique words. We could, in theory, use a list to store new words. But it’s much easier to let Python do the hard work for us, using a set to guarantee the uniqueness. Thus, we create the unique_words set at the start of the program, and then use unique_words.update (http://mng.bz/MdOn) to add all of the words in the current line into the set (figure 5.3). For the report to work on our dict, we then add a new key-value pair to counts, using len(unique_words) to count the number of words in the set.

Figure 5.3 The data structures, including unique words, after several lines

Solution

def wordcount(filename):
    counts = {'characters': 0,
              'words': 0,
              'lines': 0}
    unique_words = set()                         

    for one_line in open(filename):
        counts['lines'] += 1
        counts['characters'] += len(one_line)
        counts['words'] += len(one_line.split())

        unique_words.update(one_line.split())    

    counts['unique words'] = len(unique_words)   
    for key, value in counts.items():
        print(f'{key}: {value}')
 
wordcount('wcfile.txt')

You can create sets with curly braces, but not if they’re empty! Use set() to create a new empty set.

set.update adds all of the elements of an iterable to a set.

Sticks the set’s length into counts for a combined report

You can work through a version of this code in the Python Tutor at http://mng.bz/MdZo.

Screencast solution

Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.

Beyond the exercise

Creating reports based on files is a common use for Python, and using dicts to accumulate information from those files is also common. Here are some additional things you can try to do, similar to what we did here:

  • Ask the user to enter the name of a text file and then (on one line, separated by spaces) words whose frequencies should be counted in that file. Count how many times those words appear in a dict, using the user-entered words as the keys and the counts as the values.

  • Create a dict in which the keys are the names of files on your system and the values are the sizes of those files. To calculate the size, you can use os.stat (http://mng.bz/dyyo).

  • Given a directory, read through each file and count the frequency of each letter. (Force letters to be lowercase, and ignore nonletter characters.) Use a dict to keep track of the letter frequencies. What are the five most common letters across all of these files?

Exercise 21 Longest word per file

So far, we’ve worked with individual files. Many tasks, however, require you to analyze data in multiple files--such as all of the files in a dict. This exercise will give you some practice working with multiple files, aggregating measurements across all of them.

In this exercise, write two functions. find_longest_word takes a filename as an argument and returns the longest word found in the file. The second function, find_all_longest_words, takes a directory name and returns a dict in which the keys are filenames and the values are the longest words from each file.

If you don’t have any text files that you can use for this exercise, you can download and use a zip file I’ve created from the five most popular books at Project Gutenberg (https://gutenberg.org/). You can download the zip file from http://mng.bz/rrWj.

Note There are several ways to solve this problem. If you already know how to use comprehensions, and particularly dict comprehensions, then that’s probably the most Pythonic approach. But if you aren’t yet comfortable with them, and would prefer not to jump to read about them in chapter 7, then no worries--you can use a traditional for loop, and you’ll be just fine.

Working it out

In this case, you’re being asked to take a directory name and then find the longest word in each plain-text file in that directory. As noted, your function should return a dict in which the dict’s keys are the filenames and the dict’s values are the longest words in each file.

Whenever you hear that you need to transform a collection of inputs into a collection of outputs, you should immediately think about comprehensions--most commonly list comprehensions, but set comprehensions and dict comprehensions are also useful. In this case, we’ll use a dict comprehension--which means that we’ll create a dict based on iterating over a source. The source, in our case, will be a list of filenames. The filenames will also provide the dict keys, while the values will be the result of passing the filenames to a function.

In other words, our dict comprehension will

Iterate over the list of files in the named directory, putting the filename in the variable filename.

  1. For each file, run the function find_longest_word, passing filename as an argument. The return value will be a string, the longest string in the file.

  2. Each filename-longest word combination will become a key-value pair in the dict we create.

How can we implement find_longest_word? We could read the file’s entire contents into a string, turn that string into a list, and then find the longest word in the list with sorted. Although this will work well for short files, it’ll use a lot of memory for even medium-sized files.

My solution is thus to iterate over every line of a file, and then over every word in the line. If we find a word that’s longer than the current longest_word, we replace the old word with the new one. When we’re done iterating over the file, we can return the longest word that we found.

Note my use of os.path.join (http://mng.bz/oPPM) to combine the directory name with a filename. You can think of os.path.join as a filename-specific version of str.join. It has additional advantages, as well, such as taking into account the current operating system. On Windows, os.path.join will use backslashes, whereas on Macs and Unix/Linux systems, it’ll use a forward slash.

Solution

import os

def find_longest_word(filename):
    longest_word = ''
    for one_line in open(filename):
        for one_word in one_line.split():
            if len(one_word) > len(longest_word):
                longest_word = one_word
    return longest_word
 
def find_all_longest_words(dirname):
    return {filename:
            find_longest_word(os.path.join(dirname,
                                        filename))      
            for filename in os.listdir(dirname)         
            if os.path.isfile(os.path.join(dirname,
                                           filename))}  
 
print(find_all_longest_words('.'))

Gets the filename and its full path

Iterates over all of the files in dirname

We’re only interested in files, not directories or special files.

Because these functions work with directories, there is no Python Tutor link.

Screencast solution

Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.

Beyond the exercise

You’ll commonly produce reports about files and file contents using dicts and other basic data structures in Python. Here are a few possible exercises to practice these ideas further:

  • Use the hashlib module in the Python standard library, and the md5 function within it, to calculate the MD5 hash for the contents of every file in a user-specified directory. Then print all of the filenames and their MD5 hashes.

  • Ask the user for a directory name. Show all of the files in the directory, as well as how long ago the directory was modified. You will probably want to use a combination of os.stat and the Arrow package on PyPI (http://mng.bz/nPPK) to do this easily.

  • Open an HTTP server’s log file. (If you lack one, then you can read one from me at http://mng.bz/vxxM.) Summarize how many requests resulted in numeric response codes--202, 304, and so on.

Directory listings

For a language that claims “there’s one way to do it,” Python has too many ways to list files in a directory. The two most common are os.listdir and glob.glob, both of which I’ve mentioned in this chapter. A third way is to use pathlib, which provides us with an object-oriented API to the filesystem.

The easiest and most standard of these is os.listdir, a function in the os module. It returns a list of strings, the names of files in the directory; for example

filenames = os.listdir('/etc/')

The good news is that it’s easy to understand and work with os.listdir. The bad news is that it returns a list of filenames without the directory name, which means that to open or work with the files, you’ll need to add the directory name at the beginning--ideally with os.path.join, which works cross-platform.

The other problem with os.listdir is that you can’t filter the filenames by a pattern. You get everything, including subdirectories and hidden files. So if you want just all of the .txt files in a directory, os.listdir won’t be enough.

That’s where the glob module comes in. It lets you use patterns, sometimes known as globbing, to describe the files that you want. Moreover, it returns a list of strings--with each string containing the complete path to the file. For example, I can get the full paths of the configuration files in /etc/ on my computer with

filenames = glob.glob('/etc/*.conf')

I don’t need to worry about other files or subdirectories in this case, which makes it much easier to work with. For a long time, glob.glob was thus my go-to function for finding files.

Then there’s pathlib, a module that comes with the Python standard library and makes things easier in many ways. You start by creating a pathlib.Path object, which represents a file or directory:

import pathlib
p = pathlib.Path('/etc/')

Once you have this Path object, you can do lots of things with it that previously required separate functions--including the ones I’ve just described. For example, you can get an iterator that returns files in the directory with iterdir:

for one_filename in p.iterdir():
    print(one_filename)

In each iteration, you don’t get a string, but rather a Path object (or more specifically, on my Mac I get a PosixPath object). Having a full-fledged Path object, rather than a string, allows you to do lots more than just print the filename; you can open and inspect the file as well.

If you want to get a list of files matching a pattern, as I did with glob.glob, you can use the glob method:

for one_filename in p.glob('*.conf'):
    print(one_filename)

pathlib is a great addition to recent Python versions. If you have a chance to use it, you should do so; I’ve found that it clarifies and shortens quite a bit of my code. A good introduction to pathlib is here: http://mng.bz/4AAV.

Exercise 22 Reading and writing CSV

In a CSV file, each record is stored on one line, and fields are separated by commas. CSV is commonly used for exchanging information, especially (but not only) in the world of data science. For example, a CSV file might contain information about different vegetables:

lettuce,green,soft
carrot,orange,hard
pepper,green,hard
eggplant,purple,soft

Each line in this CSV file contains three fields, separated by commas. There aren’t any headers describing the fields, although many CSV files do have them.

Sometimes, the comma is replaced by another character, so as to avoid potential ambiguity. My personal favorite is to use a TAB character ( in Python strings).

Python comes with a csv module (http://mng.bz/Qyyj) that handles writing to and reading from CSV files. For example, you can write to a CSV file with the following code:

import csv

with open('/tmp/stuff.csv', 'w') as f:
    o = csv.writer(f)                         
    o.writerow(range(5))                      
    o.writerow(['a', 'b', 'c', 'd', 'e'])     

Creates a csv.writer object, wrapping our file-like object “f”

Writes the integers from 0-4 to the file, separated by commas

Writes this list of strings as a record to the CSV file, separated by commas

Not all CSV files necessarily look like CSV files. For example, the standard Unix /etc/passwd file, which contains information about users on a system (but no longer users’ passwords, despite its name), separates fields with : characters.

For this exercise, create a function, passwd_to_csv, that takes two filenames as arguments: the first is a passwd-style file to read from, and the second is the name of a file in which to write the output.

The new file’s contents are the username (index 0) and the user ID (index 2). Note that a record may contain a comment, in which case it will not have anything at index 2; you should take that into consideration when writing the file. The output file should use TAB characters to separate the elements.

Thus, the input will look like this

root:*:0:0::0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1::0:0:System Services:/var/root:/usr/bin/false
# I am a comment line
_ftp:*:98:-2::0:0:FTP Daemon:/var/empty:/usr/bin/false

and the output will look like this:

root    0
daemon  1
_ftp    98

Notice that the comment line in the input file is not placed in the output file. You can assume that any line with at least two colon-separated fields is legitimate.

How Python handles end of lines and newlines on different OSs

Different operating systems have different ways of indicating that we’ve reached the end of the line. Unix systems, including the Mac, use ASCII 10 (line feed, or LF). Windows systems use two characters, namely ASCII 13 (carriage return, or CR) + ASCII 10. Old-style Macs used just ASCII 13.

Python tries to bridge these gaps by being flexible, and making some good guesses, when it reads files. I’ve thus rarely had problems using Python to read text files that were created using Windows. By the same token, my students (who typically use Windows) generally have no problem reading the files that I’ve created on the Mac. Python figures out what line ending is being used, so we don’t need to provide any more hints. And inside of the Python program, the line ending is symbolized by .

Writing to files, in contrast, is a bit trickier. Python will try to use the line ending appropriate for the operating system. So if you’re writing to a file on Windows, it’ll use CR+LF (sometimes shown as ). If you’re writing to a file on a Unix machine, then it’ll just use LF.

This typically works just fine. But sometimes, you’ll find yourself seeing too many or too few newlines when you read from a file. This might mean that Python has guessed incorrectly, or that the file used a few different line endings, confusing Python’s guessing algorithm.

In such cases, you can pass a value to the newline parameter in the open function, used to open files. You can try to explicitly use newline=' ' to force Unix-style newlines, or newline=' ' to force Windows-style newlines. If this doesn’t fix the problem, you might need to examine the file further to see how it was defined.

For a complete introduction to working with CSV files in Python, check out http:// mng.bz/XPP6/.

Working it out

The solution program uses a number of aspects of Python that are useful when working with files. We’ve already seen and discussed with earlier in this chapter. Here, you can see how you can use with to open two separate files, or generally to define any number of objects. As soon as our block exits, both of the files are automatically closed.

We define two variables in the with statement, for the two files with which we’ll be working. The passwd file is opened for reading from /etc/passwd. The output file is opened for writing, and writes to /tmp/output.csv. Our program will act as a go-between, translating from the input file and placing a reformatted subset into the output file.

We do this by creating one instance of csv.reader, which wraps passwd. However, because /etc/passwd uses colons (:) to delimit fields, we must tell this to csv.reader. Otherwise, it’ll try to use commas, which will likely lead to an error--or, worse yet, not lead to an error, despite parsing the file incorrectly. Similarly, we define an instance of csv.writer, wrapping our output file and indicating that we want to use as the delimiter.

Now that we have our objects in place for reading and writing CSV data, we can run through the input file, writing a row (line) to the output file for each of those inputs. We take the username (from index 0) and the user ID (from index 2), create a tuple, and pass that tuple to csv.writerow. Our csv.writer object knows how to take our fields and print them to the file, separated by .

Perhaps the trickiest thing here is to ensure we don’t try to transform lines that contain comments--that is, those which begin with a hash (#) character. There are a number of ways to do this, but the method that I’ve employed here is simply to check the number of fields we got for the current input line. If there’s only one field, then it must be a comment line, or perhaps another type of malformed line. In such a case, we ignore the line altogether. Another good technique would be to check for # at the start of the line, perhaps using str.startswith.

Solution

import csv
 
def passwd_to_csv(passwd_filename, csv_filename):
    with open(passwd_filename) as passwd,
     open(csv_filename, 'w') as output:
        infile = csv.reader(passwd,
                    delimiter=':')      
        outfile = csv.writer(output,
                    delimiter='	')     
        for record in infile:
            if len(record) > 1:
                outfile.writerow((record[0], record[2]))

Fields in the input file are separated by colons (“:”).

Fields in the output file are separated by tabs (“ ”).

Because we can’t write to files on the Python Tutor, there is no link for this exercise.

Screencast solution

Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.

Beyond the exercise

CSV files are extremely useful and common, and the csv module that comes with Python works with them very well. If you need something more advanced, then you might want to look into pandas (http://mng.bz/yyyq), which handles a wide array of CSV variations, as well as many other formats.

Here are several additional exercises you can try to improve your facility with CSV files:

  • Extend this exercise by asking the user to enter a space-separated list of integers, indicating which fields should be written to the output CSV file. Also ask the user which character should be used as a delimiter in the output file. Then read from /etc/passwd, writing the user’s chosen fields, separated by the user’s chosen delimiter.

  • Write a function that writes a dict to a CSV file. Each line in the CSV file should contain three fields: (1) the key, which we’ll assume to be a string, (2) the value, and (3) the type of the value (e.g., str or int).

  • Create a CSV file, in which each line contains 10 random integers between 10 and 100. Now read the file back, and print the sum and mean of the numbers on each line.

Exercise 23 JSON

JSON (described at http://json.org/) is a popular format for data exchange. In particular, many web services and APIs send and receive data using JSON.

JSON-encoded data can be read into a very large number of programming languages, including Python. The Python standard library comes with the json module (http://mng.bz/Mddn), which can be used to turn JSON-encoded strings into Python objects, and vice versa. The json.load method reads a JSON-encoded string from a file and returns a combination of Python objects.

In this exercise, you’re analyzing test data in a high school. There’s a scores directory on the filesystem containing a number of files in JSON format. Each file represents the scores for one class. Write a function, print_scores, that takes a directory name as an argument and prints a summary of the student scores it finds.

If you’re trying to analyze the scores from class 9a, they’d be in a file called 9a.json that looks like this:

[{"math" : 90, "literature" : 98, "science" : 97},
 {"math" : 65, "literature" : 79, "science" : 85},
 {"math" : 78, "literature" : 83, "science" : 75},
 {"math" : 92, "literature" : 78, "science" : 85},
 {"math" : 100, "literature" : 80, "science" : 90}
]

The directory may also contain files for 10th grade (10a.json, 10b.json, and 10c.json) and other grades and classes in the high school. Each file contains the JSON equivalent of a list of dicts, with each dict containing scores for several different school subjects.

Note Valid JSON uses double quotes ("), not single quotes ('). This can be surprising and frustrating for Python developers to discover.

Your function should print the highest, lowest, and average test scores for each subject in each class. Given two files (9a.json and 9b.json) in the scores directory, we would see the following output:

scores/9a.json
    science: min 75, max 97, average 86.4
    literature: min 78, max 98, average 83.6
    math: min 65, max 100, average 85.0
scores/9b.json
    science: min 35, max 95, average 82.0
    literature: min 38, max 98, average 72.0
    math: min 38, max 100, average 77.0

You can download a zipfile with these JSON files from http://mng.bz/Vg1x.

Working it out

In many languages, the first response to this kind of problem would be “Let’s create our own class!” But in Python, while we can (and often do) create our own classes, it’s often easier and faster to make use of built-in data structures--lists, tuples, and dicts.

In this particular case, we’re reading from a JSON file. JSON is a data representation, much like XML; it isn’t a data type per se. Thus, if we want to create JSON, we must use the json module to turn our Python data into JSON-formatted strings. And if we want to read from a JSON file, we must read the contents of the file, as strings, into our program, and then turn it into Python data structures.

In this exercise, though, you’re being asked to work on multiple files in one directory. We know that the directory is called scores and that the files all have a .json suffix. We could thus use os.listdir on the directory, filtering (perhaps with a list comprehension) through all of those filenames such that we only work on those ending with .json.

However, this seems like a more appropriate place to use glob (http://mng .bz/044N), which takes a Unix-style filename pattern with (among others) * and ? characters and returns a list of those filenames that match the pattern. Thus, by invoking glob.glob('scores/*.json'), we get all of the files ending in .json within the scores directory. We can then iterate over that list, assigning the current filename (a string) to filename.

Next, we create a new entry in our scores dict, which is where we’ll store the scores. This will actually be a dict of dicts, in which the first level will be the name of the file--and thus the class--from which we’ve read the data. The second-level keys will be the subjects; the dict’s values will be a list of scores, from which we can then calculate the statistics we need. Thus, once we’ve defined filename, we immediately add the filename as a key to scores, with a new empty dict as the value.

Sometimes, you’ll need to read each line of a file into Python and then invoke json.loads to turn that line into data. In our case, however, the file contains a single JSON array. We must thus use json.load to read from the file object infile, which turns the contents of the file into a Python list of dicts.

Because json.load returns a list of dicts, we can iterate over it. Each test result is placed in the result variable, which is a dict, in which the keys are the subjects and the values are the scores. Our goal is to reveal some statistics for each of the subjects in the class, which means that while the input file reports scores on a per-student basis, our report will ignore the students in favor of the subjects.

Given that result is a dict, we can iterate over its key-value pairs with result .items(), using parallel assignment to iterate over the key and value (here called subject and score). Now, we don’t know in advance what subjects will be in our file, nor do we know how many tests there will be. As a result, it’s easiest for us to store our scores in a list. This means that our scores dict will have one top-level key for each filename, and one second-level key for each subject. The second-level value will be a list, to which we’ll then append with each iteration through the JSON-parsed list.

We’ll want to add our score to the list:

scores[filename][subject]

Before we can do that, we need to make sure the list exists. One easy way to do this is with dict.setdefault, which assigns a key-value pair to a dict, but only if the key doesn’t already exist. In other words, d.setdefault(k, v) is the same as saying

if k not in d:
    d[k] = v

We use dict.setdefault (http://mng.bz/aRRB) to create the list if it doesn’t yet exist. In the next line, we add the score to the list for this subject, in this class.

When we’ve completed our initial for loop, we have all of the scores for each class. We can then iterate over each class, printing the name of the class.

Then, we iterate over each subject for the class. We once again use the method dict.items to return a key-value pair--in this case, calling them subject (for the name of the class) and subject_scores (for the list of scores for that subject). We then use an f-string to produce some output, using the built-in min (http://mng.bz/gyyE) and max (http://mng.bz/Vgq5) functions, and then combining sum (http://mng.bz/ eQQv) and len to get the average score.

While this program reads from a file containing JSON and then produces output on the user’s screen, it could just as easily read from a network connection containing JSON, and/or write to a file or socket in JSON format. As long as we use built-in and standard Python data structures, the json module will be able to take our data and turn it into JSON.

Solution

import json
import glob
 
 
def print_scores(dirname):
 
    scores = {}
 
    for filename in glob.glob(f'{dirname}/*.json'):
        scores[filename] = {}
 
        with open(filename) as infile:
            for result in json.load(infile):                
                for subject, score in result.items():
                    scores[filename].setdefault(subject,
                                                [])         
                    scores[filename][subject].append(score)
 
    for one_class in scores:                                
        print(one_class)
        for subject, subject_scores in scores[one_class].items():
            min_score = min(subject_scores)
            max_score = max(subject_scores)
            average_score = (sum(subject_scores) /
                             len(subject_scores))
 
            print(subject)
            print(f'	min {min_score}')
            print(f'	max {max_score}')
            print(f'	average {average_score}')

Reads from the file infile and turns it from JSON into Python objects

Makes sure that subject exists as a key in scores[filename]

Summarizes the scores

Because these functions work with directories, there is no Python Tutor link.

Screencast solution

Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.

Beyond the exercise

Here are some more tasks you can try that use JSON:

  • Convert /etc/passwd from a CSV-style file into a JSON-formatted file. The JSON file will contain the equivalent of a list of Python tuples, with each tuple representing one line from the file.

  • For a slightly different challenge, turn each line in the file into a Python dict. This will require identifying each field with a unique column or key name. If you’re not sure what each field in /etc/passwd does, you can give it an arbitrary name.

  • Ask the user for the name of a directory. Iterate through each file in that directory (ignoring subdirectories), getting (via os.stat) the size of the file and when it was last modified. Create a JSON-formatted file on disk listing each filename, size, and modification timestamp. Then read the file back in, and identify which files were modified most and least recently, and which files are largest and smallest, in that directory.

Exercise 24 Reverse lines

In many cases, we want to take a file in one format and save it to another format. In this function, we do a basic version of this idea. The function takes two arguments: the names of the input file (to be read from) and the output file (which will be created).

For example, if a file looks like

abc def
ghi jkl

then the output file will be

fed cba
lkj ihg

Notice that the newline remains at the end of the string, while the rest of the characters are all reversed.

Transforming files from one format into another and taking data from one file and creating another one based on it are common tasks. For example, you might need to translate dates to a different format, move timestamps from Eastern Daylight Time into Greenwich Mean Time, or transform prices from euros into dollars. You might also want to extract only some data from an input file, such as for a particular date or location.

Working it out

This solution depends not only on the fact that we can iterate over a file one line at a time, but also that we can work with more than one object in a with statement. Remember that with takes one or more objects and allows us to assign variables to them. I particularly like the fact that when I want to read from one file and write to another, I can just use with to open one for reading, open a second for writing, and then do what I’ve shown here.

I then read through each line of the input file. I then reverse the line using Python’s slice syntax--remember that s[::-1] means that we want all of the elements of s, from the start to the end, but I use a step size of -1, which returns a reversed version of the string.

Before we can reverse the string, however, we first want to remove the newline character that’s the final character in the string. So we first run str.rstrip() on the current line, and then we reverse it. We then write it to the output file, adding a newline character so we’ll actually descend by one line.

The use of with guarantees that both files will be closed when the block ends. When we close a file that we opened for writing, it’s automatically flushed, which means we don’t need to worry about whether the data has actually been saved to disk.

I should note that people often ask me how to read from and write to the same file. Python does support that, with the r+ mode. But I find that this opens the door to many potential problems because of the chance you’ll overwrite the wrong character, and thus mess up the format of the file you’re editing. I suggest that people use this sort of read-from-one, write-to-the-other code, which has roughly the same effect, without the potential danger of messing up the input file.

Solution

def reverse_lines(infilename, outfilename):
    with open(infilename) as infile, open(outfilename, 'w') as outfile:
        for one_line in infile:
            outfile.write(f'{one_line.rstrip()[::-1]}
')     

str.rstrip removes all whitespace from the right side of a string.

Because these functions work with directories, there is no Python Tutor link.

Screencast solution

Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.

Beyond the exercise

Here are some more exercise ideas for translating files from one format to another using with and this kind of technique:

  • “Encrypt” a text file by turning all of its characters into their numeric equivalents (with the built-in ord function) and writing that file to disk. Now “decrypt” the file (using the built-in chr function), turning the numbers back into their original characters.

  • Given an existing text file, create two new text files. The new files will each contain the same number of lines as the input file. In one output file, you’ll write all of the vowels (a, e, i, o, and u) from the input file. In the other, you’ll write all of the consonants. (You can ignore punctuation and whitespace.)

  • The final field in /etc/passwd is the shell, the Unix command interpreter that’s invoked when a user logs in. Create a file, containing one line per shell, in which the shell’s name is written, followed by all of the usernames that use the shell; for example

/bin/bash:root, jci, user, reuven, atara
/bin/sh:spamd, gitlab

Summary

It’s almost impossible to imagine writing programs without using files. And while there are many different types of files, Python is especially well suited for working with text files--especially, but not only, including log files and configuration files, as well those formatted in such standard ways as JSON and CSV.

It’s important to remember a few things when working with files:

  • You will typically open files for either reading or writing.

  • You can (and should) iterate over files one line at a time, rather than reading the whole thing into memory at once.

  • Using with when opening a file for writing ensures that the file will be flushed and closed.

  • The csv module makes it easy to read from and write to CSV files.

  • The json module’s dump and load functions allow us to move between Python data structures and JSON-formatted strings.

  • Reading from files into built-in Python data types is a common and powerful technique.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.115.120