Files are an indispensable part of the world of computers, and thus of programming. We read data from files, and write to files. Even when something isn’t really a file--such as a network connection--we try to use an interface similar to files because they’re so familiar.
To normal, everyday users, there are different types of files--Word, Excel, PowerPoint, and PDF, among others. To programmers, things are both simpler and more complicated. They’re simpler in that we see files as data structures to which we can write strings, and from which we can read strings. But files are also more complicated, in that when we read the string into memory, we might need to parse it into a data structure.
Working with files is one of the easiest and most straightforward things you can do in Python. It’s also one of the most common things that we need to do, since programs that don’t interact with the filesystem are rather boring.
In this chapter, we’ll practice working with files--reading from them, writing to them, and manipulating the data that they contain. Along the way, you’ll get used to some of the paradigms that are commonly used when working with Python files, such as iterating over a file’s contents and writing to files in a with
block.
In some cases, we’ll work with data formatted as CSV (comma-separated values) or JSON (JavaScript object notation), two common formats that modules in Python’s standard library handle. If you’ve forgotten the basics of CSV or JSON, I have some short reminders in this chapter.
After this chapter, you’ll not only be more comfortable working with files, you’ll also better understand how you can translate from in-memory data structures (e.g., lists and dicts) to on-disk data formats (e.g., CSV and JSON) and back. In this way, files make it possible for you to keep data structures intact--even when the program isn’t running or when the computer is shut down--or even to transfer such data structures to other computers.
Puts an object in a context manager ; in the case of a file, ensures it’s flushed and closed by the end of the block |
|||
Retrieves information (size, permissions, type) about a file |
|||
It’s very common for new Python programmers to learn how they can iterate over the lines of a file, printing one line at a time. But what if I’m not interested in each line, or even in most of the lines? What if I’m only interested in a single line--the final line of the file?
Now, retrieving the final line of a file might not seem like a super useful action. But consider the Unix head
and tail
utilities, which show the first and last lines of a file, respectively--and which I use all the time to examine files, particularly log files and configuration files. Moreover, knowing how to read specific parts of a file, as opposed to the entire thing, is a useful and practical skill to have.
In this exercise, write a function (get_final_line
) that takes a filename as an argument. The function should return that file’s final line on the screen.
The solution code uses a number of common Python idioms that I’ll explain here. And along the way, you’ll see how using these idioms leads not just to more readable code, but also to more efficient execution.
Depending on which arguments you use when calling it, the built-in open
function can return a number of different objects, such as TextIOWrapper
or BufferedReader
. These objects all implement the same API for working with files and are thus described in the Python world as “file-like objects.” Using such an object allows us to paper over the many different types of filesystems out there and just think in terms of “a file.” Such an object also allows us to take advantage of whatever optimizations, such as buffering, the operating system might be using.
Here’s how open
is usually invoked:
f = open(filename)
In this case, filename
is a string representing a valid file name. When we invoke open
with just one argument, it should be a filename. The second, optional, argument is a string that can include multiple characters, indicating whether we want to read from, write to, or append to the file (using r
, w
, or a
), and whether the file should be read by character (the default) or by bytes (the b
option, in which case we’ll use rb
, wb
, or ab
). (See the sidebar about the b
option and reading the file in byte, or binary, mode.) I could thus more fully write the previous line of code as
f = open(filename, 'r')
Because we read from files more often than we write to them, r
is the default value for the second argument. It’s quite usual for Python programs not to specify r
if reading from a file.
As you can see here, we’ve put the resulting object into the variable f
. And because file-like objects are all iterable, returning one line per iteration, it’s typical to then say this:
for current_line in f: print(current_line)
But if you’re just planning to iterate over f
once, then why create it as a variable at all? We can avoid the variable definition and simply iterate over the file object that open
returned:
for current_line in open(filename): print(current_line)
With each iteration over a file-like object, we get the next line from the file, up to and including the
newline character. Thus, in this code, line
is always going to be a string that always contains a single
character at the end of it. A blank line in a file will contain just the
newline character.
In theory, files should end with an
, such that you’ll never finish the file in the middle of a line. In practice, I’ve seen many files that don’t end with an
. Keep this in mind whenever you’re printing out a file; assuming that a file will always end with a newline character can cause trouble.
What about closing the file? This code will work, printing the length of each line in a file. However, this sort of code is frowned upon in the Python world because it doesn’t explicitly close the file. Now, when it comes to reading from files, it’s not that big of a deal, especially if you’re only opening a small number of them at a time. But if you’re writing to files, or if you’re opening many files at once, you’ll want to close them--both to conserve resources and to ensure that the file has been closed for good.
The way to do that is with the with
construct. I could rewrite the previous code as follows:
with open(filename) as f: for one_line in f: print(len(one_line))
Instead of opening the file and assigning the file object to f
directly, we’ve opened it within the context of with
, assigned it to f
as part of the with
statement, and then opened a block.
There’s more detail about this in the sidebar about with
and “context managers,” but you should know that this is the standard Pythonic way to open a file--in no small part because it guarantees that the file has been closed by the end of the block.
In this particular exercise, you were asked to print the final line of a file. One way to do so might look like the following code:
for current_line in open(filename): pass print(current_line)
This trick works because we iterate over the lines of the file and assign current_line
in each iteration--but we don’t actually do anything in the body of the for
loop. Rather, we use pass
, which is a way of telling Python to do nothing. (Python requires that we have at least one line in an indented block, such as the body of a for
loop.) The reason that we execute this loop is for its side effect--namely, the fact that the final value assigned to current_line
remains in place after the loop exits.
However, looping over the rows of a file just to get the final one strikes me as a bit strange, even if it works. My preferred solution, shown in figure 5.1, is to iterate over each line of the file, getting the current line but immediately assigning it to final_line
.
When we exit from the loop, final_line
will contain whatever was in the most recent line. We can thus print it out afterwards.
Normally, print
adds a newline after printing something to the screen. However, when we iterate over a file, each line already ends with a newline character. This can lead to doubled whitespace between printed output. The solution is to stop print
from displaying anything by overriding the default
value in the end
parameter. By passing end=''
, we tell print
to add ''
, the empty string, after printing final_line
. For further information about the arguments you can pass to print
, take a look here: http://mng.bz/RAAZ.
def get_final_line(filename):
final_line = ''
for current_line in open(filename): ❶
final_line = current_line
return final_line
print(get_final_line('/etc/passwd'))
❶ Iterates over each line of the file. You don’t need to declare a variable; just iterate directly over the result of open.
You can work through a version of this code in the Python Tutor at http://mng.bz/ D24g.
Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.
Iterating over files, and understanding how to work with the content as (and after) you iterate over them, is an important skill to have when working with Python. It is also important to understand how to turn the contents of a file into a Python data structure--something we’ll look at several more times in this chapter. Here are a few ideas for things you can do when iterating through files in this way:
Iterate over the lines of a text file. Find all of the words (i.e., non-whitespace surrounded by whitespace) that contain only integers, and sum them.
Create a text file (using an editor, not necessarily Python) containing two tab-separated columns, with each column containing a number. Then use Python to read through the file you’ve created. For each line, multiply each first number by the second, and then sum the results from all the lines. Ignore any line that doesn’t contain two numeric columns.
Read through a text file, line by line. Use a dict to keep track of how many times each vowel (a, e, i, o, and u) appears in the file. Print the resulting tabulation.
It’s both common and useful to think of files as sequences of strings. After all, when you iterate over a file object, you get each of the file’s lines as a string, one at a time. But it often makes more sense to turn a file into a more complex data structure, such as a dict.
In this exercise, write a function, passwd_to_dict
, that reads from a Unix-style “password file,” commonly stored as /etc/passwd
, and returns a dict based on it. If you don’t have access to such a file, you can download one that I’ve uploaded at http://mng.bz/2XXg.
Here’s a sample of what the file looks like:
nobody:*:-2:-2::0:0:Unprivileged User:/var/empty:/usr/bin/false root:*:0:0::0:0:System Administrator:/var/root:/bin/sh daemon:*:1:1::0:0:System Services:/var/root:/usr/bin/false
Each line is one user record, divided into colon-separated fields. The first field (index 0) is the username, and the third field (index 2) is the user’s unique ID number. (In the system from which I took the /etc/passwd
file, nobody
has ID -2
, root
has ID 0
, and daemon
has ID 1
.) For our purposes, you can ignore all but these two fields.
Sometimes, the file will contain lines that fail to adhere to this format. For example, we generally ignore lines containing nothing but whitespace. Some vendors (e.g., Apple) include comments in their /etc/passwd
files, in which the line starts with a #
character.
The function passwd_to_dict
should return a dict based on /etc/passwd
in which the dict’s keys are usernames and the values are the users’ IDs.
Once again, we’re opening a text file and iterating over its lines, one at a time. Here, we assume that we know the file’s format, and that we can extract fields from within each record.
In this case, we’re splitting each line across the :
character, using the str.split
method. str.split
always returns a list of strings, although the length of that list depends on the number of times that :
occurs in the string. In the case of /etc/passwd
, we will assume that any line containing :
is a legitimate user record and thus has all of the necessary fields.
However, the file might contain comment lines beginning with #
. If we were to invoke str.split
(http://mng.bz/aR4z) on those lines, we’d get back a list, but one containing only a single element--leading to an IndexError
exception if we tried to retrieve user_info[2]
.
It’s thus important that we ignore those lines that begin with #
. Fortunately, we can use a str.startswith
(http://mng.bz/PAAw) method. Specifically, I identify and discard comment and blank lines using this code:
if not line.startswith(('#', ' ')):
The invocation of str.startswith
passes it a tuple of two strings. str.startswith
will return True
if either of the strings in that tuple are found at the start of the line. Because every line contains a newline, including blank lines, we could say that a line that starts with
is a blank line.
Assuming that it has found a user record, our program then adds a new key-value pair to users
. The key is user_info[0]
, and the value is user_info[2]
. Notice how we can use user_info[0]
as the name of a key; as long as the value of that variable contains a string, we may use it as a dict key.
I use with
(http://mng.bz/lGG2) here to open the file, thus ensuring that it’s closed when the block ends. (See the sidebar about with
and context managers.)
def passwd_to_dict(filename): users = {} with open(filename) as passwd: for line in passwd: if not line.startswith(('#', ' ')): ❶ user_info = line.split(':') ❷ users[user_info[0]] = int(user_info[2]) return users print(passwd_to_dict('/etc/passwd'))
❶ Ignores comment and blank lines
❷ Turns the line into a list of strings
You can work through a version of this code in the Python Tutor at http://mng.bz/ lGWR.
Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.
At a certain point in your Python career, you’ll stop seeing files as sequences of characters on a disk, and start seeing them as raw material you can transform into Python data structures. Our programs have more semantic power with structured data (e.g., dicts) than strings. We can similarly do more and think in deeper ways if we read a file into a data structure rather than just into a string.
For example, imagine a CSV file in which each line contains the name of a country and its population. Reading this file as a string, it would be possible--but frustrating--to compare the populations of France and Thailand. But reading this file into a dict, it would be trivial to make such a comparison.
Indeed, I’m a particular fan of reading files into dicts, in no small part because many file formats lend themselves to this sort of translation--but you can also use more complex data structures. Here are some additional exercises you can try to help you see that connection and make the transformation in your code:
Read through /etc/passwd
, creating a dict in which user login shells (the final field on each line) are the keys. Each value will be a list of the users for whom that shell is defined as their login shell.
Ask the user to enter integers, separated by spaces. From this input, create a dict whose keys are the factors for each number, and the values are lists containing those of the users’ integers that are multiples of those factors.
From /etc/passwd
, create a dict in which the keys are the usernames (as in the main exercise) and the values are themselves dicts with keys (and appropriate values) for user ID, home directory, and shell.
Unix systems contain many utility functions. One of the most useful to me is wc
(http:// mng.bz/Jyyo), the word count program. If you run wc
against a text file, it’ll count the characters, words, and lines that the file contains.
The challenge for this exercise is to write a wordcount
function that mimics the wc
Unix command. The function will take a filename as input and will print four lines of output:
Number of characters (including whitespace)
Number of unique words (case sensitive, so “NO” is different from “no”)
I’ve placed a test file (wcfile.txt
) at http://mng.bz/B2ml. You may download and use that file to test your implementation of wc
. Any file will do, but if you use this one, your results will match up with mine. That file’s contents look like this:
This is a test file. It contains 28 words and 20 different words. It also contains 165 characters. It also contains 11 lines. It is also self-referential. Wow!
This exercise, like many others in this chapter, tries to help you see the connections between text files and Python’s built-in data structures. It’s very common to use Python to work with log files and configuration files, collecting and reporting that data in a human-readable format.
This program demonstrates a number of Python’s capabilities that many programmers use on a daily basis. First and foremost, many people who are new to Python believe that if they have to measure four aspects of a file, then they should read through the file four times. That might mean opening the file once and reading through it four times, or even opening it four separate times. But it’s more common in Python to loop over the file once, iterating over each line and accumulating whatever data the program can find from that line.
How will we accumulate this data? We could use separate variables, and there’s nothing wrong with that. But I prefer to use a dict (figure 5.2), since the counts are closely related, and because it also reduces the code I need to produce a report.
So, once we’re iterating over the lines of the file, how can we count the various elements? Counting lines is the easiest part: each iteration goes over one line, so we can simply add 1 to counts['lines']
at the top of the loop.
Next, we want to count the number of characters in the file. Since we’re already iterating over the file, there’s not that much work to do. We get the number of characters in the current line by calculating len(one_line)
, and then adding that to counts['characters']
.
Many people are surprised that this includes whitespace characters, such as spaces and tabs, as well as newlines. Yes, even an “empty” line contains a single newline character. But if we didn’t have newline characters, then it wouldn’t be obvious to the computer when it should start a new line. So such characters are necessary, and they take up some space.
Next, we want to count the number of words. To get this count, we turn one_line
into a list of words, invoking one_line.split
. The solution invokes split
without any arguments, which causes it to use all whitespace--spaces, tabs, and newlines--as delimiters. The result is then put into counts['words']
.
The final item to count is unique words. We could, in theory, use a list to store new words. But it’s much easier to let Python do the hard work for us, using a set
to guarantee the uniqueness. Thus, we create the unique_words
set at the start of the program, and then use unique_words.update
(http://mng.bz/MdOn) to add all of the words in the current line into the set (figure 5.3). For the report to work on our dict, we then add a new key-value pair to counts
, using len(unique_words)
to count the number of words in the set.
def wordcount(filename): counts = {'characters': 0, 'words': 0, 'lines': 0} unique_words = set() ❶ for one_line in open(filename): counts['lines'] += 1 counts['characters'] += len(one_line) counts['words'] += len(one_line.split()) unique_words.update(one_line.split()) ❷ counts['unique words'] = len(unique_words) ❸ for key, value in counts.items(): print(f'{key}: {value}') wordcount('wcfile.txt')
❶ You can create sets with curly braces, but not if they’re empty! Use set() to create a new empty set.
❷ set.update adds all of the elements of an iterable to a set.
❸ Sticks the set’s length into counts for a combined report
You can work through a version of this code in the Python Tutor at http://mng.bz/MdZo.
Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.
Creating reports based on files is a common use for Python, and using dicts to accumulate information from those files is also common. Here are some additional things you can try to do, similar to what we did here:
Ask the user to enter the name of a text file and then (on one line, separated by spaces) words whose frequencies should be counted in that file. Count how many times those words appear in a dict, using the user-entered words as the keys and the counts as the values.
Create a dict in which the keys are the names of files on your system and the values are the sizes of those files. To calculate the size, you can use os.stat
(http://mng.bz/dyyo).
Given a directory, read through each file and count the frequency of each letter. (Force letters to be lowercase, and ignore nonletter characters.) Use a dict to keep track of the letter frequencies. What are the five most common letters across all of these files?
So far, we’ve worked with individual files. Many tasks, however, require you to analyze data in multiple files--such as all of the files in a dict. This exercise will give you some practice working with multiple files, aggregating measurements across all of them.
In this exercise, write two functions. find_longest_word
takes a filename as an argument and returns the longest word found in the file. The second function, find_all_longest_words
, takes a directory name and returns a dict in which the keys are filenames and the values are the longest words from each file.
If you don’t have any text files that you can use for this exercise, you can download and use a zip file I’ve created from the five most popular books at Project Gutenberg (https://gutenberg.org/). You can download the zip file from http://mng.bz/rrWj.
Note There are several ways to solve this problem. If you already know how to use comprehensions, and particularly dict comprehensions, then that’s probably the most Pythonic approach. But if you aren’t yet comfortable with them, and would prefer not to jump to read about them in chapter 7, then no worries--you can use a traditional for
loop, and you’ll be just fine.
In this case, you’re being asked to take a directory name and then find the longest word in each plain-text file in that directory. As noted, your function should return a dict in which the dict’s keys are the filenames and the dict’s values are the longest words in each file.
Whenever you hear that you need to transform a collection of inputs into a collection of outputs, you should immediately think about comprehensions--most commonly list comprehensions, but set comprehensions and dict comprehensions are also useful. In this case, we’ll use a dict comprehension--which means that we’ll create a dict based on iterating over a source. The source, in our case, will be a list of filenames. The filenames will also provide the dict keys, while the values will be the result of passing the filenames to a function.
In other words, our dict comprehension will
Iterate over the list of files in the named directory, putting the filename in the variable filename
.
For each file, run the function find_longest_word
, passing filename
as an argument. The return value will be a string, the longest string in the file.
Each filename-longest word combination will become a key-value pair in the dict we create.
How can we implement find_longest_word
? We could read the file’s entire contents into a string, turn that string into a list, and then find the longest word in the list with sorted
. Although this will work well for short files, it’ll use a lot of memory for even medium-sized files.
My solution is thus to iterate over every line of a file, and then over every word in the line. If we find a word that’s longer than the current longest_word
, we replace the old word with the new one. When we’re done iterating over the file, we can return the longest word that we found.
Note my use of os.path.join
(http://mng.bz/oPPM) to combine the directory name with a filename. You can think of os.path.join
as a filename-specific version of str.join
. It has additional advantages, as well, such as taking into account the current operating system. On Windows, os.path.join
will use backslashes, whereas on Macs and Unix/Linux systems, it’ll use a forward slash.
import os def find_longest_word(filename): longest_word = '' for one_line in open(filename): for one_word in one_line.split(): if len(one_word) > len(longest_word): longest_word = one_word return longest_word def find_all_longest_words(dirname): return {filename: find_longest_word(os.path.join(dirname, filename)) ❶ for filename in os.listdir(dirname) ❷ if os.path.isfile(os.path.join(dirname, filename))} ❸ print(find_all_longest_words('.'))
❶ Gets the filename and its full path
❷ Iterates over all of the files in dirname
❸ We’re only interested in files, not directories or special files.
Because these functions work with directories, there is no Python Tutor link.
Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.
You’ll commonly produce reports about files and file contents using dicts and other basic data structures in Python. Here are a few possible exercises to practice these ideas further:
Use the hashlib
module in the Python standard library, and the md5
function within it, to calculate the MD5 hash for the contents of every file in a user-specified directory. Then print all of the filenames and their MD5 hashes.
Ask the user for a directory name. Show all of the files in the directory, as well as how long ago the directory was modified. You will probably want to use a combination of os.stat
and the Arrow package on PyPI (http://mng.bz/nPPK) to do this easily.
Open an HTTP server’s log file. (If you lack one, then you can read one from me at http://mng.bz/vxxM.) Summarize how many requests resulted in numeric response codes--202, 304, and so on.
In a CSV file, each record is stored on one line, and fields are separated by commas. CSV is commonly used for exchanging information, especially (but not only) in the world of data science. For example, a CSV file might contain information about different vegetables:
lettuce,green,soft carrot,orange,hard pepper,green,hard eggplant,purple,soft
Each line in this CSV file contains three fields, separated by commas. There aren’t any headers describing the fields, although many CSV files do have them.
Sometimes, the comma is replaced by another character, so as to avoid potential ambiguity. My personal favorite is to use a TAB
character (
in Python strings).
Python comes with a csv
module (http://mng.bz/Qyyj) that handles writing to and reading from CSV files. For example, you can write to a CSV file with the following code:
import csv with open('/tmp/stuff.csv', 'w') as f: o = csv.writer(f) ❶ o.writerow(range(5)) ❷ o.writerow(['a', 'b', 'c', 'd', 'e']) ❸
❶ Creates a csv.writer object, wrapping our file-like object “f”
❷ Writes the integers from 0-4 to the file, separated by commas
❸ Writes this list of strings as a record to the CSV file, separated by commas
Not all CSV files necessarily look like CSV files. For example, the standard Unix /etc/passwd
file, which contains information about users on a system (but no longer users’ passwords, despite its name), separates fields with :
characters.
For this exercise, create a function, passwd_to_csv
, that takes two filenames as arguments: the first is a passwd
-style file to read from, and the second is the name of a file in which to write the output.
The new file’s contents are the username (index 0) and the user ID (index 2). Note that a record may contain a comment, in which case it will not have anything at index 2; you should take that into consideration when writing the file. The output file should use TAB
characters to separate the elements.
Thus, the input will look like this
root:*:0:0::0:0:System Administrator:/var/root:/bin/sh daemon:*:1:1::0:0:System Services:/var/root:/usr/bin/false # I am a comment line _ftp:*:98:-2::0:0:FTP Daemon:/var/empty:/usr/bin/false
and the output will look like this:
root 0 daemon 1 _ftp 98
Notice that the comment line in the input file is not placed in the output file. You can assume that any line with at least two colon-separated fields is legitimate.
For a complete introduction to working with CSV files in Python, check out http:// mng.bz/XPP6/.
The solution program uses a number of aspects of Python that are useful when working with files. We’ve already seen and discussed with
earlier in this chapter. Here, you can see how you can use with
to open two separate files, or generally to define any number of objects. As soon as our block exits, both of the files are automatically closed.
We define two variables in the with
statement, for the two files with which we’ll be working. The passwd
file is opened for reading from /etc/passwd
. The output
file is opened for writing, and writes to /tmp/output.csv
. Our program will act as a go-between, translating from the input file and placing a reformatted subset into the output file.
We do this by creating one instance of csv.reader
, which wraps passwd
. However, because /etc/passwd
uses colons (:
) to delimit fields, we must tell this to csv.reader
. Otherwise, it’ll try to use commas, which will likely lead to an error--or, worse yet, not lead to an error, despite parsing the file incorrectly. Similarly, we define an instance of csv.writer
, wrapping our output
file and indicating that we want to use
as the delimiter.
Now that we have our objects in place for reading and writing CSV data, we can run through the input file, writing a row (line) to the output file for each of those inputs. We take the username (from index 0) and the user ID (from index 2), create a tuple, and pass that tuple to csv.writerow
. Our csv.writer
object knows how to take our fields and print them to the file, separated by
.
Perhaps the trickiest thing here is to ensure we don’t try to transform lines that contain comments--that is, those which begin with a hash (#
) character. There are a number of ways to do this, but the method that I’ve employed here is simply to check the number of fields we got for the current input line. If there’s only one field, then it must be a comment line, or perhaps another type of malformed line. In such a case, we ignore the line altogether. Another good technique would be to check for #
at the start of the line, perhaps using str.startswith
.
import csv def passwd_to_csv(passwd_filename, csv_filename): with open(passwd_filename) as passwd, ➥ open(csv_filename, 'w') as output: infile = csv.reader(passwd, delimiter=':') ❶ outfile = csv.writer(output, delimiter=' ') ❷ for record in infile: if len(record) > 1: outfile.writerow((record[0], record[2]))
❶ Fields in the input file are separated by colons (“:”).
❷ Fields in the output file are separated by tabs (“ ”).
Because we can’t write to files on the Python Tutor, there is no link for this exercise.
Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.
CSV files are extremely useful and common, and the csv
module that comes with Python works with them very well. If you need something more advanced, then you might want to look into pandas
(http://mng.bz/yyyq), which handles a wide array of CSV variations, as well as many other formats.
Here are several additional exercises you can try to improve your facility with CSV files:
Extend this exercise by asking the user to enter a space-separated list of integers, indicating which fields should be written to the output CSV file. Also ask the user which character should be used as a delimiter in the output file. Then read from /etc/passwd
, writing the user’s chosen fields, separated by the user’s chosen delimiter.
Write a function that writes a dict to a CSV file. Each line in the CSV file should contain three fields: (1) the key, which we’ll assume to be a string, (2) the value, and (3) the type of the value (e.g., str
or int
).
Create a CSV file, in which each line contains 10 random integers between 10 and 100. Now read the file back, and print the sum and mean of the numbers on each line.
JSON (described at http://json.org/) is a popular format for data exchange. In particular, many web services and APIs send and receive data using JSON.
JSON-encoded data can be read into a very large number of programming languages, including Python. The Python standard library comes with the json
module (http://mng.bz/Mddn), which can be used to turn JSON-encoded strings into Python objects, and vice versa. The json.load
method reads a JSON-encoded string from a file and returns a combination of Python objects.
In this exercise, you’re analyzing test data in a high school. There’s a scores
directory on the filesystem containing a number of files in JSON format. Each file represents the scores for one class. Write a function, print_scores
, that takes a directory name as an argument and prints a summary of the student scores it finds.
If you’re trying to analyze the scores from class 9a, they’d be in a file called 9a.json
that looks like this:
[{"math" : 90, "literature" : 98, "science" : 97}, {"math" : 65, "literature" : 79, "science" : 85}, {"math" : 78, "literature" : 83, "science" : 75}, {"math" : 92, "literature" : 78, "science" : 85}, {"math" : 100, "literature" : 80, "science" : 90} ]
The directory may also contain files for 10th grade (10a.json
, 10b.json
, and 10c.json
) and other grades and classes in the high school. Each file contains the JSON equivalent of a list of dicts, with each dict containing scores for several different school subjects.
Note Valid JSON uses double quotes ("
), not single quotes ('
). This can be surprising and frustrating for Python developers to discover.
Your function should print the highest, lowest, and average test scores for each subject in each class. Given two files (9a.json
and 9b.json
) in the scores
directory, we would see the following output:
scores/9a.json science: min 75, max 97, average 86.4 literature: min 78, max 98, average 83.6 math: min 65, max 100, average 85.0 scores/9b.json science: min 35, max 95, average 82.0 literature: min 38, max 98, average 72.0 math: min 38, max 100, average 77.0
You can download a zipfile with these JSON files from http://mng.bz/Vg1x.
In many languages, the first response to this kind of problem would be “Let’s create our own class!” But in Python, while we can (and often do) create our own classes, it’s often easier and faster to make use of built-in data structures--lists, tuples, and dicts.
In this particular case, we’re reading from a JSON file. JSON is a data representation, much like XML; it isn’t a data type per se. Thus, if we want to create JSON, we must use the json
module to turn our Python data into JSON-formatted strings. And if we want to read from a JSON file, we must read the contents of the file, as strings, into our program, and then turn it into Python data structures.
In this exercise, though, you’re being asked to work on multiple files in one directory. We know that the directory is called scores
and that the files all have a .json
suffix. We could thus use os.listdir
on the directory, filtering (perhaps with a list comprehension) through all of those filenames such that we only work on those ending with .json
.
However, this seems like a more appropriate place to use glob
(http://mng .bz/044N), which takes a Unix-style filename pattern with (among others) *
and ?
characters and returns a list of those filenames that match the pattern. Thus, by invoking glob.glob('scores/*.json')
, we get all of the files ending in .json
within the scores
directory. We can then iterate over that list, assigning the current filename (a string) to filename
.
Next, we create a new entry in our scores
dict, which is where we’ll store the scores. This will actually be a dict of dicts, in which the first level will be the name of the file--and thus the class--from which we’ve read the data. The second-level keys will be the subjects; the dict’s values will be a list of scores, from which we can then calculate the statistics we need. Thus, once we’ve defined filename
, we immediately add the filename as a key to scores
, with a new empty dict as the value.
Sometimes, you’ll need to read each line of a file into Python and then invoke json.loads
to turn that line into data. In our case, however, the file contains a single JSON array. We must thus use json.load
to read from the file object infile
, which turns the contents of the file into a Python list of dicts.
Because json.load
returns a list of dicts, we can iterate over it. Each test result is placed in the result
variable, which is a dict, in which the keys are the subjects and the values are the scores. Our goal is to reveal some statistics for each of the subjects in the class, which means that while the input file reports scores on a per-student basis, our report will ignore the students in favor of the subjects.
Given that result
is a dict, we can iterate over its key-value pairs with result .items()
, using parallel assignment to iterate over the key and value (here called subject
and score
). Now, we don’t know in advance what subjects will be in our file, nor do we know how many tests there will be. As a result, it’s easiest for us to store our scores in a list. This means that our scores
dict will have one top-level key for each filename, and one second-level key for each subject. The second-level value will be a list, to which we’ll then append with each iteration through the JSON-parsed list.
We’ll want to add our score to the list:
scores[filename][subject]
Before we can do that, we need to make sure the list exists. One easy way to do this is with dict.setdefault
, which assigns a key-value pair to a dict, but only if the key doesn’t already exist. In other words, d.setdefault(k,
v)
is the same as saying
if k not in d: d[k] = v
We use dict.setdefault
(http://mng.bz/aRRB) to create the list if it doesn’t yet exist. In the next line, we add the score to the list for this subject, in this class.
When we’ve completed our initial for
loop, we have all of the scores for each class. We can then iterate over each class, printing the name of the class.
Then, we iterate over each subject for the class. We once again use the method dict.items
to return a key-value pair--in this case, calling them subject
(for the name of the class) and subject_scores
(for the list of scores for that subject). We then use an f-string to produce some output, using the built-in min
(http://mng.bz/gyyE) and max
(http://mng.bz/Vgq5) functions, and then combining sum
(http://mng.bz/ eQQv) and len
to get the average score.
While this program reads from a file containing JSON and then produces output on the user’s screen, it could just as easily read from a network connection containing JSON, and/or write to a file or socket in JSON format. As long as we use built-in and standard Python data structures, the json
module will be able to take our data and turn it into JSON.
import json import glob def print_scores(dirname): scores = {} for filename in glob.glob(f'{dirname}/*.json'): scores[filename] = {} with open(filename) as infile: for result in json.load(infile): ❶ for subject, score in result.items(): scores[filename].setdefault(subject, []) ❷ scores[filename][subject].append(score) for one_class in scores: ❸ print(one_class) for subject, subject_scores in scores[one_class].items(): min_score = min(subject_scores) max_score = max(subject_scores) average_score = (sum(subject_scores) / len(subject_scores)) print(subject) print(f' min {min_score}') print(f' max {max_score}') print(f' average {average_score}')
❶ Reads from the file infile and turns it from JSON into Python objects
❷ Makes sure that subject exists as a key in scores[filename]
Because these functions work with directories, there is no Python Tutor link.
Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.
Here are some more tasks you can try that use JSON:
Convert /etc/passwd
from a CSV-style file into a JSON-formatted file. The JSON file will contain the equivalent of a list of Python tuples, with each tuple representing one line from the file.
For a slightly different challenge, turn each line in the file into a Python dict. This will require identifying each field with a unique column or key name. If you’re not sure what each field in /etc/passwd
does, you can give it an arbitrary name.
Ask the user for the name of a directory. Iterate through each file in that directory (ignoring subdirectories), getting (via os.stat
) the size of the file and when it was last modified. Create a JSON-formatted file on disk listing each filename, size, and modification timestamp. Then read the file back in, and identify which files were modified most and least recently, and which files are largest and smallest, in that directory.
In many cases, we want to take a file in one format and save it to another format. In this function, we do a basic version of this idea. The function takes two arguments: the names of the input file (to be read from) and the output file (which will be created).
For example, if a file looks like
abc def ghi jkl
fed cba lkj ihg
Notice that the newline remains at the end of the string, while the rest of the characters are all reversed.
Transforming files from one format into another and taking data from one file and creating another one based on it are common tasks. For example, you might need to translate dates to a different format, move timestamps from Eastern Daylight Time into Greenwich Mean Time, or transform prices from euros into dollars. You might also want to extract only some data from an input file, such as for a particular date or location.
This solution depends not only on the fact that we can iterate over a file one line at a time, but also that we can work with more than one object in a with
statement. Remember that with
takes one or more objects and allows us to assign variables to them. I particularly like the fact that when I want to read from one file and write to another, I can just use with
to open one for reading, open a second for writing, and then do what I’ve shown here.
I then read through each line of the input file. I then reverse the line using Python’s slice syntax--remember that s[::-1]
means that we want all of the elements of s
, from the start to the end, but I use a step size of -1, which returns a reversed version of the string.
Before we can reverse the string, however, we first want to remove the newline character that’s the final character in the string. So we first run str.rstrip()
on the current line, and then we reverse it. We then write it to the output file, adding a newline character so we’ll actually descend by one line.
The use of with
guarantees that both files will be closed when the block ends. When we close a file that we opened for writing, it’s automatically flushed, which means we don’t need to worry about whether the data has actually been saved to disk.
I should note that people often ask me how to read from and write to the same file. Python does support that, with the r+
mode. But I find that this opens the door to many potential problems because of the chance you’ll overwrite the wrong character, and thus mess up the format of the file you’re editing. I suggest that people use this sort of read-from-one, write-to-the-other code, which has roughly the same effect, without the potential danger of messing up the input file.
def reverse_lines(infilename, outfilename):
with open(infilename) as infile, open(outfilename, 'w') as outfile:
for one_line in infile:
outfile.write(f'{one_line.rstrip()[::-1]}
') ❶
❶ str.rstrip removes all whitespace from the right side of a string.
Because these functions work with directories, there is no Python Tutor link.
Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.
Here are some more exercise ideas for translating files from one format to another using with
and this kind of technique:
“Encrypt” a text file by turning all of its characters into their numeric equivalents (with the built-in ord
function) and writing that file to disk. Now “decrypt” the file (using the built-in chr
function), turning the numbers back into their original characters.
Given an existing text file, create two new text files. The new files will each contain the same number of lines as the input file. In one output file, you’ll write all of the vowels (a, e, i, o, and u) from the input file. In the other, you’ll write all of the consonants. (You can ignore punctuation and whitespace.)
The final field in /etc/passwd
is the shell, the Unix command interpreter that’s invoked when a user logs in. Create a file, containing one line per shell, in which the shell’s name is written, followed by all of the usernames that use the shell; for example
/bin/bash:root, jci, user, reuven, atara /bin/sh:spamd, gitlab
It’s almost impossible to imagine writing programs without using files. And while there are many different types of files, Python is especially well suited for working with text files--especially, but not only, including log files and configuration files, as well those formatted in such standard ways as JSON and CSV.
It’s important to remember a few things when working with files:
You will typically open files for either reading or writing.
You can (and should) iterate over files one line at a time, rather than reading the whole thing into memory at once.
Using with
when opening a file for writing ensures that the file will be flushed and closed.
The csv
module makes it easy to read from and write to CSV files.
The json
module’s dump
and load
functions allow us to move between Python data structures and JSON-formatted strings.
Reading from files into built-in Python data types is a common and powerful technique.
18.222.115.120