9. Input and Output

Input and output (I/O) is part of all programs. This chapter describes the essentials of Python I/O including data encoding, command-line options, environment variables, file I/O, and data serialization. Particular attention is given to programming techniques and abstractions that encourage proper I/O handling. The end of this chapter gives an overview of common standard library modules related to I/O.

9.1 Data Representation

The main problem of I/O is the outside world. To communicate with it, data must be properly represented, so that it can be manipulated. At the lowest level, Python works with two fundamental datatypes: bytes that represent raw uninterpreted data of any kind and text that represents Unicode characters.

To represent bytes, two built-in types are used, bytes and bytearray. bytes is an immutable string of integer byte values. bytearray is a mutable byte array that behaves as a combination of a byte string and a list. Its mutability makes it suitable for building up groups of bytes in a more incremental manner, as when assembling data from fragments. The following example illustrates a few features of bytes and bytearray:

# Specify a bytes literal (note: b' prefix)
a = b'hello'

# Specify bytes from a list of integers
b = bytes([0x68, 0x65, 0x6c, 0x6c, 0x6f])

# Create and populate a bytearray from parts
c = bytearray()
c.extend(b'world')   # d = 'world'
c.append(0x21)       # d = 'world!'

# Access byte values
print(a[0])     # --> prints 104

for x in b:     # Outputs 104 101 108 108 111
   print(x)

Accessing individual elements of byte and bytearray objects produces integer byte values, not single-character byte strings. This is different from text strings, so it is a common usage error.

Text is represented by the str datatype and stored as an array of Unicode code points. For example:

d = 'hello'     # Text (Unicode)
len(d)          # --> 5
print(d[0])     # prints 'h'

Python maintains a strict separation between bytes and text. There is never automatic conversion between the two types, comparisons between these types evaluate as False, and any operation that mixes bytes and text together results in an error. For example:

a = b'hello'     # bytes
b = 'hello'      # text
c = 'world'      # text

print(a == b)    # -> False
d = a + c        # TypeError: can't concat str to bytes
e = b + c        # -> 'helloworld' (both are strings)

When performing I/O, make sure you’re working with the right kind of data representation. If you are manipulating text, use text strings. If you are manipulating binary data, use bytes.

9.2 Text Encoding and Decoding

If you work with text, all data read from input must be decoded and all data written to output must be encoded. For explicit conversion between text and bytes, there are encode(text [,errors]) and decode(bytes [,errors]) methods on text and bytes objects, respectively. For example:

a = 'hello'             # Text
b = a.encode('utf-8')   # Encode to bytes

c = b'world'            # Bytes
d = c.decode('utf-8')   # Decode to text

Both encode() and decode() require the name of an encoding such as 'utf-8' or 'latin-1'. The encodings in Table 9.1 are common.

Table 9.1 Common Encodings

Encoding Name

Description

'ascii'

Character values in the range [0x00, 0x7f].

'latin1'

Character values in the range [0x00, 0xff]. Also known as 'iso-8859-1'.

'utf-8'

Variable-length encoding that allows all Unicode characters to be represented.

'cp1252'

A common text encoding on Windows.

'macroman'

A common text encoding on Macintosh.

Additionally, the encoding methods accept an optional errors argument that specifies behavior in the presence of encoding errors. It is one of the values in Table 9.2.

Table 9.2 Error Handling Options

Value

Description

'strict'

Raises a UnicodeError exception for encoding and decoding errors (the default).

'ignore'

Ignores invalid characters.

'replace'

Replaces invalid characters with a replacement character (U+FFFD in Unicode, b'?' in bytes).

'backslashreplace'

Replaces each invalid character with a Python character escape sequence. For example, the character U+1234 is replaced by 'u1234' (encoding only).

'xmlcharrefreplace'

Replaces each invalid character with an XML character reference. For example, the character U+1234 is replaced by 'ሴ' (encoding only).

'surrogateescape'

Replaces any invalid byte 'xhh' with U+DChh on decoding, replaces U+DChh with byte 'xhh' on encoding.

The 'backslashreplace' and 'xmlcharrefreplace' error policies represent unrepresentable characters in a form that allows them to be viewed as simple ASCII text or as XML character references. This can be useful for debugging.

The 'surrogateescape' error handling policy allows degenerate byte data—data that does not follow the expected encoding rules—to survive a roundtrip decoding/encoding cycle intact regardless of the text encoding being used. Specifically, s.decode(enc, 'surrogateescape').encode(enc, 'surrogateescape') == s. This round-trip preservation of data is useful for certain kinds of system interfaces where a text encoding is expected but can’t be guaranteed due to issues outside of Python’s control. Instead of destroying data with a bad encoding, Python embeds it “as is” using surrogate encoding. Here’s an example of this behavior with a improperly encoded UTF-8 string:

>>> a = b'Spicy Jalapexf1o'   # Invalid UTF-8
>>> a.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1
in position 12: invalid continuation byte
>>> a.decode('utf-8', 'surrogateescape')
'Spicy Jalapeudcf1o'
>>> # Encode the resulting string back into bytes
>>> _.encode('utf-8', 'surrogateescape')
b'Spicy Jalapexf1o'
>>>

9.3 Text and Byte Formatting

A common problem when working with text and byte strings is string conversions and formatting—for example, converting a floating-point number to a string with a given width and precision. To format a single value, use the format() function:

x = 123.456
format(x, '0.2f')       # '123.46'
format(x, '10.4f')      # '  123.4560'
format(x, '<*10.2f')    # '123.46****'

The second argument to format() is a format specifier. The general format of the specifier is [[fill[align]][sign][0][width][,][.precision][type] where each part enclosed in [] is optional. The width specifies the minimum field width to use, and the align specifier is one of <, >, or ^ for left, right, and centered alignment within the field. An optional fill character fill is used to pad the space. For example:

name = 'Elwood'
r = format(name, '<10')     # r = 'Elwood    '
r = format(name, '>10')     # r = '    Elwood'
r = format(name, '^10')     # r = '  Elwood  '
r = format(name, '*^10')    # r = '**Elwood**'

The type specifier indicates the type of data. Table 9.3 lists the supported format codes. If not supplied, the default format code is s for strings, d for integers, and f for floats.

Table 9.3 Format Codes

Character

Output Format

d

Decimal integer or long integer.

b

Binary integer or long integer.

o

Octal integer or long integer.

x

Hexadecimal integer or long integer.

X

Hexadecimal integer (uppercase letters).

f, F

Floating point as [-]m.dddddd.

e

Floating point as [-]m.dddddde±xx.

E

Floating point as [-]m.ddddddE±xx.

g, G

Use e or E for exponents less than [nd]4 or greater than the precision; otherwise use f.

n

Same as g except that the current locale setting determines the decimal point character.

%

Multiplies a number by 100 and displays it using f format followed by a % sign.

s

String or any object. The formatting code uses str() to generate strings.

c

Single character.

The sign part of a format specifier is one of +, -, or a space. A + indicates that a leading sign should be used on all numbers. A - is the default and only adds a sign character for negative numbers. A space adds a leading space to positive numbers.

An optional comma (,) may appear between the width and the precision. This adds a thousands separator character. For example:

x = 123456.78
format(x, '16,.2f')   # '      123,456.78'

The precision part of the specifier supplies the number of digits of accuracy to use for decimals. If a leading 0 is added to the field width for numbers, numeric values are padded with leading 0s to fill the space. Here are some examples of formatting different kinds of numbers:

x = 42
r = format(x, '10d')        # r = '        42'
r = format(x, '10x')        # r = '        2a'
r = format(x, '10b')        # r = '    101010'
r = format(x, '010b')       # r = '0000101010'

y = 3.1415926
r = format(y, '10.2f')      # r = '      3.14'
r = format(y, '10.2e')      # r = '  3.14e+00'
r = format(y, '+10.2f')     # r = '     +3.14'
r = format(y, '+010.2f')    # r = '+000003.14'
r = format(y, '+10.2%')     # r = '  +314.16%'

For more complex string formatting, you can use f-strings:

x = 123.456
f'Value is {x:0.2f}'        # 'Value is 123.46'
f'Value is {x:10.4f}'       # 'Value is   123.4560'
f'Value is {2*x:*<10.2f}'   # 'Value is 246.91****'

Within an f-string, text of the form {expr:spec} is replaced by the value of format(expr, spec). expr can be an arbitrary expression as long as it doesn’t include {, }, or characters. Parts of the format specifier itself can optionally be supplied by other expressions. For example:

y = 3.1415926
width = 8
precision=3

r = f'{y:{width}.{precision}f}'   # r = '   3.142'

If you end expr by =, then the literal text of expr is also included in the result. For example:

x = 123.456

f'{x=:0.2f}'        # 'x=123.46'
f'{2*x=:0.2f}'      # '2*x=246.91'

If you append !r to a value, formatting is applied to the output of repr(). If you use !s, formatting is applied to the output of str(). For example:

f'{x!r:spec}'      # Calls (repr(x).__format__('spec'))
f'{x!s:spec}'      # Calls (str(x).__format__('spec'))

As an alternative to f-strings, you can use the .format() method of strings:

x = 123.456

'Value is {:0.2f}' .format(x)            # 'Value is 123.46'
'Value is {0:10.2f}' .format(x)          # 'Value is   123.4560'
'Value is {val:<*10.2f}' .format(val=x)  # 'Value is 123.46****'

With a string formatted by .format(), text of the form {arg:spec} is replaced by the value of format(arg, spec). In this case, arg refers to one of the arguments given to the format() method. If omitted entirely, the arguments are taken in order. For example:

name = 'IBM'
shares = 50
price = 490.1

r = '{:>10s} {:10d} {:10.2f}'.format(name, shares, price)
# r = '       IBM         50     490.10'

arg can also refer to a specific argument number or name. For example:

tag = 'p'
text = 'hello world'

r = '<{0}>{1}</{0}>'.format(tag, text)  # r = '<p>hello world</p>'
r = '<{tag}>{text}</{tag}>'.format(tag='p', text='hello world')

Unlike f-strings, the arg value of a specifier cannot be an arbitrary expression, so it’s not quite as expressive. However, the format() method can perform limited attribute lookup, indexing, and nested substitutions. For example:

y = 3.1415926
width = 8
precision=3

r = 'Value is {0:{1}.{2}f}'.format(y, width, precision)

d = {
   'name': 'IBM',
   'shares': 50,
   'price': 490.1
}
r = '{0[shares]:d} shares of {0[name]} at {0[price]:0.2f}'.format(d)
# r = '50 shares of IBM at 490.10'

bytes and bytearray instances can be formatted using the % operator. The semantics of this operator are modeled after the sprintf() function from C. Here are some examples:

name = b'ACME'
x = 123.456

b'Value is %0.2f' % x             # b'The value is 123.46'
bytearray(b'Value is %0.2f') % x  # b'Value is 123.46'
b'%s = %0.2f' % (name, x)         # b'ACME = 123.46'

With this formatting, sequences of the form %spec are replaced in order with values from a tuple provided as the second operand to the % operator. The basic format codes (d, f, s, etc.) are the same as those used for the format() function. However, more advanced features are either missing or changed slightly. For example, to adjust alignment, you use a - character like this:

x = 123.456
b'%10.2f' % x     # b'    123.46'
b'%-10.2f' % x    # b'123.46    '

Using a format code of %r produces the output of ascii() which can be useful in debugging and logging.

When working with bytes, be aware that text strings are not supported. They need to be explicitly encoded.

name = 'Dave'

b'Hello %s' % name                    # TypeError!
b'Hello %s' % name.encode('utf-8')    # Ok

This form of formatting can also be used with text strings, but it’s considered to be an older programming style. However, it still arises in certain libraries. For example, messages produced by the logging module are formatted in this way:

import logging
log = logging.getLogger(__name__)

log.debug('%s got %d', name, value)  # '%s got %d' % (name, value)

The logging module is briefly described later in this chapter in Section 9.15.12.

9.4 Reading Command-Line Options

When Python starts, command-line options are placed in the list sys.argv as text strings. The first item is the name of the program. Subsequent items are the options added on the command line after the program name. The following program is a minimal prototype of manually processing command-line arguments:

def main(argv):
    if len(argv) != 3:
        raise SystemExit(
              f'Usage : python {argv[0]} inputfile outputfile
')
    inputfile  = argv[1]
    outputfile = argv[2]
    ...

if __name__ == '__main__':
    import sys
    main(sys.argv)

For better code organization, testing, and similar reasons, it’s a good idea to write a dedicated main() function that accepts the command-line options (if any) as a list, as opposed to directly reading sys.argv. Include a small fragment of code at the end of your program to pass the command-line options to your main() function.

sys.argv[0] contains the name of the script being executed. Writing a descriptive help message and raising SystemExit is standard practice for command-line scripts that want to report an error.

Although in simple scripts, you can manually process command options, consider using the argparse module for more complicated command-line handling. Here is an example:

import argparse

def main(argv):
    p = argparse.ArgumentParser(description="This is some program")

    # A positional argument
    p.add_argument("infile")

    # An option taking an argument
    p.add_argument("-o","--output", action="store")

    # An option that sets a boolean flag
    p.add_argument("-d","--debug", action="store_true", default=False)

    # Parse the command line
    args = p.parse_args(args=argv)

    # Retrieve the option settings
    infile    = args.infile
    output    = args.output
    debugmode = args.debug

    print(infile, output, debugmode)

if __name__ == '__main__':
    import sys
    main(sys.argv[1:])

This example only shows the most simple use of the argparse module. The standard library documentation provides more advanced usage. There are also third-party modules such as click and docopt that can simplify writing of more complex command-line parsers.

Finally, command-line options might be provided to Python in an invalid text encoding. Such arguments are still accepted but they will be encoded using 'surrogateescape' error handling as described in Section 9.2. You need to be aware of this if such arguments are later included in any kind of text output and it’s critical to avoid crashing. It might not be critical, though—don’t overcomplicate your code for edge cases that don’t matter.

9.5 Environment Variables

Sometimes data is passed to a program via environment variables set in the command shell. For example, a Python program might be launched using a shell command such as env:

bash $ env SOMEVAR=somevalue python3 somescript.py

Environment variables are accessed as text strings in the mapping os.environ. Here’s an example:

import os
path = os.environ['PATH']
user = os.environ['USER']
editor = os.environ['EDITOR']
val = os.environ['SOMEVAR']
... etc ...

To modify environment variables, set the os.environ variable. For example:

os.environ['NAME'] = 'VALUE'

Modifications to os.environ affect both the running program and any subprocesses created later—for example, those created by the subprocess module.

As with command-line options, badly encoded environment variables may produce strings that use the 'surrogateescape' error handling policy.

9.6 Files and File Objects

To open a file, use the built-in open() function. Usually, open() is given a filename and a file mode. It is also often used in combination with the with statement as a context manager. Here are some common usage patterns of working with files:

# Read a text file all at once as a string
with open('filename.txt', 'rt') as file:
    data = file.read()

# Read a file line-by-line
with open('filename.txt', 'rt') as file:
    for line in file:
        ...

# Write to a text file
with open('out.txt', 'wt') as file:
    file.write('Some output
')
    print('More output', file=file)

In most cases, using open() is a straightforward affair. You give it the name of the file you want to open along with a file mode. For example:

open('name.txt')        # Opens "name.txt" for reading
open('name.txt', 'rt')  # Opens "name.txt" for reading (same)
open('name.txt', 'wt')  # Opens "name.txt" for writing
open('data.bin', 'rb')  # Binary mode read
open('data.bin', 'wb')  # Binary mode write

For most programs, you will never need to know more than these simple examples to work with files. However, there are a number of special cases and more esoteric features of open() worth knowing. The next few sections discuss open() and file I/O in more detail.

9.6.1 Filenames

To open a file, you need to give open() the name of the file. The name can either be a fully specified absolute pathname such as '/Users/guido/Desktop/files/old/data.csv' or a relative pathname such as 'data.csv' or '..olddata.csv'. For relative filenames, the file location is determined relative to the current working directory as returned by os.getcwd(). The current working directory can be changed with os.chdir(newdir).

The name itself can be encoded in a number of forms. If it’s a text string, the name is interpreted according to the text encoding returned by sys.getfilesystemencoding() before being passed to the host operating system. If the filename is a byte string, it is left unencoded and is passed as is. This latter option may be useful if you’re writing programs that must handle the possibility of degenerate or miscoded filenames—instead of passing the filename as text, you can pass the raw binary representation of the name. This might seem like an obscure edge case, but Python is commonly used to write system-level scripts that manipulate the filesystem. Abuse of the filesystem is a common technique used by hackers to either hide their tracks or to break system tools.

In addition to text and bytes, any object that implements the special method __fspath__() can be used as a name. The __fspath__() method must return a text or bytes object corresponding to the actual name. This is the mechanism that makes standard library modules such as pathlib work. For example:

>>> from pathlib import Path
>>> p = Path('Data/portfolio.csv')
>>> p.__fspath__()
'Data/portfolio.csv'
>>>

Potentially, you could make your own custom Path object that worked with open() as long as it implements __fspath__() in a way that resolves to a proper filename on the system.

Finally, filenames can be given as low-level integer file descriptors. This requires that the “file” is already open on the system in some way. Perhaps it corresponds to a network socket, a pipe, or some other system resource that exposes a file descriptor. Here is an example of opening a file directly with the os module and then turning it into a proper file object:

>>> import os
>>> fd = os.open('/etc/passwd', os.O_RDONLY)
>>> fd
3
>>> file = open(fd, 'rt')
>>> file
<_io.TextIOWrapper name=3 mode='rt' encoding='UTF-8'>
>>> data = file.read()
>>>

When opening an existing file descriptor like this, the close() method of the returned file will also close the underlying descriptor. This can be disabled by passing closefd=False to open(). For example:

file = open(fd, 'rt', closefd=False)

9.6.2 File Modes

When opening a file, you need to specify a file mode. The core file modes are 'r' for reading, 'w' for writing, and 'a' for appending. 'w' mode replaces any existing file with new content. 'a' opens a file for writing and positions the file pointer to the end of the file so that new data can be appended.

A special file mode of 'x' can be used to write to a file, but only if it doesn’t exist already. This is a useful way to prevent accidental overwriting of existing data. For this mode, a FileExistsError exception is raised if the file already exists.

Python makes a strict distinction between text and binary data. To specify the kind of data, you append a 't' or a 'b' to the file mode. For example, a file mode of 'rt' opens a file for reading in text mode and 'rb' opens a file for reading in binary mode. The mode determines the kind of data returned by file-related methods such as f.read(). In text mode, strings are returned. In binary mode, bytes are returned.

Binary files can be opened for in-place updates by supplying a plus (+) character, such as 'rb+' or 'wb+'. When a file is opened for update, you can perform both input and output, as long as all output operations flush their data before any subsequent input operations. If a file is opened using 'wb+' mode, its length is first truncated to zero. A common use of the update mode is to provide random read/write access to file contents in combination with seek operations.

9.6.3 I/O Buffering

By default, files are opened with I/O buffering enabled. With I/O buffering, I/O operations are performed in larger chunks to avoid excessive system calls. For example, write operations would start filling an internal memory buffer and output would only actually occur when the buffer is filled up. This behavior can be changed by giving a buffering argument to open(). For example:

# Open a binary-mode file with no I/O buffering

with open('data.bin', 'wb', buffering=0) as file:
    file.write(data)
    ...

A value of 0 specifies unbuffered I/O and is only valid for binary mode files. A value of 1 specifies line-buffering and is usually only meaningful for text-mode files. Any other positive value indicates the buffer size to use (in bytes). If no buffering value is specified, the default behavior depends on the kind of file. If it’s a normal file on disk, buffering is managed in blocks and the buffer size is set to io.DEFAULT_BUFFER_SIZE. Typically this is some small multiple of 4096 bytes. It might vary by system. If the file represents an interactive terminal, line buffering is used.

For normal programs, I/O buffering is not typically a major concern. However, buffering can have an impact on applications involving active communication between processes. For example, a problem that sometimes arises is two communicating subprocesses that deadlock due to an internal buffering issue—for instance, one process writes into a buffer, but a receiver never sees that data because the buffer didn’t get flushed. Such problems can be fixed by either specifying unbuffered I/O or by an explicit flush() call on the associated file. For example:

file.write(data)
file.write(data)
...
file.flush()       # Make sure all data is written from buffers

9.6.4 Text Mode Encoding

For files opened in text mode, an optional encoding and an error-handling policy can be specified using the encoding and errors arguments. For example:

with open('file.txt', 'rt',
          encoding='utf-8', errors='replace') as file:
    data = file.read()

The values given to the encoding and errors argument have the same meaning as for the encode() and decode() methods of strings and bytes respectively.

The default text encoding is determined by the value of sys.getdefaultencoding() and may vary by system. If you know the encoding in advance, it’s often better to explicitly provide it even if it happens to match the default encoding on your system.

9.6.5 Text-Mode Line Handling

With text files, one complication is the encoding of newline characters. Newlines are encoded as ' ', ' ', or ' ' depending on the host operating system—for example, ' ' on UNIX and ' ' on Windows. By default, Python translates all of these line endings to a standard ' ' character when reading. On writing, newline characters are translated back to the default line ending used on the system. The behavior is sometimes referred to as “universal newline mode” in Python documentation.

You can change the newline behavior by giving a newline argument to open(). For example:

# Exactly require '
' and leave intact
file = open('somefile.txt', 'rt', newline='
')

Specifying newline=None enables the default line handling behavior where all line endings are translated to a standard ' ' character. Giving newline='' makes Python recognize all line endings, but disables the translation step—if lines were terminated by ' ', the ' ' combination would be left in the input intact. Specifying a value of ' ', ' ', or ' ' makes that the expected line ending.

9.7 I/O Abstraction Layers

The open() function serves as a kind of high-level factory function for creating instances of different I/O classes. These classes embody the different file modes, encodings, and buffering behaviors. They are also composed together in layers. The following classes are defined in the io module:

FileIO(filename, mode='r', closefd=True, opener=None)

Opens a file for raw unbuffered binary I/O. filename is any valid filename accepted by the open() function. Other arguments have the same meaning as for open().

BufferedReader(file [, buffer_size])
BufferedWriter(file [, buffer_size])
BufferedRandom(file,[, buffer_size])

Implements a buffered binary I/O layer for a file. file is an instance of FileIO. buffer_size specifies the internal buffer size to use. The choice of class depends on whether or not the file is reading, writing, or updating data. The optional buffer_size argument specifies the internal buffer size used.

TextIOWrapper(buffered, [encoding, [errors [, newline [, line_buffering [, write_through]]]]])

Implements text mode I/O. buffered is a buffered binary mode file, such as BufferedReader or BufferedWriter. The encoding, errors, and newline arguments have the same meaning as for open(). line_buffering is a Boolean flag that forces I/O to be flushed on newline characters (False by default). write_through is a Boolean flag that forces all writes to be flushed (False by default).

Here is an example that shows how a text-mode file is constructed, layer-by-layer:

>>> raw = io.FileIO('filename.txt', 'r')        # Raw-binary mode
>>> buffer = io.BufferedReader(raw)      # Binary buffered reader
>>> file = io.TextIOWrapper(buffer, encoding='utf-8') # Text mode
>>>

Normally you don’t need to manually construct layers like this—the built-in open() function takes care of all of the work. However, if you already have an existing file object and want to change its handling in some way, you might manipulate the layers as shown.

To strip layers away, use the detach() method of a file. For example, here is how you can convert an already text-mode file into a binary-mode file:

f = open('something.txt', 'rt')  # Text-mode file
fb = f.detach()                  # Detach underlying binary mode file
data = fb.read()                 # Returns bytes

9.7.1 File Methods

The exact type of object returned by open() depends on the combination of file mode and buffering options provided. However, the resulting file object supports the methods in Table 9.4.

Table 9.4 File Methods

Method

Description

f.readable()

Returns True if file can be read.

f.read([n])

Reads at most n bytes.

f.readline([n])

Reads a single line of input up to n characters. If n is omitted, this method reads the entire line.

f.readlines([size])

Reads all the lines and returns a list. size optionally specifies the approximate number of characters to read on the file before stopping.

f.readinto(buffer)

Reads data into a memory buffer.

f.writable()

Returns True if file can be written.

f.write(s)

Writes string s.

f.writelines(lines)

Writes all strings in iterable lines.

f.close()

Closes the file.

f.seekable()

Returns True if file supports random-access seeking.

f.tell()

Returns the current file pointer.

f.seek(offset [, where])

Seeks to a new file position.

f.isatty()

Returns True if f is an interactive terminal.

f.flush()

Flushes the output buffers.

f.truncate([size])

Truncates the file to at most size bytes.

f.fileno()

Returns an integer file descriptor.

The readable(), writable(), and seekable() methods test for supported file capabilities and modes. The read() method returns the entire file as a string unless an optional length parameter specifies the maximum number of characters. The readline() method returns the next line of input, including the terminating newline; the readlines() method returns all the input file as a list of strings. The readline() method optionally accepts a maximum line length, n. If a line longer than n characters is read, the first n characters are returned. The remaining line data is not discarded and will be returned on subsequent read operations. The readlines() method accepts a size parameter that specifies the approximate number of characters to read before stopping. The actual number of characters read may be larger than this depending on how much data has been buffered already. The readinto() method is used to avoid memory copies and is discussed later.

read() and readline() indicate end-of-file (EOF) by returning an empty string. Thus, the following code shows how you can detect an EOF condition:

while True:
    line = file.readline()
    if not line:        # EOF
        break
    statements
    ...

You could also write this code as follows:

while (line:=file.readline()):
    statements
    ...

A convenient way to read all lines in a file is to use iteration with a for loop:

for line in file:    # Iterate over all lines in the file
    # Do something with line
    ...

The write() method writes data to the file, and the writelines() method writes an iterable of strings to the file. write() and writelines() do not add newline characters to the output, so all output that you produce should already include all necessary formatting.

Internally, each open file object keeps a file pointer that stores the byte offset at which the next read or write operation will occur. The tell() method returns the current value of the file pointer. The seek(offset [,whence]) method is used to randomly access parts of a file given an integer offset and a placement rule in whence. If whence is os.SEEK_SET (the default), seek() assumes that offset is relative to the start of the file; if whence is os.SEEK_CUR, the position is moved relative to the current position; and if whence is os.SEEK_END, the offset is taken from the end of the file.

The fileno() method returns the integer file descriptor for a file and is sometimes used in low-level I/O operations in certain library modules. For example, the fcntl module uses the file descriptor to provide low-level file control operations on UNIX systems.

The readinto() method is used to perform zero-copy I/O into contiguous memory buffers. It is most commonly used in combination with specialized libraries such as numpy —for example, to read data directly into the memory allocated for a numeric array .

File objects also have the read-only data attributes shown in Table 9.5.

Table 9.5 File Attributes

Attribute

Description

f.closed

Boolean value indicates the file state: False if the file is open, True if closed.

f.mode

The I/O mode for the file.

f.name

Name of the file if created using open(). Otherwise, it will be a string indicating the source of the file.

f.newlines

The newline representation actually found in the file. The value is either None if no newlines have been encountered, a string containing ' ', ' ', or ' ', or a tuple containing all the different newline encodings seen.

f.encoding

A string that indicates file encoding, if any (for example, 'latin-1' or 'utf-8'). The value is None if no encoding is being used.

f.errors

The error handling policy.

f.write_through

Boolean value indicating if writes on a text file pass data directly to the underlying binary level file without buffering.

9.8 Standard Input, Output, and Error

The interpreter provides three standard file-like objects, known as standard input, standard output, and standard error, available as sys.stdin, sys.stdout, and sys.stderr, respectively. stdin is a file object corresponding to the stream of input characters supplied to the interpreter, stdout is the file object that receives output produced by print(), and stderr is a file that receives error messages. More often than not, stdin is mapped to the user’s keyboard, whereas stdout and stderr produce text on screen.

The methods described in the preceding section can be used to perform I/O with the user. For example, the following code writes to standard output and reads a line of input from standard input:

import sys
sys.stdout.write("Enter your name : ")
name = sys.stdin.readline()

Alternatively, the built-in function input(prompt) can read a line of text from stdin and optionally print a prompt:

name = input("Enter your name : ")

Lines read by input() do not include the trailing newline. This is different than reading directly from sys.stdin where newlines are included in the input text.

If necessary, the values of sys.stdout, sys.stdin, and sys.stderr can be replaced with other file objects, in which case the print() and input() functions will use the new values. Should it ever be necessary to restore the original value of sys.stdout, it should be saved first. The original values of sys.stdout, sys.stdin, and sys.stderr at interpreter startup are also available in sys.__stdout__, sys.__stdin__, and sys.__stderr__, respectively.

9.9 Directories

To get a directory listing, use the os.listdir(pathname) function. For example, here is how to print out a list of filenames in a directory:

import os

names = os.listdir('dirname')
for name in names:
    print(name)

Names returned by listdir() are normally decoded according to the encoding returned by sys.getfilesystemencoding(). If you specify the initial path as bytes, the filenames are returned as undecoded byte strings. For example:

import os

# Return raw undecoded names
names = os.listdir(b'dirname')

A useful operation related to directory listing is matching filenames according to a pattern, known as globbing. The pathlib modules can be used for this purpose. For example, here is an example of matching all *.txt files in a specific directory:

import pathlib

for filename in path.Path(dirname).glob('*.txt'):
    print(filename)

If you use rglob() instead of glob(), it will recursively search all subdirectories for filenames that match the pattern. Both the glob() and rglob() functions return a generator that produces the filenames via iteration.

9.10 The print() function

To print a series of values separated by spaces, supply them all to print() like this:

print('The values are', x, y, z)

To suppress or change the line ending, use the end keyword argument:

# Suppress the newline
print('The values are', x, y, z, end='')

To redirect the output to a file, use the file keyword argument:

# Redirect to file object f
print('The values are', x, y, z, file=f)

To change the separator character between items, use the sep keyword argument:

# Put commas between the values
print('The values are', x, y, z, sep=',')

9.11 Generating Output

Working directly with files is most familiar to programmers. However, generator functions can also be used to emit an I/O stream as a sequence of data fragments. To do this, use the yield statement as you would use a write() or print() function. Here is an example:

def countdown(n):
    while n > 0:
         yield f'T-minus {n}
'
         n -= 1
    yield 'Kaboom!
'

Producing an output stream in this manner provides flexibility because it is decoupled from the code that actually directs the stream to its intended destination. For example, if you want to route the above output to a file f, you can do this:

lines = countdown(5)
f.writelines(lines)

If, instead, you want to redirect the output across a socket s, you can do this:

for chunk in lines:
    s.sendall(chunk)

Or, if you simply want to capture all of the output in a single string, you can do this:

out = ''.join(lines)

More advanced applications can use this approach to implement their own I/O buffering. For example, a generator could be emitting small text fragments, but another function would then collect the fragments into larger buffers to create a more efficient single I/O operation.

chunks = []
buffered_size = 0
for chunk in count:
    chunks.append(chunk)
    buffered_size += len(chunk)
    if buffered_size >= MAXBUFFERSIZE:
         outf.write(''.join(chunks))
         chunks.clear()
         buffered_size = 0
outf.write(''.join(chunks)

For programs that are routing output to files or network connections, a generator approach can also result in a significant reduction in memory use because the entire output stream can often be generated and processed in small fragments, as opposed to being first collected into one large output string or list of strings.

9.12 Consuming Input

For programs that consume fragmentary input, enhanced generators can be useful for decoding protocols and other facets of I/O. Here is an example of an enhanced generator that receives byte fragments and assembles them into lines:

def line_receiver():
    data = bytearray()
    line = None
    linecount = 0
    while True:
        part = yield line
        linecount += part.count(b'
')
        data.extend(part)
        if linecount > 0:
            index = data.index(b'
')
            line = bytes(data[:index+1])
            data = data[index+1:]
            linecount -= 1
        else:
            line = None

In this example, a generator has been programmed to receive byte fragments that are collected into a byte array. If the array contains a newline, a line is extracted and returned. Otherwise, None is returned. Here’s an example illustrating how it works:

>>> r = line_receiver()
>>> r.send(None)    # Necessary first step
>>> r.send(b'hello')
>>> r.send(b'world
it ')
b'hello world
'
>>> r.send(b'works!')
>>> r.send(b'
')
b'it works!
''
>>>

An interesting side effect of this approach is that it externalizes the actual I/O operations that must be performed to get the input data. Specifically, the implementation of line_receiver() contains no I/O operations at all! This means that it could be used in different contexts. For example, with sockets:

r = line_receiver()
data = None
while True:
    while not (line:=r.send(data)):
        data = sock.recv(8192)

    # Process the line
    ...

or with files:

r = line_receiver()
data = None
while True:
    while not (line:=r.send(data)):
        data = file.read(10000)

    # Process the line
    ...

or even in asynchronous code:

async def reader(ch):
    r = line_receiver()
    data = None
    while True:
        while not (line:=r.send(data)):
            data = await ch.receive(8192)

        # Process the line
        ...

9.13 Object Serialization

Sometimes it’s necessary to serialize the representation of an object so it can be transmitted over the network, saved to a file, or stored in a database. One way to do this is to convert data into a standard encoding such as JSON or XML. There is also a common Python-specific data serialization format called Pickle.

The pickle module serializes an object into a stream of bytes that can be used to reconstruct the object at a later point in time. The interface to pickle is simple, consisting of two operations, dump() and load(). For example, the following code writes an object to a file:

import pickle
obj = SomeObject()
with open(filename, 'wb') as file:
   pickle.dump(obj, file)      # Save object on f

To restore the object, use:

with open(filename, 'rb') as file:
    obj = pickle.load(file)   # Restore the object

The data format used by pickle has its own framing of records. Thus, a sequence of objects can be saved by issuing a series of dump() operations one after the other. To restore these objects, simply use a similar sequence of load() operations.

For network programming, it is common to use pickle to create byte-encoded messages. To do that, use dumps() and loads(). Instead of reading/writing data to a file, these functions work with byte strings.

obj = SomeObject()

# Turn an object into bytes
data = pickle.dumps(obj)
...

# Turn bytes back into an object
obj = pickle.loads(data)

It is not normally necessary for user-defined objects to do anything extra to work with pickle. However, certain kinds of objects can’t be pickled. These tend to be objects that incorporate runtime state—open files, threads, closures, generators, and so on. To handle these tricky cases, a class can define the special methods __getstate__() and __setstate__().

The __getstate__() method, if defined, will be called to create a value representing the state of an object. The value returned by __getstate__() is typically a string, tuple, list, or dictionary. The __setstate__() method receives this value during unpickling and should restore the state of an object from it.

When encoding an object, pickle does not include the underlying source code itself. Instead, it encodes a name reference to the defining class. When unpickling, this name is used to perform a source-code lookup on the system. For unpickling to work, the recipient of a pickle must have the proper source code already installed. It is also important to emphasize that pickle is inherently insecure—unpickling untrusted data is a known vector for remote code execution. Thus, pickle should only be used if you can completely secure the runtime environment.

9.14 Blocking Operations and Concurrency

A fundamental aspect of I/O is the concept of blocking. By its very nature, I/O is connected to the real world. It often involves waiting for input or devices to be ready. For example, code that reads data on the network might perform a receive operation on a socket like this:

data = sock.recv(8192)

When this statement executes, it might return immediately if data is available. However, if that’s not the case, it will stop—waiting for data to arrive. This is blocking. While the program is blocked, nothing else happens.

For a data analysis script or a simple program, blocking is not something that you worry about. However, if you want your program to do something else while an operation is blocked, you will need to take a different approach. This is the fundamental problem of concurrency—having a program work on more than one thing at a time. One common problem is having a program read on two or more different network sockets at the same time:

def reader1(sock):
    while (data := sock.recv(8192)):
        print('reader1 got:', data)

def reader2(sock):
    while (data := sock.recv(8192)):
        print('reader2 got:', data)

# Problem: How to make reader1() and reader2()
# run at the same time?

The rest of this section outlines a few different approaches to solving this problem. However, it is not meant to be a full tutorial on concurrency. For that, you will need to consult other resources.

9.14.1 Nonblocking I/O

One approach to avoiding blocking is to use so-called nonblocking I/O. This is a special mode that has to be enabled—for example, on a socket:

sock.setblocking(False)

Once enabled, an exception will now be raised if an operation would have blocked. For example:

try:
    data = sock.recv(8192)
except BlockingIOError as e:
    # No data is avaiable
    ...

In response to a BlockingIOError, the program could elect to work on something else. It could retry the I/O operation later to see if any data has arrived. For example, here’s how you might read on two sockets at once:

def reader1(sock):
    try:
        data = sock.recv(8192)
        print('reader1 got:', data)
    except BlockingIOError:
        pass

def reader2(sock):
    try:
        data = sock.recv(8192)
        print('reader2 got:', data)
    except BlockingIOError:
        pass

def run(sock1, sock2):
    sock1.setblocking(False)
    sock2.setblocking(False)
    while True:
        reader1(sock1)
        reader2(sock2)

In practice, relying only on nonblocking I/O is clumsy and inefficient. For example, the core of this program is the run() function at the end. It will run in a inefficient busy loop as it constantly tries to read on the sockets. This works, but it is not a good design.

9.14.2 I/O Polling

Instead of relying upon exceptions and spinning, it is possible to poll I/O channels to see if data is available. The select or selectors module can be used for this purpose. For example, here’s a slightly modified version of the run() function:

from selectors import DefaultSelector, EVENT_READ, EVENT_WRITE

def run(sock1, sock2):
    selector = DefaultSelector()
    selector.register(sock1, EVENT_READ, data=reader1)
    selector.register(sock2, EVENT_READ, data=reader2)
    # Wait for something to happen
    while True:
        for key, evt in selector.select():
            func = key.data
            func(key.fileobj)

In this code, the loop dispatches either reader1() or reader2() function as a callback whenever I/O is detected on the appropriate socket. The selector.select() operation itself blocks, waiting for I/O to occur. Thus, unlike the previous example, it won’t make the CPU furiously spin.

This approach to I/O is the foundation of many so-called “async” frameworks such as asyncio, although you usually don’t see the inner workings of the event loop.

9.14.3 Threads

In the last two examples, concurrency required the use of a special run() function to drive the calculation. As an alternative, you can use thread programming and the threading module. Think of a thread as an independent task that runs inside your program. Here is an example of code that reads data on two sockets at once:

import threading

def reader1(sock):
    while (data := sock.recv(8192)):
        print('reader1 got:', data)

def reader2(sock):
    while (data := sock.recv(8192)):
        print('reader2 got:', data)

t1 = threading.Thread(target=reader1, args=[sock1]).start()
t2 = threading.Thread(target=reader2, args=[sock2]).start()

# Start the threads
t1.start()
t2.start()

# Wait for the threads to finish
t1.join()
t2.join()

In this program, the reader1() and reader2() functions execute concurrently. This is managed by the host operating system, so you don’t need to know much about how it works. If a blocking operation occurs in one thread, it does not affect the other thread.

The subject of thread programming is, in its entirety, beyond the scope of this book. However, a few additional examples are provided in the threading module section later in this chapter.

9.14.4 Concurrent Execution with asyncio

The asyncio module provides a concurrency implementation alternative to threads. Internally, it’s based on an event loop that uses I/O polling. However, the high-level programming model looks very similar to threads through the use of special async functions. Here is an example:

import asyncio

async def reader1(sock):
    loop = asyncio.get_event_loop()
    while (data := await loop.sock_recv(sock, 8192)):
        print('reader1 got:', data)

async def reader2(sock):
    loop = asyncio.get_event_loop()
    while (data := await loop.sock_recv(sock, 8192)):
        print('reader2 got:', data)

async def main(sock1, sock2):
    loop = asyncio.get_event_loop()
    t1 = loop.create_task(reader1(sock1))
    t2 = loop.create_task(reader2(sock2))

    # Wait for the tasks to finish
    await t1
    await t2

...
# Run it
asyncio.run(main(sock1, sock2))

Full details of using asyncio would require its own dedicated book. What you should know is that many libraries and frameworks advertise support for asynchronous operation. Usually that means that concurrent execution is supported through asyncio or a similar module. Much of the code is likely to involve async functions and related features.

9.15 Standard Library Modules

A large number of standard library modules are used for various I/O related tasks. This section provides a brief overview of the commonly used modules, along with a few examples. Complete reference material can be found online or in an IDE and is not repeated here. The main purpose of this section is to point you in the right direction by giving you the names of the modules that you should be using along with a few examples of very common programming tasks involving each module.

Many of the examples are shown as interactive Python sessions. These are experiments that you are encouraged to try yourself.

9.15.1 asyncio Module

The asyncio module provides support for concurrent I/O operations using I/O polling and an underlying event loop. Its primary use is in code involving networks and distributed systems. Here is an example of a TCP echo server using low-level sockets:

import asyncio
from socket import *

async def echo_server(address):
    loop = asyncio.get_event_loop()
    sock = socket(AF_INET, SOCK_STREAM)
    sock.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
    sock.bind(address)
    sock.listen(5)
    sock.setblocking(False)
    print('Server listening at', address)
    with sock:
        while True:
            client, addr = await loop.sock_accept(sock)
            print('Connection from', addr)
            loop.create_task(echo_client(loop, client))

async def echo_client(loop, client):
    with client:
        while True:
            data = await loop.sock_recv(client, 10000)
            if not data:
                break
            await loop.sock_sendall(client, b'Got:' + data)
    print('Connection closed')

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.create_task(echo_server(loop, ('', 25000)))
    loop.run_forever()

To test this code, use a program such as nc or telnet to connect to port 25000 on your machine. The code should echo back the text that you type. If you connect more than once using multiple terminal windows, you’ll find that the code can handle all of the connections concurrently.

Most applications using asyncio will probably operate at a higher level than sockets. However, in such applications, you will still have to make use of special async functions and interact with the underlying event loop in some manner.

9.15.2 binascii Module

The binascii module has functions for converting binary data into various text-based representations such as hexadecimal and base64. For example:

>>> binascii.b2a_hex(b'hello')
b'68656c6c6f'
>>> binascii.a2b_hex(_)
b'hello'
>>> binascii.b2a_base64(b'hello')
b'aGVsbG8=
'
>>> binascii.a2b_base64(_)
b'hello'
>>>

Similar functionality can be found in the base64 module as well as with the hex() and fromhex() methods of bytes. For example:

>>> a = b'hello'
>>> a.hex()
'68656c6c6f'
>>> bytes.fromhex(_)
b'hello'
>>> import base64
>>> base64.b64encode(a)
b'aGVsbG8='
>>>

9.15.3 cgi Module

So, let’s say you just want to put a basic form on your website. Perhaps it’s a sign-up form for your weekly “Cats and Categories” newsletter. Sure, you could install the latest web framework and spend all of your time fiddling with it. Or, you could also just write a basic CGI script—old-school style. The cgi module is for doing just that.

Suppose you have the following form fragment on a webpage:

<form method="POST" action="cgi-bin/register.py">
   <p>
   To register, please provide a contact name and email address.
   </p>
   <div>
      <input name="name" type="text">Your name:</input>
   </div>
   <div>
      <input name="email" type="email">Your email:</input>
   </div>
   <div class="modal-footer justify-content-center">
      <input type="submit" name="submit" value="Register"></input>
   </div>
</form>

Here’s a CGI script that receives the form data on the other end:

#!/usr/bin/env python
import cgi
try:
    form = cgi.FieldStorage()
    name = form.getvalue('name')
    email = form.getvalue('email')
    # Validate the responses and do whatever
    ...
    # Produce an HTML result (or redirect)
    print("Status: 302 Moved
")
    print("Location: https://www.mywebsite.com/thanks.html
")
    print("
")
except Exception as e:
    print("Status: 501 Error
")
    print("Content-type: text/plain
")
    print("
")
    print("Some kind of error occurred.
")

Will writing such a CGI script get you a job at an Internet startup? Probably not. Will it solve your actual problem? Likely.

9.15.4 configparser Module

INI files are a common format for encoding program configuration information in a human-readable form. Here is an example:

# config.ini

; A comment
[section1]
name1 = value1
name2 = value2

[section2]
; Alternative syntax
name1: value1
name2: value2

The configparser module is used to read .ini files and extract values. Here’s a basic example:

import configparser

# Create a config parser and read a file
cfg = configparser.ConfigParser()
cfg.read('conig.ini')

# Extract values
a = cfg.get('section1', 'name1')
b = cfg.get('section2', 'name2')
...

More advanced functionality is also available, including string interpolation features, the ability to merge multiple .ini files, provide default values, and more. Consult the official documentation for more examples.

9.15.5 csv Module

The csv module is used to read/write files of comma-separated values (CSV) produced by programs such as Microsoft Excel or exported from a database. To use it, open a file and then wrap an extra layer of CSV encoding/decoding around it. For example:

import csv

# Read a CSV file into a list of tuples
def read_csv_data(filename):
    with open(filename) as file:
        rows = csv.reader(file)
        # First line is often a header. This reads it
        headers = next(rows)
        # Now read the rest of the data
        for row in rows:
            # Do something with row
            ...

# Write Python data to a CSV file
def write_csv_data(filename, headers, rows):
    with open(filename, "w") as file:
        out = csv.writer(file)
        out.writerow(headers)
        out.writerows(rows)

An often used convenience is to use a DictReader() instead. This interprets the first line of a CSV file as headers and returns each row as dictionary instead of a tuple.

import csv

def find_nearby(filename):
    with open(filename) as file:
        rows = csv.DictReader(file)
        for row in rows:
            lat = float(rows['latitude'])
            lon = float(rows['longitude'])
            if close_enough(lat, lon):
                print(row)

The csv module doesn’t do much with CSV data other than reading or writing it. The main benefit provided is that the module knows how to properly encode/decode the data and handles a lot of edge cases involving quotation, special characters, and other details. This is a module you might use to write simple scripts for cleaning or preparing data to be used with other programs. If you want to perform data analysis tasks with CSV data, consider using a third-party package such as the popular pandas library.

9.15.6 errno Module

Whenever a system-level error occurs, Python reports it with an exception that’s a subclass of OSError. Some of the more common kinds of system errors are represented by separate subclasses of OSError such as PermissionError or FileNotFoundError. However, there are hundreds of other errors that could occur in practice. For these, any OSError exception carries a numeric errno attribute that can be inspected. The errno module provides symbolic constants corresponding to these error codes. They are often used when writing specialized exception handlers. For example, here is an exception handler that checks for no space remaining on a device:

import errno

def write_data(file, data):
    try:
        file.write(data)
    except OSError as e:
        if e.errno == errno.ENOSPC:
            print("You're out of disk space!"
        else:
            raise       # Some other error. Propagate

9.15.7 fcntl Module

The fcntl module is used to perform low-level I/O control operations on UNIX using the fcntl() and ioctl() system calls. This is also the module to use if you want to perform any kind of file locking—a problem that sometimes arises in the context of concurrency and distributed systems. Here is an example of opening a file in combination with mutual exclusion locking across all processes using fcntl.flock():

import fcntl

with open("somefile", "r") as file:
     try:
         fcntl.flock(file.fileno(), fcntl.LOCK_EX)
         # Use the file
         ...
     finally:
         fcntl.flock(file.fileno(), fcntl.LOCK_UN)

9.15.8 hashlib Module

The hashlib module provides functions for computing cryptographic hash values such as MD5 and SHA-1. The following example illustrates how to use the module:

>>> h = hashlib.new('sha256')
>>> h.update(b'Hello')    # Feed data
>>> h.update(b'World')
>>> h.digest()
b'xa5x91xa6xd4x0bxf4 @Jx01x173xcfxb7xb1x90xd6,exbfx0bxcdxa3+Wxb2wxd9xadx9fx14n
>>> h.hexdigest()
'a591a6d40bf420404a011733cfb7b190d62c65bf0bcda32b57b277d9ad9f146e'
>>> h.digest_size
32
>>>

9.15.9 http Package

The http package contains a large amount of code related to the low-level implementation of the HTTP internet protocol. It can be used to implement both servers and clients. However, most of this package is considered legacy and too low-level for day-to-day work. Serious programmers working with HTTP are more likely to use third-party libraries such as requests, httpx, Django, flask, and others.

Nevertheless, one useful easter egg of the http package is the ability for Python to run a standalone web server. Go a directory with a collection of files and type the following:

bash $ python -m http.server
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...

Now, Python will serve the files to your browser if you point it at the right port. You wouldn’t use this to run a website, but it can be useful for testing and debugging programs related to the web. For example, the author has used this to locally test programs involving a mix of HTML, Javascript, and WebAssembly.

9.15.10 io Module

The io module primarily contains the definitions of classes used to implement the file objects as returned by the open() function. It is not so common to access those classes directly. However, the module also contains a pair of classes that are useful for “faking” a file in the form of strings and bytes. This can be useful for testing and other applications where you need to provide a “file” but have obtained data in a different way.

The StringIO() class provides a file-like interface on top of strings. For example, here is how you can write output to a string:

# Function that expects a file
def greeting(file):
    file.write('Hello
')
    file.write('World
')

# Call the function using a real file
with open('out.txt', 'w') as file:
    greeting(file)

# Call the function with a "fake" file
import io
file = io.StringIO()
greeting(file)

# Get the resulting output
output = file.getvalue()

Similarly, you can create a StringIO object and use it for reading:

file = io.StringIO('hello
world
')
while (line := file.readline()):
    print(line, end='')

The BytesIO() class serves a similar purpose but is used for emulating binary I/O with bytes.

9.15.11 json Module

The json module can be used to encode and decode data in the JSON format, commonly used in the APIs of microservices and web applications. There are two basic functions for converting data, dumps() and loads(). dumps() takes a Python dictionary and encodes it as a JSON Unicode string:

>>> import json
>>> data = { 'name': 'Mary A. Python', 'email': '[email protected]' }
>>> s = json.dumps(data)
>>> s
'{"name": "Mary A. Python", "email": "[email protected]"}'
>>>

The loads() function goes in the other direction:

>>> d = json.loads(s)
>>> d == data
True
>>>

Both the dumps() and loads() functions have many options for controlling aspects of the conversion as well as interfacing with Python class instances. That’s beyond the scope of this section but copious amounts of information is available in the official documentation.

9.15.12 logging Module

The logging module is the de facto standard module used for reporting program diagnostics and for print-style debugging. It can be used to route output to a log file and provides a large number of configuration options. A common practice is to write code that creates a Logger instance and issues messages on it like this:

import logging
log = logging.getLogger(__name__)

# Function that uses logging
def func(args):
    log.debug("A debugging message")
    log.info("An informational message")
    log.warning("A warning message")
    log.error("An error message")
    log.critical("A critical message")

# Configuration of logging (occurs one at program startup)
if __name__ == '__main__':
    logging.basicConfig(
         level=logging.WARNING,
         filename="output.log"
    )

There are five built-in levels of logging ordered by increasing severity. When configuring the logging system, you specify a level that acts as a filter. Only messages at that level or greater severity are reported. Logging provides a large number of configuration options, mostly related to the back-end handling of the log messages. Usually you don’t need to know about that when writing application code—you use debug(), info(), warning(), and similar methods on some given Logger instance. Any special configuration takes place during program startup in a special location (such as a main() function or the main code block).

9.15.13 os Module

The os module provides a portable interface to common operating-system functions, typically associated with the process environment, files, directories, permissions, and so forth. The programming interface closely follows C programming and standards such as POSIX.

Practically speaking, most of this module is probably too low-level to be directly used in a typical application. However, if you’re ever faced with the problem of executing some obscure low-level system operation (such as opening a TTY), there’s a good chance you’ll find the functionality for it here.

9.15.14 os.path Module

The os.path module is a legacy module for manipulating pathnames and performing common operations on the filesystem. Its functionality has been largely replaced by the newer pathlib module, but since its use is still so widespread, you’ll continue to see it in a lot of code.

One fundamental problem solved by this module is portable handling of path separators on UNIX (forward-slash /) and Windows (backslash ). Functions such as os.path.join() and os.path.split() are often used to pull apart filepaths and put them back together:

>>> filename = '/Users/beazley/Desktop/old/data.csv'
>>> os.path.split()
('/Users/beazley/Desktop/old', 'data.csv')
>>> os.path.join('/Users/beazley/Desktop', 'out.txt')
'/Users/beazley/Desktop/out.txt'
>>>

Here is an example of code that uses these functions:

import os.path

def clean_line(line):
    # Line up a line (whatever)
    return line.strip().upper() + '
'

def clean_data(filename):
    dirname, basename = os.path.split()
    newname = os.path.join(dirname, basename+'.clean')
    with open(newname, 'w') as out_f:
        with open(filename, 'r') as in_f:
            for line in in_f:
                out_f.write(clean_line(line))

The os.path module also has a number of functions, such as isfile(), isdir(), and getsize(), for performing tests on the filesystem and getting file metadata. For example, this function returns the total size in bytes of a simple file or of all of files in a directory:

import os.path

def compute_usage(filename):
    if os.path.isfile(filename):
        return os.path.getsize(filename)
    elif os.path.isdir(filename):
        return sum(compute_usage(os.path.join(filename, name))
                   for name in os.listdir(filename))
    else:
        raise RuntimeError('Unsupported file kind')

9.15.15 pathlib Module

The pathlib module is the modern way of manipulating pathnames in a portable and high-level manner. It combines a lot of the file-oriented functionality in one place and uses an object-oriented interface. The core object is the Path class. For example:

from pathlib import Path

filename = Path('/Users/beazley/old/data.csv')

Once you have an instance filename of Path, you can perform various operations on it to manipulate the filename. For example:

>>> filename.name
'data.csv'
>>> filename.parent
Path('/Users/beazley/old')
>>> filename.parent / 'newfile.csv'
Path('/Users/beazley/old/newfile.csv')
>>> filename.parts
('/', 'Users', 'beazley', 'old', 'data.csv')
>>> filename.with_suffix('.csv.clean')
Path('/Users/beazley/old/data.csv.clean')
>>>

Path instances also have functions for obtaining file metadata, getting directory listings, and other similar functions. Here is a reimplementation of the compute_usage() function from the previous section:

import pathlib

def compute_usage(filename):
    pathname = pathlib.Path(filename)
    if pathname.is_file():
        return pathname.stat().st_size
    elif pathname.is_dir():
        return sum(path.stat().st_size
                   for path in pathname.rglob('*')
                   if path.is_file())
        return pathname.stat().st_size
    else:
        raise RuntimeError('Unsupported file kind')

9.15.16 re Module

The re module is used to perform text matching, searching, and replacement operations using regular expressions. Here is a simple example:

>>> text = 'Today is 3/27/2018. Tomorrow is 3/28/2018.'
>>> # Find all occurrences of a date
>>> import re
>>> re.findall(r'd+/d+/d+', text)
['3/27/2018', '3/28/2018']
>>> # Replace all occurrences of a date with replacement text
>>> re.sub(r'(d+)/(d+)/(d+)', r'3-1-2', text)
'Today is 2018-3-27. Tomorrow is 2018-3-28.'
>>>

Regular expressions are often notorious for their inscrutable syntax. In this example, the d+ is interpreted to mean “one or more digits.” More information about the pattern syntax can be found in the official documentation for the re module.

9.15.17 shutil Module

The shutil module is used to carry out some common tasks that you might otherwise perform in the shell. These include copying and removing files, working with archives, and so forth. For example, to copy a file:

import shutil

shutil.copy(srcfile, dstfile)

To move a file:

shutil.move(srcfile, dstfile)

To copy a directory tree:

shutil.copytree(srcdir, dstdir)

To remove a directory tree:

shutil.rmtree(pathname)

The shutil module is often used as a safer and more portable alternative to directly executing shell commands with the os.system() function.

9.15.18 select Module

The select module is used for simple polling of multiple I/O streams. That is, it can be used to watch a collection of file descriptors for incoming data or for the ability to receive outgoing data. The following example shows typical usage:

import select

# Collections of objects representing file descriptors.  Must be
# integers or objects with a fileno() method.
want_to_read = [ ... ]
want_to_write = [ ... ]
check_exceptions = [ ... ]

# Timeout (or None)
timeout = None

# Poll for I/O
can_read, can_write, have_exceptions = 
    select.select(want_to_read, want_to_write, check_exceptions, timeout)

# Perform I/O operations
for file in can_read:
    do_read(file)
for file in can_write:
    do_write(file)

# Handle exceptions
for file in have_exceptions:
    handle_exception(file)

In this code, three sets of file descriptors are constructed. These sets correspond to reading, writing, and exceptions. These are passed to select() along with an optional timeout. select() returns three subsets of the passed arguments. These subsets represent the files on which the requested operation can be performed. For example, a file returned in can_read() has incoming data pending.

The select() function is a standard low-level system call that’s commonly used to watch for system events and to implement asynchronous I/O frameworks such as the built-in asyncio module.

In addition to select(), the select module also exposes poll(), epoll(), kqueue(), and similar variant functions that provide similar functionality. The availability of these functions varies by operating system.

The selectors module provides a higher-level interface to select that might be useful in certain contexts. An example was given earlier in Section 9.14.2.

9.15.19 smtplib Module

The smtplib module implements the client side of SMTP, commonly used to send email messages. A common use of the module is in a script that does just that—sends an email to someone. Here is an example:

import smtplib

fromaddr = "[email protected]"
toaddrs = ["[email protected]" ]
amount = 123.45
msg = f"""From: {fromaddr}



Pay {amount} bitcoin or else.  We're watching.

"""

server = smtplib.SMTP('localhost')
serv.sendmail(fromaddr, toaddrs, msg)
serv.quit()

There are additional features to handle passwords, authentication, and other matters. However, if you’re running a script on a machine and that machine is configured to support email, the above example will usually do the job.

9.15.20 socket Module

The socket module provides low-level access to network programming functions. The interface is modeled after the standard BSD socket interface commonly associated with system programming in C.

The following example shows how to make an outgoing connection and receive a response:

from socket import socket, AF_INET, SOCK_STREAM

sock = socket(AF_INET, SOCK_STREAM)
sock.connect(('python.org', 80))
sock.send(b'GET /index.html HTTP/1.0

')
parts = []
while True:
    part = sock.recv(10000)
    if not part:
        break
    parts.append(part)
response = b''.join(part)
print(part)

The following example shows a basic echo server that accepts client connections and echoes back any received data. To test this server, run it and then connect to it using a command such as telnet localhost 25000 or nc localhost 25000 in a separate terminal session.

from socket import socket, AF_INET, SOCK_STREAM

def echo_server(address):
    sock = socket(AF_INET, SOCK_STREAM)
    sock.bind(address)
    sock.listen(1)
    while True:
        client, addr = sock.accept()
        echo_handler(client, addr)

def echo_handler(client, addr):
    print('Connection from:', addr)
    with client:
        while True:
            data = client.recv(10000)
            if not data:
                break
            client.sendall(data)
    print('Connection closed')

if __name__ == '__main__':
    echo_server(('', 25000))

For UDP servers, there is no connection process. However, a server must still bind the socket to a known address. Here is a typical example of what a UDP server and client look like:

# udp.py

from socket import socket, AF_INET, SOCK_DGRAM

def run_server(address):
    sock = socket(AF_INET, SOCK_DGRAM)     # 1. Create a UDP socket
    sock.bind(address)                     # 2. Bind to address/port
    while True:
        msg, addr = sock.recvfrom(2000)    # 3. Get a message
        # ... do something
        response = b'world'
        sock.sendto(response, addr)        # 4. Send a response back

def run_client(address):
    sock = socket(AF_INET, SOCK_DGRAM)     # 1. Create a UDP socket
    sock.sendto(b'hello', address)         # 2. Send a message
    response, addr = sock.recvfrom(2000)   # 3. Get response
    print("Received:", response)
    sock.close()

if __name__ == '__main__':
    import sys
    if len(sys.argv) != 4:
        raise SystemExit('Usage: udp.py [-client|-server] hostname port')
    address = (sys.argv[2], int(sys.argv[3]))
    if sys.argv[1] == '-server':
        run_server(address)
    elif sys.argv[1] == '-client':
        run_client(address)

9.15.21 struct Module

The struct module is used to convert data between Python and binary data structures, represented as Python byte strings. These data structures are often used when interacting with functions written in C, binary file formats, network protocols, or binary communication over serial ports.

As an example, suppose you need to construct a binary message with its format described by a C data structure:

# Message format: All values are "big endian"
struct Message {
    unsigned short msgid;      // 16 bit unsigned integer
    unsigned int sequence;     // 32 bit sequence number
    float x;                   // 32 bit float
    float y;                   // 32 bit float
}

Here’s how you do this using the struct module:

>>> import struct
>>> data = struct.pack('>HIff', 123, 456, 1.23, 4.56)
>>> data
b'x00{x00x00x00-?x9dpxa4@x91xebx85'
>>>

To decode binary data, use struct.unpack:

>>> struct.unpack('>HIff', data)
(123, 456, 1.2300000190734863, 4.559999942779541)
>>>

The differences in the floating-point values are due to the loss of accuracy incurred by their conversion to 32-bit values. Python represents floating-point values as 64-bit double precision values.

9.15.22 subprocess Module

The subprocess module is used to execute a separate program as a subprocess, but with control over the execution environment including I/O handling, termination, and so forth. There are two common uses of the module.

If you want to run a separate program and collect all of its output at once, use check_output(). For example:

import subprocess

# Run the 'netstat -a' command and collect its output
try:
    out = subprocess.check_output(['netstat', '-a'])
except subprocess.CalledProcessError as e:
    print("It failed:", e)

The data returned by check_output() is presented as bytes. If you want to convert it to text, make sure you apply a proper decoding:

text = out.decode('utf-8')

It is also possible to set up a pipe and to interact with a subprocess in a more detailed manner. To do that, use the Popen class like this:

import subprocess

p = subprocess.Popen(['wc'],
                     stdin=subprocess.PIPE,
                     stdout=subprocess.PIPE)

# Send data to the subprocess
p.stdin.write(b'hello world
this is a test
')
p.stdin.close()

# Read data back
out = p.stdout.read()
print(out)

An instance p of Popen has attributes stdin and stdout that can be used to communicate with the subprocess.

9.15.23 tempfile Module

The tempfile module provides support for creating temporary files and directories. Here is an example of creating a temporary file:

import tempfile

with tempfile.TemporaryFile() as f:
    f.write(b'Hello World')
    f.seek(0)
    data = f.read()
    print('Got:', data)

By default, temporary files are open in binary mode and allow both reading and writing. The with statement is also commonly used to define a scope for when the file will be used. The file is deleted at the end of the with block.

If you would like to create a temporary directory, use this:

with tempfile.TemporaryDirectory() as dirname:
    # Use the directory dirname
    ...

As with a file, the directory and all of its contents will be deleted at the end of the with block.

9.15.24 textwrap Module

The textwrap module can be used to format text to fit a specific terminal width. Perhaps it’s a bit special-purpose but it can sometimes be useful in cleaning up text for output when making reports. There are two functions of interest.

wrap() takes text and wraps it to fit a specified column width. The function returns a list of strings. For example:

import textwrap

text = """look into my eyes
look into my eyes
the eyes the eyes the eyes
not around the eyes
don't look around the eyes
look into my eyes you're under
"""

wrapped = textwrap.wrap(text, width=81)
print('
'.join(wrapped))
# Produces:
# look into my eyes look into my eyes the
# eyes the eyes the eyes not around the
# eyes don't look around the eyes look
# into my eyes you're under

The indent() function can be used to indent a block of text. For example:

print(textwrap.indent(text, '    '))
# Produces:
#    look into my eyes
#    look into my eyes
#    the eyes the eyes the eyes
#    not around the eyes
#    don't look around the eyes
#    look into my eyes you're under

9.15.25 threading Module

The threading module is used to execute code concurrently. This problem commonly arises with I/O handling in network programs. Thread programming is a large topic, but the following examples illustrate solutions to common problems.

Here’s an example of launching a thread and waiting for it:

import threading
import time

def countdown(n):
    while n > 0:
        print('T-minus', n)
        n -= 1
        time.sleep(1)

t = threading.Thread(target=countdown, args=[10])
t.start()
t.join()      # Wait for the thread to finish

If you’re never going to wait for the thread to finish, make it daemonic by supplying an extra daemon flag like this:

t = threading.Thread(target=countdown, args=[10], daemon=True)

If you want to make a thread terminate, you’ll need to do so explicitly with a flag or some dedicated variable for that purpose. The thread will have to be programmed to check for it.

import threading
import time

must_stop = False

def countdown(n):
    while n > 0 and not must_stop:
        print('T-minus', n)
        n -= 1
        time.sleep(1)

If threads are going to mutate shared data, protect it with a Lock.

import threading

class Counter:
    def __init__(self):
        self.value = 0
        self.lock = threading.Lock()

    def increment(self):
        with self.lock:
             self.value += 1

    def decrement(self):
        with self.lock:
             self.value -= 1

If one thread must wait for another thread to do something, use an Event.

import threading
import time

def step1(evt):
    print('Step 1')
    time.sleep(5)
    evt.set()

def step2(evt):
    evt.wait()
    print('Step 2')

evt = threading.Event()
threading.Thread(target=step1, args=[evt]).start()
threading.Thread(target=step2, args=[evt]).start()

If threads are going to communicate, use a Queue:

import threading
import queue
import time

def producer(q):
    for i in range(10):
        print('Producing:', i)
        q.put(i)
    print('Done')
    q.put(None)

def consumer(q):
    while True:
        item = q.get()
        if item is None:
            break
        print('Consuming:', item)
    print('Goodbye')

q = queue.Queue()
threading.Thread(target=producer, args=[q]).start()
threading.Thread(target=consumer, args=[q]).start()

9.15.26 time Module

The time module is used to access system time-related functions. The following selected functions are the most useful:

sleep(seconds)

Make Python sleep for a given number of seconds, given as a floating point.

time()

Return the current system time in UTC as a floating-point number. This is the number of seconds since the epoch (usually January 1, 1970 for UNIX systems). Use localtime() to convert it into a data structure suitable for extracting useful information.

localtime([secs])

Return a struct_time object representing the local time on the system or the time represented by the floating-point value secs passed as an argument. The resulting struct has attributes tm_year, tm_mon, tm_mday, tm_hour, tm_min, tm_sec, tm_wday, tm_yday, and tm_isdst.

gmtime([secs])

The same as localtime() except that the resulting structure represents the time in UTC (or Greenwich Mean Time).

ctime([secs])

Convert a time represented as seconds to a text string suitable for printing. Useful for debugging and logging.

asctime(tm)

Convert a time structure as represented by localtime() into a text string suitable for printing.

The datetime module is more generally used for representing dates and times for the purpose of performing date-related computations and dealing with timezones.

9.15.27 urllib Package

The urllib package is used to make client-side HTTP requests. Perhaps the most useful function is urllib.request.urlopen() which can be used to fetch simple webpages. For example:

>>> from urllib.request import urlopen
>>> u = urlopen('http://www.python.org')
>>> data = u.read()
>>>

If you want to encode form parameters, you can use urllib.parse.urlencode() as shown here:

from urllib.parse import urlencode
from urllib.request import urlopen

form = {
   'name': 'Mary A. Python',
   'email': '[email protected]'
}

data = urlencode(form)
u = urlopen('http://httpbin.org/post', data.encode('utf-8'))
response = u.read()

The urlopen() function works fine for basic webpages and APIs involving HTTP or HTTPS. However, it becomes quite awkward to use if access also involves cookies, advanced authentication schemes, and other layers. Frankly, most Python programmers would use a third-party library such as requests or httpx to handle these situations. You should too.

The urllib.parse subpackage has additional functions for manipulating URLs themselves. For example, the urlparse() function can be used to pull apart a URL:

>>> url = 'http://httpbin.org/get?name=Dave&n=42'
>>> from urllib.parse import urlparse
>>> urlparse(url)
ParseResult(scheme='http', netloc='httpbin.org', path='/get', params='',
query='name=Dave&n=42', fragment='')
>>>

9.15.28 unicodedata Module

The unicodedata module is used for more advanced operations involving Unicode text strings. There are often multiple representations of the same Unicode text. For example, the character U+00F1 (ñ) might be fully composed as a single character U+00F1 or decomposed into a multicharacter sequence U+006e U+0303 (n, ~). This can cause strange problems in programs that are expecting text strings that visually render the same to actually be the same in representation. Consider the following example involving dictionary keys:

>>> d = {}
>>> d['Jalapexf1o'] = 'spicy'
>>> d['Jalapenu0303o'] = 'mild'
>>> d
{'jalapeño': 'spicy', 'jalapeño': 'mild' }
>>>

At first glance, this looks like it should be an operational error—how could a dictionary have two identical, yet separate, keys like that? The answer is found in the fact that the keys consist of different Unicode character sequences.

If consistent processing of identically rendered Unicode strings is an issue, they should be normalized. The unicodedata.normalize() function can be used to ensure a consistent character representation. For example, unicodedata.normalize('NFC', s) will make sure that all characters in s are fully composed and not represented as a sequence of combining characters. Using unicodedata.normalize('NFD', s) will make sure that all characters in s are fully decomposed.

The unicodedata module also has functions for testing character properties such as capitalization, numbers, and whitespace. General character properties can be obtained with the unicodedata.category(c) function. For example, unicodedata.category('A') returns 'Lu', signifying that the character is an uppercase letter. More information about these values can be found in the official Unicode character database at https://www.unicode.org/ucd.

9.15.29 xml Package

The xml package is a large collection of modules for processing XML data in various ways. However, if your primary goal is to read an XML document and extract information from it, the easiest way to do it is to use the xml.etree subpackage. Suppose you had an XML document in a file recipe.xml like this:

<?xml version="1.0" encoding="iso-8859-1"?>
<recipe>
   <title>Famous Guacamole</title>
   <description>A southwest favorite!</description>
   <ingredients>
        <item num="4"> Large avocados, chopped </item>
        <item num="1"> Tomato, chopped </item>
        <item num="1/2" units="C"> White onion, chopped </item>
        <item num="2" units="tbl"> Fresh squeezed lemon juice </item>
        <item num="1"> Jalapeno pepper, diced </item>
        <item num="1" units="tbl"> Fresh cilantro, minced </item>
        <item num="1" units="tbl"> Garlic, minced </item>
        <item num="3" units="tsp"> Salt </item>
        <item num="12" units="bottles"> Ice-cold beer </item>
   </ingredients>
   <directions>
   Combine all ingredients and hand whisk to desired consistency.
   Serve and enjoy with ice-cold beers.
   </directions>
</recipe>

Here’s how to extract specific elements from it:

from xml.etree.ElementTree import ElementTree

doc = ElementTree(file="recipe.xml")
title = doc.find('title')
print(title.text)

# Alternative (just get element text)
print(doc.findtext('description'))

# Iterate over multiple elements
for item in doc.findall('ingredients/item'):
    num = item.get('num')
    units = item.get('units', '')
    text = item.text.strip()
    print(f'{num} {units} {text}')

9.16 Final Words

I/O is a fundamental part of writing any useful program. Given its popularity, Python is able to work with literally any data format, encoding, or document structure that’s in use. Although the standard library might not support it, you will almost certainly find a third-party module to solve your problem.

In the big picture, it may be more useful to think about the edges of your application. At the outer boundary between your program and reality, it’s common to encounter issues related to data encoding. This is especially true for textual data and Unicode. Much of the complexity in Python’s I/O handling—supporting different encoding, error handling policies, and so on—is aimed at this specific problem. It’s also critical to keep in mind that textual data and binary data are strictly separated. Knowing what you’re working with helps in understanding the big picture.

A secondary consideration in I/O is the overall evaluation model. Python code is currently separated into two worlds—normal synchronous code and asynchronous code usually associated with the asyncio module (characterized by the use of async functions and the async/await syntax). Asynchronous code almost always requires using dedicated libraries that are capable of operating in that environment. This, in turn, forces your hand on writing your application code in the “async” style as well. Honestly, you should probably avoid asynchronous coding unless you absolutely know that you need it—and if you’re not really sure, then you almost certainly don’t. Most of the well-adjusted Python-speaking universe codes in a normal synchronous style that is far easier to reason about, debug, and test. You should choose that.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.228.35