Input and output (I/O) is part of all programs. This chapter describes the essentials of Python I/O including data encoding, command-line options, environment variables, file I/O, and data serialization. Particular attention is given to programming techniques and abstractions that encourage proper I/O handling. The end of this chapter gives an overview of common standard library modules related to I/O.
The main problem of I/O is the outside world. To communicate with it, data must be properly represented, so that it can be manipulated. At the lowest level, Python works with two fundamental datatypes: bytes that represent raw uninterpreted data of any kind and text that represents Unicode characters.
To represent bytes, two built-in types are used, bytes
and bytearray
. bytes
is an immutable string of integer byte values. bytearray
is a mutable byte array that behaves as a combination of a byte string and a list. Its mutability makes it suitable for building up groups of bytes in a more incremental manner, as when assembling data from fragments. The following example illustrates a few features of bytes
and bytearray
:
# Specify a bytes literal (note: b' prefix) a = b'hello' # Specify bytes from a list of integers b = bytes([0x68, 0x65, 0x6c, 0x6c, 0x6f]) # Create and populate a bytearray from parts c = bytearray() c.extend(b'world') # d = 'world' c.append(0x21) # d = 'world!' # Access byte values print(a[0]) # --> prints 104 for x in b: # Outputs 104 101 108 108 111 print(x)
Accessing individual elements of byte
and bytearray
objects produces integer byte values, not single-character byte strings. This is different from text strings, so it is a common usage error.
Text is represented by the str
datatype and stored as an array of Unicode code points. For example:
d = 'hello' # Text (Unicode) len(d) # --> 5 print(d[0]) # prints 'h'
Python maintains a strict separation between bytes and text. There is never automatic conversion between the two types, comparisons between these types evaluate as False
, and any operation that mixes bytes and text together results in an error. For example:
a = b'hello' # bytes b = 'hello' # text c = 'world' # text print(a == b) # -> False d = a + c # TypeError: can't concat str to bytes e = b + c # -> 'helloworld' (both are strings)
When performing I/O, make sure you’re working with the right kind of data representation. If you are manipulating text, use text strings. If you are manipulating binary data, use bytes.
If you work with text, all data read from input must be decoded and all data written to output must be encoded. For explicit conversion between text and bytes, there are encode(text [,errors])
and decode(bytes [,errors])
methods on text and bytes objects, respectively. For example:
a = 'hello' # Text b = a.encode('utf-8') # Encode to bytes c = b'world' # Bytes d = c.decode('utf-8') # Decode to text
Both encode()
and decode()
require the name of an encoding such as 'utf-8'
or 'latin-1'
. The encodings in Table 9.1 are common.
Encoding Name |
Description |
---|---|
|
Character values in the range [0x00, 0x7f]. |
|
Character values in the range [0x00, 0xff]. Also known as |
|
Variable-length encoding that allows all Unicode characters to be represented. |
|
A common text encoding on Windows. |
|
A common text encoding on Macintosh. |
Additionally, the encoding methods accept an optional errors
argument that specifies behavior in the presence of encoding errors. It is one of the values in Table 9.2.
Value |
Description |
---|---|
|
Raises a |
|
Ignores invalid characters. |
|
Replaces invalid characters with a replacement character (U+FFFD in Unicode, |
|
Replaces each invalid character with a Python character escape sequence. For example, the character U+1234 is replaced by |
|
Replaces each invalid character with an XML character reference. For example, the character U+1234 is replaced by |
|
Replaces any invalid byte |
The 'backslashreplace'
and 'xmlcharrefreplace'
error policies represent unrepresentable characters in a form that allows them to be viewed as simple ASCII text or as XML character references. This can be useful for debugging.
The 'surrogateescape'
error handling policy allows degenerate byte data—data that does not follow the expected encoding rules—to survive a roundtrip decoding/encoding cycle intact regardless of the text encoding being used. Specifically, s.decode(enc, 'surrogateescape').encode(enc, 'surrogateescape') == s
. This round-trip preservation of data is useful for certain kinds of system interfaces where a text encoding is expected but can’t be guaranteed due to issues outside of Python’s control. Instead of destroying data with a bad encoding, Python embeds it “as is” using surrogate encoding. Here’s an example of this behavior with a improperly encoded UTF-8 string:
>>> a = b'Spicy Jalapexf1o' # Invalid UTF-8 >>> a.decode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 12: invalid continuation byte >>> a.decode('utf-8', 'surrogateescape') 'Spicy Jalapeudcf1o' >>> # Encode the resulting string back into bytes >>> _.encode('utf-8', 'surrogateescape') b'Spicy Jalapexf1o' >>>
A common problem when working with text and byte strings is string conversions and formatting—for example, converting a floating-point number to a string with a given width and precision. To format a single value, use the format()
function:
x = 123.456 format(x, '0.2f') # '123.46' format(x, '10.4f') # ' 123.4560' format(x, '<*10.2f') # '123.46****'
The second argument to format()
is a format specifier. The general format of the specifier is [[fill[align]][sign][0][width][,][.precision][type]
where each part enclosed in []
is optional. The width
specifies the minimum field width to use, and the align specifier is one of <
, >
, or ^
for left, right, and centered alignment within the field. An optional fill character fill
is used to pad the space. For example:
name = 'Elwood' r = format(name, '<10') # r = 'Elwood ' r = format(name, '>10') # r = ' Elwood' r = format(name, '^10') # r = ' Elwood ' r = format(name, '*^10') # r = '**Elwood**'
The type
specifier indicates the type of data. Table 9.3 lists the supported format codes. If not supplied, the default format code is s
for strings, d
for integers, and f
for floats.
Character |
Output Format |
---|---|
|
Decimal integer or long integer. |
|
Binary integer or long integer. |
|
Octal integer or long integer. |
|
Hexadecimal integer or long integer. |
|
Hexadecimal integer (uppercase letters). |
|
Floating point as [-]m.dddddd. |
|
Floating point as [-]m.dddddde±xx. |
|
Floating point as [-]m.ddddddE±xx. |
|
Use |
|
Same as |
|
Multiplies a number by 100 and displays it using |
|
String or any object. The formatting code uses |
|
Single character. |
The sign part of a format specifier is one of +
, -
, or a space. A +
indicates that a leading sign should be used on all numbers. A -
is the default and only adds a sign character for negative numbers. A space adds a leading space to positive numbers.
An optional comma (,
) may appear between the width and the precision. This adds a thousands separator character. For example:
x = 123456.78 format(x, '16,.2f') # ' 123,456.78'
The precision
part of the specifier supplies the number of digits of accuracy to use for decimals. If a leading 0
is added to the field width
for numbers, numeric values are padded with leading 0s to fill the space. Here are some examples of formatting different kinds of numbers:
x = 42 r = format(x, '10d') # r = ' 42' r = format(x, '10x') # r = ' 2a' r = format(x, '10b') # r = ' 101010' r = format(x, '010b') # r = '0000101010' y = 3.1415926 r = format(y, '10.2f') # r = ' 3.14' r = format(y, '10.2e') # r = ' 3.14e+00' r = format(y, '+10.2f') # r = ' +3.14' r = format(y, '+010.2f') # r = '+000003.14' r = format(y, '+10.2%') # r = ' +314.16%'
For more complex string formatting, you can use f-strings:
x = 123.456
f'Value is {x:0.2f}' # 'Value is 123.46' f'Value is {x:10.4f}' # 'Value is 123.4560' f'Value is {2*x:*<10.2f}' # 'Value is 246.91****'
Within an f-string, text of the form {expr:spec}
is replaced by the value of format(expr, spec)
. expr
can be an arbitrary expression as long as it doesn’t include {
, }
, or characters. Parts of the format specifier itself can optionally be supplied by other expressions. For example:
y = 3.1415926 width = 8 precision=3 r = f'{y:{width}.{precision}f}' # r = ' 3.142'
If you end expr
by =
, then the literal text of expr
is also included in the result. For example:
x = 123.456 f'{x=:0.2f}' # 'x=123.46' f'{2*x=:0.2f}' # '2*x=246.91'
If you append !r
to a value, formatting is applied to the output of repr()
. If you use !s
, formatting is applied to the output of str()
. For example:
f'{x!r:spec}' # Calls (repr(x).__format__('spec')) f'{x!s:spec}' # Calls (str(x).__format__('spec'))
As an alternative to f-strings, you can use the .format()
method of strings:
x = 123.456 'Value is {:0.2f}' .format(x) # 'Value is 123.46' 'Value is {0:10.2f}' .format(x) # 'Value is 123.4560' 'Value is {val:<*10.2f}' .format(val=x) # 'Value is 123.46****'
With a string formatted by .format()
, text of the form {arg:spec}
is replaced by the value of format(arg, spec)
. In this case, arg
refers to one of the arguments given to the format()
method. If omitted entirely, the arguments are taken in order. For example:
name = 'IBM' shares = 50 price = 490.1 r = '{:>10s} {:10d} {:10.2f}'.format(name, shares, price) # r = ' IBM 50 490.10'
arg
can also refer to a specific argument number or name. For example:
tag = 'p' text = 'hello world' r = '<{0}>{1}</{0}>'.format(tag, text) # r = '<p>hello world</p>' r = '<{tag}>{text}</{tag}>'.format(tag='p', text='hello world')
Unlike f-strings, the arg
value of a specifier cannot be an arbitrary expression, so it’s not quite as expressive. However, the format()
method can perform limited attribute lookup, indexing, and nested substitutions. For example:
y = 3.1415926 width = 8 precision=3 r = 'Value is {0:{1}.{2}f}'.format(y, width, precision) d = { 'name': 'IBM', 'shares': 50, 'price': 490.1 } r = '{0[shares]:d} shares of {0[name]} at {0[price]:0.2f}'.format(d) # r = '50 shares of IBM at 490.10'
bytes
and bytearray
instances can be formatted using the %
operator. The semantics of this operator are modeled after the sprintf()
function from C. Here are some examples:
name = b'ACME' x = 123.456 b'Value is %0.2f' % x # b'The value is 123.46' bytearray(b'Value is %0.2f') % x # b'Value is 123.46' b'%s = %0.2f' % (name, x) # b'ACME = 123.46'
With this formatting, sequences of the form %spec
are replaced in order with values from a tuple provided as the second operand to the %
operator. The basic format codes (d
, f
, s
, etc.) are the same as those used for the format()
function. However, more advanced features are either missing or changed slightly. For example, to adjust alignment, you use a -
character like this:
x = 123.456 b'%10.2f' % x # b' 123.46' b'%-10.2f' % x # b'123.46 '
Using a format code of %r
produces the output of ascii()
which can be useful in debugging and logging.
When working with bytes, be aware that text strings are not supported. They need to be explicitly encoded.
name = 'Dave' b'Hello %s' % name # TypeError! b'Hello %s' % name.encode('utf-8') # Ok
This form of formatting can also be used with text strings, but it’s considered to be an older programming style. However, it still arises in certain libraries. For example, messages produced by the logging
module are formatted in this way:
import logging log = logging.getLogger(__name__) log.debug('%s got %d', name, value) # '%s got %d' % (name, value)
The logging
module is briefly described later in this chapter in Section 9.15.12.
When Python starts, command-line options are placed in the list sys.argv
as text strings. The first item is the name of the program. Subsequent items are the options added on the command line after the program name. The following program is a minimal prototype of manually processing command-line arguments:
def main(argv): if len(argv) != 3: raise SystemExit( f'Usage : python {argv[0]} inputfile outputfile ') inputfile = argv[1] outputfile = argv[2] ... if __name__ == '__main__': import sys main(sys.argv)
For better code organization, testing, and similar reasons, it’s a good idea to write a dedicated main()
function that accepts the command-line options (if any) as a list, as opposed to directly reading sys.argv
. Include a small fragment of code at the end of your program to pass the command-line options to your main()
function.
sys.argv[0]
contains the name of the script being executed. Writing a descriptive help message and raising SystemExit
is standard practice for command-line scripts that want to report an error.
Although in simple scripts, you can manually process command options, consider using the argparse
module for more complicated command-line handling. Here is an example:
import argparse def main(argv): p = argparse.ArgumentParser(description="This is some program") # A positional argument p.add_argument("infile") # An option taking an argument p.add_argument("-o","--output", action="store") # An option that sets a boolean flag p.add_argument("-d","--debug", action="store_true", default=False) # Parse the command line args = p.parse_args(args=argv) # Retrieve the option settings infile = args.infile output = args.output debugmode = args.debug print(infile, output, debugmode) if __name__ == '__main__': import sys main(sys.argv[1:])
This example only shows the most simple use of the argparse
module. The standard library documentation provides more advanced usage. There are also third-party modules such as click
and docopt
that can simplify writing of more complex command-line parsers.
Finally, command-line options might be provided to Python in an invalid text encoding. Such arguments are still accepted but they will be encoded using 'surrogateescape'
error handling as described in Section 9.2. You need to be aware of this if such arguments are later included in any kind of text output and it’s critical to avoid crashing. It might not be critical, though—don’t overcomplicate your code for edge cases that don’t matter.
Sometimes data is passed to a program via environment variables set in the command shell. For example, a Python program might be launched using a shell command such as env
:
bash $ env SOMEVAR=somevalue python3 somescript.py
Environment variables are accessed as text strings in the mapping os.environ
. Here’s an example:
import os path = os.environ['PATH'] user = os.environ['USER'] editor = os.environ['EDITOR'] val = os.environ['SOMEVAR'] ... etc ...
To modify environment variables, set the os.environ
variable. For example:
os.environ['NAME'] = 'VALUE'
Modifications to os.environ
affect both the running program and any subprocesses created later—for example, those created by the subprocess
module.
As with command-line options, badly encoded environment variables may produce strings that use the 'surrogateescape'
error handling policy.
To open a file, use the built-in open()
function. Usually, open()
is given a filename and a file mode. It is also often used in combination with the with
statement as a context manager. Here are some common usage patterns of working with files:
# Read a text file all at once as a string with open('filename.txt', 'rt') as file: data = file.read() # Read a file line-by-line with open('filename.txt', 'rt') as file: for line in file: ... # Write to a text file with open('out.txt', 'wt') as file: file.write('Some output ') print('More output', file=file)
In most cases, using open()
is a straightforward affair. You give it the name of the file you want to open along with a file mode. For example:
open('name.txt') # Opens "name.txt" for reading open('name.txt', 'rt') # Opens "name.txt" for reading (same) open('name.txt', 'wt') # Opens "name.txt" for writing open('data.bin', 'rb') # Binary mode read open('data.bin', 'wb') # Binary mode write
For most programs, you will never need to know more than these simple examples to work with files. However, there are a number of special cases and more esoteric features of open()
worth knowing. The next few sections discuss open()
and file I/O in more detail.
To open a file, you need to give open()
the name of the file. The name can either be a fully specified absolute pathname such as '/Users/guido/Desktop/files/old/data.csv'
or a relative pathname such as 'data.csv'
or '..olddata.csv'
. For relative filenames, the file location is determined relative to the current working directory as returned by os.getcwd()
. The current working directory can be changed with os.chdir(newdir)
.
The name itself can be encoded in a number of forms. If it’s a text string, the name is interpreted according to the text encoding returned by sys.getfilesystemencoding()
before being passed to the host operating system. If the filename is a byte string, it is left unencoded and is passed as is. This latter option may be useful if you’re writing programs that must handle the possibility of degenerate or miscoded filenames—instead of passing the filename as text, you can pass the raw binary representation of the name. This might seem like an obscure edge case, but Python is commonly used to write system-level scripts that manipulate the filesystem. Abuse of the filesystem is a common technique used by hackers to either hide their tracks or to break system tools.
In addition to text and bytes, any object that implements the special method __fspath__()
can be used as a name. The __fspath__()
method must return a text or bytes object corresponding to the actual name. This is the mechanism that makes standard library modules such as pathlib
work. For example:
>>> from pathlib import Path >>> p = Path('Data/portfolio.csv') >>> p.__fspath__() 'Data/portfolio.csv' >>>
Potentially, you could make your own custom Path
object that worked with open()
as long as it implements __fspath__()
in a way that resolves to a proper filename on the system.
Finally, filenames can be given as low-level integer file descriptors. This requires that the “file” is already open on the system in some way. Perhaps it corresponds to a network socket, a pipe, or some other system resource that exposes a file descriptor. Here is an example of opening a file directly with the os
module and then turning it into a proper file object:
>>> import os >>> fd = os.open('/etc/passwd', os.O_RDONLY) >>> fd 3 >>> file = open(fd, 'rt') >>> file <_io.TextIOWrapper name=3 mode='rt' encoding='UTF-8'> >>> data = file.read() >>>
When opening an existing file descriptor like this, the close()
method of the returned file will also close the underlying descriptor. This can be disabled by passing closefd=False
to open()
. For example:
file = open(fd, 'rt', closefd=False)
When opening a file, you need to specify a file mode. The core file modes are 'r'
for reading, 'w'
for writing, and 'a'
for appending. 'w'
mode replaces any existing file with new content. 'a'
opens a file for writing and positions the file pointer to the end of the file so that new data can be appended.
A special file mode of 'x'
can be used to write to a file, but only if it doesn’t exist already. This is a useful way to prevent accidental overwriting of existing data. For this mode, a FileExistsError
exception is raised if the file already exists.
Python makes a strict distinction between text and binary data. To specify the kind of data, you append a 't'
or a 'b'
to the file mode. For example, a file mode of 'rt'
opens a file for reading in text mode and 'rb'
opens a file for reading in binary mode. The mode determines the kind of data returned by file-related methods such as f.read()
. In text mode, strings are returned. In binary mode, bytes are returned.
Binary files can be opened for in-place updates by supplying a plus (+
) character, such as 'rb+'
or 'wb+'
. When a file is opened for update, you can perform both input and output, as long as all output operations flush their data before any subsequent input operations. If a file is opened using 'wb+'
mode, its length is first truncated to zero. A common use of the update mode is to provide random read/write access to file contents in combination with seek operations.
By default, files are opened with I/O buffering enabled. With I/O buffering, I/O operations are performed in larger chunks to avoid excessive system calls. For example, write operations would start filling an internal memory buffer and output would only actually occur when the buffer is filled up. This behavior can be changed by giving a buffering
argument to open()
. For example:
# Open a binary-mode file with no I/O buffering with open('data.bin', 'wb', buffering=0) as file: file.write(data) ...
A value of 0
specifies unbuffered I/O and is only valid for binary mode files. A value of 1
specifies line-buffering and is usually only meaningful for text-mode files. Any other positive value indicates the buffer size to use (in bytes). If no buffering value is specified, the default behavior depends on the kind of file. If it’s a normal file on disk, buffering is managed in blocks and the buffer size is set to io.DEFAULT_BUFFER_SIZE
. Typically this is some small multiple of 4096 bytes. It might vary by system. If the file represents an interactive terminal, line buffering is used.
For normal programs, I/O buffering is not typically a major concern. However, buffering can have an impact on applications involving active communication between processes. For example, a problem that sometimes arises is two communicating subprocesses that deadlock due to an internal buffering issue—for instance, one process writes into a buffer, but a receiver never sees that data because the buffer didn’t get flushed. Such problems can be fixed by either specifying unbuffered I/O or by an explicit flush()
call on the associated file. For example:
file.write(data)
file.write(data)
...
file.flush() # Make sure all data is written from buffers
For files opened in text mode, an optional encoding and an error-handling policy can be specified using the encoding
and errors
arguments. For example:
with open('file.txt', 'rt', encoding='utf-8', errors='replace') as file: data = file.read()
The values given to the encoding
and errors
argument have the same meaning as for the encode()
and decode()
methods of strings and bytes respectively.
The default text encoding is determined by the value of sys.getdefaultencoding()
and may vary by system. If you know the encoding in advance, it’s often better to explicitly provide it even if it happens to match the default encoding on your system.
With text files, one complication is the encoding of newline characters. Newlines are encoded as '
'
, '
'
, or '
'
depending on the host operating system—for example, '
'
on UNIX and '
'
on Windows. By default, Python translates all of these line endings to a standard '
'
character when reading. On writing, newline characters are translated back to the default line ending used on the system. The behavior is sometimes referred to as “universal newline mode” in Python documentation.
You can change the newline behavior by giving a newline
argument to open()
. For example:
# Exactly require ' ' and leave intact file = open('somefile.txt', 'rt', newline=' ')
Specifying newline=None
enables the default line handling behavior where all line endings are translated to a standard '
'
character. Giving newline=''
makes Python recognize all line endings, but disables the translation step—if lines were terminated by '
'
, the '
'
combination would be left in the input intact. Specifying a value of '
'
, '
'
, or '
'
makes that the expected line ending.
The open()
function serves as a kind of high-level factory function for creating instances of different I/O classes. These classes embody the different file modes, encodings, and buffering behaviors. They are also composed together in layers. The following classes are defined in the io
module:
FileIO(filename, mode='r', closefd=True, opener=None)
Opens a file for raw unbuffered binary I/O. filename
is any valid filename accepted by the open()
function. Other arguments have the same meaning as for open()
.
BufferedReader(file [, buffer_size]) BufferedWriter(file [, buffer_size]) BufferedRandom(file,[, buffer_size])
Implements a buffered binary I/O layer for a file. file
is an instance of FileIO
. buffer_size
specifies the internal buffer size to use. The choice of class depends on whether or not the file is reading, writing, or updating data. The optional buffer_size
argument specifies the internal buffer size used.
TextIOWrapper(buffered, [encoding, [errors [, newline [, line_buffering [, write_through]]]]])
Implements text mode I/O. buffered
is a buffered binary mode file, such as BufferedReader
or BufferedWriter
. The encoding
, errors
, and newline
arguments have the same meaning as for open()
. line_buffering
is a Boolean flag that forces I/O to be flushed on newline characters (False
by default). write_through
is a Boolean flag that forces all writes to be flushed (False
by default).
Here is an example that shows how a text-mode file is constructed, layer-by-layer:
>>> raw = io.FileIO('filename.txt', 'r') # Raw-binary mode >>> buffer = io.BufferedReader(raw) # Binary buffered reader >>> file = io.TextIOWrapper(buffer, encoding='utf-8') # Text mode >>>
Normally you don’t need to manually construct layers like this—the built-in open()
function takes care of all of the work. However, if you already have an existing file object and want to change its handling in some way, you might manipulate the layers as shown.
To strip layers away, use the detach()
method of a file. For example, here is how you can convert an already text-mode file into a binary-mode file:
f = open('something.txt', 'rt') # Text-mode file fb = f.detach() # Detach underlying binary mode file data = fb.read() # Returns bytes
The exact type of object returned by open()
depends on the combination of file mode and buffering options provided. However, the resulting file object supports the methods in Table 9.4.
Method |
Description |
---|---|
|
Returns |
|
Reads at most |
|
Reads a single line of input up to |
|
Reads all the lines and returns a list. |
|
Reads data into a memory buffer. |
|
Returns |
|
Writes string |
|
Writes all strings in iterable |
|
Closes the file. |
|
Returns |
|
Returns the current file pointer. |
|
Seeks to a new file position. |
|
Returns |
|
Flushes the output buffers. |
|
Truncates the file to at most |
|
Returns an integer file descriptor. |
The readable()
, writable()
, and seekable()
methods test for supported file capabilities and modes. The read()
method returns the entire file as a string unless an optional length parameter specifies the maximum number of characters. The readline()
method returns the next line of input, including the terminating newline; the readlines()
method returns all the input file as a list of strings. The readline()
method optionally accepts a maximum line length, n
. If a line longer than n
characters is read, the first n
characters are returned. The remaining line data is not discarded and will be returned on subsequent read operations. The readlines()
method accepts a size
parameter that specifies the approximate number of characters to read before stopping. The actual number of characters read may be larger than this depending on how much data has been buffered already. The readinto()
method is used to avoid memory copies and is discussed later.
read()
and readline()
indicate end-of-file (EOF) by returning an empty string. Thus, the following code shows how you can detect an EOF condition:
while True: line = file.readline() if not line: # EOF break statements ...
You could also write this code as follows:
while (line:=file.readline()): statements ...
A convenient way to read all lines in a file is to use iteration with a for
loop:
for line in file: # Iterate over all lines in the file # Do something with line ...
The write()
method writes data to the file, and the writelines()
method writes an iterable of strings to the file. write()
and writelines()
do not add newline characters to the output, so all output that you produce should already include all necessary formatting.
Internally, each open file object keeps a file pointer that stores the byte offset at which the next read or write operation will occur. The tell()
method returns the current value of the file pointer. The seek(offset [,whence])
method is used to randomly access parts of a file given an integer offset and a placement rule in whence
. If whence
is os.SEEK_SET
(the default), seek()
assumes that offset is relative to the start of the file; if whence
is os.SEEK_CUR
, the position is moved relative to the current position; and if whence
is os.SEEK_END
, the offset is taken from the end of the file.
The fileno()
method returns the integer file descriptor for a file and is sometimes used in low-level I/O operations in certain library modules. For example, the fcntl
module uses the file descriptor to provide low-level file control operations on UNIX systems.
The readinto()
method is used to perform zero-copy I/O into contiguous memory buffers. It is most commonly used in combination with specialized libraries such as numpy
—for example, to read data directly into the memory allocated for a numeric array .
File objects also have the read-only data attributes shown in Table 9.5.
Attribute |
Description |
---|---|
|
Boolean value indicates the file state: |
|
The I/O mode for the file. |
|
Name of the file if created using |
|
The newline representation actually found in the file. The value is either |
|
A string that indicates file encoding, if any (for example, |
|
The error handling policy. |
|
Boolean value indicating if writes on a text file pass data directly to the underlying binary level file without buffering. |
The interpreter provides three standard file-like objects, known as standard input, standard output, and standard error, available as sys.stdin
, sys.stdout
, and sys.stderr
, respectively. stdin
is a file object corresponding to the stream of input characters supplied to the interpreter, stdout
is the file object that receives output produced by print()
, and stderr
is a file that receives error messages. More often than not, stdin
is mapped to the user’s keyboard, whereas stdout
and stderr
produce text on screen.
The methods described in the preceding section can be used to perform I/O with the user. For example, the following code writes to standard output and reads a line of input from standard input:
import sys sys.stdout.write("Enter your name : ") name = sys.stdin.readline()
Alternatively, the built-in function input(prompt)
can read a line of text from stdin
and optionally print a prompt:
name = input("Enter your name : ")
Lines read by input()
do not include the trailing newline. This is different than reading directly from sys.stdin
where newlines are included in the input text.
If necessary, the values of sys.stdout
, sys.stdin
, and sys.stderr
can be replaced with other file objects, in which case the print()
and input()
functions will use the new values. Should it ever be necessary to restore the original value of sys.stdout
, it should be saved first. The original values of sys.stdout
, sys.stdin
, and sys.stderr
at interpreter startup are also available in sys.__stdout__
, sys.__stdin__
, and sys.__stderr__
, respectively.
To get a directory listing, use the os.listdir(pathname)
function. For example, here is how to print out a list of filenames in a directory:
import os names = os.listdir('dirname') for name in names: print(name)
Names returned by listdir()
are normally decoded according to the encoding returned by sys.getfilesystemencoding()
. If you specify the initial path as bytes, the filenames are returned as undecoded byte strings. For example:
import os # Return raw undecoded names names = os.listdir(b'dirname')
A useful operation related to directory listing is matching filenames according to a pattern, known as globbing. The pathlib
modules can be used for this purpose. For example, here is an example of matching all *.txt
files in a specific directory:
import pathlib for filename in path.Path(dirname).glob('*.txt'): print(filename)
If you use rglob()
instead of glob()
, it will recursively search all subdirectories for filenames that match the pattern. Both the glob()
and rglob()
functions return a generator that produces the filenames via iteration.
print()
functionTo print a series of values separated by spaces, supply them all to print()
like this:
print('The values are', x, y, z)
To suppress or change the line ending, use the end
keyword argument:
# Suppress the newline print('The values are', x, y, z, end='')
To redirect the output to a file, use the file
keyword argument:
# Redirect to file object f print('The values are', x, y, z, file=f)
To change the separator character between items, use the sep
keyword argument:
# Put commas between the values print('The values are', x, y, z, sep=',')
Working directly with files is most familiar to programmers. However, generator functions can also be used to emit an I/O stream as a sequence of data fragments. To do this, use the yield
statement as you would use a write()
or print()
function. Here is an example:
def countdown(n): while n > 0: yield f'T-minus {n} ' n -= 1 yield 'Kaboom! '
Producing an output stream in this manner provides flexibility because it is decoupled from the code that actually directs the stream to its intended destination. For example, if you want to route the above output to a file f
, you can do this:
lines = countdown(5) f.writelines(lines)
If, instead, you want to redirect the output across a socket s
, you can do this:
for chunk in lines: s.sendall(chunk)
Or, if you simply want to capture all of the output in a single string, you can do this:
out = ''.join(lines)
More advanced applications can use this approach to implement their own I/O buffering. For example, a generator could be emitting small text fragments, but another function would then collect the fragments into larger buffers to create a more efficient single I/O operation.
chunks = [] buffered_size = 0 for chunk in count: chunks.append(chunk) buffered_size += len(chunk) if buffered_size >= MAXBUFFERSIZE: outf.write(''.join(chunks)) chunks.clear() buffered_size = 0 outf.write(''.join(chunks)
For programs that are routing output to files or network connections, a generator approach can also result in a significant reduction in memory use because the entire output stream can often be generated and processed in small fragments, as opposed to being first collected into one large output string or list of strings.
For programs that consume fragmentary input, enhanced generators can be useful for decoding protocols and other facets of I/O. Here is an example of an enhanced generator that receives byte fragments and assembles them into lines:
def line_receiver(): data = bytearray() line = None linecount = 0 while True: part = yield line linecount += part.count(b' ') data.extend(part) if linecount > 0: index = data.index(b' ') line = bytes(data[:index+1]) data = data[index+1:] linecount -= 1 else: line = None
In this example, a generator has been programmed to receive byte fragments that are collected into a byte array. If the array contains a newline, a line is extracted and returned. Otherwise, None
is returned. Here’s an example illustrating how it works:
>>> r = line_receiver() >>> r.send(None) # Necessary first step >>> r.send(b'hello') >>> r.send(b'world it ') b'hello world ' >>> r.send(b'works!') >>> r.send(b' ') b'it works! '' >>>
An interesting side effect of this approach is that it externalizes the actual I/O operations that must be performed to get the input data. Specifically, the implementation of line_receiver()
contains no I/O operations at all! This means that it could be used in different contexts. For example, with sockets:
r = line_receiver() data = None while True: while not (line:=r.send(data)): data = sock.recv(8192) # Process the line ...
or with files:
r = line_receiver() data = None while True: while not (line:=r.send(data)): data = file.read(10000) # Process the line ...
or even in asynchronous code:
async def reader(ch): r = line_receiver() data = None while True: while not (line:=r.send(data)): data = await ch.receive(8192) # Process the line ...
Sometimes it’s necessary to serialize the representation of an object so it can be transmitted over the network, saved to a file, or stored in a database. One way to do this is to convert data into a standard encoding such as JSON or XML. There is also a common Python-specific data serialization format called Pickle.
The pickle
module serializes an object into a stream of bytes that can be used to reconstruct the object at a later point in time. The interface to pickle
is simple, consisting of two operations, dump()
and load()
. For example, the following code writes an object to a file:
import pickle obj = SomeObject() with open(filename, 'wb') as file: pickle.dump(obj, file) # Save object on f
To restore the object, use:
with open(filename, 'rb') as file: obj = pickle.load(file) # Restore the object
The data format used by pickle
has its own framing of records. Thus, a sequence of objects can be saved by issuing a series of dump()
operations one after the other. To restore these objects, simply use a similar sequence of load()
operations.
For network programming, it is common to use pickle to create byte-encoded messages. To do that, use dumps()
and loads()
. Instead of reading/writing data to a file, these functions work with byte strings.
obj = SomeObject() # Turn an object into bytes data = pickle.dumps(obj) ... # Turn bytes back into an object obj = pickle.loads(data)
It is not normally necessary for user-defined objects to do anything extra to work with pickle
. However, certain kinds of objects can’t be pickled. These tend to be objects that incorporate runtime state—open files, threads, closures, generators, and so on. To handle these tricky cases, a class can define the special methods __getstate__()
and __setstate__()
.
The __getstate__()
method, if defined, will be called to create a value representing the state of an object. The value returned by __getstate__()
is typically a string, tuple, list, or dictionary. The __setstate__()
method receives this value during unpickling and should restore the state of an object from it.
When encoding an object, pickle
does not include the underlying source code itself. Instead, it encodes a name reference to the defining class. When unpickling, this name is used to perform a source-code lookup on the system. For unpickling to work, the recipient of a pickle must have the proper source code already installed. It is also important to emphasize that pickle
is inherently insecure—unpickling untrusted data is a known vector for remote code execution. Thus, pickle
should only be used if you can completely secure the runtime environment.
A fundamental aspect of I/O is the concept of blocking. By its very nature, I/O is connected to the real world. It often involves waiting for input or devices to be ready. For example, code that reads data on the network might perform a receive operation on a socket like this:
data = sock.recv(8192)
When this statement executes, it might return immediately if data is available. However, if that’s not the case, it will stop—waiting for data to arrive. This is blocking. While the program is blocked, nothing else happens.
For a data analysis script or a simple program, blocking is not something that you worry about. However, if you want your program to do something else while an operation is blocked, you will need to take a different approach. This is the fundamental problem of concurrency—having a program work on more than one thing at a time. One common problem is having a program read on two or more different network sockets at the same time:
def reader1(sock): while (data := sock.recv(8192)): print('reader1 got:', data) def reader2(sock): while (data := sock.recv(8192)): print('reader2 got:', data) # Problem: How to make reader1() and reader2() # run at the same time?
The rest of this section outlines a few different approaches to solving this problem. However, it is not meant to be a full tutorial on concurrency. For that, you will need to consult other resources.
One approach to avoiding blocking is to use so-called nonblocking I/O. This is a special mode that has to be enabled—for example, on a socket:
sock.setblocking(False)
Once enabled, an exception will now be raised if an operation would have blocked. For example:
try: data = sock.recv(8192) except BlockingIOError as e: # No data is avaiable ...
In response to a BlockingIOError
, the program could elect to work on something else. It could retry the I/O operation later to see if any data has arrived. For example, here’s how you might read on two sockets at once:
def reader1(sock): try: data = sock.recv(8192) print('reader1 got:', data) except BlockingIOError: pass def reader2(sock): try: data = sock.recv(8192) print('reader2 got:', data) except BlockingIOError: pass def run(sock1, sock2): sock1.setblocking(False) sock2.setblocking(False) while True: reader1(sock1) reader2(sock2)
In practice, relying only on nonblocking I/O is clumsy and inefficient. For example, the core of this program is the run()
function at the end. It will run in a inefficient busy loop as it constantly tries to read on the sockets. This works, but it is not a good design.
Instead of relying upon exceptions and spinning, it is possible to poll I/O channels to see if data is available. The select
or selectors
module can be used for this purpose. For example, here’s a slightly modified version of the run()
function:
from selectors import DefaultSelector, EVENT_READ, EVENT_WRITE def run(sock1, sock2): selector = DefaultSelector() selector.register(sock1, EVENT_READ, data=reader1) selector.register(sock2, EVENT_READ, data=reader2) # Wait for something to happen while True: for key, evt in selector.select(): func = key.data func(key.fileobj)
In this code, the loop dispatches either reader1()
or reader2()
function as a callback whenever I/O is detected on the appropriate socket. The selector.select()
operation itself blocks, waiting for I/O to occur. Thus, unlike the previous example, it won’t make the CPU furiously spin.
This approach to I/O is the foundation of many so-called “async” frameworks such as asyncio
, although you usually don’t see the inner workings of the event loop.
In the last two examples, concurrency required the use of a special run()
function to drive the calculation. As an alternative, you can use thread programming and the threading
module. Think of a thread as an independent task that runs inside your program. Here is an example of code that reads data on two sockets at once:
import threading def reader1(sock): while (data := sock.recv(8192)): print('reader1 got:', data) def reader2(sock): while (data := sock.recv(8192)): print('reader2 got:', data) t1 = threading.Thread(target=reader1, args=[sock1]).start() t2 = threading.Thread(target=reader2, args=[sock2]).start() # Start the threads t1.start() t2.start() # Wait for the threads to finish t1.join() t2.join()
In this program, the reader1()
and reader2()
functions execute concurrently. This is managed by the host operating system, so you don’t need to know much about how it works. If a blocking operation occurs in one thread, it does not affect the other thread.
The subject of thread programming is, in its entirety, beyond the scope of this book. However, a few additional examples are provided in the threading
module section later in this chapter.
asyncio
The asyncio
module provides a concurrency implementation alternative to threads. Internally, it’s based on an event loop that uses I/O polling. However, the high-level programming model looks very similar to threads through the use of special async
functions. Here is an example:
import asyncio async def reader1(sock): loop = asyncio.get_event_loop() while (data := await loop.sock_recv(sock, 8192)): print('reader1 got:', data) async def reader2(sock): loop = asyncio.get_event_loop() while (data := await loop.sock_recv(sock, 8192)): print('reader2 got:', data) async def main(sock1, sock2): loop = asyncio.get_event_loop() t1 = loop.create_task(reader1(sock1)) t2 = loop.create_task(reader2(sock2)) # Wait for the tasks to finish await t1 await t2 ... # Run it asyncio.run(main(sock1, sock2))
Full details of using asyncio
would require its own dedicated book. What you should know is that many libraries and frameworks advertise support for asynchronous operation. Usually that means that concurrent execution is supported through asyncio
or a similar module. Much of the code is likely to involve async functions and related features.
A large number of standard library modules are used for various I/O related tasks. This section provides a brief overview of the commonly used modules, along with a few examples. Complete reference material can be found online or in an IDE and is not repeated here. The main purpose of this section is to point you in the right direction by giving you the names of the modules that you should be using along with a few examples of very common programming tasks involving each module.
Many of the examples are shown as interactive Python sessions. These are experiments that you are encouraged to try yourself.
asyncio
ModuleThe asyncio
module provides support for concurrent I/O operations using I/O polling and an underlying event loop. Its primary use is in code involving networks and distributed systems. Here is an example of a TCP echo server using low-level sockets:
import asyncio from socket import * async def echo_server(address): loop = asyncio.get_event_loop() sock = socket(AF_INET, SOCK_STREAM) sock.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) sock.bind(address) sock.listen(5) sock.setblocking(False) print('Server listening at', address) with sock: while True: client, addr = await loop.sock_accept(sock) print('Connection from', addr) loop.create_task(echo_client(loop, client)) async def echo_client(loop, client): with client: while True: data = await loop.sock_recv(client, 10000) if not data: break await loop.sock_sendall(client, b'Got:' + data) print('Connection closed') if __name__ == '__main__': loop = asyncio.get_event_loop() loop.create_task(echo_server(loop, ('', 25000))) loop.run_forever()
To test this code, use a program such as nc
or telnet
to connect to port 25000 on your machine. The code should echo back the text that you type. If you connect more than once using multiple terminal windows, you’ll find that the code can handle all of the connections concurrently.
Most applications using asyncio
will probably operate at a higher level than sockets. However, in such applications, you will still have to make use of special async
functions and interact with the underlying event loop in some manner.
binascii
ModuleThe binascii
module has functions for converting binary data into various text-based representations such as hexadecimal and base64. For example:
>>> binascii.b2a_hex(b'hello') b'68656c6c6f' >>> binascii.a2b_hex(_) b'hello' >>> binascii.b2a_base64(b'hello') b'aGVsbG8= ' >>> binascii.a2b_base64(_) b'hello' >>>
Similar functionality can be found in the base64
module as well as with the hex()
and fromhex()
methods of bytes
. For example:
>>> a = b'hello' >>> a.hex() '68656c6c6f' >>> bytes.fromhex(_) b'hello' >>> import base64 >>> base64.b64encode(a) b'aGVsbG8=' >>>
cgi
ModuleSo, let’s say you just want to put a basic form on your website. Perhaps it’s a sign-up form for your weekly “Cats and Categories” newsletter. Sure, you could install the latest web framework and spend all of your time fiddling with it. Or, you could also just write a basic CGI script—old-school style. The cgi
module is for doing just that.
Suppose you have the following form fragment on a webpage:
<form method="POST" action="cgi-bin/register.py"> <p> To register, please provide a contact name and email address. </p> <div> <input name="name" type="text">Your name:</input> </div> <div> <input name="email" type="email">Your email:</input> </div> <div class="modal-footer justify-content-center"> <input type="submit" name="submit" value="Register"></input> </div> </form>
Here’s a CGI script that receives the form data on the other end:
#!/usr/bin/env python import cgi try: form = cgi.FieldStorage() name = form.getvalue('name') email = form.getvalue('email') # Validate the responses and do whatever ... # Produce an HTML result (or redirect) print("Status: 302 Moved ") print("Location: https://www.mywebsite.com/thanks.html ") print(" ") except Exception as e: print("Status: 501 Error ") print("Content-type: text/plain ") print(" ") print("Some kind of error occurred. ")
Will writing such a CGI script get you a job at an Internet startup? Probably not. Will it solve your actual problem? Likely.
configparser
ModuleINI files are a common format for encoding program configuration information in a human-readable form. Here is an example:
# config.ini ; A comment [section1] name1 = value1 name2 = value2 [section2] ; Alternative syntax name1: value1 name2: value2
The configparser
module is used to read .ini
files and extract values. Here’s a basic example:
import configparser # Create a config parser and read a file cfg = configparser.ConfigParser() cfg.read('conig.ini') # Extract values a = cfg.get('section1', 'name1') b = cfg.get('section2', 'name2') ...
More advanced functionality is also available, including string interpolation features, the ability to merge multiple .ini
files, provide default values, and more. Consult the official documentation for more examples.
csv
ModuleThe csv
module is used to read/write files of comma-separated values (CSV) produced by programs such as Microsoft Excel or exported from a database. To use it, open a file and then wrap an extra layer of CSV encoding/decoding around it. For example:
import csv # Read a CSV file into a list of tuples def read_csv_data(filename): with open(filename) as file: rows = csv.reader(file) # First line is often a header. This reads it headers = next(rows) # Now read the rest of the data for row in rows: # Do something with row ... # Write Python data to a CSV file def write_csv_data(filename, headers, rows): with open(filename, "w") as file: out = csv.writer(file) out.writerow(headers) out.writerows(rows)
An often used convenience is to use a DictReader()
instead. This interprets the first line of a CSV file as headers and returns each row as dictionary instead of a tuple.
import csv def find_nearby(filename): with open(filename) as file: rows = csv.DictReader(file) for row in rows: lat = float(rows['latitude']) lon = float(rows['longitude']) if close_enough(lat, lon): print(row)
The csv
module doesn’t do much with CSV data other than reading or writing it. The main benefit provided is that the module knows how to properly encode/decode the data and handles a lot of edge cases involving quotation, special characters, and other details. This is a module you might use to write simple scripts for cleaning or preparing data to be used with other programs. If you want to perform data analysis tasks with CSV data, consider using a third-party package such as the popular pandas
library.
errno
ModuleWhenever a system-level error occurs, Python reports it with an exception that’s a subclass of OSError
. Some of the more common kinds of system errors are represented by separate subclasses of OSError
such as PermissionError
or FileNotFoundError
. However, there are hundreds of other errors that could occur in practice. For these, any OSError
exception carries a numeric errno
attribute that can be inspected. The errno
module provides symbolic constants corresponding to these error codes. They are often used when writing specialized exception handlers. For example, here is an exception handler that checks for no space remaining on a device:
import errno def write_data(file, data): try: file.write(data) except OSError as e: if e.errno == errno.ENOSPC: print("You're out of disk space!" else: raise # Some other error. Propagate
fcntl
ModuleThe fcntl
module is used to perform low-level I/O control operations on UNIX using the fcntl()
and ioctl()
system calls. This is also the module to use if you want to perform any kind of file locking—a problem that sometimes arises in the context of concurrency and distributed systems. Here is an example of opening a file in combination with mutual exclusion locking across all processes using fcntl.flock()
:
import fcntl with open("somefile", "r") as file: try: fcntl.flock(file.fileno(), fcntl.LOCK_EX) # Use the file ... finally: fcntl.flock(file.fileno(), fcntl.LOCK_UN)
hashlib
ModuleThe hashlib
module provides functions for computing cryptographic hash values such as MD5 and SHA-1. The following example illustrates how to use the module:
>>> h = hashlib.new('sha256') >>> h.update(b'Hello') # Feed data >>> h.update(b'World') >>> h.digest() b'xa5x91xa6xd4x0bxf4 @Jx01x173xcfxb7xb1x90xd6,exbfx0bxcdxa3+Wxb2wxd9xadx9fx14n >>> h.hexdigest() 'a591a6d40bf420404a011733cfb7b190d62c65bf0bcda32b57b277d9ad9f146e' >>> h.digest_size 32 >>>
http
PackageThe http
package contains a large amount of code related to the low-level implementation of the HTTP internet protocol. It can be used to implement both servers and clients. However, most of this package is considered legacy and too low-level for day-to-day work. Serious programmers working with HTTP are more likely to use third-party libraries such as requests
, httpx
, Django
, flask
, and others.
Nevertheless, one useful easter egg of the http
package is the ability for Python to run a standalone web server. Go a directory with a collection of files and type the following:
bash $ python -m http.server Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
Now, Python will serve the files to your browser if you point it at the right port. You wouldn’t use this to run a website, but it can be useful for testing and debugging programs related to the web. For example, the author has used this to locally test programs involving a mix of HTML, Javascript, and WebAssembly.
io
ModuleThe io
module primarily contains the definitions of classes used to implement the file objects as returned by the open()
function. It is not so common to access those classes directly. However, the module also contains a pair of classes that are useful for “faking” a file in the form of strings and bytes. This can be useful for testing and other applications where you need to provide a “file” but have obtained data in a different way.
The StringIO()
class provides a file-like interface on top of strings. For example, here is how you can write output to a string:
# Function that expects a file def greeting(file): file.write('Hello ') file.write('World ') # Call the function using a real file with open('out.txt', 'w') as file: greeting(file) # Call the function with a "fake" file import io file = io.StringIO() greeting(file) # Get the resulting output output = file.getvalue()
Similarly, you can create a StringIO
object and use it for reading:
file = io.StringIO('hello world ') while (line := file.readline()): print(line, end='')
The BytesIO()
class serves a similar purpose but is used for emulating binary I/O with bytes.
json
ModuleThe json
module can be used to encode and decode data in the JSON format, commonly used in the APIs of microservices and web applications. There are two basic functions for converting data, dumps()
and loads()
. dumps()
takes a Python dictionary and encodes it as a JSON Unicode string:
>>> import json >>> data = { 'name': 'Mary A. Python', 'email': '[email protected]' } >>> s = json.dumps(data) >>> s '{"name": "Mary A. Python", "email": "[email protected]"}' >>>
The loads()
function goes in the other direction:
>>> d = json.loads(s) >>> d == data True >>>
Both the dumps()
and loads()
functions have many options for controlling aspects of the conversion as well as interfacing with Python class instances. That’s beyond the scope of this section but copious amounts of information is available in the official documentation.
logging
ModuleThe logging
module is the de facto standard module used for reporting program diagnostics and for print-style debugging. It can be used to route output to a log file and provides a large number of configuration options. A common practice is to write code that creates a Logger
instance and issues messages on it like this:
import logging log = logging.getLogger(__name__) # Function that uses logging def func(args): log.debug("A debugging message") log.info("An informational message") log.warning("A warning message") log.error("An error message") log.critical("A critical message") # Configuration of logging (occurs one at program startup) if __name__ == '__main__': logging.basicConfig( level=logging.WARNING, filename="output.log" )
There are five built-in levels of logging ordered by increasing severity. When configuring the logging system, you specify a level that acts as a filter. Only messages at that level or greater severity are reported. Logging provides a large number of configuration options, mostly related to the back-end handling of the log messages. Usually you don’t need to know about that when writing application code—you use debug()
, info()
, warning()
, and similar methods on some given Logger
instance. Any special configuration takes place during program startup in a special location (such as a main()
function or the main code block).
os
ModuleThe os
module provides a portable interface to common operating-system functions, typically associated with the process environment, files, directories, permissions, and so forth. The programming interface closely follows C programming and standards such as POSIX.
Practically speaking, most of this module is probably too low-level to be directly used in a typical application. However, if you’re ever faced with the problem of executing some obscure low-level system operation (such as opening a TTY), there’s a good chance you’ll find the functionality for it here.
os.path
ModuleThe os.path
module is a legacy module for manipulating pathnames and performing common operations on the filesystem. Its functionality has been largely replaced by the newer pathlib
module, but since its use is still so widespread, you’ll continue to see it in a lot of code.
One fundamental problem solved by this module is portable handling of path separators on UNIX (forward-slash /
) and Windows (backslash ). Functions such as
os.path.join()
and os.path.split()
are often used to pull apart filepaths and put them back together:
>>> filename = '/Users/beazley/Desktop/old/data.csv' >>> os.path.split() ('/Users/beazley/Desktop/old', 'data.csv') >>> os.path.join('/Users/beazley/Desktop', 'out.txt') '/Users/beazley/Desktop/out.txt' >>>
Here is an example of code that uses these functions:
import os.path def clean_line(line): # Line up a line (whatever) return line.strip().upper() + ' ' def clean_data(filename): dirname, basename = os.path.split() newname = os.path.join(dirname, basename+'.clean') with open(newname, 'w') as out_f: with open(filename, 'r') as in_f: for line in in_f: out_f.write(clean_line(line))
The os.path
module also has a number of functions, such as isfile()
, isdir()
, and getsize()
, for performing tests on the filesystem and getting file metadata. For example, this function returns the total size in bytes of a simple file or of all of files in a directory:
import os.path def compute_usage(filename): if os.path.isfile(filename): return os.path.getsize(filename) elif os.path.isdir(filename): return sum(compute_usage(os.path.join(filename, name)) for name in os.listdir(filename)) else: raise RuntimeError('Unsupported file kind')
pathlib
ModuleThe pathlib
module is the modern way of manipulating pathnames in a portable and high-level manner. It combines a lot of the file-oriented functionality in one place and uses an object-oriented interface. The core object is the Path
class. For example:
from pathlib import Path filename = Path('/Users/beazley/old/data.csv')
Once you have an instance filename
of Path
, you can perform various operations on it to manipulate the filename. For example:
>>> filename.name 'data.csv' >>> filename.parent Path('/Users/beazley/old') >>> filename.parent / 'newfile.csv' Path('/Users/beazley/old/newfile.csv') >>> filename.parts ('/', 'Users', 'beazley', 'old', 'data.csv') >>> filename.with_suffix('.csv.clean') Path('/Users/beazley/old/data.csv.clean') >>>
Path
instances also have functions for obtaining file metadata, getting directory listings, and other similar functions. Here is a reimplementation of the compute_usage()
function from the previous section:
import pathlib def compute_usage(filename): pathname = pathlib.Path(filename) if pathname.is_file(): return pathname.stat().st_size elif pathname.is_dir(): return sum(path.stat().st_size for path in pathname.rglob('*') if path.is_file()) return pathname.stat().st_size else: raise RuntimeError('Unsupported file kind')
re
ModuleThe re
module is used to perform text matching, searching, and replacement operations using regular expressions. Here is a simple example:
>>> text = 'Today is 3/27/2018. Tomorrow is 3/28/2018.' >>> # Find all occurrences of a date >>> import re >>> re.findall(r'd+/d+/d+', text) ['3/27/2018', '3/28/2018'] >>> # Replace all occurrences of a date with replacement text >>> re.sub(r'(d+)/(d+)/(d+)', r'3-1-2', text) 'Today is 2018-3-27. Tomorrow is 2018-3-28.' >>>
Regular expressions are often notorious for their inscrutable syntax. In this example, the d+
is interpreted to mean “one or more digits.” More information about the pattern syntax can be found in the official documentation for the re
module.
shutil
ModuleThe shutil
module is used to carry out some common tasks that you might otherwise perform in the shell. These include copying and removing files, working with archives, and so forth. For example, to copy a file:
import shutil shutil.copy(srcfile, dstfile)
To move a file:
shutil.move(srcfile, dstfile)
To copy a directory tree:
shutil.copytree(srcdir, dstdir)
To remove a directory tree:
shutil.rmtree(pathname)
The shutil
module is often used as a safer and more portable alternative to directly executing shell commands with the os.system()
function.
select
ModuleThe select
module is used for simple polling of multiple I/O streams. That is, it can be used to watch a collection of file descriptors for incoming data or for the ability to receive outgoing data. The following example shows typical usage:
import select # Collections of objects representing file descriptors. Must be # integers or objects with a fileno() method. want_to_read = [ ... ] want_to_write = [ ... ] check_exceptions = [ ... ] # Timeout (or None) timeout = None # Poll for I/O can_read, can_write, have_exceptions = select.select(want_to_read, want_to_write, check_exceptions, timeout) # Perform I/O operations for file in can_read: do_read(file) for file in can_write: do_write(file) # Handle exceptions for file in have_exceptions: handle_exception(file)
In this code, three sets of file descriptors are constructed. These sets correspond to reading, writing, and exceptions. These are passed to select()
along with an optional timeout. select()
returns three subsets of the passed arguments. These subsets represent the files on which the requested operation can be performed. For example, a file returned in can_read()
has incoming data pending.
The select()
function is a standard low-level system call that’s commonly used to watch for system events and to implement asynchronous I/O frameworks such as the built-in asyncio
module.
In addition to select()
, the select
module also exposes poll()
, epoll()
, kqueue()
, and similar variant functions that provide similar functionality. The availability of these functions varies by operating system.
The selectors
module provides a higher-level interface to select
that might be useful in certain contexts. An example was given earlier in Section 9.14.2.
smtplib
ModuleThe smtplib
module implements the client side of SMTP, commonly used to send email messages. A common use of the module is in a script that does just that—sends an email to someone. Here is an example:
import smtplib fromaddr = "[email protected]" toaddrs = ["[email protected]" ] amount = 123.45 msg = f"""From: {fromaddr} Pay {amount} bitcoin or else. We're watching. """ server = smtplib.SMTP('localhost') serv.sendmail(fromaddr, toaddrs, msg) serv.quit()
There are additional features to handle passwords, authentication, and other matters. However, if you’re running a script on a machine and that machine is configured to support email, the above example will usually do the job.
socket
ModuleThe socket
module provides low-level access to network programming functions. The interface is modeled after the standard BSD socket interface commonly associated with system programming in C.
The following example shows how to make an outgoing connection and receive a response:
from socket import socket, AF_INET, SOCK_STREAM sock = socket(AF_INET, SOCK_STREAM) sock.connect(('python.org', 80)) sock.send(b'GET /index.html HTTP/1.0 ') parts = [] while True: part = sock.recv(10000) if not part: break parts.append(part) response = b''.join(part) print(part)
The following example shows a basic echo server that accepts client connections and echoes back any received data. To test this server, run it and then connect to it using a command such as telnet localhost 25000
or nc localhost 25000
in a separate terminal session.
from socket import socket, AF_INET, SOCK_STREAM def echo_server(address): sock = socket(AF_INET, SOCK_STREAM) sock.bind(address) sock.listen(1) while True: client, addr = sock.accept() echo_handler(client, addr) def echo_handler(client, addr): print('Connection from:', addr) with client: while True: data = client.recv(10000) if not data: break client.sendall(data) print('Connection closed') if __name__ == '__main__': echo_server(('', 25000))
For UDP servers, there is no connection process. However, a server must still bind the socket to a known address. Here is a typical example of what a UDP server and client look like:
# udp.py from socket import socket, AF_INET, SOCK_DGRAM def run_server(address): sock = socket(AF_INET, SOCK_DGRAM) # 1. Create a UDP socket sock.bind(address) # 2. Bind to address/port while True: msg, addr = sock.recvfrom(2000) # 3. Get a message # ... do something response = b'world' sock.sendto(response, addr) # 4. Send a response back def run_client(address): sock = socket(AF_INET, SOCK_DGRAM) # 1. Create a UDP socket sock.sendto(b'hello', address) # 2. Send a message response, addr = sock.recvfrom(2000) # 3. Get response print("Received:", response) sock.close() if __name__ == '__main__': import sys if len(sys.argv) != 4: raise SystemExit('Usage: udp.py [-client|-server] hostname port') address = (sys.argv[2], int(sys.argv[3])) if sys.argv[1] == '-server': run_server(address) elif sys.argv[1] == '-client': run_client(address)
struct
ModuleThe struct
module is used to convert data between Python and binary data structures, represented as Python byte strings. These data structures are often used when interacting with functions written in C, binary file formats, network protocols, or binary communication over serial ports.
As an example, suppose you need to construct a binary message with its format described by a C data structure:
# Message format: All values are "big endian" struct Message { unsigned short msgid; // 16 bit unsigned integer unsigned int sequence; // 32 bit sequence number float x; // 32 bit float float y; // 32 bit float }
Here’s how you do this using the struct
module:
>>> import struct >>> data = struct.pack('>HIff', 123, 456, 1.23, 4.56) >>> data b'x00{x00x00x00-?x9dpxa4@x91xebx85' >>>
To decode binary data, use struct.unpack
:
>>> struct.unpack('>HIff', data) (123, 456, 1.2300000190734863, 4.559999942779541) >>>
The differences in the floating-point values are due to the loss of accuracy incurred by their conversion to 32-bit values. Python represents floating-point values as 64-bit double precision values.
subprocess
ModuleThe subprocess
module is used to execute a separate program as a subprocess, but with control over the execution environment including I/O handling, termination, and so forth. There are two common uses of the module.
If you want to run a separate program and collect all of its output at once, use check_output()
. For example:
import subprocess # Run the 'netstat -a' command and collect its output try: out = subprocess.check_output(['netstat', '-a']) except subprocess.CalledProcessError as e: print("It failed:", e)
The data returned by check_output()
is presented as bytes. If you want to convert it to text, make sure you apply a proper decoding:
text = out.decode('utf-8')
It is also possible to set up a pipe and to interact with a subprocess in a more detailed manner. To do that, use the Popen
class like this:
import subprocess p = subprocess.Popen(['wc'], stdin=subprocess.PIPE, stdout=subprocess.PIPE) # Send data to the subprocess p.stdin.write(b'hello world this is a test ') p.stdin.close() # Read data back out = p.stdout.read() print(out)
An instance p
of Popen
has attributes stdin
and stdout
that can be used to communicate with the subprocess.
tempfile
ModuleThe tempfile
module provides support for creating temporary files and directories. Here is an example of creating a temporary file:
import tempfile with tempfile.TemporaryFile() as f: f.write(b'Hello World') f.seek(0) data = f.read() print('Got:', data)
By default, temporary files are open in binary mode and allow both reading and writing. The with
statement is also commonly used to define a scope for when the file will be used. The file is deleted at the end of the with
block.
If you would like to create a temporary directory, use this:
with tempfile.TemporaryDirectory() as dirname: # Use the directory dirname ...
As with a file, the directory and all of its contents will be deleted at the end of the with
block.
textwrap
ModuleThe textwrap
module can be used to format text to fit a specific terminal width. Perhaps it’s a bit special-purpose but it can sometimes be useful in cleaning up text for output when making reports. There are two functions of interest.
wrap()
takes text and wraps it to fit a specified column width. The function returns a list of strings. For example:
import textwrap text = """look into my eyes look into my eyes the eyes the eyes the eyes not around the eyes don't look around the eyes look into my eyes you're under """ wrapped = textwrap.wrap(text, width=81) print(' '.join(wrapped)) # Produces: # look into my eyes look into my eyes the # eyes the eyes the eyes not around the # eyes don't look around the eyes look # into my eyes you're under
The indent()
function can be used to indent a block of text. For example:
print(textwrap.indent(text, ' ')) # Produces: # look into my eyes # look into my eyes # the eyes the eyes the eyes # not around the eyes # don't look around the eyes # look into my eyes you're under
threading
ModuleThe threading
module is used to execute code concurrently. This problem commonly arises with I/O handling in network programs. Thread programming is a large topic, but the following examples illustrate solutions to common problems.
Here’s an example of launching a thread and waiting for it:
import threading import time def countdown(n): while n > 0: print('T-minus', n) n -= 1 time.sleep(1) t = threading.Thread(target=countdown, args=[10]) t.start() t.join() # Wait for the thread to finish
If you’re never going to wait for the thread to finish, make it daemonic by supplying an extra daemon
flag like this:
t = threading.Thread(target=countdown, args=[10], daemon=True)
If you want to make a thread terminate, you’ll need to do so explicitly with a flag or some dedicated variable for that purpose. The thread will have to be programmed to check for it.
import threading import time must_stop = False def countdown(n): while n > 0 and not must_stop: print('T-minus', n) n -= 1 time.sleep(1)
If threads are going to mutate shared data, protect it with a Lock
.
import threading class Counter: def __init__(self): self.value = 0 self.lock = threading.Lock() def increment(self): with self.lock: self.value += 1 def decrement(self): with self.lock: self.value -= 1
If one thread must wait for another thread to do something, use an Event
.
import threading import time def step1(evt): print('Step 1') time.sleep(5) evt.set() def step2(evt): evt.wait() print('Step 2') evt = threading.Event() threading.Thread(target=step1, args=[evt]).start() threading.Thread(target=step2, args=[evt]).start()
If threads are going to communicate, use a Queue
:
import threading import queue import time def producer(q): for i in range(10): print('Producing:', i) q.put(i) print('Done') q.put(None) def consumer(q): while True: item = q.get() if item is None: break print('Consuming:', item) print('Goodbye') q = queue.Queue() threading.Thread(target=producer, args=[q]).start() threading.Thread(target=consumer, args=[q]).start()
time
ModuleThe time
module is used to access system time-related functions. The following selected functions are the most useful:
sleep(seconds)
Make Python sleep for a given number of seconds, given as a floating point.
time()
Return the current system time in UTC as a floating-point number. This is the number of seconds since the epoch (usually January 1, 1970 for UNIX systems). Use localtime()
to convert it into a data structure suitable for extracting useful information.
localtime([secs])
Return a struct_time
object representing the local time on the system or the time represented by the floating-point value secs
passed as an argument. The resulting struct has attributes tm_year
, tm_mon
, tm_mday
, tm_hour
, tm_min
, tm_sec
, tm_wday
, tm_yday
, and tm_isdst
.
gmtime([secs])
The same as localtime()
except that the resulting structure represents the time in UTC (or Greenwich Mean Time).
ctime([secs])
Convert a time represented as seconds to a text string suitable for printing. Useful for debugging and logging.
asctime(tm)
Convert a time structure as represented by localtime()
into a text string suitable for printing.
The datetime
module is more generally used for representing dates and times for the purpose of performing date-related computations and dealing with timezones.
urllib
PackageThe urllib
package is used to make client-side HTTP requests. Perhaps the most useful function is urllib.request.urlopen()
which can be used to fetch simple webpages. For example:
>>> from urllib.request import urlopen >>> u = urlopen('http://www.python.org') >>> data = u.read() >>>
If you want to encode form parameters, you can use urllib.parse.urlencode()
as shown here:
from urllib.parse import urlencode from urllib.request import urlopen form = { 'name': 'Mary A. Python', 'email': '[email protected]' } data = urlencode(form) u = urlopen('http://httpbin.org/post', data.encode('utf-8')) response = u.read()
The urlopen()
function works fine for basic webpages and APIs involving HTTP or HTTPS. However, it becomes quite awkward to use if access also involves cookies, advanced authentication schemes, and other layers. Frankly, most Python programmers would use a third-party library such as requests
or httpx
to handle these situations. You should too.
The urllib.parse
subpackage has additional functions for manipulating URLs themselves. For example, the urlparse()
function can be used to pull apart a URL:
>>> url = 'http://httpbin.org/get?name=Dave&n=42' >>> from urllib.parse import urlparse >>> urlparse(url) ParseResult(scheme='http', netloc='httpbin.org', path='/get', params='', query='name=Dave&n=42', fragment='') >>>
unicodedata
ModuleThe unicodedata
module is used for more advanced operations involving Unicode text strings. There are often multiple representations of the same Unicode text. For example, the character U+00F1 (ñ) might be fully composed as a single character U+00F1 or decomposed into a multicharacter sequence U+006e U+0303 (n, ~). This can cause strange problems in programs that are expecting text strings that visually render the same to actually be the same in representation. Consider the following example involving dictionary keys:
>>> d = {} >>> d['Jalapexf1o'] = 'spicy' >>> d['Jalapenu0303o'] = 'mild' >>> d {'jalapeño': 'spicy', 'jalapeño': 'mild' } >>>
At first glance, this looks like it should be an operational error—how could a dictionary have two identical, yet separate, keys like that? The answer is found in the fact that the keys consist of different Unicode character sequences.
If consistent processing of identically rendered Unicode strings is an issue, they should be normalized. The unicodedata.normalize()
function can be used to ensure a consistent character representation. For example, unicodedata.normalize('NFC', s)
will make sure that all characters in s
are fully composed and not represented as a sequence of combining characters. Using unicodedata.normalize('NFD', s)
will make sure that all characters in s
are fully decomposed.
The unicodedata
module also has functions for testing character properties such as capitalization, numbers, and whitespace. General character properties can be obtained with the unicodedata.category(c)
function. For example, unicodedata.category('A')
returns 'Lu'
, signifying that the character is an uppercase letter. More information about these values can be found in the official Unicode character database at https://www.unicode.org/ucd
.
xml
PackageThe xml
package is a large collection of modules for processing XML data in various ways. However, if your primary goal is to read an XML document and extract information from it, the easiest way to do it is to use the xml.etree
subpackage. Suppose you had an XML document in a file recipe.xml
like this:
<?xml version="1.0" encoding="iso-8859-1"?> <recipe> <title>Famous Guacamole</title> <description>A southwest favorite!</description> <ingredients> <item num="4"> Large avocados, chopped </item> <item num="1"> Tomato, chopped </item> <item num="1/2" units="C"> White onion, chopped </item> <item num="2" units="tbl"> Fresh squeezed lemon juice </item> <item num="1"> Jalapeno pepper, diced </item> <item num="1" units="tbl"> Fresh cilantro, minced </item> <item num="1" units="tbl"> Garlic, minced </item> <item num="3" units="tsp"> Salt </item> <item num="12" units="bottles"> Ice-cold beer </item> </ingredients> <directions> Combine all ingredients and hand whisk to desired consistency. Serve and enjoy with ice-cold beers. </directions> </recipe>
Here’s how to extract specific elements from it:
from xml.etree.ElementTree import ElementTree doc = ElementTree(file="recipe.xml") title = doc.find('title') print(title.text) # Alternative (just get element text) print(doc.findtext('description')) # Iterate over multiple elements for item in doc.findall('ingredients/item'): num = item.get('num') units = item.get('units', '') text = item.text.strip() print(f'{num} {units} {text}')
I/O is a fundamental part of writing any useful program. Given its popularity, Python is able to work with literally any data format, encoding, or document structure that’s in use. Although the standard library might not support it, you will almost certainly find a third-party module to solve your problem.
In the big picture, it may be more useful to think about the edges of your application. At the outer boundary between your program and reality, it’s common to encounter issues related to data encoding. This is especially true for textual data and Unicode. Much of the complexity in Python’s I/O handling—supporting different encoding, error handling policies, and so on—is aimed at this specific problem. It’s also critical to keep in mind that textual data and binary data are strictly separated. Knowing what you’re working with helps in understanding the big picture.
A secondary consideration in I/O is the overall evaluation model. Python code is currently separated into two worlds—normal synchronous code and asynchronous code usually associated with the asyncio
module (characterized by the use of async
functions and the async/await
syntax). Asynchronous code almost always requires using dedicated libraries that are capable of operating in that environment. This, in turn, forces your hand on writing your application code in the “async” style as well. Honestly, you should probably avoid asynchronous coding unless you absolutely know that you need it—and if you’re not really sure, then you almost certainly don’t. Most of the well-adjusted Python-speaking universe codes in a normal synchronous style that is far easier to reason about, debug, and test. You should choose that.
18.217.228.35