File IO

So far through this book, when our examples touch files, we've operated entirely on text files. Operating systems, however, actually represent files as a sequence of bytes, not text.

Because reading bytes and converting the data to text is one of the more common operations on files, Python wraps the incoming (or outgoing) stream of bytes with appropriate decode (or encode) calls so we can deal directly with str objects. This saves us a lot of boilerplate code to be constantly encoding and decoding text.

The open() function is used to open a file. For reading text from a file, we only need to pass the filename into the function. The file will be opened for reading, and the bytes will be converted to text using the platform default encoding. As with decode and encode on bytes and str objects, the open function can accept encoding and errors arguments to open a text file in a specific character encoding or to choose a specific replacement strategy for invalid bytes in that encoding. These are normally supplied to open as keyword arguments. For example, we can use the following code to read the contents of a text file in ASCII format, converting any unknown bytes using the replace strategy:

	file = open('filename', encoding='ascii', errors='replace')
	print(file.read())
	file.close()

Of course, we don't always want to read files; often we want to write data to them! The encoding and errors arguments can also be passed when writing text files. In addition, to open a file for writing, we need to pass a mode argument as the second positional argument, with a value of "w":

	contents = "an oft-repeated cliché"
	file = open("filename", "w", encoding="ascii", errors="replace")
	file.write(contents)
	file.close()

We could also supply the value "a" as a mode argument, to append to the file, rather than completely overwriting existing file contents.

These files with their wrappers for converting bytes to text are great, but it'd be awfully inconvenient if the file we wanted to open was an image, executable, or other binary file, wouldn't it?

To open a binary file, we simply need to append a'b' to the mode string. So'wb' would open a file for writing bytes, while'rb' allows us to read them. They will behave like text files, but without the automatic encoding of text to bytes. When we read such a file, it will return bytes instead of str, and when we write to it, it will fail if we try to pass a Unicode object.

Once a file is opened for reading, we can call the read, readline, or readlines methods to get the contents of the file. The read method returns the entire contents of the file as an str or bytes object, depending on whether there is a'b' in the mode. Be careful not to use this method without arguments on huge files. You don't want to find out what happens if you try to load that much data into memory!

It is also possible to read a fixed number of bytes from a file; we simply pass an integer argument to the read method describing how many bytes we want to read. The next call to read will load the next sequence of bytes, and so on. We can do this inside a while loop to read the entire file in manageable chunks.

The readline method returns a single line from the file; we can call it repeatedly to get more lines. The plural readlines method returns a list of all the lines in the file. Like the read method, it's not safe to use on very large files. These two methods even work when the file is open in bytes mode, but it only makes sense if we are parsing text-like data. An image or audio file, for example, will not have newlines in it (unless the newline byte happened to represent a certain pixel or sound), so applying readline wouldn't make sense.

For readability and to avoid reading a large file into memory at once, we can also use a for loop directly on a file object to read each line, one at a time, and process it.

Writing to a file is just as easy; the write method on file objects simply writes a string (or bytes, for binary data) object to the file; it can be called repeatedly to write multiple strings, one after the other. The writelines method accepts an iterator and writes each of the iterated values to the file. It specifically does not turn the arguments into multiple lines by appending a newline after each one. If each item in the iterator is expected to be a separate line, they should all have newline characters at the end. The writelines method is basically a convenience to write the contents of an iterator without having to explicitly iterate over it using a for loop.

A final important method on file objects is the close method. This method should be called when we are finished reading or writing the file to ensure any buffered writes are written to the file, that the file has been properly cleaned up, and that all resources associated with the file are released back to the operating system. Technically, this will happen automatically when the script exits, but it's better to be explicit and clean up after ourselves.

Placing it in context

This need to always close a file can make for some ugly code. Because an exception may occur during file IO, we ought to wrap all calls to a file in a try...finally clause, and close the file in finally, regardless of whether IO was successful. This isn't very Pythonic; there must be a more elegant way to do it.

If we run dir on a file-like object, we see that it has two special methods named __enter__ and __exit__. These methods turn the file object into what is known as a context manager. Basically, if we use a special syntax called the with statement, these methods will be called before and after nested code is executed. On file objects, the __exit__ method ensures the file is closed, even if an exception is raised. We no longer have to explicitly manage the closing of the file. Here is what the with statement looks like in practice:


	with open('filename') as file:
		for line in file:
			print(line, end='')

The open call returns a file object, which has __enter__ and __exit__ methods. The returned object is assigned to the variable named file by the as clause. We know the file will be closed when the code returns to the outer indentation level, and that this will happen even if an exception is raised.

The with statement is used in several places in the standard library where startup or cleanup code needs to be executed. For example, the urlopen call returns an object that can be used in a with statement to clean up the socket when we're done. Locks in the threading module can automatically release the lock when the statement has been executed.

Most interestingly, because the with statement can apply to any object that has the appropriate special methods, we can use it in our own frameworks. Keeping with our string examples, let's create a simple context manager that allows us to construct a sequence of characters and automatically convert it to a string upon exit:

	class StringJoiner(list):
		def __enter__(self):
			return self
			
		def __exit__(self, type, value, tb):
			self.result = "".join(self)

This code simply adds the two special methods required of a context manager to the list class it inherits from. The __enter__ method performs any required setup code (in this case, there isn't any) and then returns the object that will be assigned to the variable after as in the with statement. Often, as we've done here, this is just the context manager object itself.

The __exit__ method accepts three arguments. In a normal situation, these are all given a value of None. However, if an exception occurs inside the with block, they will be set to values related to the type, value, and traceback for the exception. This allows the __exit__ method to do any cleanup code that may be required even if an exception occurred. In our example, we simply create a result string by joining the characters in the string, regardless of whether an exception was thrown.

While this is one of the simplest context managers we could write, and its usefulness is dubious, it does work with a with statement. Have a look at it in action:

	import random, string
	with StringJoiner() as joiner:
		for i in range(15):
			joiner.append(random.choice(string.ascii_letters))

	print(joiner.result)

This code simply constructs a string of fifteen random characters. It appends these to a StringJoiner using the append method it inherited from list. When the with statement goes out of scope (back to the outer indentation level), the __exit__ method is called, and the result attribute becomes available on the joiner object. We print this value to see a random string.

Faking files

Sometimes we need code that provides a file-like interface but doesn't actually read from or write to any real files. For example, we might want to retrieve a string from a third-party library that only knows how to write to a file. This is an example of the adapter pattern in action; we need an adapter that converts the file-like interface into a string-like one.

Two such adapters already exist in the standard library, StringIO and BytesIO. They behave in much the same way, except that one deals with text characters and the second deals with bytes data. Both classes are available in the io package. To emulate a file open for reading, we can supply a string or bytes object to the constructor. Calls to read or readline will then parse that string as if it was a file. To emulate a file opened for writing, we simply construct a StringIO or BytesIO object and call the write or writelines methods. When writing is complete, we can discover the final contents of the written "file" using the getvalue method. It's really very simple:

	# coding=utf-8
	from io import StringIO, BytesIO
	source_file = StringIO("an oft-repeated cliché")
	dest_file = BytesIO()
	
	char = source_file.read(1)
	while char:
		dest_file.write(char.encode("ascii", "replace"))
		char = source_file.read(1)
	
	print(dest_file.getvalue())

This piece of code is, technically, doing nothing more than encoding a str to a bytes. But it is performing this task using a file-like interface. We first create a source "file" that contains a string, and a destination "file" to write it to. Then we read one character at a time from the source, encode it to ASCII using the "replace" error replacement strategy, and write the resulting byte to the destination file. This code doesn't know that the object it is calling write on is not a file, nor does it care.

The file interface is common for reading and writing data, even if it's not to a file or a string. Network IO often uses the same protocol (set of methods) for reading and writing data to the network, and compression libraries use it to store compressed data, for example. This is duck typing at work; we can write code that operates on a file-like object, and it will never need to know if the data actually came from a compressed file, a string, or the internet.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.249.210