© Valentina Porcu 2018
Valentina PorcuPython for Data Mining Quick Syntax Referencehttps://doi.org/10.1007/978-1-4842-4113-4_7

7. Importing Files

Valentina Porcu1 
(1)
Nuoro, Italy
 

Importing a file to Python is important to learning how to manage datasets. In this chapter we examine the basics. We can import many types of data to Python: from the most canonical format (.csv) or Excel data formats, to text formats for text mining, and to binary files such as images, video and audio. First, let’s look at some basic ways to import files. Sometimes the process for doing so may seem a bit tricky. The pandas package, which we examine in Chapter 8, makes importing datasets for analysis much easier.

The basic structure for importing files is as follows:
file1 = open("file.name", mode)
“mode” represents the way we open a file. The most important ways are
  • Read only (“r”)

  • Write (“w”)

  • Append some text at the end of the document (“a”)

  • Read and write (“r+”)

We use the open() function to open the file. If the file does not exist, it is created in the work directory.
# we open (or create) the file
>>> file1 = open("file1.txt","w")
# at the moment, the file was created in our work directory but it is empty
# we are in write mode; we then add text this way
>>> file1.write("Add line n.1 to the file1")
# we close the file
>>> file1.flush()
>>> file1.close()
# we open the file in read mode
>>> file1 = open("file1.txt", "r")
# we create an object that contains our file and display it with the print() function
>>> text1 = file1.read()
>>> print(text1)
Add line 1 to file1
# we close the file
>>> file1.close()
# we reopen the file in write mode
>>> file1 = open("file1.txt","w")
# if we write the new text now, the line we had written previously is replaced
>>> file1.write("Let's replace the first line with this new one")
# we test
>>> file1.flush()
>>> file1.close()
>>> file1 = open("file1.txt", "r")
>>> text2 = file1.read()
>>> print(text2)
Let's replace the first line with this new one
>>> file1.close()
# to add text without overwriting, we open the file in append mode
>>> file1 = open("file1.txt", "a")
>>> file1.write(" Add a second line to this new version of the file")
>>> file1.close()
>>> file1 = open("file1.txt", "r")
>>> text3 = file1.read()
>>> print(text3)
We replace the first line with this new one
 Add a second line to this new version of the file
# we can verify the length of the text with the len function()
len(text3)
105
We can also create a small function that reads every line of the file:
>>> file1 = open("file1.txt", "r")
>>> for line in file1:
...        print(line, end = "")
Or proceed as follows:
>>> with open("file1.txt", "a") as file:
...        file1.write("this is the third line")
Table 7-1 provides a summary of the modes used to open a file.
Table 7-1

Modes for Opening a File

Mode

Description

‘r’

Read only, default mode

‘rb’

Read only in binary format

‘r+’

Read and write

‘rb+’

Read and write in binary format

‘w’

Write

‘wb’

Write in binary format only. Overwrites an existing file. If the file does not exist, a new one is created.

‘w+’

Read and write. Overwrites an existing file. If the file does not exist, a new one is created.

‘wb+’

Read and write in binary format. Overwrites an existing file. If the file does not exist, a new one is created.

‘a’

Adds to an existing file without overwriting. If the file does not exist, a new one is created.

‘ab’

Adds to an existing file or creates a new binary file

‘a+’

Reads, adds, and overwrites a new file (or creates a new one)

‘ab+’

Reads and adds in binary format; overwrites a new file or creates a new one

.csv Format

Files in .csv or .tsv format are those used most frequently in data mining. Later, we look at some packages (such as pandas) that make it easier to import and manage files. For now, however, we study some basic procedures that do not require separate installation.
# we import csv
import csv
# next, I generate a random csv file that I save in the work directory and call it 'df'. The second argument, 'r', means we are accessing the file in read mode.
csv1 = open('df', 'r')
# if we want to import a file that is not in the work directory, we can include the entire address, for example:
csv2 = open('/Users/valentinaporcu/Desktop/df2', 'r')
# we go on reading the first file
read = csv.reader(csv1)
 for row in read:
...     print row
...
[", '0', '1', '2', '3', '4']
['0', '15.982938813888007', '96.04182101708831', '74.68301929612825', '31.670249691004994', '50.37042800222742']
[...]

From the Web

The basic methods also allow us to read a file from the Web. For example:
# Python2
# we import csv and urllib2
import csv
import urllib2
# we create an object that contains the address
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
# we create a connection
conn = urllib2.urlopen(url)
# we create an object containing the .csv file
file = csv.reader(conn)
we print the file
for row in file:
...     print row
...
['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']
['4.9', '3.0', '1.4', '0.2', 'Iris-setosa']
['4.7', '3.2', '1.3', '0.2', 'Iris-setosa']
['4.6', '3.1', '1.5', '0.2', 'Iris-setosa']
['5.0', '3.6', '1.4', '0.2', 'Iris-setosa']
['5.4', '3.9', '1.7', '0.4', 'Iris-setosa']
[...]

In JSON

Let’s see how to import a test file in JSON. Some JSON test files can be downloaded from https://www.jsonar.com/resources/ .
import json
jfile = open('zips.json').read()
print(jfile)
{ "city" : "AGAWAM", "loc" : [ -72.622739, 42.070206 ], "pop" : 15338, "state" : "MA", "_id" : "01001" }
{ "city" : "CUSHMAN", "loc" : [ -72.51564999999999, 42.377017 ], "pop" : 36963, "state" : "MA", "_id" : "01002" }
{ "city" : "BARRE", "loc" : [ -72.10835400000001, 42.409698 ], "pop" : 4546, "state" : "MA", "_id" : "01005" }
{ "city" : "BELCHERTOWN", "loc" : [ -72.41095300000001, 42.275103 ], "pop" : 10579, "state" : "MA", "_id" : "01007" }
{ "city" : "BLANDFORD", "loc" : [ -72.936114, 42.182949 ], "pop" : 1240, "state" : "MA", "_id" : "01008" }
{ "city" : "BRIMFIELD", "loc" : [ -72.188455, 42.116543 ], "pop" : 3706, "state" : "MA", "_id" : "01010" }
{ "city" : "CHESTER", "loc" : [ -72.988761, 42.279421 ], "pop" : 1688, "state" : "MA", "_id" : "01011" }
{ "city" : "CHESTERFIELD", "loc" : [ -72.833309, 42.38167 ], "pop" : 177, "state" : "MA", "_id" : "01012" }
{ "city" : "CHICOPEE", "loc" : [ -72.607962, 42.162046 ], "pop" : 23396, "state" : "MA", "_id" : "01013" }
{ "city" : "CHICOPEE", "loc" : [ -72.576142, 42.176443 ], "pop" : 31495, "state" : "MA", "_id" : "01020" }
{ "city" : "WESTOVER AFB", "loc" : [ -72.558657, 42.196672 ], "pop" : 1764, "state" : "MA", "_id" : "01022" }
{ "city" : "CUMMINGTON", "loc" : [ -72.905767, 42.435296 ], "pop" : 1484, "state" : "MA", "_id" : "01026" }

In Chapter 8, we learn how to use pandas to create and export data frames in JSON.

Other Formats

We’ve now seen how to import files and data using some of the most common formats in Python. Other formats include the following:
  • lxml —particularly the objectify module—allows you to import files into XML.

  • SQLite3 allows you to import SQL databases.

  • PyMongo allows you to manage Mongo databases

  • feedparser allows you to process feeds in many formats, including RSS

  • xlrd allows you to import files into Excel (note, however, that pandas is much easier to use)

Summary

Importing a file and a dataset in multiple formats is one of the most important things in data analysis. The procedures described in this chapter are important because we don’t need a library to import files and data. We can use the basic functions in Python.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.79.20