Automating the training data setup

Ideally, we will want the entire process automated. This way, we can easily run the process end to end on any computer we use without having to carry around ancillary assets. This will be important later, as we will often develop on one computer (our development machine) and deploy on a different machine (our production server).

I have already written the code for this chapter, as well as all the other chapters; it is available at https://github.com/mlwithtf/MLwithTF. Our approach will be to rewrite it together while understanding it. Some straightforward parts, such as this, may be skipped. I recommend forking the repository and cloning a local copy for your projects:

cd ~/workdir
git clone https://github.com/mlwithtf/MLwithTF
cd chapter_02

The code for this specific section is available at— https://github.com/mlwithtf/mlwithtf/blob/master/chapter_02/download.py.

Preparing the dataset is an important part of the training process. Before we go deeper into the code, we will run download.py to automatically download and prepare the dataset:

python download.py

The result will look like this:

Now, we will take a look at several functions that are used in download.py. You can find the code in this file:

https://github.com/mlwithtf/mlwithtf/blob/master/data_utils.py

The following downloadFile function will automatically download the file and validate it against an expected file size:

 from __future__ import print_function 
 import os 
 from six.moves.urllib.request import urlretrieve 
 import datetime 
 def downloadFile(fileURL, expected_size): 
    timeStampedDir=datetime.datetime.now()
     .strftime("%Y.%m.%d_%I.%M.%S") 
    os.makedirs(timeStampedDir) 
    fileNameLocal = timeStampedDir + "/" +     
    fileURL.split('/')[-1] 
    print ('Attempting to download ' + fileURL) 
    print ('File will be stored in ' + fileNameLocal) 
    filename, _ = urlretrieve(fileURL, fileNameLocal) 
    statinfo = os.stat(filename) 
    if statinfo.st_size == expected_size: 
        print('Found and verified', filename) 
    else: 
        raise Exception('Could not get ' + filename) 
    return filename

The function can be called as follows:

 tst_set = 
 downloadFile('http://yaroslavvb.com/upload/notMNIST/notMNIST_small
 .tar.gz', 8458043)

The code to extract the contents is as follows (note that the additional import is required):

 import os, sys, tarfile 
 from os.path import basename 
 
 def extractFile(filename): 
    timeStampedDir=datetime.datetime.now()
     .strftime("%Y.%m.%d_%I.%M.%S") 
    tar = tarfile.open(filename) 
    sys.stdout.flush() 
    tar.extractall(timeStampedDir) 
    tar.close() 
    return timeStampedDir + "/" + os.listdir(timeStampedDir)[0]

We call the download and extract methods in sequence as follows:

 tst_src='http://yaroslavvb.com/upload/notMNIST/notMNIST_small.tar.
 gz' 
 tst_set = downloadFile(tst_src, 8458043) 
 print ('Test set stored in: ' + tst_set) 
 tst_files = extractFile(tst_set) 
 print ('Test file set stored in: ' + tst_files)

Table of Contents for Automating the training data setup

Create new playlist

Sign In

Sign Up

Table of Contents for
Automating the training data setup