Searching File Information

In this section, we create a file indexing script that can generate an XML document representing your entire filesystem or a specific portion of it. Indexing files with XML is a powerful way to keep track of information, or perform bulk operations on groups of particular files on a disk. You can create an XML-generating indexing routine easily in Python. The index.py program in Example 3-4 (which shows up a little later in the chapter) starts in any directory you specify and generates an element for each file or directory that exists beneath the starting point. Once we have the index of file information, we look at how to use SAX to search the information to filter the list of files for whatever criteria interests us at the time.

Creating the Index Generator

The main part of this routine works by just checking each file in a starting directory, and then recursing into any directories it finds beneath the starting directory. Recursion allows it to index an entire filesystem if you choose. On Unix, the program performs a lot of work, as it does content checking via a popen call to the file command for each file. (While this could be made more efficient by calling find less often and requiring it to operate on more than one file at a time, that isn’t the topic of this book.) One of the key methods of this class is indexDirectoryFiles:

def indexDirectoryFiles(self, dir):
     """Index a directory structure and creates an XML output file."""


     # prepare output XML file
     self.__fd = open(self.outputFile, "w")
     self.__fd.write('<?xml version="1.0" encoding="' +
                     XML_ENC + '"?>
')
     self.__fd.write("<IndexedFiles>
")

     # do actual indexing
     self.__indexDir(dir)
     # close out XML file
     self.__fd.write("</IndexedFiles>
")
     self.__fd.close()

An XML file is created with the name given in outputFile and an XML declaration and root element are added. The indexDirectoryFiles method calls its internal _ _indexDir method—this is the real worker method. It is a recursive method that descends the file hierarchy, indexing files along the way.

def __indexDir(self, dir):
     """Recursive function to do the actual indexing."""
     # Create an indexFile for each regular file,
     # and call the function again for each directory
     files = listdir(dir)
     for file in files:
       fullname = os.path.join(dir, file)
       st_values = stat(fullname)

       # check if its a directory
       if S_ISDIR(st_values[0]):
         print file

         # create directory element
         self.__fd.write("<Directory ")
         self.__fd.write(' name="' + escape(fullname) + '">
')
         self.__indexDir(fullname)
         self.__fd.write("</Directory>
")

       else:
         # create regular file entry
         print dir + file
         lf = IndexFile(fullname, st_values)
         self.__fd.write(lf.getXML())

The actual work is just determining those files that are directories and those that are regular files. XML is created accordingly during this process, and written to the output file. When all of the __indexDir calls eventually return, the XML file is closed.

Now the program is essentially finished. A helper function named escape is imported from the xml.sax.saxutils module to perform entity substitution against some common characters within XML character data to ensure they do not appear to be markup in the resulting XML.

Creating the IndexFile class

The IndexFile class is used for an XML representation of file information. This information is derived primarily from the os.stat system call. The class copies information from the stat call into its member variables in its __init__ method, as shown here:

def __init__(self, filename, st_vals):
  """Extract properties from supplied stat object."""
  self.filename = filename
  self.uid = st_vals[4]
  self.gid = st_vals[5]
  self.size = st_vals[6]
  self.accessed = ctime(st_vals[7])
  self.modified = ctime(st_vals[8])
  self.created = ctime(st_vals[9])
  
  # try for filename extension
  self.extension = os.path.splitext(filename)[1]

In this method, important file information is extracted from the tuple st_vals. This contains the filesystem information returned by the stat call. The __init__ method also tries for a filename extension if possible by checking for the "." character. If you are running Unix, the script tries to use the os.popen function to call the file command, which returns a human-readable description of the content of both text and binary files. It can take much longer to generate, but once in, the XML is valuable and does not need to be regenerated every time we want it:

# check contents using file command on linux
if os.name == "posix":
  # Open a process to check file contents
  fd = popen("file "" + filename + """)
  self.contents = fd.readline().rstrip(  )
  fd.close(  )
else:
  # No content information
  self.contents = self.extension

If you’re not using Unix, the file command is unavailable, and so the contents information is given the file extension. For example, in a Word file, the XML is <contents>.doc</contents>. On Unix, however, the call to popen returns a file object. The output text of the command is read in using the readline method of the file object. The results are then stripped and used as a description of the files contents. The class features a single method, getXML, which returns the file information as a single XML element in string format:

def getXML(self):
  """Returns XML version of all data members."""
  return ("<file name="" + escape(self.filename) + "">" +
          "
	<userID>" + str(self.uid) + "</userID>" +
          "
	<groupID>" + str(self.gid) + "</groupID>" +
          "
	<size>" + str(self.size) + "</size>" +
          "
	<lastAccessed>" + self.accessed +
          "</lastAccessed>" +
          "
	<lastModified>" + self.modified +
          "</lastModified>" +
          "
	<created>" + self.created + "</created>" +
          "
	<extension>" + self.extension +
          "</extension>" +
          "
	<contents>" + escape(self.contents) +
          "</contents>" +
          "
</file>")

In the preceding code, the XML is thrown together as a series of strings. Another way is to use a DOMImplementation object to create individual elements and insert them into the document’s structure (illustrated in Chapter 10).

Both of these classes are used to develop a lengthy XML document representing files and metadata for any given section of your filesystem. The complete listing of index.py is shown in Example 3-4

Example 3-4. index.py
#!/usr/bin/env python
"""
index.py 
usage: python index.py <starting-dir> <output-file>
"""

import os
import sys

from os   import stat
from os   import listdir
from os   import popen
from stat import S_ISDIR
from time import ctime

from xml.sax.saxutils import escape

XML_ENC = "ISO-8859-1"

"""""""""""""""""""""""""""""""""""""""""""""""""""
 Class: Index(startingDir, outputFile)
"""""""""""""""""""""""""""""""""""""""""""""""""""
class Index:
  """
  This class indexes files and builds
  a resultant XML document.
  """
def __init__(self, startingDir, outputFile):
    """ init: sets output file """
    self.outputFile = outputFile
    self.startingDir = startingDir

  def indexDirectoryFiles(self, dir):
    """Index a directory structure and creates an XML output file."""
    # prepare output XML file
    self.__fd = open(self.outputFile, "w")
    self.__fd.write('<?xml version="1.0" encoding="' +
                    XML_ENC + '"?>
')
    self.__fd.write("<IndexedFiles>
")

    # do actual indexing
    self.__indexDir(dir)

    # close out XML file
    self.__fd.write("</IndexedFiles>
")
    self.__fd.close(  )

  def __indexDir(self, dir):
    """Recursive function to do the actual indexing."""
    # Create an indexFile for each regular file,
    # and call the function again for each directory
    files = listdir(dir)
    for file in files:
      fullname = os.path.join(dir, file)
      st_values = stat(fullname)

      # check if its a directory
      if S_ISDIR(st_values[0]):
        print file

        # create directory element
        self.__fd.write("<Directory ")
        self.__fd.write(' name="' + escape(fullname) + '">
')
        self.__indexDir(fullname)
        self.__fd.write("</Directory>
")

      else:
        # create regular file entry
        print dir + file
        lf = IndexFile(fullname, st_values)
        self.__fd.write(lf.getXML(  ))

"""""""""""""""""""""""""""""""""""""""""""""""""""
 Class: IndexFile(filename, stat-tuple)
"""""""""""""""""""""""""""""""""""""""""""""""""""
class IndexFile:
  """
  Simple file representation object with XML
  """
  def __init__(self, filename, st_vals):
    """Extract properties from supplied stat object."""
    self.filename = filename
    self.uid = st_vals[4]
    self.gid = st_vals[5]
    self.size = st_vals[6]
    self.accessed = ctime(st_vals[7])
    self.modified = ctime(st_vals[8])
    self.created = ctime(st_vals[9])

    # try for filename extension
    self.extension = os.path.splitext(filename)[1]

    # check contents using file command on linux
    if os.name == "posix":
      # Open a process to check file
      # contents
      fd = popen("file "" + filename + """)
      self.contents = fd.readline().rstrip(  )
      fd.close(  )
    else:
      # No content information
      self.contents = self.extension
      
  def getXML(self):
    """Returns XML version of all data members."""
    return ("<file name="" + escape(self.filename) + "">" +
            "
	<userID>" + str(self.uid) + "</userID>" +
            "
	<groupID>" + str(self.gid) + "</groupID>" +
            "
	<size>" + str(self.size) + "</size>" +
            "
	<lastAccessed>" + self.accessed +
            "</lastAccessed>" +
            "
	<lastModified>" + self.modified +
            "</lastModified>" +
            "
	<created>" + self.created + "</created>" +
            "
	<extension>" + self.extension +
            "</extension>" +
            "
	<contents>" + escape(self.contents) +
            "</contents>" +
            "
</file>")

"""""""""""""""""""""""""""""""""""""""""""""""""""
Main
"""""""""""""""""""""""""""""""""""""""""""""""""""
if __name__ == "__main__":
  index = Index(sys.argv[1], sys.argv[2])
  print "Starting Dir:", index.startingDir
  print "Output file:", index.outputFile

  index.indexDirectoryFiles(index.startingDir)

Running index.py

Running index.py from the command line requires supplying both a starting directory and an XML filename to use as output:

$> python index.py /usr/bin/ usrbin.xml

The script prints directory names similar to the find command, but after completion, the file usrbin.xml contains something similar to the following:

<?xml version="1.0" encoding=" ISO-8859-1"?>
<IndexedFiles>
<Directory  name="/usr/bin/X11">
<file name="/usr/bin/X11/Magick-config">
        <userID>0</userID>
        <groupID>0</groupID>
        <size>1786</size>
        <lastAccessed>Fri Jan 19 22:29:34 2001</lastAccessed>
        <lastModified>Mon Aug 30 20:49:06 1999</lastModified>
        <created>Mon Sep 11 17:22:01 2000</created>
        <extension>None</extension>
        <contents>/usr/bin/X11/Magick-config: Bourne shell script text</contents>
</file><file name="/usr/bin/X11/animate">
        <userID>0</userID>
        <groupID>0</groupID>
        <size>16720</size>
        <lastAccessed>Fri Jan 19 22:29:34 2001</lastAccessed>
        <lastModified>Mon Aug 30 20:49:09 1999</lastModified>
        <created>Mon Sep 11 17:22:01 2000</created>
        <extension>None</extension>
        <contents>/usr/bin/X11/animate: ELF 32-bit LSB executable, 
         Intel 80386,version 1, dynamically linked (uses 
         shared libs), stripped</contents>
</file>

The XML file’s size depends on the particular directory it originated in. By default, the program follows symbolic links (on Unix, symbolic links allow one directory or filename to refer to another), introducing the possibility of forming infinite recursion, so beware! Indexing your home directory or indexing a directory of open source software that you’ve downloaded is probably the most effective thing to do in this case.

Searching the Index

Now that your file data has been abstracted to XML, you can write a SAX event handler to search for items within the file list. SAX is a good choice here, because this document could easily be several megabytes in size, and interpreting it as it is being read is the least resource-intensive approach.

The saxfinder.py script takes a single argument (the search text) and parses the supplied XML file checking via its SAX handler interfaces, in order to see if any of the files are of interest to you.

The script expects to work on XML as created earlier with index.py.

If the contents element of your XML file contains the character data that you supplied on the command line, the file is considered a match and the script prints a message accordingly. If you are running Windows, your contents tags only have the file extension, so your searches are limited to file extensions, unless you alter the code to watch something besides just the contents element.

Use three methods of the SAX interface to implement your metadata finder. First, startElement is implemented to both capture the name of the current file element as well as mark when you’ve entered the character data portion following a contents tag:

def startElement(self, name, attrs):
  if name == "file":
    self.filename = attrs.get('name', "")

  elif name == "contents":
    self.getCont = 1

If you’re entering a content element, a flag (self.getCont) is set so that the characters method knows when to gobble up character data and store it in another member variable:

def characters(self, ch):
  if self.getCont:
    self.contents += ch

When an endElement event rolls around, the script examines the contents that have been captured (if any) to see if they match the original command-line parameter. If so, the filename is printed; if not, SAX happily moves on to the next file element within the XML document:

def endElement(self, name):
  if name == "contents":
    self.getCont = 0 
    if self.contents.find(self.contentType) > -1:
      print self.filename, "has", self.contentType, "content."
    self.contents = ""

In addition, the self.getCont flag is disabled after leaving a contents element, to instruct the characters method not to capture data.

SAX helps you here by allowing you to process an XML index file that represents an entire filesystem and easily takes up 20 megabytes on your disk. Parsing such a gigantic document with the DOM can be difficult and unbearably slow.

Example 3-5 shows the complete listing of saxfinder.py.

Example 3-5. saxfinder.py
"""
saxfinder.py - generates HTML from pyxml.xml
"""
import sys

from xml.sax import make_parser
from xml.sax import ContentHandler

class FileHandler(ContentHandler):
  def __init__(self, contentType):
    self.getCont  = 0
    self.contents = ""
    self.filename = ""
    self.contentType = contentType

  def startElement(self, name, attrs):
    if name == "file":
      self.filename = attrs.get('name', "")

    elif name == "contents":
      self.getCont = 1

  def characters(self, ch):
    if self.getCont:
      self.contents += ch

  def endElement(self, name):
    if name == "contents":
      self.getCont = 0 
      if self.contents.find(self.contentType) > -1:
        print self.filename, "has", self.contentType, "content."
      self.contents = ""

# Main

fh = FileHandler(sys.argv[1])
parser = make_parser(  )
parser.setContentHandler(fh)
parser.parse(sys.stdin)

You can run saxfinder.py from the command line on both Unix and Windows. You need to supply a search string as the first parameter, and be sure and redirect or pipe an XML document (created with index.py) into standard input:

$> python saxfinder.py "C program" < nard.xml

The result should be something like this:

/home/shm00/nard/xd/server.cpp has C program content.
/home/shm00/nard/xd/shmoo.cpp has C program content.
/home/shm00/nard/gl-misc/array.cpp has C program content.
/home/shm00/nard/gl-misc/vertex.cpp has C program content.
/home/shm00/nard/gl-misc/mecogl.cpp has C program content.
/home/shm00/nard/gl-misc/drewgl/smugl.cpp has C program content.
/home/shm00/nard/gl-misc/drewgl/pal.cpp has C program content.
/home/shm00/nard/gl-misc/drewgl/pal.h has C program content.
/home/shm00/nard/gl-misc/drewgl/gl.cpp has C program content.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.93.12