Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Implementing Sets

Another commonly used data structure is the set, a collection of objects that support operations such as:

Intersection: Make a new set with all items in common.
Union: Make a new set with all items in either operand.
Membership: Test whether an item exists in a set.

And there are others, depending on the intended use. Sets come in handy for dealing with more abstract group combinations. For instance, given a set of engineers and a set of writers, you can pick out individuals who do both activities by intersecting the two sets. A union of such sets would contain either type of individual, but would include any given individual only once.

Python lists, tuples, and strings come close to the notion of a set: the in operator tests membership, for iterates, and so on. Here, we add operations not directly supported by Python sequences. The idea is that we’re extending built-in types for unique requirements.

Set Functions

As before, let’s first start out with a function-based set manager. But this time, instead of managing a shared set object in a module, let’s define functions to implement set operations on passed-in Python sequences (see Example 20-8).

Example 20-8. PP3EDstructBasicinter.py

def intersect(seq1, seq2):
    res = []                          # start with an empty list
    for x in seq1:                    # scan the first sequence
        if x in seq2:
            res.append(x)             # add common items to the end
    return res

def union(seq1, seq2):
    res = list(seq1)                  # make a copy of seq1
    for x in seq2:                    # add new items in seq2
        if not x in res:
            res.append(x)
    return res

These functions work on any type of sequence—lists strings, tuples, and other objects that conform to the sequence protocols expected by these functions (for loops, in membership tests). In fact, we can even use them on mixed object types: the last two commands in the following code compute the intersection and union of a list and a tuple. As usual in Python, the object interface is what matters, not the specific types:

C:...PP3EDstructBasic>python
>>> from inter import *
>>> s1 = "SPAM"
>>> s2 = "SCAM"
>>> intersect(s1, s2), union(s1, s2)
(['S', 'A', 'M'], ['S', 'P', 'A', 'M', 'C'])
>>> intersect([1,2,3], (1,4))
[1]
>>> union([1,2,3], (1,4))
[1, 2, 3, 4]

Notice that the result is always a list here, regardless of the type of sequences passed in. We could work around this by converting types or by using a class to sidestep this issue (and we will in a moment). But type conversions aren’t clear-cut if the operands are mixed-type sequences. Which type do we convert to?

Supporting multiple operands

If we’re going to use the intersect and union functions as general tools, one useful extension is support for multiple arguments (i.e., more than two). The functions in Example 20-9 use Python’s variable-length argument lists feature to compute the intersection and union of arbitrarily many operands.

Example 20-9. PP3EDstructBasicinter2.py

def intersect(*args):
    res = []
    for x in args[0]:                  # scan the first list
        for other in args[1:]:         # for all other arguments
            if x not in other: break   # this item in each one?
        else:
            res.append(x)              # add common items to the end
    return res

def union(*args):
    res = []
    for seq in args:                   # for all sequence-arguments
        for x in seq:                  # for all nodes in argument
            if not x in res:
                res.append(x)          # add new items to result
    return res

The multioperand functions work on sequences in the same way as the original functions, but they support three or more operands. Notice that the last two examples in the following session work on lists with embedded compound objects: the in tests used by the intersect and union functions apply equality testing to sequence nodes recursively, as deep as necessary to determine collection comparison results:

C:...PP3EDstructBasic>python
>>> from inter2 import *
>>> s1, s2, s3 = 'SPAM', 'SLAM', 'SCAM'
>>> intersect(s1, s2)
['S', 'A', 'M']
>>> intersect(s1, s2, s3)
['S', 'A', 'M']
>>> intersect(s1, s2, s3, 'HAM')
['A', 'M']

>>> union(s1, s2), union(s1, s2, s3)
(['S', 'P', 'A', 'M', 'L'], ['S', 'P', 'A', 'M', 'L', 'C'])
>>> intersect([1,2,3], (1,4), range(5))
[1]
>>> s1 = (9, (3.14, 1), "bye", [1,2], "mello")
>>> s2 = [[1,2], "hello", (3.14, 0), 9]
>>> intersect(s1, s2)
[9, [1, 2]]
>>> union(s1, s2)
[9, (3.14, 1), 'bye', [1, 2], 'mello', 'hello', (3.14, 0)]

Set Classes

The set functions can operate on a variety of sequences, but they aren’t as friendly as true objects. Among other things, your scripts need to keep track of the sequences passed into these functions manually. Classes can do better: the class in Example 20-10 implements a set object that can hold any type of object. Like the stack classes, it’s essentially a wrapper around a Python list with extra set operations.

Example 20-10. PP3EDstructBasicset.py

class Set:
    def _ _init_ _(self, value = []):     # on object creation
        self.data = []                     # manages a local list
        self.concat(value)

    def intersect(self, other):         # other is any sequence type
        res = []                        # self is the instance subject
        for x in self.data:
            if x in other:
                res.append(x)
        return Set(res)                 # return a new Set

    def union(self, other):
        res = self.data[:]              # make a copy of my list
        for x in other:
            if not x in res:
                res.append(x)
        return Set(res)

    def concat(self, value):            # value: a list, string, Set...
        for x in value:                 # filters out duplicates
           if not x in self.data:
                self.data.append(x)

    def _ _len_ _(self):          return len(self.data)
    def _ _getitem_ _(self, key): return self.data[key]
    def _ _and_ _(self, other):   return self.intersect(other)
    def _ _or_ _(self, other):    return self.union(other)
    def _ _repr_ _(self):         return '<Set:' + repr(self.data) + '>'

The Set class is used like the Stack class we saw earlier in this chapter: we make instances and apply sequence operators plus unique set operations to them. Intersection and union can be called as methods, or by using the & and | operators normally used for built-in integer objects. Because we can string operators in expressions now (e.g., x & y & z), there is no obvious need to support multiple operands in intersect/union methods here. As with all objects, we can either use the Set class within a program or test it interactively as follows:

>>>from set import Set
>>> users1 = Set(['Bob', 'Emily', 'Howard', 'Peeper'])
>>> users2 = Set(['Jerry', 'Howard', 'Carol'])
>>> users3 = Set(['Emily', 'Carol'])
>>> users1 & users2
<Set:['Howard']>
>>> users1 | users2
<Set:['Bob', 'Emily', 'Howard', 'Peeper', 'Jerry', 'Carol']>
>>> users1 | users2 & users3
<Set:['Bob', 'Emily', 'Howard', 'Peeper', 'Carol']>
>>> (users1 | users2) & users3
<Set:['Emily', 'Carol']>
>>> users1.data
['Bob', 'Emily', 'Howard', 'Peeper']

Optimization: Moving Sets to Dictionaries

Once you start using the Set class, the first problem you might encounter is its performance: its nested for loops and in scans become exponentially slow. That slowness may or may not be significant in your applications, but library classes should generally be coded as efficiently as possible.

One way to optimize set performance is by changing the implementation to use dictionaries rather than lists, for storing sets internally—items may be stored as the keys of a dictionary whose values are all None. Because lookup time is constant and short for dictionaries, the in list scans of the original set may be replaced with direct dictionary fetches in this scheme. In traditional terms, moving sets to dictionaries replaces slow linear searches with fast hash table fetches. A computer scientist would explain this by saying that the repeated nested scanning of the list-based intersection is an exponential algorithm in terms of its complexity, but dictionaries can be linear.

The module in Example 20-11 implements this idea. Its class is a subclass of the original set, and it redefines the methods that deal with the internal representation but inherits others. The inherited & and | methods trigger the new intersect and union methods here, and the inherited len method works on dictionaries as is. As long as Set clients are not dependent on the order of items in a set, they can switch to this version directly by just changing the name of the module from which Set is imported; the class name is the same.

Example 20-11. PP3EDstructBasicfastset.py

import set
                                           # fastset.Set extends set.Set
class Set(set.Set):
    def _ _init_ _(self, value = []):
        self.data = {}                     # manages a local dictionary
        self.concat(value)                 # hashing: linear search times

    def intersect(self, other):
        res = {}
        for x in other:                    # other: a sequence or Set
            if self.data.has_key(x):       # use hash-table lookup (or "in")
                res[x] = None
        return Set(res.keys( ))             # a new dictionary-based Set

    def union(self, other):
        res = {}                           # other: a sequence or Set
        for x in other:                    # scan each set just once
            res[x] = None
        for x in self.data.keys( ):             # '&' and '|' come back here
            res[x] = None                  # so they make new fastset's
        return Set(res.keys( ))

    def concat(self, value):
        for x in value: self.data[x] = None

    # inherit and, or, len
    def _ _getitem_ _(self, key):  return self.data.keys( )[key]
    def _ _repr_ _(self):          return '<Set:' + repr(self.data.keys( )) + '>'

This works about the same as the previous version:

>>>from fastset import Set
>>> users1 = Set(['Bob', 'Emily', 'Howard', 'Peeper'])
>>> users2 = Set(['Jerry', 'Howard', 'Carol'])
>>> users3 = Set(['Emily', 'Carol'])
>>> users1 & users2
<Set:['Howard']>
>>> users1 | users2
<Set:['Emily', 'Howard', 'Jerry', 'Carol', 'Peeper', 'Bob']>
>>> users1 | users2 & users3
<Set:['Emily', 'Howard', 'Carol', 'Peeper', 'Bob']>
>>> (users1 | users2) & users3
<Set:['Emily', 'Carol']>
>>> users1.data
{'Emily': None, 'Bob': None, 'Peeper': None, 'Howard': None}

The main functional difference in this version is the order of items in the set: because dictionaries are randomly ordered, this set’s order will differ from the original. For instance, you can store compound objects in sets, but the order of items varies in this version:

>>>import set, fastset
>>> a = set.Set([(1,2), (3,4), (5,6)])
>>> b = set.Set([(3,4), (7,8)])
>>> a & b
<Set:[(3, 4)]>
>>> a | b
<Set:[(1, 2), (3, 4), (5, 6), (7, 8)]>
>>> a = fastset.Set([(1,2), (3,4), (5,6)])
>>> b = fastset.Set([(3,4), (7,8)])
>>> a & b
<Set:[(3, 4)]>
>>> a | b
<Set:[(3, 4), (1, 2), (7, 8), (5, 6)]>
>>> b | a
<Set:[(3, 4), (5, 6), (1, 2), (7, 8)]>

Sets aren’t supposed to be ordered anyhow, so this isn’t a showstopper. A deviation that might matter, though, is that this version cannot be used to store unhashable (that is, immutable) objects. This stems from the fact that dictionary keys must be immutable. Because values are stored in keys, dictionary sets can contain only things such as tuples, strings, numbers, and class objects with immutable signatures. Mutable objects such as lists and dictionaries won’t work directly. For example, the following call:

fastset.Set([[1,2],[3,4]])

raises an exception with this dictionary-based set, but it works with the original set class. Tuples do work here as set items because they are immutable; Python computes hash values and tests key equality as expected.

Timing the results under Python 2.4

So how did we do on the optimization front? Example 20-12 contains a script to compare set class performance. It reuses the timer module of Example 20-6 used earlier to test stacks.

Example 20-12. PP3EDstructBasicsettime.py

import timer, sys
import set, fastset

def setops(Class):
    a = Class(range(50))                 # a 50-integer set
    b = Class(range(20))                 # a 20-integer set
    c = Class(range(10))
    d = Class(range(5))
    for i in range(5):
        t = a & b & c & d                # 3 intersections
        t = a | b | c | d                # 3 unions

if _ _name_ _ == '_ _main_ _':
    rept = int(sys.argv[1])
    print 'set =>    ', timer.test(rept, setops, set.Set)
    print 'fastset =>', timer.test(rept, setops, fastset.Set)

The setops function makes four sets and combines them with intersection and union operators five times. A command-line argument controls the number of times this whole process is repeated. More accurately, each call to setops makes 34 Set instances (4 + [5 × (3 + 3)]) and runs the intersect and union methods 15 times each (5 × 3) in the for loop’s body. The performance improvement is equally dramatic this time around, on a 1.2 GHz machine:

C:...PP3EDstructBasic>python settime.py 50
set =>     0.605568584834
fastset => 0.10293794323

C:...PP3EDstructBasic>python settime.py 100
set =>     1.21189676342
fastset => 0.207752661302

C:...PP3EDstructBasic>python settime.py 200
set =>     2.47468966028
fastset => 0.415944763929

These results will vary per machine, and they may vary per Python release. But at least for this test case, the dictionary-based set implementation (fastest) is roughly six times faster than the simple list-based set (set). In fact, this sixfold speedup is probably sufficient. Python dictionaries are already optimized hash tables that you might be hard-pressed to improve on. Unless there is evidence that dictionary-based sets are still too slow, our work here is probably done.

Timing results under Python 1.5.2: version skew

For detail-minded readers, the prior section’s sixfold speedup results listed in this edition of the book were timed with Python 2.4 on a 1.2 GHz machine. Surprisingly, in the second edition, under an older Python (1.5.2) and slower machine (650 MHz), all list-based set results were roughly twice as slow as today, but the dictionary-based set was roughly four times slower (e.g., 1.54 and 0.44 seconds for dictionaries and lists at 50 iterations). In relative terms, the net effect is that dictionary sets went from being approximately three times faster than list sets to being six times faster today.

That is, machine speed doubled, but in addition, the dictionary code grew twice as quickly as before, relative to list-based sets. This larger jump reflects optimizations in Python itself. As you can see, version skew is an important consideration when analyzing performance; in this case, dictionaries are twice the performance boost they were a few years earlier.

Using the Python profiler

Timing code sections helps, but the ultimate way to isolate bottlenecks is profiling. The Python profiler provides another way to gather performance data besides timing sections of code as done in this chapter so far. Because the profiler tracks all function calls, it provides much more information in a single blow. As such, it’s a more powerful way to isolate bottlenecks in slow programs—after profiling, you should have a good idea of where to focus your optimization efforts.

The profiler ships with Python as the standard library module called profile, and it provides a variety of interfaces for measuring code performance. It is structured and used much like the pdb command-line debugger: import the profile module and call its functions with a code string to measure performance. The simplest profiling interface is its profile.run(statementstring) function. When invoked, the profiler runs the code string, collects statistics during the run, and issues a report on the screen when the statement completes. See it in action profiling a 100-iteration call to the set test function of the previous section’s Example 20-12, for the list-based sets of Example 20-10 (hint: profile an import statement to profile an entire file):

>>>import settime, timer, set
>>> import profile
>>> profile.run('timer.test(100, settime.setops, set.Set)')

         675906 function calls in 6.328 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   118500    0.434    0.000    0.434    0.000 :0(append)
        2    0.000    0.000    0.000    0.000 :0(clock)
      500    0.003    0.000    0.003    0.000 :0(range)
        1    0.004    0.004    0.004    0.004 :0(setprofile)
        1    0.000    0.000    6.323    6.323 <string>:1(?)
        0    0.000             0.000          profile:0(profiler)
        1    0.000    0.000    6.328    6.328 profile:0(timer.test(100,...more...))
     1500    0.133    0.000    0.951    0.001 set.py:13(union)
     3400    0.029    0.000    1.011    0.000 set.py:2(_ _init_ _)
     3400    0.621    0.000    0.982    0.000 set.py:20(concat)
   544000    2.032    0.000    2.032    0.000 set.py:26(_ _getitem_ _)
     1500    0.014    0.000    5.226    0.003 set.py:27(_ _and_ _)
     1500    0.013    0.000    0.964    0.001 set.py:28(_ _or_ _)
     1500    3.009    0.002    5.212    0.003 set.py:6(intersect)
      100    0.033    0.000    6.321    0.063 settime.py:4(setops)
        1    0.002    0.002    6.323    6.323 timer.py:1(test)

The report’s format is straightforward and well documented in the Python library manual. By default, it lists the number of calls and times spent in each function invoked during the run. When the profiler is enabled, each interpreter event is routed to a Python handler. This gives an accurate picture of performance, but it tends to make the program being profiled run much slower than normal. In fact, the call profiled runs five times slower in this case under Python 2.4 and the 1.2 GHz test machine (6 seconds versus 1.2 seconds when not profiled).

On the other hand, the profiler’s report helps you isolate which functions to recode and possibly migrate to the C language. In the preceding listing, for instance, we can see that the intersect function (and the corresponding _ _and_ _ operator method) is the main drag on performance; it takes roughly five-sixths of the total execution time. Indexing (_ _getitem_ _) is a close second, most likely because it occurs so often with the repeated scans used by intersection. Union, on the other hand, is fairly quick from a relative perspective.

Here is a profile of the dictionary-based set implementation of Example 20-11 for comparison; the code runs five times slower under the profiler again (1 second versus 0.2 seconds), though the relative speed of the list and dictionary-based set code is the same when both are profiled (6 seconds versus 1 second, the same sixfold difference we noticed before):

>>>import settime, timer, fastset 

>>> import profile
>>> profile.run('timer.test(100, settime.setops, fastset.Set)')

         111406 function calls in 1.049 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000    0.000    0.000 :0(clock)
    17500    0.065    0.000    0.065    0.000 :0(has_key)
    42500    0.166    0.000    0.166    0.000 :0(keys)
      500    0.003    0.000    0.003    0.000 :0(range)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
        1    0.000    0.000    1.049    1.049 <string>:1(?)
     1500    0.159    0.000    0.439    0.000 fastset.py:15(union)
     3400    0.052    0.000    0.052    0.000 fastset.py:23(concat)
    38000    0.299    0.000    0.445    0.000 fastset.py:27(_ _getitem_ _)
     3400    0.030    0.000    0.082    0.000 fastset.py:4(_ _init_ _)
     1500    0.208    0.000    0.533    0.000 fastset.py:8(intersect)
        0    0.000             0.000          profile:0(profiler)
        1    0.000    0.000    1.049    1.049 profile:0(timer.test(100, ...more...))
     1500    0.014    0.000    0.548    0.000 set.py:27(_ _and_ _)
     1500    0.015    0.000    0.454    0.000 set.py:28(_ _or_ _)
      100    0.035    0.000    1.048    0.010 settime.py:4(setops)
        1    0.001    0.001    1.049    1.049 timer.py:1(test)

This time, there is no obvious culprit behind the total execution time: intersection and union take roughly the same amount of time, and indexing is not much of a factor. Ultimately, the real difference is the exponential algorithm of list-based set intersection versus the linear nature of the dictionary-base algorithms, and this factor outweighs the choice of programming language used to implement the objects. Recoding in Python first is the best bet, since an exponential algorithm would be just as slow if implemented in C.

Optimizing fastset by Coding Techniques (or Not)

As coded, there seems to be a bottleneck in the fastset class: each time we call a dictionary’s keys method, Python makes a new list to hold the result. This can happen repeatedly during intersections and unions. If you are interested in trying to optimize this further, see the following files in the book’s examples distribution:

PP3EDstructBasicfastset2.py
PP3EDstructBasicfastset3.py

I wrote these to try to speed up sets further, but failed miserably. It turns out that adding extra code to try to shave operations usually negates the speedup you obtain. There may be faster codings, but the biggest win here was likely in changing the underlying data structure to dictionaries, not in minor code tweaks.

As a rule of thumb, your intuition about performance is almost always wrong in a dynamic language such as Python: the algorithm is usually the real culprit behind performance problems, not the coding style or even the implementation language. By removing the combinatorial list scanning algorithm of the original set class, the Python implementation became dramatically faster.

In fact, moving the original set class to C without fixing the algorithm would not have addressed the real performance problem. Coding tricks don’t usually help as much either, and they make your programs difficult to understand. In Python, it’s almost always best to code for readability first and optimize later if needed. Despite its simplicity, fastset is fast indeed.

Adding Relational Algebra to Sets (External)

If you are interested in studying additional set-like operations coded in Python, see the following files in this book’s examples distribution:

PP3EDstructBasic set.py: RSet implementation
PP3EDstructBasic eltest.py: Test script for RSet

The RSet subclass defined in rset.py adds basic relational algebra operations for sets of dictionaries. It assumes the items in sets are mappings (rows), with one entry per column (field). RSet inherits all the original Set operations (iteration, intersection, union, & and | operators, uniqueness filtering, and so on), and adds new operations as methods:

Select: Return a set of nodes that have a field equal to a given value.
Bagof: Collect set nodes that satisfy an expression string.
Find: Select tuples according to a comparison, field, and value.
Match: Find nodes in two sets with the same values for common fields.
Product: Compute a Cartesian product: concatenate tuples from two sets.
Join: Combine tuples from two sets that have the same value for a field.
Project: Extract named fields from the tuples in a table.
Difference: Remove one set’s tuples from another.

Alternative implementations of set difference operations can also be found in the diff.py file in the same examples distribution directory.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 20.3. Implementing Sets

Create new playlist

Sign In

Sign Up