Another commonly used data structure is the set, a collection of objects that support operations such as:
Make a new set with all items in common.
Make a new set with all items in either operand.
Test whether an item exists in a set.
And there are others, depending on the intended use. Sets come in handy for dealing with more abstract group combinations. For instance, given a set of engineers and a set of writers, you can pick out individuals who do both activities by intersecting the two sets. A union of such sets would contain either type of individual, but would include any given individual only once.
Python lists, tuples, and strings come close to the notion of a
set: the in
operator tests
membership, for
iterates, and so
on. Here, we add operations not directly supported by Python
sequences. The idea is that we’re extending
built-in types for unique requirements.
As before, let’s first start out with a function-based set manager. But this time, instead of managing a shared set object in a module, let’s define functions to implement set operations on passed-in Python sequences (see Example 20-8).
Example 20-8. PP3EDstructBasicinter.py
def intersect(seq1, seq2): res = [] # start with an empty list for x in seq1: # scan the first sequence if x in seq2: res.append(x) # add common items to the end return res def union(seq1, seq2): res = list(seq1) # make a copy of seq1 for x in seq2: # add new items in seq2 if not x in res: res.append(x) return res
These functions work on any type of sequence—lists strings,
tuples, and other objects that conform to the sequence protocols
expected by these functions (for
loops, in
membership tests). In
fact, we can even use them on mixed object types: the last two
commands in the following code compute the intersection and union of
a list and a tuple. As usual in Python, the object
interface is what matters, not the specific
types:
C:...PP3EDstructBasic>python
>>>from inter import *
>>>s1 = "SPAM"
>>> s2 = "SCAM" >>>intersect(s1, s2), union(s1, s2)
(['S', 'A', 'M'], ['S', 'P', 'A', 'M', 'C']) >>>intersect([1,2,3], (1,4))
[1] >>>union([1,2,3], (1,4))
[1, 2, 3, 4]
Notice that the result is always a list here, regardless of the type of sequences passed in. We could work around this by converting types or by using a class to sidestep this issue (and we will in a moment). But type conversions aren’t clear-cut if the operands are mixed-type sequences. Which type do we convert to?
If we’re going to use the intersect
and union
functions as general tools, one
useful extension is support for multiple arguments (i.e., more
than two). The functions in Example 20-9 use Python’s
variable-length argument lists feature to compute the intersection
and union of arbitrarily many operands.
Example 20-9. PP3EDstructBasicinter2.py
def intersect(*args): res = [] for x in args[0]: # scan the first list for other in args[1:]: # for all other arguments if x not in other: break # this item in each one? else: res.append(x) # add common items to the end return res def union(*args): res = [] for seq in args: # for all sequence-arguments for x in seq: # for all nodes in argument if not x in res: res.append(x) # add new items to result return res
The multioperand functions work on sequences in the same way
as the original functions, but they support three or more
operands. Notice that the last two examples in the following
session work on lists with embedded compound objects: the in
tests used by the intersect
and union
functions apply equality testing
to sequence nodes recursively, as deep as necessary to determine
collection comparison results:
C:...PP3EDstructBasic>python
>>>from inter2 import *
>>>s1, s2, s3 = 'SPAM', 'SLAM', 'SCAM'
>>>intersect(s1, s2)
['S', 'A', 'M'] >>>intersect(s1, s2, s3)
['S', 'A', 'M'] >>>intersect(s1, s2, s3, 'HAM')
['A', 'M'] >>>union(s1, s2), union(s1, s2, s3)
(['S', 'P', 'A', 'M', 'L'], ['S', 'P', 'A', 'M', 'L', 'C']) >>>intersect([1,2,3], (1,4), range(5))
[1] >>>s1 = (9, (3.14, 1), "bye", [1,2], "mello")
>>>s2 = [[1,2], "hello", (3.14, 0), 9]
>>>intersect(s1, s2)
[9, [1, 2]] >>>union(s1, s2)
[9, (3.14, 1), 'bye', [1, 2], 'mello', 'hello', (3.14, 0)]
The set functions can operate on a variety of sequences, but they aren’t as friendly as true objects. Among other things, your scripts need to keep track of the sequences passed into these functions manually. Classes can do better: the class in Example 20-10 implements a set object that can hold any type of object. Like the stack classes, it’s essentially a wrapper around a Python list with extra set operations.
Example 20-10. PP3EDstructBasicset.py
class Set: def _ _init_ _(self, value = []): # on object creation self.data = [] # manages a local list self.concat(value) def intersect(self, other): # other is any sequence type res = [] # self is the instance subject for x in self.data: if x in other: res.append(x) return Set(res) # return a new Set def union(self, other): res = self.data[:] # make a copy of my list for x in other: if not x in res: res.append(x) return Set(res) def concat(self, value): # value: a list, string, Set... for x in value: # filters out duplicates if not x in self.data: self.data.append(x) def _ _len_ _(self): return len(self.data) def _ _getitem_ _(self, key): return self.data[key] def _ _and_ _(self, other): return self.intersect(other) def _ _or_ _(self, other): return self.union(other) def _ _repr_ _(self): return '<Set:' + repr(self.data) + '>'
The Set
class is used like
the Stack
class we saw earlier in
this chapter: we make instances and apply sequence operators plus
unique set operations to them. Intersection and union can be called
as methods, or by using the &
and |
operators normally used for
built-in integer objects. Because we can string operators in
expressions now (e.g., x & y &
z
), there is no obvious need to support multiple operands
in intersect
/union
methods here. As with all objects,
we can either use the Set
class
within a program or test it interactively as follows:
>>>from set import Set
>>>users1 = Set(['Bob', 'Emily', 'Howard', 'Peeper'])
>>>users2 = Set(['Jerry', 'Howard', 'Carol'])
>>>users3 = Set(['Emily', 'Carol'])
>>>users1 & users2
<Set:['Howard']> >>>users1 | users2
<Set:['Bob', 'Emily', 'Howard', 'Peeper', 'Jerry', 'Carol']> >>>users1 | users2 & users3
<Set:['Bob', 'Emily', 'Howard', 'Peeper', 'Carol']> >>> (users1 | users2) & users3
<Set:['Emily', 'Carol']> >>>users1.data
['Bob', 'Emily', 'Howard', 'Peeper']
Once you start using the Set
class, the first problem you might
encounter is its performance: its nested for
loops and in
scans become exponentially slow. That
slowness may or may not be significant in your applications, but
library classes should generally be coded as efficiently as
possible.
One way to optimize set performance is by changing the
implementation to use dictionaries rather than lists, for storing
sets internally—items may be stored as the keys of a dictionary
whose values are all None
.
Because lookup time is constant and short for dictionaries, the
in
list scans of the original set
may be replaced with direct dictionary fetches in this scheme. In
traditional terms, moving sets to dictionaries replaces slow linear
searches with fast hash table fetches. A computer scientist would
explain this by saying that the repeated nested scanning of the
list-based intersection is an exponential
algorithm in terms of its complexity, but dictionaries can be
linear.
The module in Example
20-11 implements this idea. Its class is a subclass of the
original set, and it redefines the methods that deal with the
internal representation but inherits others. The inherited &
and |
methods trigger the new intersect
and union
methods here, and the inherited
len
method works on dictionaries
as is. As long as Set
clients are
not dependent on the order of items in a set, they can switch to
this version directly by just changing the name of the module from
which Set
is imported; the class
name is the same.
Example 20-11. PP3EDstructBasicfastset.py
import set # fastset.Set extends set.Set class Set(set.Set): def _ _init_ _(self, value = []): self.data = {} # manages a local dictionary self.concat(value) # hashing: linear search times def intersect(self, other): res = {} for x in other: # other: a sequence or Set if self.data.has_key(x): # use hash-table lookup (or "in") res[x] = None return Set(res.keys( )) # a new dictionary-based Set def union(self, other): res = {} # other: a sequence or Set for x in other: # scan each set just once res[x] = None for x in self.data.keys( ): # '&' and '|' come back here res[x] = None # so they make new fastset's return Set(res.keys( )) def concat(self, value): for x in value: self.data[x] = None # inherit and, or, len def _ _getitem_ _(self, key): return self.data.keys( )[key] def _ _repr_ _(self): return '<Set:' + repr(self.data.keys( )) + '>'
This works about the same as the previous version:
>>>from fastset import Set
>>>users1 = Set(['Bob', 'Emily', 'Howard', 'Peeper'])
>>>users2 = Set(['Jerry', 'Howard', 'Carol'])
>>>users3 = Set(['Emily', 'Carol'])
>>>users1 & users2
<Set:['Howard']> >>>users1 | users2
<Set:['Emily', 'Howard', 'Jerry', 'Carol', 'Peeper', 'Bob']> >>>users1 | users2 & users3
<Set:['Emily', 'Howard', 'Carol', 'Peeper', 'Bob']> >>>(users1 | users2) & users3
<Set:['Emily', 'Carol']> >>>users1.data
{'Emily': None, 'Bob': None, 'Peeper': None, 'Howard': None}
The main functional difference in this version is the order of items in the set: because dictionaries are randomly ordered, this set’s order will differ from the original. For instance, you can store compound objects in sets, but the order of items varies in this version:
>>>import set, fastset
>>>a = set.Set([(1,2), (3,4), (5,6)])
>>>b = set.Set([(3,4), (7,8)])
>>>a & b
<Set:[(3, 4)]> >>>a | b
<Set:[(1, 2), (3, 4), (5, 6), (7, 8)]> >>>a = fastset.Set([(1,2), (3,4), (5,6)])
>>>b = fastset.Set([(3,4), (7,8)])
>>>a & b
<Set:[(3, 4)]> >>>a | b
<Set:[(3, 4), (1, 2), (7, 8), (5, 6)]> >>>b | a
<Set:[(3, 4), (5, 6), (1, 2), (7, 8)]>
Sets aren’t supposed to be ordered anyhow, so this isn’t a showstopper. A deviation that might matter, though, is that this version cannot be used to store unhashable (that is, immutable) objects. This stems from the fact that dictionary keys must be immutable. Because values are stored in keys, dictionary sets can contain only things such as tuples, strings, numbers, and class objects with immutable signatures. Mutable objects such as lists and dictionaries won’t work directly. For example, the following call:
fastset.Set([[1,2],[3,4]])
raises an exception with this dictionary-based set, but it works with the original set class. Tuples do work here as set items because they are immutable; Python computes hash values and tests key equality as expected.
So how did we do on the optimization front? Example 20-12 contains a
script to compare set class performance. It reuses the timer
module of Example 20-6 used earlier to
test stacks.
Example 20-12. PP3EDstructBasicsettime.py
import timer, sys import set, fastset def setops(Class): a = Class(range(50)) # a 50-integer set b = Class(range(20)) # a 20-integer set c = Class(range(10)) d = Class(range(5)) for i in range(5): t = a & b & c & d # 3 intersections t = a | b | c | d # 3 unions if _ _name_ _ == '_ _main_ _': rept = int(sys.argv[1]) print 'set => ', timer.test(rept, setops, set.Set) print 'fastset =>', timer.test(rept, setops, fastset.Set)
The setops
function makes
four sets and combines them with intersection and union operators
five times. A command-line argument controls the number of times
this whole process is repeated. More accurately, each call to
setops
makes 34 Set
instances (4 + [5 × (3 + 3)]) and
runs the intersect
and union
methods 15 times each (5 × 3) in
the for
loop’s body. The
performance improvement is equally dramatic this time around, on a
1.2 GHz machine:
C:...PP3EDstructBasic>python settime.py 50
set => 0.605568584834 fastset => 0.10293794323 C:...PP3EDstructBasic>python settime.py 100
set => 1.21189676342 fastset => 0.207752661302 C:...PP3EDstructBasic>python settime.py 200
set => 2.47468966028 fastset => 0.415944763929
These results will vary per machine, and they may vary per
Python release. But at least for this test case, the
dictionary-based set implementation (fastest
) is roughly six
times faster than the simple list-based set (set
). In fact, this sixfold speedup is
probably sufficient. Python dictionaries are already optimized
hash tables that you might be hard-pressed to improve on. Unless
there is evidence that dictionary-based sets are still too slow,
our work here is probably done.
For detail-minded readers, the prior section’s sixfold speedup results listed in this edition of the book were timed with Python 2.4 on a 1.2 GHz machine. Surprisingly, in the second edition, under an older Python (1.5.2) and slower machine (650 MHz), all list-based set results were roughly twice as slow as today, but the dictionary-based set was roughly four times slower (e.g., 1.54 and 0.44 seconds for dictionaries and lists at 50 iterations). In relative terms, the net effect is that dictionary sets went from being approximately three times faster than list sets to being six times faster today.
That is, machine speed doubled, but in addition, the dictionary code grew twice as quickly as before, relative to list-based sets. This larger jump reflects optimizations in Python itself. As you can see, version skew is an important consideration when analyzing performance; in this case, dictionaries are twice the performance boost they were a few years earlier.
Timing code sections helps, but the ultimate way to isolate bottlenecks is profiling. The Python profiler provides another way to gather performance data besides timing sections of code as done in this chapter so far. Because the profiler tracks all function calls, it provides much more information in a single blow. As such, it’s a more powerful way to isolate bottlenecks in slow programs—after profiling, you should have a good idea of where to focus your optimization efforts.
The profiler ships with Python as the standard library
module called profile
, and it
provides a variety of interfaces for measuring code performance.
It is structured and used much like the pdb
command-line debugger: import the
profile
module and call its
functions with a code string to measure performance. The simplest
profiling interface is its profile.run(statementstring)
function.
When invoked, the profiler runs the code string, collects
statistics during the run, and issues a report on the screen when
the statement completes. See it in action profiling a
100-iteration call to the set test function of the previous
section’s Example
20-12, for the list-based sets of Example 20-10 (hint: profile
an import
statement to profile
an entire file):
>>>import settime, timer, set
>>>import profile
>>>profile.run('timer.test(100, settime.setops, set.Set)')
675906 function calls in 6.328 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 118500 0.434 0.000 0.434 0.000 :0(append) 2 0.000 0.000 0.000 0.000 :0(clock) 500 0.003 0.000 0.003 0.000 :0(range) 1 0.004 0.004 0.004 0.004 :0(setprofile) 1 0.000 0.000 6.323 6.323 <string>:1(?) 0 0.000 0.000 profile:0(profiler) 1 0.000 0.000 6.328 6.328 profile:0(timer.test(100,...more...)) 1500 0.133 0.000 0.951 0.001 set.py:13(union) 3400 0.029 0.000 1.011 0.000 set.py:2(_ _init_ _) 3400 0.621 0.000 0.982 0.000 set.py:20(concat) 544000 2.032 0.000 2.032 0.000 set.py:26(_ _getitem_ _) 1500 0.014 0.000 5.226 0.003 set.py:27(_ _and_ _) 1500 0.013 0.000 0.964 0.001 set.py:28(_ _or_ _) 1500 3.009 0.002 5.212 0.003 set.py:6(intersect) 100 0.033 0.000 6.321 0.063 settime.py:4(setops) 1 0.002 0.002 6.323 6.323 timer.py:1(test)
The report’s format is straightforward and well documented in the Python library manual. By default, it lists the number of calls and times spent in each function invoked during the run. When the profiler is enabled, each interpreter event is routed to a Python handler. This gives an accurate picture of performance, but it tends to make the program being profiled run much slower than normal. In fact, the call profiled runs five times slower in this case under Python 2.4 and the 1.2 GHz test machine (6 seconds versus 1.2 seconds when not profiled).
On the other hand, the profiler’s report helps you isolate
which functions to recode and possibly migrate to the C language.
In the preceding listing, for instance, we can see that the
intersect
function (and the
corresponding _ _and_ _
operator method) is the main drag on performance; it takes roughly
five-sixths of the total execution time. Indexing (_ _getitem_ _
) is a close second, most
likely because it occurs so often with the repeated scans used by
intersection. Union, on the other hand, is fairly quick from a
relative perspective.
Here is a profile of the dictionary-based set implementation of Example 20-11 for comparison; the code runs five times slower under the profiler again (1 second versus 0.2 seconds), though the relative speed of the list and dictionary-based set code is the same when both are profiled (6 seconds versus 1 second, the same sixfold difference we noticed before):
>>>import settime, timer, fastset
>>>import profile
>>>profile.run('timer.test(100, settime.setops, fastset.Set)')
111406 function calls in 1.049 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 2 0.000 0.000 0.000 0.000 :0(clock) 17500 0.065 0.000 0.065 0.000 :0(has_key) 42500 0.166 0.000 0.166 0.000 :0(keys) 500 0.003 0.000 0.003 0.000 :0(range) 1 0.000 0.000 0.000 0.000 :0(setprofile) 1 0.000 0.000 1.049 1.049 <string>:1(?) 1500 0.159 0.000 0.439 0.000 fastset.py:15(union) 3400 0.052 0.000 0.052 0.000 fastset.py:23(concat) 38000 0.299 0.000 0.445 0.000 fastset.py:27(_ _getitem_ _) 3400 0.030 0.000 0.082 0.000 fastset.py:4(_ _init_ _) 1500 0.208 0.000 0.533 0.000 fastset.py:8(intersect) 0 0.000 0.000 profile:0(profiler) 1 0.000 0.000 1.049 1.049 profile:0(timer.test(100, ...more...)) 1500 0.014 0.000 0.548 0.000 set.py:27(_ _and_ _) 1500 0.015 0.000 0.454 0.000 set.py:28(_ _or_ _) 100 0.035 0.000 1.048 0.010 settime.py:4(setops) 1 0.001 0.001 1.049 1.049 timer.py:1(test)
This time, there is no obvious culprit behind the total execution time: intersection and union take roughly the same amount of time, and indexing is not much of a factor. Ultimately, the real difference is the exponential algorithm of list-based set intersection versus the linear nature of the dictionary-base algorithms, and this factor outweighs the choice of programming language used to implement the objects. Recoding in Python first is the best bet, since an exponential algorithm would be just as slow if implemented in C.
As coded, there seems to be a bottleneck in the
fastset
class: each time we call
a dictionary’s keys
method,
Python makes a new list to hold the result. This can happen
repeatedly during intersections and unions. If you are interested in
trying to optimize this further, see the following files in the
book’s examples distribution:
PP3EDstructBasicfastset2.py
PP3EDstructBasicfastset3.py
I wrote these to try to speed up sets further, but failed miserably. It turns out that adding extra code to try to shave operations usually negates the speedup you obtain. There may be faster codings, but the biggest win here was likely in changing the underlying data structure to dictionaries, not in minor code tweaks.
As a rule of thumb, your intuition about performance is almost always wrong in a dynamic language such as Python: the algorithm is usually the real culprit behind performance problems, not the coding style or even the implementation language. By removing the combinatorial list scanning algorithm of the original set class, the Python implementation became dramatically faster.
In fact, moving the original set class to C without fixing the
algorithm would not have addressed the real performance problem.
Coding tricks don’t usually help as much either, and they make your
programs difficult to understand. In Python, it’s almost always best
to code for readability first and optimize later if needed. Despite
its simplicity, fastset
is fast
indeed.
If you are interested in studying additional set-like operations coded in Python, see the following files in this book’s examples distribution:
RSet
implementation
Test script for RSet
The RSet
subclass defined
in rset.py adds basic relational algebra
operations for sets of dictionaries. It assumes the items in sets
are mappings (rows), with one entry per column (field). RSet
inherits all the original Set
operations (iteration, intersection,
union, &
and |
operators, uniqueness filtering, and so
on), and adds new operations as methods:
Select
Return a set of nodes that have a field equal to a given value.
Bagof
Collect set nodes that satisfy an expression string.
Find
Select tuples according to a comparison, field, and value.
Match
Find nodes in two sets with the same values for common fields.
Product
Compute a Cartesian product: concatenate tuples from two sets.
Join
Combine tuples from two sets that have the same value for a field.
Project
Extract named fields from the tuples in a table.
Difference
Remove one set’s tuples from another.
Alternative implementations of set difference operations can also be found in the diff.py file in the same examples distribution directory.
18.220.237.24