Sets

Sets are unordered collections of elements, with the additional restriction that the elements must be unique.  The main use-cases where sets are a good choice are membership tests (testing if an element is present in the collection) and, unsurprisingly, set operations such as union, difference, and intersection.

In Python, sets are implemented using a hash-based algorithm just like dictionaries; therefore, the time complexities for addition, deletion, and test for membership scale as O(1) with the size of the collection.

Sets contain only unique elements. An immediate use case of sets is the removal of duplicates from a collection, which can be accomplished by simply passing the collection through the set constructor, as follows:

    # create a list that contains duplicates
x = list(range(1000)) + list(range(500))
# the set *x_unique* will contain only
# the unique elements in x
x_unique = set(x)

The time complexity for removing duplicates is O(N), as it requires to read the input and put each element in the set.

Sets expose a number of operations like union, intersection, and difference. The union of two sets is a new set containing all the elements of both the sets; the intersection is a new set that contains only the elements in common between the two sets, and the difference is a new set containing the element of the first set that are not contained in the second set. The time complexities for these operations are shown in the following table. Note that since we have two different input sizes, we will use the letter S to indicate the size of the first set (called s), and T to indicate the size of the second set (called t):

Code Time
s.union(t) O(S + T)
s.intersection(t) O(min(S, T))
s.difference(t) O(S)

An application of set operations are, for example, Boolean queries. Going back to the inverted index example of the previous subsection, we may want to support queries that include multiple terms. For example, we may want to search for all the documents that contain the words cat and table. This kind of a query can be efficiently computed by taking the intersection between the set of documents containing cat and the set of documents containing table.

In order to efficiently support those operations, we can change our indexing code so that each term is associated to a set of documents (rather than a list). After applying this change, calculating more advanced queries is a matter of applying the right set operation. In the following code, we show the inverted index based on sets and the query using set operations:

    # Building an index using sets
index = {}
for i, doc in enumerate(docs):
# We iterate over each term in the document
for word in doc.split():
# We build a set containing the indices
# where the term appears
if word not in index:
index[word] = {i}
else:
index[word].add(i)

# Querying the documents containing both "cat" and "table"
index['cat'].intersection(index['table'])
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.212.102