51

Data Structures in Web Information Retrieval*

Monika Henzinger

University of Vienna

51.1Introduction

51.2Inverted Indices

Index CompressionIndex Granularity

51.3Fingerprints

51.4Finding Near-Duplicate Documents

51.5Conclusions

References

51.1Introduction

Current search engines process thousands of queries per second over a collection of billions of web pages with a sub-second average response time. There are two reasons for this astonishing performance: Massive parallelism and a simple yet efficient data structure, called inverted index.

In this chapter we will describe inverted indices. The parallelism deployed by search engines is quite straightforward: Given a collection of documents and a user query the goal of information retrieval is to find all documents that are relevant to a user query and return them in decreasing order of relevance. Since on the Web there are often thousands of matches for a given user query, Web search engines usually return just the top 10 results and retrieve more results only upon request. This can be easily parallelized over m machines: Distribute the documents equally over m − 1 machines, find the best up to 10 documents for each machine and return them to the machine without documents, which then merges the lists to determine the overall top 10.

Since the users are only presented with the top 10 results they are usually annoyed if these results contain duplicates or near-duplicates. Thus, it is crucial for a search engine to detect near-duplicate web pages. In Section 51.4 we will describe a technique for doing so based on fingerprints, which we will introduce in Section 51.3.

51.2Inverted Indices

Given a user query consisting of terms t1,,tn, a search engine has to find all documents relevant to the query. Usually Web search engines make the following simplifying assumption: A document is relevant only if it contains all query terms. To find all these documents the search engine uses the inverted (file) index data structure. It consists of two parts:

A data structure containing every word t that appears in at least one document together with a pointer pt for each word and potentially a count ft of the number of documents containing the word; and

An inverted list for every term t that consists of all the documents containing t and is pointed to by pt.

One popular implementation is to store the sorted list of words in a search tree whose leaves are connected by a linked list. This imposes an order on the words. The inverted lists are implemented as one large (virtual) array such that t’s inverted list consists of all documents between the position in the array pointed to by pt and by pt, where t′ is the term following t in the sorted list. in the web scenario this array does not fit onto a single machine and thus needs to be distributed over multiple machines, either in RAM or on disk.

Another implementation would be to store the list of words in any search tree or hash function together with the pointers pt. A special terminator document id (like 0) is appended to the inverted lists and they can be stored either in one large array or in multiple arrays.

To determine all documents containing the query terms t1,,tn (called a Boolean AND) the intersection of their inverted lists can be computed by traversing their lists in parallel. If one of the query terms is negated (Boolean NOT), i.e. the user does not want this term to be contained in the query, the negation of the corresponding list is used. To speed up the computation in the case that a very frequent term is combined with a very infrequent term the documents in the inverted list are stored in sorted order and a one-level or two-level search tree is stored for each inverted list. The one-level search tree consists of every 1/f-th entry entry in the inverted list together with a pointer to the location of this entry in the inverted list, where f is a constant like 100. At query time entries in the inverted list of the frequent term can be skipped quickly to get to the next document that contains the infrequent term. The space requirement is only 1/f of the space requirement of the space requirement of the inverted lists.

51.2.1Index Compression

Of course, data compression is crucial when indexing billions of documents. For that it is usually assumed that the documents are numbered consecutively from 1 to N and that the documents are stored in the inverted list by increasing document id. A simple but powerful technique is called delta-encoding: Instead of storing the actual document ids in the inverted list, the document id of the first document is stored together with the difference between the i-th and the i + 1-st document containing the term. When traversing the inverted list the first document id is summed up with delta-values seen so far to get the document id of the current document. Note that delta-encoding does not interfere with the one-level or two-level search tree stored on top of the inverted list. The advantage of delta-encoding is that it turns large integer (document ids) into mostly small integers (the delta values), depending of course on the value of ft. A variable length encoding scheme, like Golomb codes, of the delta values then leads to considerable space saving. We sketch Golomb codes in the next paragraph. See Reference 1 for more details on variable compression schemes.

There if a trade-off between the space used for the index and the time penalty encountered at query time for decompressing the index. If the index is read from disk, the time to read the appropriate part of the index dominates the compression time and thus more compression can be used. If the index is in RAM, then less compression should be used, except if sophisticated compression is the only way to make the index fit in RAM.

Let n be the number of unique words in all the documents and let f be the number of words in all the documents, including repetitions. Given an integer x it takes at least log x bits to represent x in binary. However, the algorithm reading the index needs to know how many bits to read. There are these two possibilities: (1) represent x is in binary by log N bits in binary with the top log N − log x bits set to 0; or (2) represent x in unary by x 1’s followed by a 0. Usually the binary representation is chosen, but the unary representation is more space efficient when x < log N. Golumb codes combine both approaches as follows.

Choose b0.69Nnf. The Golumb code for an integer x ≥ 1 consists of two parts:

A unary encoding of q followed by 0, where q = ⌊(x − 1)/b⌋; and

A binary encoding of the remainder r = xqb − 1. If b is a power of 2, this requires log b bits.

Note that Nnf is the average distance between entries in the inverted index, that is, the average value after the delta-encoding. The value of b was chosen to be roughly 69% of the average. Every entry of value b or less requires 1 bit for the unary part and log b bits for the binary part. This is a considerable space saving over the binary representation since log b + 1 < log N. As long as (x − 1)/b < log N − log b − 1, there is a space saving over implementing using the binary encoding. So only entries that are roughly log N times larger than the average require more space than the average.

If the frequency ft of a term t is known, then the inverted list of t can be compressed even better using bt0.69Nft instead of b. This implies that for frequent terms, bt is smaller than b and thus fewer bits are used for small integers, which occur frequently in the delta-encoding of frequent terms.

51.2.2Index Granularity

Another crucial issue is the level of granularity used in the inverted index. So far, we only discussed storing document ids. However to handle some query operators, like quotes, that require the query term to be next to each other in the document, the exact position of the document is needed. There are two possible ways to handle this:

1.One way is to consider all documents to be concatenated into one large document and then store the positions in this large document instead of the document ids. Additionally one special inverted list is constructed that stores the position of the first word of each document. At query time an implicit Boolean AND operation is performed with this special inverted list to determine to which document a given position belongs. Note that the above compression techniques continue to work. This approach is very space efficient but incurs a run-time overhead at query time for traversing the special inverted list and performing the Boolean AND operation.

2.Another solution is to store (document id,position)-pairs in the inverted list and to delta-encode the document ids and also the position within the same document. This approach uses more space, but does not incur the run-time overhead.

Another data structure that a search engine could use are suffix trees (see Chapter 30). They require, more space and are more complicated, but allow more powerful operations, like searching for syllables or letters.

51.3Fingerprints

Fingerprints are short strings that represent larger strings and have the following properties:

If the fingerprints of two strings are different then the strings are guaranteed to be different.

If the fingerprints of two strings are identical then there is only a small probability that the strings are different.

If two different strings have the same fingerprint, it is called a collision. Thus, fingerprints are a type of hash functions where the hash table is populated only very sparsely in exchange for a very low collision probability.

Fingerprints are very useful. For example, search engines can use them to quickly check whether a URL that is contained on a web page is already stored in the index. We will describe how they are useful for finding near-duplicate documents in the next section. One fingerprinting scheme is due to Rabin [2] and is equivalent to cyclic redundancy checks. We follow here the presentation by Broder [3]. Let s be a binary string and let s′ = 1s, i.e. s′ equals s prefixed with a 1. The string s=(s1,s2,,sm) induces a polynomial S(t) over Z2 as follows:

S(t)=s1tm1+s2tm2++sm.

Let P(t) be an irreducible polynomial of degree k over Z2. The fingerprint of s is defined to be the polynomial

f(s)=S(t)modP(t)

over Z2, which can be simply represented as a string of its coefficients. As shown in Reference 3 the probability that two distinct strings of length m from a set of n strings have the same fingerprint is less than nm2/2k. Thus, given n and m the probability of collision can be reduced to the required value simply by increasing k, that is, the length of a fingerprint.

Rabin’s fingerprints have the property that given two overlapping strings s=(s1,s2,,sm) and t=(t1,t2,,tm) such that ti+1=si for all 1 ≤ i < m then the fingerprint of t can be computed from the fingerprint of s very efficiently: If

f(s)=r1tk1+r2tk2++rk,

then

f(t)=r2tk1+r3tk2++rkt+tm+(r1tk)modP(t).

Note that Q(t) = tkmodP(t) is equivalent to P(t) with the leading coefficient removed, which means it can be computed easily. Thus

f(t)=r2tk1+r3tk2++rkt+tm+r1Q(t).

Hence f(t) can be computed from f(s) as follows: Assume f(s) is stored in a shift register and Q(t) is stored in another register. Shift left with tm as input and if r1 = 1 perform a bit-wise EX-OR operation with Q(t). See [3] for a software implementation for a typical 32-bit word computer.

Other ways of computing fingerprints are cryptographic checksums like Sha-1 or MD5.

51.4Finding Near-Duplicate Documents

Users of search engine strongly dislike getting near-duplicate results on the same results page. Of course it depends on the user what s/he considers to be near-duplicate. However, most users would agree that pages with the same content except for different ads or a different layout or a different header or footer are near-duplicates. By “near-duplicate” we mean these kinds of “syntactic” duplicates – semantic duplication is even harder detect.

Unfortunately, there are a variety of circumstances that create near-duplicate pages. The main reasons are (1) local copies of public-domain pages (for example the PERL-manual or the preamble of the US constitution) or various databases and (2) multiple visits of the same page by the crawler (the part of the search engine that retrieves web pages to store in the index) without detection of the replication. The latter can happen because there were small changes to the URL and the content of the page. Usually, crawlers fingerprint the URL as well as the content of the page and thus can easily detect exact duplicates.

There are various approaches to detect near-duplicate pages. One approach that works very well in the Web setting was given by Broder [4] and was validated on the Web [5]. The basic idea is to associate a set of fingerprints with each document such that the similarity of two pages is proportional to the size of the intersection of their respective sets.

We describe first how to compute the fingerprints and show then how they can be used to efficiently detect near-duplicates. The set of fingerprints for a document is computed using the shingling technique introduced by Brin et al. [6]. Let a token be either a letter, a word, a line, or a sentence in a document. Each document consists of a sequence of token. We call a contiguous sequence of q tokens a q-shingle of the document. We want to compute fingerprints for all q-shingles in a document. Using Rabin’s fingerprints this can be done efficiently if we start from the first shingle and then use a sliding window approach to compute the fingerprint of the shingle currently in the window.

Let SD be the set of fingerprints generated for document D. The idea is simply to define the similarity or resemblance r(A, B) of document A and B as

r(A,B)=|SASB||SASB|.

Experiments have indicated that a resemblance value close to 1 captures well the information notion of “syntactic” near-duplication that we discussed above.

Storing the whole set SD would take a lot of space. It suffices, however, to keep a fixed number of fingerprints from SD for each document D. This subset is called a sketch. As we will show below the sketches can be computed in time linear in the size of D and will be stored for each document. The resemblance of two documents can be approximated by using the sketches instead of the full sets of fingerprints. Thus, the time is only linear in the size of the sketches.

Sketches are determined as follows: Recall that each fingerprints requires k bits. Thus, SD{1,,n}, where n = 2k. Let π be a permutation chosen uniformly at random from the set of all permutations of [n]. Let X=min(min{π(SA)},min{π(SB)}).

The crucial observation is that if min{π(SA)}=min{π(SB)}, then X=min{π(SA)} must belong to π(SA) ∩ π(SB). If min{π(SA)}min{π(SB)}, then X belongs to either π(SA) − π(SB) or to π(SB) − π(SA), and, thus, X does not belong to π(SA) ∩ π(SB). It follows that min{π(SA)}=min{π(SB)} if and only if X belongs to π(SA) ∩ π(SB). Since π was chosen uniformly at random the probability of the latter is

|SASB||SASB|=r(A,B).

It follows that

Pr(min{π(SA)}=min{π(SB)})=r(A,B).

We choose p independent random permutations. For each document the sketch consists of the permuted fingerprint values min{π1(SD)},min{π2(SD)},,min{πp(SD)}. The resemblance of two documents is then estimated by the intersection of their sketches, whose expected value is proportional to the resemblance.

In practice, π cannot be chosen uniformly at random which led to the study of min-wise independent permutations [7].

To detect the near-duplicates in a set of documents we build for each shingle in a sketch the list of documents to which this shingle belongs. Using this list we generate for each shingle and each pair of document containing this shingle an (document id, document id)-pair. Finally we simply sort these pairs to determine the resemblance for each pair. The running time is proportional to the number of (document id, document id)-pairs. If each shingle only belongs to a constant number of documents, it is linear in the number of unique shingles.

In a web search engine it often suffices to remove the near-duplicates at query time. To do this the sketch is stored with each document. The search engine determines the top 50 or so results and then performs a pair-wise near-duplicate detection with some resemblance threshold, removing the near duplicate results, to determine the top 10 results.

The described near-duplicate detection algorithm assumes that the similarity between documents depends on the size of the overlap in their fingerprint sets. This corresponds to the Jaccard coefficient of similarity used in information retrieval. A more standard approach in information retrieval is to assign a weight vector of terms to each document and to define the similarity of two documents as the cosine of their (potentially normalized) term vectors. Charikar [8] gives an efficient near-duplicate detection algorithm based on this measure. He also presents a near-duplicate detection algorithms for another metric, the Earth Mover Distance.

Other similarity detection mechanisms are given in References 6 and 911. Many near-duplicate web pages are created because a whole web host is duplicated. Thus, a large percentage of the near-duplicate web pages can be detected if near-duplicate web hosts can be found. Techniques for this problem are presented in References 12 and 13.

51.5Conclusions

In this chapter we presented the dominant data structure for web search engines, the inverted index. We also described fingerprints, which are useful at multiple places in a search engine, and document sketches for near-duplicate detection. Other useful data structures for web information retrieval are adjacency lists: All web search engines claim to perform some type of analysis of the hyperlink structure. For this, they need to store the list of incoming links for each document, a use of the classic adjacency list representation.

Web search engines also sometimes analyze the log of the user queries issued at the search engine. With hundreds of millions of queries per day these logs are huge and can only be processed with one-pass data stream algorithms.

References

1.Witten, I. H., Moffat, A., and Bell, T. C., 1999. Managing Gigabytes. Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco.

2.Rabin, M. O., 1981. Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University.

3.Broder, A. Z., 1993. Some applications of Rabin’s fingerprinting method. In Capocelli, R., De Santis, A., and Vaccaro, U., editors, Sequences II: Methods in Communications, Security, and Computer Science, Springer-Verlag, New York, 143–152.

4.Broder, A. Z., 1997. On the resemblance and containment of documents. In Proceedings of Compression and Complexity of Sequences 1997, IEEE Computer Society, Washington, DC, 21–29.

5.Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G., 1997. Syntactic clustering of the Web. In Proceedings Sixth International World Wide Web Conference, Elsevier Science, Santa Clara, 391–404.

6.Brin, S., Davis, J., and Garcia-Molina, H., 1995. Copy detection mechanisms for digital documents. In Proceedings 1995 ACM SIGMOD International conference on Management of Data, San Jose, 398–409.

7.Broder, A. Z., Charikar, M., Frieze, A., and Mitzenmacher, M., 1998. Min-wise independent permutations. In Proceedings 30th Annual ACM Symposium on Theory of Computing, ACM Press, Dallas, 327–336.

8.Charikar, M. S., 2002. Similarity estimation techniques from rounding algorithms. In Proceedings 34th Annual ACM Symposium on Theory of Computing, ACM Press, Montreal, 380–388.

9.Heintze, N., 1996. Scalable document fingerprinting. In Proceedings second USENIX Workshop on Electronic Commerce, Oakland, 191–200.

10.Manber, U., 1994. Finding similar files in a large file system. In Proceedings Winter 1994 USENIX Conference, San Francisco, 1–10.

11.Shivakumar, N. and Garcia-Molina, H., 1995. SCAM: A copy detection mechanism for digital documents. In Proceedings Second Annual Conference on the Theory and Practice of Digital Libraries, Austin.

12.Bharat, K., Broder, A. Z., Dean, J., and Henzinger, M. R., 2000. A comparison of techniques to find mirrored hosts on the WWW. Journal of the American Society for Information Science, 51(12):114–1122.

13.Cho, J., Shivakumar, N., and Garcia-Molina, H., 2000. Finding replicated web collections. In Proceedings 2000 ACM International Conference on Management of Data, ACM Press, Dallas, 355–366.

*This chapter has been reprinted from first edition of this Handbook, without any content updates.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.132.123