Chapter 6. Information Retrieval

In the last chapter we came across common words that made it difficult to characterize a corpus. This is a problem for different kinds NLP tasks. Fortunately, the field of information retrieval has developed many techniques that can be used to improve a variety of NLP applications.

Earlier, we talked about how text data exists, and more is being generated every day. We need some way to manage and search through this data. If there is an ID or title, we can of course have an index on this data, but how do we search by content. With structured data, we can create logical expressions, and retrieve all rows that satisfy the expressions. This can also be done with text, though less exactly.

The foundations of information retrieval predates computers. Information retrieval focuses on how to find specific pieces of information in a larger set of information, especially information in text data. The most common type of task in information retrieval is search, i.e. document search.

When searching for a document here are the components.

  • Query q - a logical statement describing the document or kind of document you are looking for
    • Query term q_t - a term in the query, generally a token
  • Corpus of documents D - a collection of documents
    • Document d - a document in D with terms t_d that describe the document
  • Ranking function r(q, D) - a function that ranks the documents in D according to relevance to the query q
  • Result R - the ranked list of documents

Before we get into how to implement these components, we need to consider a technical problem. How can we quickly access documents based on the information within them? If we have to scan every document, that would mean that we could not search large collections of documents. To solve this problem we use an inverted index.

Inverted Indices

Originally, indexing was a means of organizing and labeling information, especially books, in a way that makes retrieving them easier. The Dewey Decimal Classification system is a way to index books based on their subject matter. We can also have indexes based on titles, authors, publication date, etc. Another kind of index is the kind that can often be found in the back of a book. This is a list of concepts in the book and pages on which to find them.

The index in inverted index is slightly different than the traditional index, instead taking inspiration from mathematical concept of indexing - that is assigning indexes to an element of a set. Recall our set of documents D. We can assign a number to each document creating an mapping from integers to documents, i -> d.

Let’s create this index for our DataFrame. Normally, we would store an inverted index in a data store that allows for quick lookups. Spark DataFrames are not for quick lookups. We will introduce tools used for search.

Let’s look at how we can build an inverted index in Spark. Here are the steps we will follow:

  1. Load the data
  2. Create the index: i -> d*
    • Since we are using Spark, we will generate this index on the rows
  3. Process the text
  4. Create the inverted index from terms to documents: t_d -> i*

Step 1

We will be creating in inverted index for the mini_newsgroups data set.

import os

from pyspark.sql.types import *
from pyspark.sql.functions import collect_set
from pyspark.sql import Row
from pyspark.ml import Pipeline

import sparknlp
from sparknlp import DocumentAssembler, Finisher

spark = sparknlp.start()
path = os.path.join('data', 'mini_newsgroups', '*')
texts = spark.sparkContext.wholeTextFiles(path)

schema = StructType([
    StructField('path', StringType()),
    StructField('text', StringType()),
])

texts = spark.createDataFrame(texts, schema=schema).persist()

Now we need to create the index. Spark assumes the data is distributed, so in order to assign an index, we need to use the lower-level RDD API. The zipWithIndex will sort the data on the workers, and assign the indexes.

Step 2

rows_w_indexed = texts.rdd.zipWithIndex()
(path, text), i = rows_w_indexed.first()

print(i)
print(path)
print(text[:200])
0
file:/home/alext/projects/spark-nlp-book/data/mini_...
Xref: cantaloupe.srv.cs.cmu.edu sci.astro:35223 sci.space:61404
Newsgroups: sci.astro,sci.space
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!...

 Now that we have created the index, we need to create a DataFrame like we did above except now we need to add our index into our Rows.

indexed = rows_w_indexed.map(
    lambda row_index: Row(
        index=row_index[1], 
        **row_index[0].asDict())
)
(i, path, text) = indexed.first()
indexed_schema = schema.add(StructField('index', IntegerType()))

indexed = spark.createDataFrame(indexed, schema=indexed_schema)
    .persist()
indexed.limit(10).toPandas()
  path text index
0 file:/.../spark-nlp-book/data/m... Newsgroups: rec.motorcycles From: [email protected]... 0
1 file:/.../spark-nlp-book/data/m... Path: cantaloupe.srv.cs.cmu.edu!das-news.harva... 1
2 ​file:/.../spark-nlp-book/data/m... Newsgroups: rec.motorcycles Path: cantaloupe.... 2
3 ​file:/.../spark-nlp-book/data/m... Xref: cantaloupe.srv.cs.cmu.edu rec.motorcycle... 3
4 ​file:/.../spark-nlp-book/data/m... Path: cantaloupe.srv.cs.cmu.edu!das-news.harva... 4
5 ​file:/.../spark-nlp-book/data/m... Path: cantaloupe.srv.cs.cmu.edu!magnesium.club... 5
6 ​file:/.../spark-nlp-book/data/m... Newsgroups: rec.motorcycles Path: cantaloupe.... 6
7 ​file:/.../spark-nlp-book/data/m... Newsgroups: rec.motorcycles Path: cantaloupe.... 7
8 ​file:/.../spark-nlp-book/data/m... Path: cantaloupe.srv.cs.cmu.edu!rochester!udel... 8
9 ​file:/.../spark-nlp-book/data/m... Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.... 9

Each document d is a collection of terms, t_d. So our index is the mapping from integers to collections of terms.

An inverted index, on the other hand, is the mapping from terms t_d to integers, inv-index: t_d -> i, j, k, .... This allows us to look up what documents contain a given term quickly.

Now let’s process the text.

Step 3

from sparknlp.pretrained import PretrainedPipeline

assembler = DocumentAssembler()
    .setInputCol('text')
    .setOutputCol('document')
tokenizer = Tokenizer()
    .setInputCols(['document'])
    .setOutputCol('token')
lemmatizer = LemmatizerModel.pretrained()
    .setInputCols(['token'])
    .setOutputCol('lemma')
normalizer = Normalizer()
    .setInputCols(['lemma'])
    .setOutputCol('normalized')
    .setLowercase(True)
finisher = Finisher()
    .setInputCols(['normalized'])
    .setOutputCols(['normalized'])
    .setOutputAsArray(True)

pipeline = Pipeline().setStages([
    assembler, tokenizer, 
    lemmatizer, normalizer, finisher
]).fit(indexed)

indexed_w_tokens = pipeline.transform(indexed)
indexed_w_tokens.limit(10).toPandas()
  path text index normalized
0 ​file:/.../spark-nlp-book/data/m... ... 0 [newsgroups, recmotorcycles, from, lisaalexcom...
1 ​file:/.../spark-nlp-book/data/m... ... 1 [path, cantaloupesrvcscmuedudasnewsharvardedun...
2 ​file:/.../spark-nlp-book/data/m... ... 2 [newsgroups, recmotorcycles, path, cantaloupes...
3 ​file:/.../spark-nlp-book/data/m... ... 3 [xref, cantaloupesrvcscmuedu, recmotorcyclesha...
4 ​file:/.../spark-nlp-book/data/m... ... 4 [path, cantaloupesrvcscmuedudasnewsharvardeduo...
5 ​file:/.../spark-nlp-book/data/m... ... 5 [path, cantaloupesrvcscmuedumagnesiumclubcccmu...
6 ​file:/.../spark-nlp-book/data/m... ... 6 [newsgroups, recmotorcycles, path, cantaloupes...
7 ​file:/.../spark-nlp-book/data/m... ... 7 [newsgroups, recmotorcycles, path, cantaloupes...
8 ​file:/.../spark-nlp-book/data/m... ... 8 [path, cantaloupesrvcscmuedurochesterudelbogus...
9 ​file:/.../spark-nlp-book/data/m... ... 9 [path, cantaloupesrvcscmueducrabapplesrvcscmue...

Because we are using a small data set, we will move out of Spark for the purposes of this example. We will collect our data into pandas, and use our index field as our DataFrame index.

index = indexed_w_tokens.select('index', 'path', 'text').toPandas()
index = index.set_index('index')

Now, let us create our inverted index. We will use SparkSQL to do this.

Step 4

SELECT term, collect_set(index) AS documents
FROM (
    SELECT index, explode(normalized) AS term
    FROM indexed_w_tokens
)
GROUP BY term
ORDER BY term
inverted_index = indexed_w_tokens
    .selectExpr('index', 'explode(normalized) AS term')
    .distinct()
    .groupBy('term').agg(collect_set('index').alias('documents'))
    .persist()
inverted_index.show(10)
  term documents
0 aaangel.qdeck.com [198]
1 accumulation [857]
2 adventists [314]
3 aecfb.student.cwru.edu [526]
4 again...hmmm [1657]
5 alt.binaries.pictures [815]
6 amplifier [630, 624, 654]
7 antennae [484, 482]
8 apr..gordian.com [1086]
9 apr..lokkur.dexter.mi.us [292]

This is our inverted index. We can see that the term “amplifier” occurs in documents 630, 624, and 654. With this information, we can quickly find all documents that contain particular terms.

Another benefit, is that this inverted index based on the size of our vocabulary not the amount of text in our corpus, so it is not big data. The inverted index only grows with new terms and document indexes. For very large corpora, this can still be a large amount of data for a single machine. In the case of the mini_newsgroups data set, it is easily manageable though.

Let’s see how big is our inverted index.

inverted_index.count()
42624

For us, since we have such a small number of documents, the inverted index has more entries than the index. If you recall, word frequencies follow Zipf’s law, that is the frequency of a word is inversely proportional to its rank when sorted. This means that the most used English words are already in our inverted index. This can further be constrained by not tracking words that don’t occur at least a certain number of times. 

inverted_index = {
    term: set(docs) 
    for term, docs in inverted_index.collect()
}

Now we can begin our most basic ranking function - simple boolean search. In this case, let’s look up all the documents that contain “language” or “information”.

lang_docs = inverted_index['language']
print('docs', ('{}, ' * 10).format(*list(lang_docs)[:10]), '...')
print('number of docs', len(lang_docs))
docs 1926, 1937, 1171, 1173, 1182, 1188, 1830, 808, 1193, 433,  ...
number of docs 44
info_docs = inverted_index['information']
print('docs', ('{}, ' * 10).format(*list(info_docs)[:10]), '...')
print('number of docs', len(info_docs))
docs 516, 519, 520, 1547, 1035, 1550, 1551, 17, 1556, 22,  ...
number of docs 215
filter_set = list(lang_docs | info_docs)
print('number of docs in filter set', len(filter_set))
number of docs in filter set 246
intersection = list(lang_docs & info_docs)
print('number of docs in intersection set', len(intersection))
number of docs in intersection set 13

Let’s print out lines from our filter set. Here, the filter set is the result set, but generally, the filter set is ranked by r(q, D), which results in the result set.

Let’s look at the lines where we see the occurrences to get an idea about our result set.

k = 1
for i in filter_set:
    path, text = index.loc[i]
    lines = text.split('
')
    print(path.split('/')[-1], 'length:', len(text))
    for line_number, line in enumerate(lines):
        if 'information' in line or 'language' in line:
            print(line_number, line)
    print()
    k += 1
    if k > 5:
        break

178813 length: 1783
14 >>    Where did you get this information?  The FBI stated ...

104863 length: 2795
14 of information that I received, but as my own bad mouthing) ...

104390 length: 2223
51 ... appropriate disclaimer to outgoing public information,

178569 length: 11735
60  confidential information obtained illegally from law ...
64  ... to allegations of obtaining confidential information from
86  employee and have said they simply traded information with ...
91  than truthful" in providing information during an earlier ...
125  and Department of Motor Vehicles information such as ...
130  Bullock also provided information to the South African ...
142  information.
151  exchanged information with the FBI and worked with ...
160  information in Los Angeles, San Francisco, New York, ...
168  some confidential information in the Anti-Defamation ...
182  police information on citizens and groups.
190  ... spying operations, which collected information on more than
209  information to police, journalists, academics, government ...
211  information illegally, he said.
215  identity of any source of information," Foxman said.

104616 length: 1846
45 ... an appropriate disclaimer to outgoing public information,

Now that we have our result set, how should we rank them? We could just count the number of occurrences of our search term, but that be biased towards long documents. Also, what happens if our query has a very common word like “the”? If we just use the counts, common words like “the” will dominate our results. In our result set, the one with the most occurrences of the query terms has the longest text. We could say that the more terms found in the document, the more relevant it is, but this has problems too. What do we do with one term queries? In our example, only one document has both. Again, if our query has a common word, for example “the cat in the hat”, should “the” and “in” have the same importance as “cat” and “hat”? To solve this problem, we need a more flexible model for our documents and queries.

Vector Space Model

In the previous chapter, we introduced the concept of vectorizing documents. We talked about creating binary vectors, where 1 means that the word is present in the document. We can also use the counts.

When we convert a corpus to a collection of vectors, we are implicitly modeling our language as a vector space. In this vector space, each dimension represents one term. This has many benefits and drawbacks. It is a simple way to represent our text in a way that allows machine learning algorithms to work with it. It also allows us to represent the vectors sparsely. On the otherhand, we lose the information contained in the word order. This process also creates high dimensional data sets, which can be problematic to some algorithms.

Let’s calculate the vectors for our data set. In the last chapter, we used the CountVectorizer for this. We will build them in Python, but the way will build them help us understand libraries implement vectorization.

class SparseVector(object):
    
    def __init__(self, indices, values, length):
        # if the indices are not in ascending order, we need 
        # to sort them
        is_ascending = True
        for i in range(len(indices) - 1):
            is_ascending = is_ascending and indices[i] < indices[i+1]
        if not is_ascending:
            pairs = zip(indices, values)
            sorted_pairs = sorted(pairs, key=lambda x: x[0])
            indices, values = zip(*sorted_pairs)
        self.indices = indices
        self.values = values
        self.length = length
        
    def dot(self, other):
        assert isinstance(other, SparseVector)
        assert self.length == other.length
        res = 0
        i = j = 0
        while i < len(self.indices) and j < len(other.indices):
            if self.indices[i] == other.indices[j]:
                res += self.values[i] * other.values[j]
                i += 1
                j += 1
            elif self.indices[i] < other.indices[j]:
                i += 1
            elif self.indices[i] > other.indices[j]:
                j += 1
        return res
    
    def hadamard(self, other):
        assert isinstance(other, SparseVector)
        assert self.length == other.length
        res_indices = []
        res_values = []
        i = j = 0
        while i < len(self.indices) and j < len(other.indices):
            if self.indices[i] == other.indices[j]:
                res_indices.append(i)
                res_values.append(self.values[i] * other.values[j])
                i += 1
                j += 1
            elif self.indices[i] < other.indices[j]:
                i += 1
            elif self.indices[i] > other.indices[j]:
                j += 1
        return SparseVector(res_indices, res_values, self.length)
    
    def sum(self):
        return sum(self.values)
    
    def __repr__(self):
        return 'SparseVector({}, {})'.format(
            dict(zip(self.indices, self.values)), self.length)

We need to make two passes over all the documents. In the first pass, we will get our vocabulary and the counts. In the second pass we will construct the vectors.

from collections import Counter

vocabulary = set()
vectors = {}

for row in indexed_w_tokens.toLocalIterator():
    counts = Counter(row['normalized'])
    vocabulary.update(counts.keys())
    vectors[row['index']] = counts
    
vocabulary = list(sorted(vocabulary))
inv_vocabulary = {term: ix for ix, term in enumerate(vocabulary)}
vocab_len = len(vocabulary)

Now that we have this information, we need to go back over our word counts and construct actual vectors.

for index in vectors:
    terms, values = zip(*vectors[index].items())
    indices = [inv_vocabulary[term] for term in terms]
    vectors[index] = SparseVector(indices, values, vocab_len)
vectors[42]
SparseVector({0: 1, 136: 1, 906: 1, 1365: 4, 1608: 2, 1766: 1, 
2556: 1, 2584: 2, 3245: 1, 3585: 4, 3958: 2, 4270: 3, 4547: 1, 
4714: 1, 5038: 3, 6303: 1, 7421: 1, 7519: 1, 7893: 1, 9208: 1, 
9213: 1, 9241: 1, 9314: 1, 11081: 1, 11699: 1, 12102: 1, 12295: 1, 
12429: 1, 13000: 1, 13668: 1, 14066: 1, 14476: 1, 14510: 1, 
15205: 2, 15235: 3, 15545: 1, 15659: 1, 16112: 1, 17226: 1, 
17311: 1, 17343: 3, 18034: 2, 18440: 2, 18573: 1, 18694: 1, 
18849: 2, 19085: 1, 19260: 1, 19746: 1, 19801: 1, 19871: 1, 
20547: 3, 20615: 1, 20841: 1, 20920: 2, 22016: 1, 22069: 1, 
22308: 2, 22745: 1, 23243: 1, 24049: 1, 24952: 1, 25048: 1, 
25192: 2, 25320: 1, 26556: 1, 26670: 1, 27395: 1, 27740: 2, 
27744: 1, 27915: 1, 28092: 2, 28120: 1, 28150: 1, 28174: 1, 
28263: 2, 28304: 1, 28596: 1, 28797: 1, 28872: 1, 29366: 1, 
30554: 1, 31117: 1, 31961: 1, 32149: 1, 32337: 1, 32663: 1, 
32706: 1, 33300: 1, 33614: 1, 33663: 1, 33854: 1, 33923: 1, 
34812: 1, 34895: 2, 35436: 1, 35836: 1, 36359: 1, 36360: 1, 
36370: 1, 36490: 2, 36785: 1, 37046: 1, 37679: 1, 38260: 1, 
38273: 2, 38281: 9, 38343: 1, 38721: 1, 38722: 2, 38804: 3, 
38808: 3, 38828: 1, 38915: 1, 39518: 1, 40267: 1, 41036: 1, 
42005: 1, 42082: 1, 42165: 1, 42295: 1}, 43082)

Let’s look at some of the words that occur the most.

vocabulary[3585]
'be'
vocabulary[38281]
'the'

As we discussed above, there are many drawbacks to just using just the counts for search. The concern we have is that words that are generally common in English will have more impact than the less common words. There are a couple of strategies for this. First, let’s look at the simplest solution - removing the common words

Stop word removal

These common words we are looking to remove are called stop words. This term was coined in the 1950s by Hans Peter Luhn, a pioneer in information retrieval. There are default stop word lists available, but it is often necessary to modify generic stop word lists for different tasks.

from pyspark.ml.feature import StopWordsRemover

sw_remover = StopWordsRemover() 
    .setInputCol("normalized") 
    .setOutputCol("filtered") 
    .setStopWords(StopWordsRemover.loadDefaultStopWords("english"))

filtered = sw_remover.transform(indexed_w_tokens)
from collections import Counter

vocabulary = set()
vectors = {}

for row in filtered.toLocalIterator():
    counts = Counter(row['filtered'])
    vocabulary.update(counts.keys())
    vectors[row['index']] = counts
    
vocabulary = list(sorted(vocabulary))
inv_vocabulary = {term: ix for ix, term in enumerate(vocabulary)}
vocab_len = len(vocabulary)
for index in vectors:
    terms, values = zip(*vectors[index].items())
    indices = [inv_vocabulary[term] for term in terms]
    vectors[index] = SparseVector(indices, values, vocab_len)
vectors[42]
SparseVector({1755: 1, 2544: 1, 3231: 1, 3937: 2, 4249: 3, 4525: 1, 
4692: 1, 6277: 1, 7395: 1, 7493: 1, 7867: 1, 9182: 1, 9187: 1, 
9215: 1, 9288: 1, 11054: 1, 11672: 1, 12074: 1, 12267: 1, 12400: 1, 
12966: 1, 13633: 1, 14031: 1, 14441: 1, 14475: 1, 15169: 2, 15621: 
1, 16073: 1, 17186: 1, 17983: 2, 18640: 1, 19030: 1, 19205: 1, 
19741: 1, 19811: 1, 20486: 3, 20554: 1, 20780: 1, 20859: 2, 
21955: 1, 22008: 1, 22247: 2, 22684: 1, 23182: 1, 23987: 1, 
24890: 1, 24986: 1, 25129: 2, 25256: 1, 26490: 1, 26604: 1, 
28044: 1, 28074: 1, 28098: 1, 28512: 1, 28713: 1, 28788: 1, 
29282: 1, 30470: 1, 31033: 1, 31877: 1, 32065: 1, 32253: 1, 
32579: 1, 32622: 1, 33216: 1, 33530: 1, 33579: 1, 33838: 1, 
34726: 1, 34809: 2, 35348: 1, 35748: 1, 36269: 1, 36270: 1, 
36280: 1, 36400: 2, 36695: 1, 36956: 1, 37587: 1, 38615: 1, 
38697: 3, 38701: 3, 38720: 1, 38807: 1, 39410: 1, 40156: 1, 
40924: 1, 41880: 1, 41957: 1, 42039: 1, 42169: 1}, 42951)
vocabulary[4249]
'bmw'
vocabulary[20486]
'k'
vocabulary[38697]
'tony'

These words seem more informative, except for “k”. You should explore your data when determining what words should be included in the stop word list. In this situation, perhaps removing tokens of length one would be prudent.

It may seem like a daunting task to list all the words we don’t want. However, recalling what we discussed about morphology, we can narrow down what we want to remove. We want to remove unbound function morphemes.

A fluent speaker of a language who knows these basics of morphology, is able to create reasonable good list. This still leaves two concerns. What if we need to keep some common words? What if we want to remove some common lexical morphemes? You can modify the list, but that still leaves one last concern. How do we handle queries like “fictional cats”? The word “fictional” is less common than “cats”, so it makes sense that the former should be more important in determining what documents are returned. Let’s look at how we can implement this using our data.

Inverse Document Frequency

Instead of manually editing our vocabulary, we can try and weight the words. We need to find some way of weighting the words using there “commonness”. One way to define “commoness” is the number of document in our corpus which contain the word. This is generally called document frequency. We want words with high document frequency to be down weighted, so we are interested in using inverse document frequency (abbreviated to IDF).

We take these values and multiply them by the term frequencies, which are the frequencies of words in a given document.

StartLayout 1st Row 1st Column t f left-parenthesis t comma d right-parenthesis 2nd Column equals the number of times t occurs in d 2nd Row 1st Column d f left-parenthesis t right-parenthesis 2nd Column equals the number of documents t occurs in 3rd Row 1st Column i d f left-parenthesis t right-parenthesis 2nd Column equals StartFraction the number of documents Over d f left-parenthesis t right-parenthesis EndFraction EndLayout

There are many different flavors of tf.idf. The most common kind is smoothed logarithmic.

StartLayout 1st Row 1st Column t f left-parenthesis t comma d right-parenthesis 2nd Column equals l o g left-parenthesis 1 plus the number of times t occurs in d right-parenthesis 2nd Row 1st Column d f left-parenthesis t right-parenthesis 2nd Column equals the number of documents t occurs in 3rd Row 1st Column i d f left-parenthesis t right-parenthesis 2nd Column equals l o g left-parenthesis StartFraction the number of documents Over 1 plus d f left-parenthesis t right-parenthesis EndFraction right-parenthesis EndLayout

Let’s calculate this with our vectors. We actually already have the term frequency, so all we need to do is calcuate the idf, transform the values with log, and multiply tf and idf.

idf = Counter()

for vector in vectors.values():
    idf.update(vector.indices)
for ix, count in idf.most_common(20):
    print('{:5d} {:20s} {:d}'.format(ix, vocabulary[ix], count))
11081 date                 2000
15545 from                 2000
24049 message-id           2000
26670 newsgroups           2000
28872 path                 2000
37046 subject              2000
22069 lines                1993
28150 organization         1925
38281 the                  1874
 1766 apr                  1861
 3585 be                   1837
38722 to                   1767
27740 of                   1756
    0 a                    1730
16352 gmt                  1718
18440 i                    1708
18849 in                   1695
 1365 and                  1674
15235 for                  1474
17343 have                 1459

We can now make idf a SparseVector. We know it contains all the words, so it actually won’t be sparse, but this will help us implement the next steps.

indices, values = zip(*idf.items())
idf = SparseVector(indices, values, vocab_len)
from math import log

for index, vector in vectors.items():
    vector.values = list(map(lambda v: log(1+v), vector.values))
    
idf.values = list(map(lambda v: log(vocab_len / (1+v)), idf.values))
tfidf = {index: tf.hadamard(idf) for index, tf in vectors.items()}
tfidf[42]
SparseVector({0: 2.2280564562633938, 136: 3.9963995256107365, 
906: 2.785958617686592, 1365: 5.226315036495551, 
..., 
42165: 2.507303848514976, 42295: 4.13835219533075}, 43082)

Let’s look at the tf.idf values for “be” and “the”. Let’s also look one of the terms with a higher tf.idif than these common words.

tfidf[42][3585] # be
5.076854812241001
tfidf[42][38281] # the
7.217445184444707
vocabulary[20920], tfidf[42][20920]
('kidson', 10.200138516041417)

Let’s look at the document to get an idea of why this word is so important.

print(doc_index.loc[42]['text'])
Newsgroups: rec.motorcycles
From: [email protected] (Tony Kidson)
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!...
Subject: Re: Info on Sport-Cruisers 
Distribution: world
References: <[email protected]>
Organization: The Modem Palace
Reply-To: [email protected]
X-Newsreader: Simple NEWS 1.90 (ka9q DIS 1.21)
Lines: 25
Date: Mon, 19 Apr 1993 05:45:22 +0000
Message-ID: <[email protected]>
Sender: [email protected]

In article <[email protected]> [email protected] writes:

>I'm looking for a sport-cruiser - factory installed fairings (
...
Tony

+---------------+------------------------------+---------+
|Tony Kidson    | ** PGP 2.2 Key by request ** |Voice ...|
|...            |...                           |...      |
+---------------+------------------------------+---------+

We can see the author of the note is named “Tony Kidson”.

In Spark

Spark has stages for calculating tf.idf in MLLib. If you have a column that contains arrays of strings, you can use either CountVectorizer which we are already familiar with, or HashingTF to get the tf values. HashingTF uses the hashing trick. That is where you decide on a vector space before hand, and hash the words into that vector space. If there is a collision, then those words will be counted as the same. This lets you trade off between memory efficiency and accuracy. As you make your predetermined vector space larger, the larger the output vectors, but the chance of collisions decreases.

Now that we know how to turn a document into a vector, we can explore how we can use that vector in classic machine learning tasks in the next chapter

Exercises

Now that we have calculated the tf.dif values. Let’s build a search function. First, we need a function to process the query.

def process_query(query, pipeline):
    data = spark.createDataFrame([(query,)], ['text'])
    return pipeline.transform(data).first()['normalized']

Then we need a function to get the filter set.

def get_filter_set(processed_query):
    filter_set = set()
    # find all the documents that contain any of the terms
    return filter_set

Next, we need a function that will compute the score for the document.

def get_score(index, terms):
    return # return a single score

We also want a function for displaying results.

def display(index, score, terms):
    hits = [term for term in terms if term in vocabulary and tfidf[index][inv_vocabulary[term]] > 0.]
    print('terms', terms, 'hits', hits)
    print('score', score)
    print('path', path)
    print('length', len(doc_index.loc[index]['text']))

Finally, we are ready for our search function.

def search(query, pipeline, k=5):
    processed_query = process_query(query, pipeline)
    filter_set = get_filter_set(processed_query)
    scored = {index: get_score(index, processed_query) for index in filter_set}
    display_list = list(sorted(filter_set, key=scored.get, reverse=True))[:k]
    for index in display_list:
        display(index, scored[index], processed_query)
search('search engine', pipeline)

You should be able to implement get_filter_set and get_score easily using examples in this chapter. Try a few queries out. You will likely notice that there are two big limitations here. There is no n-gram support, and ranker is biased towards longer documents. What could you modify to fix these problems.

Resources

 

On the web

Books

  • Lucene In Action 2Ed by Michael Mccandless, Erik Hatcher, Otis Gospodnetic
    • A guide to implementing search using Lucene
  • Elasticsearch: The Definitive Guide by Clinton Gormley, Zachary Tong
    • A guide to implementing search using Elasticsearch
  • Learning to Rank for Information Retrieval by Tie-Yan Liu
    • Learning to Rank, building machine learning-based rankers, is an important part of modern search engines. Tie-Yan Liu is one the most important contributors to the field of Learning to rank.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.140.245.196