In the last chapter we came across common words that made it difficult to characterize a corpus. This is a problem for different kinds NLP tasks. Fortunately, the field of information retrieval has developed many techniques that can be used to improve a variety of NLP applications.
Earlier, we talked about how text data exists, and more is being generated every day. We need some way to manage and search through this data. If there is an ID or title, we can of course have an index on this data, but how do we search by content. With structured data, we can create logical expressions, and retrieve all rows that satisfy the expressions. This can also be done with text, though less exactly.
The foundations of information retrieval predates computers. Information retrieval focuses on how to find specific pieces of information in a larger set of information, especially information in text data. The most common type of task in information retrieval is search, i.e. document search.
When searching for a document here are the components.
q
- a logical statement describing the document or kind of document you are looking for
q_t
- a term in the query, generally a tokenD
- a collection of documents
d
- a document in D
with terms t_d
that describe the documentr(q, D)
- a function that ranks the documents in D
according to relevance to the query q
R
- the ranked list of documentsBefore we get into how to implement these components, we need to consider a technical problem. How can we quickly access documents based on the information within them? If we have to scan every document, that would mean that we could not search large collections of documents. To solve this problem we use an inverted index.
Originally, indexing was a means of organizing and labeling information, especially books, in a way that makes retrieving them easier. The Dewey Decimal Classification system is a way to index books based on their subject matter. We can also have indexes based on titles, authors, publication date, etc. Another kind of index is the kind that can often be found in the back of a book. This is a list of concepts in the book and pages on which to find them.
The index in inverted index is slightly different than the traditional index, instead taking inspiration from mathematical concept of indexing - that is assigning indexes to an element of a set. Recall our set of documents D
. We can assign a number to each document creating an mapping from integers to documents, i -> d.
Let’s create this index for our DataFrame
. Normally, we would store an inverted index in a data store that allows for quick lookups. Spark DataFrames are not for quick lookups. We will introduce tools used for search.
Let’s look at how we can build an inverted index in Spark. Here are the steps we will follow:
i -> d*
t_d -> i*
Step 1
We will be creating in inverted index for the mini_newsgroups data set.
import os from pyspark.sql.types import * from pyspark.sql.functions import collect_set from pyspark.sql import Row from pyspark.ml import Pipeline import sparknlp from sparknlp import DocumentAssembler, Finisher spark = sparknlp.start()
path = os.path.join('data', 'mini_newsgroups', '*') texts = spark.sparkContext.wholeTextFiles(path) schema = StructType([ StructField('path', StringType()), StructField('text', StringType()), ]) texts = spark.createDataFrame(texts, schema=schema).persist()
Now we need to create the index. Spark assumes the data is distributed, so in order to assign an index, we need to use the lower-level RDD
API. The zipWithIndex
will sort the data on the workers, and assign the indexes.
Step 2
rows_w_indexed = texts.rdd.zipWithIndex() (path, text), i = rows_w_indexed.first() print(i) print(path) print(text[:200])
0 file:/home/alext/projects/spark-nlp-book/data/mini_... Xref: cantaloupe.srv.cs.cmu.edu sci.astro:35223 sci.space:61404 Newsgroups: sci.astro,sci.space Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!...
Now that we have created the index, we need to create a DataFrame
like we did above except now we need to add our index into our Row
s.
indexed = rows_w_indexed.map( lambda row_index: Row( index=row_index[1], **row_index[0].asDict()) ) (i, path, text) = indexed.first()
indexed_schema = schema.add(StructField('index', IntegerType())) indexed = spark.createDataFrame(indexed, schema=indexed_schema) .persist()
indexed.limit(10).toPandas()
path | text | index | |
---|---|---|---|
0 | file:/.../spark-nlp-book/data/m... | Newsgroups: rec.motorcycles From: [email protected]... | 0 |
1 | file:/.../spark-nlp-book/data/m... | Path: cantaloupe.srv.cs.cmu.edu!das-news.harva... | 1 |
2 | file:/.../spark-nlp-book/data/m... | Newsgroups: rec.motorcycles Path: cantaloupe.... | 2 |
3 | file:/.../spark-nlp-book/data/m... | Xref: cantaloupe.srv.cs.cmu.edu rec.motorcycle... | 3 |
4 | file:/.../spark-nlp-book/data/m... | Path: cantaloupe.srv.cs.cmu.edu!das-news.harva... | 4 |
5 | file:/.../spark-nlp-book/data/m... | Path: cantaloupe.srv.cs.cmu.edu!magnesium.club... | 5 |
6 | file:/.../spark-nlp-book/data/m... | Newsgroups: rec.motorcycles Path: cantaloupe.... | 6 |
7 | file:/.../spark-nlp-book/data/m... | Newsgroups: rec.motorcycles Path: cantaloupe.... | 7 |
8 | file:/.../spark-nlp-book/data/m... | Path: cantaloupe.srv.cs.cmu.edu!rochester!udel... | 8 |
9 | file:/.../spark-nlp-book/data/m... | Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.... | 9 |
Each document d
is a collection of terms, t_d
. So our index is the mapping from integers to collections of terms.
An inverted index, on the other hand, is the mapping from terms t_d
to integers, inv-index: t_d -> i, j, k, ...
. This allows us to look up what documents contain a given term quickly.
Now let’s process the text.
Step 3
from sparknlp.pretrained import PretrainedPipeline assembler = DocumentAssembler() .setInputCol('text') .setOutputCol('document') tokenizer = Tokenizer() .setInputCols(['document']) .setOutputCol('token') lemmatizer = LemmatizerModel.pretrained() .setInputCols(['token']) .setOutputCol('lemma') normalizer = Normalizer() .setInputCols(['lemma']) .setOutputCol('normalized') .setLowercase(True) finisher = Finisher() .setInputCols(['normalized']) .setOutputCols(['normalized']) .setOutputAsArray(True) pipeline = Pipeline().setStages([ assembler, tokenizer, lemmatizer, normalizer, finisher ]).fit(indexed) indexed_w_tokens = pipeline.transform(indexed)
indexed_w_tokens.limit(10).toPandas()
path | text | index | normalized | |
---|---|---|---|---|
0 | file:/.../spark-nlp-book/data/m... | ... | 0 | [newsgroups, recmotorcycles, from, lisaalexcom... |
1 | file:/.../spark-nlp-book/data/m... | ... | 1 | [path, cantaloupesrvcscmuedudasnewsharvardedun... |
2 | file:/.../spark-nlp-book/data/m... | ... | 2 | [newsgroups, recmotorcycles, path, cantaloupes... |
3 | file:/.../spark-nlp-book/data/m... | ... | 3 | [xref, cantaloupesrvcscmuedu, recmotorcyclesha... |
4 | file:/.../spark-nlp-book/data/m... | ... | 4 | [path, cantaloupesrvcscmuedudasnewsharvardeduo... |
5 | file:/.../spark-nlp-book/data/m... | ... | 5 | [path, cantaloupesrvcscmuedumagnesiumclubcccmu... |
6 | file:/.../spark-nlp-book/data/m... | ... | 6 | [newsgroups, recmotorcycles, path, cantaloupes... |
7 | file:/.../spark-nlp-book/data/m... | ... | 7 | [newsgroups, recmotorcycles, path, cantaloupes... |
8 | file:/.../spark-nlp-book/data/m... | ... | 8 | [path, cantaloupesrvcscmuedurochesterudelbogus... |
9 | file:/.../spark-nlp-book/data/m... | ... | 9 | [path, cantaloupesrvcscmueducrabapplesrvcscmue... |
Because we are using a small data set, we will move out of Spark for the purposes of this example. We will collect our data into pandas, and use our index field as our DataFrame
index.
index = indexed_w_tokens.select('index', 'path', 'text').toPandas() index = index.set_index('index')
Now, let us create our inverted index. We will use SparkSQL to do this.
Step 4
SELECT term, collect_set(index) AS documents
FROM (
SELECT index, explode(normalized) AS term
FROM indexed_w_tokens
)
GROUP BY term
ORDER BY term
inverted_index = indexed_w_tokens .selectExpr('index', 'explode(normalized) AS term') .distinct() .groupBy('term').agg(collect_set('index').alias('documents')) .persist()
inverted_index.show(10)
term | documents | |
---|---|---|
0 | aaangel.qdeck.com | [198] |
1 | accumulation | [857] |
2 | adventists | [314] |
3 | aecfb.student.cwru.edu | [526] |
4 | again...hmmm | [1657] |
5 | alt.binaries.pictures | [815] |
6 | amplifier | [630, 624, 654] |
7 | antennae | [484, 482] |
8 | apr..gordian.com | [1086] |
9 | apr..lokkur.dexter.mi.us | [292] |
This is our inverted index. We can see that the term “amplifier” occurs in documents 630, 624, and 654. With this information, we can quickly find all documents that contain particular terms.
Another benefit, is that this inverted index based on the size of our vocabulary not the amount of text in our corpus, so it is not big data. The inverted index only grows with new terms and document indexes. For very large corpora, this can still be a large amount of data for a single machine. In the case of the mini_newsgroups data set, it is easily manageable though.
Let’s see how big is our inverted index.
inverted_index.count()
42624
For us, since we have such a small number of documents, the inverted index has more entries than the index. If you recall, word frequencies follow Zipf’s law, that is the frequency of a word is inversely proportional to its rank when sorted. This means that the most used English words are already in our inverted index. This can further be constrained by not tracking words that don’t occur at least a certain number of times.
inverted_index = { term: set(docs) for term, docs in inverted_index.collect() }
Now we can begin our most basic ranking function - simple boolean search. In this case, let’s look up all the documents that contain “language” or “information”.
lang_docs = inverted_index['language'] print('docs', ('{}, ' * 10).format(*list(lang_docs)[:10]), '...') print('number of docs', len(lang_docs))
docs 1926, 1937, 1171, 1173, 1182, 1188, 1830, 808, 1193, 433, ... number of docs 44
info_docs = inverted_index['information'] print('docs', ('{}, ' * 10).format(*list(info_docs)[:10]), '...') print('number of docs', len(info_docs))
docs 516, 519, 520, 1547, 1035, 1550, 1551, 17, 1556, 22, ... number of docs 215
filter_set = list(lang_docs | info_docs) print('number of docs in filter set', len(filter_set))
number of docs in filter set 246
intersection = list(lang_docs & info_docs) print('number of docs in intersection set', len(intersection))
number of docs in intersection set 13
Let’s print out lines from our filter set. Here, the filter set is the result set, but generally, the filter set is ranked by r(q, D)
, which results in the result set.
Let’s look at the lines where we see the occurrences to get an idea about our result set.
k = 1 for i in filter_set: path, text = index.loc[i] lines = text.split(' ') print(path.split('/')[-1], 'length:', len(text)) for line_number, line in enumerate(lines): if 'information' in line or 'language' in line: print(line_number, line) print() k += 1 if k > 5: break
Now that we have our result set, how should we rank them? We could just count the number of occurrences of our search term, but that be biased towards long documents. Also, what happens if our query has a very common word like “the”? If we just use the counts, common words like “the” will dominate our results. In our result set, the one with the most occurrences of the query terms has the longest text. We could say that the more terms found in the document, the more relevant it is, but this has problems too. What do we do with one term queries? In our example, only one document has both. Again, if our query has a common word, for example “the cat in the hat”, should “the” and “in” have the same importance as “cat” and “hat”? To solve this problem, we need a more flexible model for our documents and queries.
In the previous chapter, we introduced the concept of vectorizing documents. We talked about creating binary vectors, where 1 means that the word is present in the document. We can also use the counts.
When we convert a corpus to a collection of vectors, we are implicitly modeling our language as a vector space. In this vector space, each dimension represents one term. This has many benefits and drawbacks. It is a simple way to represent our text in a way that allows machine learning algorithms to work with it. It also allows us to represent the vectors sparsely. On the otherhand, we lose the information contained in the word order. This process also creates high dimensional data sets, which can be problematic to some algorithms.
Let’s calculate the vectors for our data set. In the last chapter, we used the CountVectorizer
for this. We will build them in Python, but the way will build them help us understand libraries implement vectorization.
class SparseVector(object): def __init__(self, indices, values, length): # if the indices are not in ascending order, we need # to sort them is_ascending = True for i in range(len(indices) - 1): is_ascending = is_ascending and indices[i] < indices[i+1] if not is_ascending: pairs = zip(indices, values) sorted_pairs = sorted(pairs, key=lambda x: x[0]) indices, values = zip(*sorted_pairs) self.indices = indices self.values = values self.length = length def dot(self, other): assert isinstance(other, SparseVector) assert self.length == other.length res = 0 i = j = 0 while i < len(self.indices) and j < len(other.indices): if self.indices[i] == other.indices[j]: res += self.values[i] * other.values[j] i += 1 j += 1 elif self.indices[i] < other.indices[j]: i += 1 elif self.indices[i] > other.indices[j]: j += 1 return res def hadamard(self, other): assert isinstance(other, SparseVector) assert self.length == other.length res_indices = [] res_values = [] i = j = 0 while i < len(self.indices) and j < len(other.indices): if self.indices[i] == other.indices[j]: res_indices.append(i) res_values.append(self.values[i] * other.values[j]) i += 1 j += 1 elif self.indices[i] < other.indices[j]: i += 1 elif self.indices[i] > other.indices[j]: j += 1 return SparseVector(res_indices, res_values, self.length) def sum(self): return sum(self.values) def __repr__(self): return 'SparseVector({}, {})'.format( dict(zip(self.indices, self.values)), self.length)
We need to make two passes over all the documents. In the first pass, we will get our vocabulary and the counts. In the second pass we will construct the vectors.
from collections import Counter vocabulary = set() vectors = {} for row in indexed_w_tokens.toLocalIterator(): counts = Counter(row['normalized']) vocabulary.update(counts.keys()) vectors[row['index']] = counts vocabulary = list(sorted(vocabulary)) inv_vocabulary = {term: ix for ix, term in enumerate(vocabulary)} vocab_len = len(vocabulary)
Now that we have this information, we need to go back over our word counts and construct actual vectors.
for index in vectors: terms, values = zip(*vectors[index].items()) indices = [inv_vocabulary[term] for term in terms] vectors[index] = SparseVector(indices, values, vocab_len)
vectors[42]
SparseVector({0: 1, 136: 1, 906: 1, 1365: 4, 1608: 2, 1766: 1, 2556: 1, 2584: 2, 3245: 1, 3585: 4, 3958: 2, 4270: 3, 4547: 1, 4714: 1, 5038: 3, 6303: 1, 7421: 1, 7519: 1, 7893: 1, 9208: 1, 9213: 1, 9241: 1, 9314: 1, 11081: 1, 11699: 1, 12102: 1, 12295: 1, 12429: 1, 13000: 1, 13668: 1, 14066: 1, 14476: 1, 14510: 1, 15205: 2, 15235: 3, 15545: 1, 15659: 1, 16112: 1, 17226: 1, 17311: 1, 17343: 3, 18034: 2, 18440: 2, 18573: 1, 18694: 1, 18849: 2, 19085: 1, 19260: 1, 19746: 1, 19801: 1, 19871: 1, 20547: 3, 20615: 1, 20841: 1, 20920: 2, 22016: 1, 22069: 1, 22308: 2, 22745: 1, 23243: 1, 24049: 1, 24952: 1, 25048: 1, 25192: 2, 25320: 1, 26556: 1, 26670: 1, 27395: 1, 27740: 2, 27744: 1, 27915: 1, 28092: 2, 28120: 1, 28150: 1, 28174: 1, 28263: 2, 28304: 1, 28596: 1, 28797: 1, 28872: 1, 29366: 1, 30554: 1, 31117: 1, 31961: 1, 32149: 1, 32337: 1, 32663: 1, 32706: 1, 33300: 1, 33614: 1, 33663: 1, 33854: 1, 33923: 1, 34812: 1, 34895: 2, 35436: 1, 35836: 1, 36359: 1, 36360: 1, 36370: 1, 36490: 2, 36785: 1, 37046: 1, 37679: 1, 38260: 1, 38273: 2, 38281: 9, 38343: 1, 38721: 1, 38722: 2, 38804: 3, 38808: 3, 38828: 1, 38915: 1, 39518: 1, 40267: 1, 41036: 1, 42005: 1, 42082: 1, 42165: 1, 42295: 1}, 43082)
Let’s look at some of the words that occur the most.
vocabulary[3585]
'be'
vocabulary[38281]
'the'
As we discussed above, there are many drawbacks to just using just the counts for search. The concern we have is that words that are generally common in English will have more impact than the less common words. There are a couple of strategies for this. First, let’s look at the simplest solution - removing the common words
These common words we are looking to remove are called stop words. This term was coined in the 1950s by Hans Peter Luhn, a pioneer in information retrieval. There are default stop word lists available, but it is often necessary to modify generic stop word lists for different tasks.
from pyspark.ml.feature import StopWordsRemover sw_remover = StopWordsRemover() .setInputCol("normalized") .setOutputCol("filtered") .setStopWords(StopWordsRemover.loadDefaultStopWords("english")) filtered = sw_remover.transform(indexed_w_tokens)
from collections import Counter vocabulary = set() vectors = {} for row in filtered.toLocalIterator(): counts = Counter(row['filtered']) vocabulary.update(counts.keys()) vectors[row['index']] = counts vocabulary = list(sorted(vocabulary)) inv_vocabulary = {term: ix for ix, term in enumerate(vocabulary)} vocab_len = len(vocabulary)
for index in vectors: terms, values = zip(*vectors[index].items()) indices = [inv_vocabulary[term] for term in terms] vectors[index] = SparseVector(indices, values, vocab_len)
vectors[42]
SparseVector({1755: 1, 2544: 1, 3231: 1, 3937: 2, 4249: 3, 4525: 1, 4692: 1, 6277: 1, 7395: 1, 7493: 1, 7867: 1, 9182: 1, 9187: 1, 9215: 1, 9288: 1, 11054: 1, 11672: 1, 12074: 1, 12267: 1, 12400: 1, 12966: 1, 13633: 1, 14031: 1, 14441: 1, 14475: 1, 15169: 2, 15621: 1, 16073: 1, 17186: 1, 17983: 2, 18640: 1, 19030: 1, 19205: 1, 19741: 1, 19811: 1, 20486: 3, 20554: 1, 20780: 1, 20859: 2, 21955: 1, 22008: 1, 22247: 2, 22684: 1, 23182: 1, 23987: 1, 24890: 1, 24986: 1, 25129: 2, 25256: 1, 26490: 1, 26604: 1, 28044: 1, 28074: 1, 28098: 1, 28512: 1, 28713: 1, 28788: 1, 29282: 1, 30470: 1, 31033: 1, 31877: 1, 32065: 1, 32253: 1, 32579: 1, 32622: 1, 33216: 1, 33530: 1, 33579: 1, 33838: 1, 34726: 1, 34809: 2, 35348: 1, 35748: 1, 36269: 1, 36270: 1, 36280: 1, 36400: 2, 36695: 1, 36956: 1, 37587: 1, 38615: 1, 38697: 3, 38701: 3, 38720: 1, 38807: 1, 39410: 1, 40156: 1, 40924: 1, 41880: 1, 41957: 1, 42039: 1, 42169: 1}, 42951)
vocabulary[4249]
'bmw'
vocabulary[20486]
'k'
vocabulary[38697]
'tony'
These words seem more informative, except for “k”. You should explore your data when determining what words should be included in the stop word list. In this situation, perhaps removing tokens of length one would be prudent.
It may seem like a daunting task to list all the words we don’t want. However, recalling what we discussed about morphology, we can narrow down what we want to remove. We want to remove unbound function morphemes.
A fluent speaker of a language who knows these basics of morphology, is able to create reasonable good list. This still leaves two concerns. What if we need to keep some common words? What if we want to remove some common lexical morphemes? You can modify the list, but that still leaves one last concern. How do we handle queries like “fictional cats”? The word “fictional” is less common than “cats”, so it makes sense that the former should be more important in determining what documents are returned. Let’s look at how we can implement this using our data.
Instead of manually editing our vocabulary, we can try and weight the words. We need to find some way of weighting the words using there “commonness”. One way to define “commoness” is the number of document in our corpus which contain the word. This is generally called document frequency. We want words with high document frequency to be down weighted, so we are interested in using inverse document frequency (abbreviated to IDF).
We take these values and multiply them by the term frequencies, which are the frequencies of words in a given document.
There are many different flavors of tf.idf. The most common kind is smoothed logarithmic.
Let’s calculate this with our vectors. We actually already have the term frequency, so all we need to do is calcuate the idf
, transform the values with log, and multiply tf
and idf
.
idf = Counter() for vector in vectors.values(): idf.update(vector.indices)
for ix, count in idf.most_common(20): print('{:5d} {:20s} {:d}'.format(ix, vocabulary[ix], count))
11081 date 2000 15545 from 2000 24049 message-id 2000 26670 newsgroups 2000 28872 path 2000 37046 subject 2000 22069 lines 1993 28150 organization 1925 38281 the 1874 1766 apr 1861 3585 be 1837 38722 to 1767 27740 of 1756 0 a 1730 16352 gmt 1718 18440 i 1708 18849 in 1695 1365 and 1674 15235 for 1474 17343 have 1459
We can now make idf
a SparseVector
. We know it contains all the words, so it actually won’t be sparse, but this will help us implement the next steps.
indices, values = zip(*idf.items()) idf = SparseVector(indices, values, vocab_len)
from math import log for index, vector in vectors.items(): vector.values = list(map(lambda v: log(1+v), vector.values)) idf.values = list(map(lambda v: log(vocab_len / (1+v)), idf.values))
tfidf = {index: tf.hadamard(idf) for index, tf in vectors.items()}
tfidf[42]
SparseVector({0: 2.2280564562633938, 136: 3.9963995256107365, 906: 2.785958617686592, 1365: 5.226315036495551, ..., 42165: 2.507303848514976, 42295: 4.13835219533075}, 43082)
Let’s look at the tf.idf values for “be” and “the”. Let’s also look one of the terms with a higher tf.idif than these common words.
tfidf[42][3585] # be
5.076854812241001
tfidf[42][38281] # the
7.217445184444707
vocabulary[20920], tfidf[42][20920]
Let’s look at the document to get an idea of why this word is so important.
print(doc_index.loc[42]['text'])
Newsgroups: rec.motorcycles From: [email protected] (Tony Kidson) Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!... Subject: Re: Info on Sport-Cruisers Distribution: world References: <[email protected]> Organization: The Modem Palace Reply-To: [email protected] X-Newsreader: Simple NEWS 1.90 (ka9q DIS 1.21) Lines: 25 Date: Mon, 19 Apr 1993 05:45:22 +0000 Message-ID: <[email protected]> Sender: [email protected] In article <[email protected]> [email protected] writes: >I'm looking for a sport-cruiser - factory installed fairings ( ... Tony +---------------+------------------------------+---------+ |Tony Kidson | ** PGP 2.2 Key by request ** |Voice ...| |... |... |... | +---------------+------------------------------+---------+
We can see the author of the note is named “Tony Kidson”.
Spark has stages for calculating tf.idf in MLLib. If you have a column that contains arrays of strings, you can use either CountVectorizer
which we are already familiar with, or HashingTF
to get the tf values. HashingTF
uses the hashing trick. That is where you decide on a vector space before hand, and hash the words into that vector space. If there is a collision, then those words will be counted as the same. This lets you trade off between memory efficiency and accuracy. As you make your predetermined vector space larger, the larger the output vectors, but the chance of collisions decreases.
Now that we know how to turn a document into a vector, we can explore how we can use that vector in classic machine learning tasks in the next chapter
Now that we have calculated the tf.dif values. Let’s build a search function. First, we need a function to process the query.
def process_query(query, pipeline): data = spark.createDataFrame([(query,)], ['text']) return pipeline.transform(data).first()['normalized']
Then we need a function to get the filter set.
def get_filter_set(processed_query): filter_set = set() # find all the documents that contain any of the terms return filter_set
Next, we need a function that will compute the score for the document.
def get_score(index, terms): return # return a single score
We also want a function for displaying results.
def display(index, score, terms): hits = [term for term in terms if term in vocabulary and tfidf[index][inv_vocabulary[term]] > 0.] print('terms', terms, 'hits', hits) print('score', score) print('path', path) print('length', len(doc_index.loc[index]['text']))
Finally, we are ready for our search function.
def search(query, pipeline, k=5): processed_query = process_query(query, pipeline) filter_set = get_filter_set(processed_query) scored = {index: get_score(index, processed_query) for index in filter_set} display_list = list(sorted(filter_set, key=scored.get, reverse=True))[:k] for index in display_list: display(index, scored[index], processed_query)
search('search engine', pipeline)
You should be able to implement get_filter_set
and get_score
easily using examples in this chapter. Try a few queries out. You will likely notice that there are two big limitations here. There is no n-gram support, and ranker is biased towards longer documents. What could you modify to fix these problems.
3.140.245.196