Before You Go Off and Try to Build a Search Engine…

While this chapter has hopefully given you some good insight into how to extract useful information from unstructured text, it’s barely scratched the surface of the most fundamental concepts, both in terms of theory and engineering considerations. Information retrieval is literally a multibillion-dollar industry, so you can only imagine the amount of combined investment that goes into both the theory and implementations that work at scale to power search engines such as Google and Yahoo!. This section is a modest attempt to make sure you’re aware of some of the inherent limitations of TF-IDF, cosine similarity, and other concepts introduced in this chapter, with the hopes that it will be beneficial in shaping your overall view of this space.

While TF-IDF is a powerful tool that’s easy to use, our specific implementation of it has a few important limitations that we’ve conveniently overlooked but that you should consider. One of the most fundamental is that it treats a document as a bag of words, which means that the order of terms in both the document and the query itself does not matter. For example, querying for “Green Mr.” would return the same results as “Mr. Green” if we didn’t implement logic to take the query term order into account or interpret the query as a phrase as opposed to a pair of independent terms. But obviously, the order in which terms appear is very important.

Even if you carry out an n-gram analysis to account for collocations and term ordering, there’s still the underlying issue that TF-IDF assumes that all tokens with the same text value mean the same thing. Clearly, however, this need not be the case. Any homonym of your choice is a counterexample, and there are plenty of them, and even words that do mean the same thing can connote slightly different meanings depending on the exact context in which they are used. A key difference in a traditional keyword search technology based on TF-IDF principles and a more advanced semantic search engine is that the semantic search engine would necessarily allow you to ground your search terms in a particular meaning by defining context. For example, you might be able to specify that the term you are searching for should be interpreted as a person, location, organization, or other specific type of entity. Being able to ground search terms in specific contexts is a very active area of research at the moment.

Cosine similarity suffers from many of the same flaws as TF-IDF. It does not take into account the context of the document or the term order from the n-gram analysis, and it assumes that terms appearing close to one another in vector space are necessarily similar, which is certainly not always the case. The obvious counterexample is homonyms, which may mean quite different things but are interpreted as the same term since they have the same text values. Our particular implementation of cosine similarity also hinges on TF-IDF scoring as its means of computing the relative importance of words in documents, so the TF-IDF errors have a cascading effect.

You’ve probably also realized that there can be a lot of pesky details that have to be managed in analyzing unstructured text, and these details turn out to be pretty important for state-of-the-art implementations. For example, string comparisons are case-sensitive, so it’s important to normalize terms so that frequencies can be calculated as accurately as possible. But blindly normalizing to lowercase, for example, can also potentially complicate the situation, since the case used in certain words and phrases can be important. “Mr. Green” and “Web 2.0” are two examples worth considering. In the case of “Mr. Green”, maintaining the title case in “Green” could potentially be advantageous since it could provide a useful clue to a query algorithm that it’s not referring to an adjective and is likely part of a noun phrase. We’ll briefly touch on this topic again in Chapter 8 when NLP is discussed, since it’s ultimately the context in which “Green” is being used that is lost with the bag-of-words approach, whereas more advanced parsing with NLP has the potential to preserve that context.

Another consideration that’s rooted more in our particular implementation than a general characteristic of TF-IDF itself is that our use of split to tokenize the text may leave trailing punctuation on tokens that can affect tabulating frequencies. For example, in the earlier working example, corpus['b'] ends with the token “study”, which is not the same as the token “study” that appears in corpus['a'] (the token that someone would probably be more likely to query). In this instance, the trailing period on the token affects both the TF and the IDF calculations.

Note

You might consider stemming words so that common variations of the same word are essentially treated as the same term instead of different terms. Check out the nltk.stem package for several good stemming implementations.

Finally, there are plenty of engineering considerations to ponder should you decide to implement a solution that you plan to take into a production situation. The use of indexes and caching are critical considerations for obtaining reasonable query times on even moderately large data sets. The ability to analyze truly massive amounts of textual data in batch-processing systems such as Hadoop,[50] even on reasonably priced cloud infrastructures such as Amazon’s Elastic Compute Cloud, can be quite expensive and require the budget of a medium-sized corporation.



[50] If Hadoop interests you, you might want to check out Dumbo, a project that allows you to write and run Hadoop programs in Python.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.93.137