Document classification

Document classification is closely related to sentiment analysis. In both cases, we're classifying documents into categories using their text. It's really only the why that changes. Document classification is all about classifying a document based on its type. The world's most obvious and common document classification system is a spam filter, but that has many other uses.

One of my favorite uses of document classification is in settling the debate around the original authors of The Federalist Papers. Alexander Hamilton, James Madison, and John Jay published 85 essays under the pseudonym Publius in 1787 and 1788 supporting the ratification of the United States Constitution. Later, Hamilton provided a list detailing the author of each paper before his fatal duel with Aaron Burr in 1804. Madison provided his own list in 1818 that created a dispute in authorship that scholars have been attempting to solve ever since. While it's mostly agreed upon that the disputed works belonged to Madison, there remain some theories as to a collaborative effort between the two. Classifying these 12 disputed documents as either Madison or Hamilton has been fodder for many a data science blog. Most formally, the paper, The Disputed Federalist Papers: SVM Feature Selection via Concave Minimization, by Glenn Fung covers the topic with quite a bit of rigor.

A final example of document classification might be around understanding the content of the document and prescribing action. Imagine a classifier that might read some information about a legal case, for example, the petition/complaint and summons, and then make a recommendation to the defendant. Our imaginary system might then say, given my experience with other cases like this one, you probably want to settle.

Sentiment analysis and documentation classification are powerful techniques based on the computer's ability to understand natural language. But, of course, this begs the question, how do we teach computers to read?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.134.198