Issues with mining unstructured data

Humans can read, parse, and understand unstructured text/documents more easily than computer-based programs. Some of the reasons why text mining is more complicated than general supervised or unsupervised learning are given here:

  • Ambiguity in terms and phrases. The word bank has multiple meanings, which a human reader can correctly associate based on context, yet this requires preprocessing steps such as POS tagging and word sense disambiguation, as we have seen. According to the Oxford English Dictionary, the word run has no fewer than 645 different uses in the verb form alone and we can see that such words can indeed present problems in resolving the meaning intended (between them, the words run, put, set, and take have more than a thousand meanings).
  • Context and background knowledge associated with the text. Consider a sentence that uses a neologism with the suffix gate to signify a political scandal, as in, With cries for impeachment and popularity ratings in a nosedive, Russiagate finally dealt a deathblow to his presidency. A human reader can surmise what is being referred to by the coinage Russiagate as something that recalls the sense of high-profile intrigue, by association via an affix, of another momentous scandal in US political history, Watergate. This is particularly difficult for a machine to make sense of.
  • Reasoning, that is, inferencing from documents is very difficult as mapping unstructured information to knowledge bases is itself a big hurdle.
  • Ability to perform supervised learning needs labeled training documents and based on the domain, performing labeling on the documents can be time consuming and costly.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.136.18.218