8

Text analytics

What is it?

Text analytics, also known as text mining, is a process of extracting value from large quantities of unstructured text data.

Most businesses have a huge amount of text-based data from memos, company documents, emails, reports, media releases, customer records and communication, websites, blogs and social media posts. Until recently, however, it wasn’t always that useful. While the text is structured to make sense to a human being it is unstructured from an analytics perspective because it doesn’t fit neatly into a relational database or rows and columns of a spreadsheet.

The only structured part of text traditionally was the name of the document, the date it was created and who created it – all of which could be searched for easier retrieval at a later date. Plus of course you can search a document to find a particular word or phrase, but this type of enquiry requires us to know already what we are looking for.

Text analytics is now capable of telling us things we didn’t already know and, perhaps more importantly, had no way of knowing before. Access to huge text data sets and improved technical capability means text can be analysed to extract additional high-quality information above and beyond what the document actually says. For example, text can be assessed for commercially relevant patterns such as an increase or decrease in positive feedback from customers, new insights that could lead to product tweaking or other interesting anomaly. And these insights can be incredibly useful in business.

When do I use it?

There are a number of reasons why you might use text analytics. Essentially, there are five main text analytics tasks:

  • text categorisation;
  • text clustering;
  • concept extraction;
  • sentiment analysis;
  • document summarisation.

Text analytics assigns a document to one or more classes or categories according to the subject or according to other attributes such as document type, author, creation date, etc. Text categorisation applies some structure to the text which can then be used for analysis or query. This can be helpful if you have a huge amount of text data that needs to be classified for easier access and usability.

Spam filters use text classification to assess the text within incoming emails and decide if the email is legitimate or not. Email routing also uses this technique to re-route an email arriving at a general address to a more appropriate recipient based on the topic discussed in the text of the email.

Text clustering allows you to automatically cluster huge amounts of text into meaningful topics or categories for fast information retrieval or filtering.

Search engines use text clustering to deliver meaningful search results. For example, if you enter ‘cell’ into a search engine the results would be clustered around ‘biology’, ‘battery’ and ‘prison’ – all of which use a different definition of the word ‘cell’.

Concept extraction is particularly useful if you have a great deal of data that you need to access but need to do quickly to deliver results. These techniques are used in law firms, for example, where there are literally millions of past case documents from their own and other legal cases. Concept extraction analytics can hone in on the documents that are likely to be most relevant to the new case, thus saving expensive personnel a huge amount of time trying to locate documents to use in court.

Sentiment analysis is particularly useful if you want to discover trends, patterns and hidden consensus within text over and above what the text actually says. Sentiment analysis, also known as opinion mining, seeks to extract subjective opinion or sentiment from text so that you can extract whether the data is positive, negative or neutral.

Finally, data summarisation allows you to automatically summarise documents using a computer program to retain the most important points from the original document. This can be really useful if you have a lot of reading to get through but not enough time. Search engines also use this technology to summarise websites on result listings.

What business questions is it helping me to answer?

Text analytics is particularly useful for information retrieval, pattern recognition, tagging and annotation, information extraction, sentiment assessment and predictive analytics. In essence it’s about getting more information from text and helping text to be even more useful over and above the actual meaning of the text.

As such, it can help you to answer:

  • What do my customers/employees think of my product? (See sentiment analysis in Chapter 9)
  • What is the perception of our employment brand among Twitter users?
  • What are the most important issues customers complain to us about?
  • What are key trends based on the search terms people use on our website?

How do I use it?

First, the text that you want to analyse must be datafied not just digitised. This is an important distinction.

By some estimates more than 130 million books have been published since the invention of the Gutenberg printing press in 1450. By 2012 the Google Book Project had scanned over 20 million titles or more than 15 per cent of the world’s entire written heritage! That’s a lot of text. If you copied a page from a book as a jpeg file or took a picture of a page in a book you would technically have a digital copy of the text but that would be of no value to you if you wanted to run text analytics.

What you need is datafied text like the text we see in many e-readers. E-readers such as the Kobo or the Amazon Kindle are not just allowing you to read a digital image of the page, you can interact with the text. You can, for example, change font size, add notes, highlight text or search for specific words and phrases in the book. For most businesses their text will already be datafied, but if you store old customer records in paper files or even microfiche then that needs to be datafied – and that doesn’t’ just mean taking an electronic copy of the document: it effectively means re-creating it in digital form.

It is also important to remove ‘stop words’ from the text being analysed. A stop word is a word like ‘a’, ‘the’, ‘of’, etc., which appear frequently in all text but don’t communicate any unique information about the content or meaning of the text.

Once the text is ready there are a number of text analytics options open to you and which one you use will depend on your objective.

If you want to know more about the various text analytic techniques and how to use them then explore the links at the end of the chapter. Alternatively there are many commercially available text analytic tools on the market that can help you.

Practical example

You may be concerned about the level of employee engagement and decide to conduct an employee engagement survey.

The easiest way to collect this type of data is to create some form of quantitative survey that may ask employees to rate their employer and their opinion on a scale for a number of different questions. But the real nuggets of wisdom usually come from open-ended questions that allow employees to elaborate on their opinions and provide examples. But that type of qualitative data is much harder to assess. You could read through hundreds of questionnaires and that might give you some good ideas, or a sense of who is happy and who is not, but it wouldn’t really give you any indication of trends or what the collective was really feeling. Text analytics would allow you to assess all that free-flowing unstructured text and establish trends or clusters of opinion in the business, divisions and within specific teams.

The surveys could, for example, be converted into a word cloud which would collate all the text data from the questionnaires and distribute that data according to how many people mentioned that word. The biggest word in a word cloud therefore refers to the word that was used by the most number of people. If the largest word on an employee engagement survey word cloud was ‘resentment’ or ‘unhappy’ then clearly you’ve got problems.

I know one organisation that uses text analytics to avoid having to do these types of surveys in the first place. Instead they simply scan and analyse the content of emails sent by their staff as well as the social media posts they make on Facebook or Twitter. This allows them to accurately understand the levels of staff engagement without the time and expense of a traditional survey.

Tips and traps

Just because you have text data doesn’t mean you need to apply text analytics to it. Make sure you know what you are trying to discover or be otherwise clear about your objective and the reason for the analysis.

Often business owners or senior executives can get really excited about text analytics, especially when they consider the vast amount of text-based data that exists in their archive room or basement. But converting paper-based text documents into something that can be used for text analysis can be a very time-consuming and expensive process, so make sure you have a really valid reason for doing it. Besides, most data has a shelf life so if it’s too old it won’t help you that much anyway. Focus on the new text data you already have access to.

Further reading and references

Text analytics is usually performed using commercial software and many vendors like SAS and IBM SPSS provide very good reading material. See for example:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.44.199