Understanding text analytics

We have explored the world of machine learning and Apache Spark's support for machine learning in the last few chapters. As we discussed, machine learning has a workflow, which is explained in the following steps:

Loading or ingesting data.
Cleansing the data.
Extracting features from the data.
Training a model on the data to generate desired outcomes based on features.
Evaluate or predict some outcome based on the data.

A simplified view of a typical pipeline is as shown in the following diagram:

Hence, there are several stages of transformation of data possible before the model is trained and then subsequently deployed. Moreover, we should expect refinement of the features and model attributes. We could even explore a completely different algorithm repeating the entire sequence of tasks as part of a new workflow.

A pipeline of steps can be created using several steps of transformation, and for this purpose, we use a domain specific language (DSL) to define the nodes (data transformation steps) to create a DAG (Directed Acyclic Graph) of nodes. Hence, the ML pipeline is a sequence of Transformers and Estimators to fit a Pipeline model to an input dataset. Each stage in the pipeline is known as Pipeline stage, which are listed as follows:

Estimator
Model
Pipeline
Transformer
Predictor

When you look at a line of text, we see sentences, phrases, words, nouns, verbs, punctuation, and so on, which when put together, have a meaning and purpose. Humans are very good at understanding sentences, words, and slangs and annotations or contexts extremely well. This comes from years of practice and learning how to read/write, proper grammar, punctuation, exclamations, and so on. So, how can we write a computer program to try to replicate this kind of capability?

Table of Contents for Understanding text analytics

Create new playlist

Sign In

Sign Up

Table of Contents for
Understanding text analytics