© Tom Taulli 2020
T. TaulliThe Robotic Process Automation Handbookhttps://doi.org/10.1007/978-1-4842-5729-6_9

9. Data Preparation

Tom Taulli1 
(1)
Monrovia, CA, USA
 

Being Ready for AI and Process Mining

Because of the power of data and algorithms, Google has created one of the most valuable companies in the world, with a market value of $900 billion. Not bad for a company that is only 20 years old.

But data can be fraught with issues. In November 2019, the Wall Street Journal published an article with the title of “Google’s ‘Project Nightingale’ Gathers Personal Health Data on Millions of Americans.”1 This stirred up lots of controversy and the US government quickly launched an investigation.

By partnering with the Ascension chain of 2,600 hospitals, Google was able to amass data – such as on diagnoses and hospitalization records – of patients across 21 states. The goal was to find ways to improve treatments but also to lessen administrative costs.

But this begged many questions: Why wasn’t this project disclosed? Were there potential privacy violations? And could Google really be trusted with this type of information?

This debate will certainly be contentious. It also is another indication that data is the “new oil.”

Granted, most companies will not run into such issues or be the target of public ire. Yet the use of data and algorithms will continue to gain adoption and pose challenges and risks.

Then how does this relate to RPA? Well, for a traditional implementation, there is often not a need for data analysis. But as companies expand on their efforts, this will definitely change. Keep in mind that more RPA platforms are adding AI capabilities and other sophisticated analytics. There is also the emergence of process mining, which involves crunching large amounts of event data (we will cover this in Chapter 12). Finally, RPA both consumes data and inherently creates data (in the form of detailed process recordings, system states, data access records, etc.) and requires retention of large amounts of data for compliance and audit purposes as well as performance management.

In other words, it’s a good idea to have an understanding of data as well as to have an overall strategy.

Types of Data

In Chapter 2, we discussed two important types of data: structured and unstructured. When it comes to analytics and AI, the latter is generally the most important. It’s this kind of data that has higher volumes, which allows for more sophisticated models. Consider that about 80% or so of data for an AI project will come from unstructured data.

This presents a tough challenge, though: You will have to find ways to essentially structure it. This is often time-consuming and prone to errors. Although, as AI becomes more important and has received much more venture funding, there are emerging new platforms to help with the process.

As for the other types of data, they include the following:
  • Meta Data: This is data about data. Essentially, it is descriptive. For example, with a music file, there is metadata about the size, length, date of upload, comments, genre, and so on. All in all, this type of data can be extremely important in AI models.

  • Semi-Structured Data: As the name implies, this is a blend of structured and unstructured data. Often semi-structured data has some type of internal tags for categorization. This could include such things as XML (extensible markup language), which is based on numerous rules to identify elements of a document, and JSON (JavaScript Object Notation), which is a way to transfer information on the Web through APIs. However, semi-structured data represents a small portion, say, 5% to 10%.

  • Time-Series Data: This is data that is ordered by time. In terms of AI, this information can prove quite useful for predictions. But time-series data is also crucial for applications like autonomous driving and tracking the “customer journey”.

Big Data

Big Data is certainly a big business. According to IDC, the global spending on this category (including business analytics) is forecasted to hit a whopping $274.3 billion by 2022.

In the report, IDC group’s vice president of Analytics and Information Management, Dan Vesset, had this to say: “Digital transformation is a key driver of BDA (big data and business analytics) spending with executive-level initiatives resulting in deep assessments of current business practices and demands for better, faster, and more comprehensive access to data and related analytics and insights. Enterprises are rearchitecting to meet these demands and investing in modern technology that will enable them to innovate and remain competitive. BDA solutions are at the heart of many of these investments.”2

So what is Big Data? Actually, the definition is somewhat elusive. Part of this is due to the fact that Big Data involves constantly evolving innovation. What’s more, the category is massive, as it has multiple use cases.

Here is how SAS defines it: “Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.”3

As for the origins of the term Big Data, it goes back to the early 2000s. This is when we saw the emergence of cloud computing and the growth in large Internet apps. There was a need to find ways to manage the enormous data volumes, such as with open source projects like Hadoop.

Now another way to look at Big Data is through the prism of the three V’s (this was developed by Gartner analyst Doug Laney). They include the following:
  • Volume: Yes, there must be a massive amount. While there is no bright line for this, there is some consensus that Big Data should be in the tens of terabytes.

  • Variety: Big Data has much diversity, such as with structured, semi-structured, and unstructured data. But as we’ve noted earlier, unstructured data is the main source.

  • Velocity: This is the speed that data is received and processed. To handle this effectively, Big Data will often leverage in-memory approaches.

Given the size and diversity of the Big Data business, it should be no surprise that more V’s have been added! One is veracity, which goes to the overall accuracy of the data. This will be a key theme in this chapter, such as with data cleansing.

Then there is visualization. This involves the use of graphs to better understand the patterns in Big Data. This is a major part of BI (business intelligence) systems.

And then there is variability, which is concerned about how data changes over time. An example of this is the evolution of sentiment with social media content.

The Issues with Big Data

Now even with all the advancements in storage and software systems, Big Data is still challenging for many companies – even for those that are technologically savvy. Let’s face it, the growth is only accelerating.

Based on data from Cisco, the annual global IP traffic will hit 4.8 zettabytes per year.4 To put this into perspective, it was 1.5 zettabytes in 2017.

And what is a zettabyte? It is 1,000,000,000,000,000,000,000 bytes or a billion terabytes or a trillion gigabytes! It’s enough to store 36,000 years of HD-TV video.

Thus, when it comes to parsing through the huge amounts of data, there are major complexities. As a result, it’s not uncommon for the projects to have notable failures.

Another issue is talent. The fact is that it is expensive to hire data scientists, who have a solid understanding of advanced statistics and machine learning. Because of this, many companies simply are unable to capitalize on their datasets effectively. However, there are companies such as Alteryx that are trying to solve this problem, such as by building tools that nontechnical people can use.

What’s more, the underlying core technologies for Big Data have seen significant changes over the years. There are not just the mega tech providers like Amazon, Google, and Teradata but also a myriad of start-ups, such as Snowflake. There has also been tremendous evolution in open source platforms, with Apache Spark getting lots of traction.

Finally, it’s important to keep in mind that you do not necessarily need huge amounts of data for good results. Even smaller sets can yield strong outcomes. The key is having an understanding of the business use case.

Note

According to research from Gartner, about 85% of Big Data projects are abandoned before they get to the pilot stages. Some of the reasons include lack of buy-in from stakeholders, dirty data, investment in the wrong IT tools, issues with data collection, and lack of a clear focus.5

The Data Process

There are various approaches for the data process. But there is one that has much support and has been refined over the past 20 years. It is called the CRISP-DM (cross-industry standard process for data mining) Process, which is the result of a consortium of experts, software developers, consultants, and academics. Here are the main steps, as seen in Figure 9-1:
../images/490883_1_En_9_Chapter/490883_1_En_9_Fig1_HTML.png
Figure 9-1

These are the steps in the CRISP-DM Process

Consider that these steps are not set in stone. When going through this process, you might iterate on one step several times – or go back to an earlier step.

But the key is that there is an overall plan.

OK then, before you start on the CRISP-DM Process, you need to assemble the right team. Ideally, you should have one or two data scientists – or those who have a solid technical background. The good news is that there are many courses, such as on Udemy, Udacity, and Coursera, that can train your team on data science. Then you will have several people from the business side of your organization who can bring real-life experience to the project.

In terms of the data science team members, you will be competing against companies like Facebook or Google who can offer generous salaries and equity packages. Sounds kind of intimidating? Unfortunately, it can be. But then keep in mind that you do not need PhDs with experience in areas like ML and AI. Instead, you want those who have a good understanding of the fundamentals of putting together data projects, such as with using tools like TensorFlow.

Now then let’s take a deeper look at the CRISP-DM Process:
  • Step #1 – Business Understanding

There must be a clear statement of the problem to be solved. This could be how a change in the price can lead to an improvement in sales or how better engagement can mean reduced churn. Boiling things down, a data project should have a hypothesis to test (you do not want to complicate things with multiple factors to measure).
  • Step #2 – Data Understanding

Here you will identify the relevant sources of data for the project. There are three categories to consider:
  • In-House Data: This can come from many sources, such as the web site, mobile apps, and IoT sensors. In-house data is not only free but has the advantage of being customized to your business. Yet there are still some nagging issues, such as the challenge of formatting the data (especially if it is unstructured) and not having enough to perform useful analytics.

  • Open Source Data: This is publicly available data that is often free or has a low cost. Common examples include datasets from the government (which, by the way, can be extremely useful), nonprofits, and universities. Open source data is usually formatted and comprehensive. However, there could be problems with bias (we’ll discuss this topic later in the chapter).

  • Third-Party Data: This is from a commercial vendor. It’s true that the fees can be far from cheap. The irony is that the data may not even be particularly useful or accurate! So before purchasing third-party data, it is important to do some due diligence.

Note

Based on the AI engagements of Teradata, about 70% of data is from in-house sources, 20% from open sources, and the remaining from commercial vendors.6

When evaluating data, here are some questions to ask:

  • Is the data complete?

  • What might be missing?

  • Where did the data come from?

  • What were the collection points?

  • Who touched the data and processed it?

  • What have been the changes in the data?

  • What are the quality issues?

  • Step #3 – Data Preparation

After you have selected your data sources, you will then need to take steps to cleanse it. This is where having expertise with data science will be critical. Even small errors can lead to terrible results in a model.

According to Experian, bad data has had a negative impact on 88% of US companies. The research indicates that the average loss of revenues is about 12%!7

Then what are some of the actions you can take to improve the quality of the data? Let’s take a look:
  • De-duplication: Duplication is a common problem with datasets. For example, a customer may change addresses or even his or her name. This could lead to data that is extraneous or misleading when scaled across a large dataset.

  • Consistency: Some categories of data may not be clearly defined. For example, the word “profit” may have different meanings. So when it comes to the dataset, make sure the definitions for each parameter are understandable.

  • Outliers: This is when some data is way beyond the range of the overall dataset. True, this may ultimately be fine. After all, there are exceptional cases (say, a person with a high IQ). But outliers could point to problems as well. Might the data be incorrect – such as from a bad input?

  • Validation Rules: In some cases, the data will be clearly false. This would be, for instance, if a person’s height is a negative number. To deal with this, you can establish certain rules and then the data can be flagged for the deviations.

  • Binning : When analyzing data, does it really matter if a person is, say, 25 or 27 years old? Probably not. Instead, you will probably want to look at broader categories (ages 30 to 40 and so on).

  • Global Data: This can certainly present some tough problems, such as with the differences in cultures and standards. For example, while the United States specifies dates according to a day–month–year structure, Europe has a year–month–day approach. To manage such things, you can set up conversion tables.

  • Merging: A column of data may be similar to another one. This could be if you have one that expresses height in inches and another in feet. In these situations, you might select just one or merge the columns.

  • Staleness: Is the data really applicable and relevant? Is it too old?

  • One-Hot Encoding : With this approach, you can replace categorical data with numbers. Here’s an illustration: Suppose you have a dataset with a column that has three possible values: Phone, desktop computer, and laptop. You can certainly represent each with a number. But this could pose a problem – that is, an algorithm may think that a laptop is greater than a phone. So by using one-hot encoding, you can avoid this situation. You will create three new columns: is_Phone, is_Desktop_Computer, and is_Laptop. That is, if it is a laptop, you will put 1 in the is_Laptop column and 0 in the rest.

All this can seem overwhelming. It is also a process that can take much time. But to help with this, a good approach is to analyze a sample of the data first and search for some of the potential issues.

Next, there are a myriad of data software tools that can help out, such as from Oracle, SAS, and IBM. There are also open source projects like OpenRefine, plyr, and reshape2.
  • Steps #4 and #5 – Modelling and Evaluation

It’s now time to look at how to create models with the data. This is done by applying algorithms to it, which are mathematical systems that can involve many steps. The temptation is actually to use the more sophisticated ones! Yet this could prove to be a mistake. It’s actually not uncommon for simple models – say, using traditional statistical techniques like regressions – to get strong results.

Keep in mind that there are hundreds of different types of algorithms. But a data scientist can use tools to test them out – such as TensorFlow, Keras, and PyTorch – and also engage in trial and error.

Once a model is selected, then it must be trained with data. While much will have already been done with the data, there are still some things to consider. For instance, you do not want it to be sorted. Why? For an algorithm, it may seem as a pattern. Rather, it is better to randomize the data.

When training the model, there will be two datasets:
  • Training Data: This will be about 70% of the complete dataset. The training data will be the information that the algorithms find the underlying patterns. Example: Suppose you are creating a model for predicting the price of a home. Your dataset will have variables like the number of rooms, amenities (like a pool), and the crime rate. By processing the training data, the algorithm will calculate the weights for each of the factors into an equation. If it’s a linear regression, then it will look something like y = m ∗ x + b.

  • Evaluation of the Model: Creating models can certainly be tricky. Consider that there may be issues with overfitting and underfitting of the data, which is when the algorithms are skewed. This is why it is essential to evaluate the model with test data, which accounts for 30% of the total dataset. For it to be effective, the information needs to be representative. So with the evaluation, you will look for discrepancies. How accurately is the model reflecting reality? Unfortunately, in some cases, there may not be enough good data to get satisfactory results.

The training of the model requires much fine-tuning and tweaking. This is just one of the reasons an experienced data scientist is so important. While there are more automation tools for modelling – say, from DataRobot – there is still a need for human expertise.
  • Step #6 – Deployment

The deployment of a model is either within an IT infrastructure or as part of a consumer-facing app or web site. Because of the complexities and risks, it is probably best to deploy the system on a limited basis (perhaps with beta users). Rushing a project is usually a recipe for failure.

Note that even the world’s best technology companies have failed miserably when it comes to deployment. Here are some notable examples:
  • Microsoft’s Tay: This was a chatbot on Twitter that was launched in March 2016. But unfortunately, it quickly spewed racist and sexist messages! The problem was that Twitter trolls were able to manipulate the core AI. As a result, Microsoft took down Tay within about 24 hours.

  • In March 2009, a shooter live-streamed on Facebook his horrific killing of 50 people in two mosques in New Zealand. It was not actually shut down until 29 minutes after. Simply put, Facebook’s AI was unable to detect it. The company would write a blog, saying: “AI systems are based on ‘training data,’ which means you need many thousands of examples of content in order to train a system that can detect certain types of text, imagery or video. This approach has worked very well for areas such as nudity, terrorist propaganda and also graphic violence where there is a large number of examples we can use to train our systems. However, this particular video did not trigger our automatic detection systems. To achieve that we will need to provide our systems with large volumes of data of this specific kind of content, something which is difficult as these events are thankfully rare. Another challenge is to automatically discern this content from visually similar, innocuous content—for example if thousands of videos from live-streamed video games are flagged by our systems, our reviewers could miss the important real-world videos where we could alert first responders to get help on the ground.”8

When it comes to deployment, there are often tough issues with existing systems. Usually they were not built for deploying algorithmic applications and may use different programming languages. There will also need to be a change in mindset as IT people may not be familiar with using technologies for making predictions and insights based on large amounts of data. Because of all this, it is often required for there to be rewriting and customization as well as some training.

Another issue is that there may not be enough attention paid to the end user, who usually does not have much of a technical background. This is why a model should have an easy-to-use interface and workflow.

You should also avoid any jargon or complex configuration. You want as little friction as possible. To this end, try to limit the options for the end user, which should help with adoption. There should also be documentation and other educational resources, such as videos. For the most part, you will need to engage in change management as well (we covered this in Chapter 6).

Finally, the deployment of models is an iterative process. You will need to have periodic monitoring to see if the results continue to be accurate.

Types of Algorithms

As noted earlier, there are many types of algorithms. But you can narrow them down into broad categories, which should provide an easier framework for using them effectively.

Here’s a look:
  • #1 – Supervised Learning

This is where algorithms use data that is labeled. They are the most common types and also generally require large amounts of data to be effective. But unfortunately, it can be tough to come up with this type of information. In fact, one of the biggest stumbling blocks for AI has been the time-consuming nature of creating labeled datasets.

But there are some strategies for dealing with this. One is from Fei-Fei Li, who is a PhD from Caltech who would go on to specialize in creating AI models. She at first had her students come up with the datasets but this did not work out. Although, one of them suggested she try crowdsourcing, similar to Amazon.com’s Mechanical Turk. She tried it out and was able to amass a database of 3.2 million labeled images (for over 5,200 categories). It turned out to be a game changer in the AI world and would be the source of contests to test new models. In 2012, researchers from the University of Toronto – Geoffrey Hinton, Ilya Sutskever, and Alex Krizhevsky – applied their deep learning algorithms against ImageNet and the results were startling as there was a material improvement in image recognition. If anything, this would become a pivotal moment in the acceleration of the AI movement.

Next, another way to create labeled data is to actually use sophisticated algorithms! That is, they should be able to discern complex patterns and fill in the gaps.
  • #2 – Unsupervised Learning

This is where algorithms are applied against data that is unlabeled. While there is much of this data available – and it is quite useful for models – it is difficult to use. Unsupervised learning is still in the nascent stages.

However, with the advances in deep learning, things are definitely improving. Algorithms can essentially find clustering in the data, which may indicate important patterns. This approach has been effective in areas like sentiment analysis – where there are patterns across social media – and recommendation engines (say, to find movies you might be interested in).

When it comes to the next generation of AI, unsupervised learning will certainly be key. In a paper in Nature by Yann LeCun, Geoffrey Hinton, and Yoshua Bengio, the authors note: “We expect unsupervised learning to become far more important in the longer term. Human and animal learning is largely unsupervised: we discover the structure of the world by observing it, not by being told the name of every object.”9
  • #3 – Reinforcement Learning

Reinforcement learning is kind of like the concept of “learning my osmosis.” So when you are new to something, you will experiment, trying different approaches. Over time, you will get a sense for what works and what to avoid.

Of course, regarding reinforcement learning, the underlying mathematics is extremely complex. But the overall idea is straightforward: it is about learning by experiencing rewards and punishments for actions.

Interestingly enough, some of the major advances in reinforcement learning have come in gaming. Just look at DeepMind, which is owned by Google. The company has several hundred AI researchers that have tested their theories on well-known games, such as Go. In 2015, DeepMind created AlphaGo that was able to beat the European champion (five games to zero). This was the first time this had happened – and it was a shocker. The conventional wisdom was that a computer simply would not have enough processing power to defeat a world champion. By the way, a couple years later, AlphaGo would win a three-game match against Ke Jie, who was ranked No. 1 in the world.

So why are games good for testing? They are constrained, such as with a board and a set of rules. In a way, it’s a small universe.

The presumption is that a game could be the basis for understanding broader ideas of intelligence. Actually, companies like DeepMind have extended their models to real-life applications, such as in reducing energy usage and the detection of diseases.
  • #4 – Semi-Supervised Learning

This is a blend of supervised and unsupervised learning. This is where you may have a small amount of unlabeled data. Then with the rest you will try to label it using algorithms (known as pseudo-labeling).

A case study of this comes from Facebook. At its F8 developers conference in 2018, the company demonstrated semi-supervised learning by leveraging its massive Instagram hashtag data.10 Basically, it served as a form of labeling, such as a description of a photo. But then again, there were some hashtags that really were not very helpful, like #tbt (which stands for “throwback Thursday”). Despite this, Facebook’s researchers were able to build a database of 3.5 billion photos that had an accuracy rate of 85.4%.

Looking to the future , the company believes that this innovative approach could be helpful in improving the rankings of the newsfeed, the detection of objectionable content, and the creation of captions for those who are visually impaired.

The Perils of the Moonshot

IBM’s Watson is perhaps the most well-known AI system. In 2011, it was able to beat two of the all-time Jeopardy! champions.

While this generated tons of buzz, IBM wanted to extend the capabilities of Watson, such as to the healthcare industry. In 2013, the company announced a strategic alliance with the University of Texas MD Anderson Cancer Center to leverage the power of AI to conquer diseases like leukemia.

In the press release for the deal, the general manager of IBM Watson Solutions, Manoj Saxena, said: “IBM Watson represents a new era of computing, in which data no longer needs to be a challenge, but rather, a catalyst to more efficiently deploy new advances into patient care. “By helping researchers and physicians understand the meaning behind each other’s data, we can empower researchers with evidence to advance novel discoveries, while helping enable physicians to make the best treatment choices or place patients in the right clinical trials.”11

While this was certainly a bold endeavor, the results came up very short. Part of the problem was the paucity of data with rare cancers. There were also issues with the updating of the system. The bottom line: Watson would sometimes produce inaccurate information (this was according to a report in the Wall Street Journal).12

The University of Texas would spend about $62 million on the program but would abandon it in 2017.13

This is a cautionary tale of the very real limits of AI. The fact is that the technology remains fairly narrow in terms of the use cases – and far from intelligent. Yet this is not to say that AI is not worth the effort. It definitely is. But the key is to be realistic about its capabilities.

Bias

Bias in data is when the dataset does not accurately reflect the population. This could easily lead to unfair outcomes, such as discrimination based on gender, race, or sexual orientation.

Bias is actually often called “the silent killer of data.” And why so? The reason is that bias is often unintentional. Data scientists do not set out to create faulty models. However, their work will certainly reflect some of their backgrounds. So if the researchers are mostly white males who come from the middle class or wealthy families, then some of their tendencies will show up in their work.

Because of this, more companies are taking actions to deal with bias. “For our AI teams, we make sure we have a diverse group in terms of geographies, race, and gender,” said Seth Siegel, who is the partner of artificial intelligence and automation at Infosys Consulting. “We believe that this makes our projects much more robust and effective.”14

Or look at Disney. The company has formed an alliance with the Geena Davis Institute on Gender in Media. Her organization has developed an AI tool called GD-IQ: Spellcheck for Bias, which is based on the research from the University of Southern California Viterbi School of Engineering. The app will analyze all the content and detect instances where there are, say, too many lead male roles or there is a lack of diversity.15

Such efforts are encouraging. Having a tool can also help systemize the process and also provide some objectivity.

Yes, the topic of bias can be very divisive and controversial. Besides, few software developers take courses on this topic while they are getting their engineering degrees!

This is why it’s important for companies to take a leadership role. Some will even set up ethics boards. But even if this is not practical, it is still a good idea to set forth some core principles, provide training, and have periodic discussions, especially when developing new models.

Conclusion

Again, RPA does not necessarily need sophisticated data approaches, at least for the basic functions. But most companies want to go beyond this. RPA can be a pathway to AI and digital transformation. So it is essential to put a data strategy together.

As for the next chapter, we will look at the landscape for the RPA vendors.

Key Takeaways

  • Having a data strategy is not a prerequisite for RPA (at least for when using the traditional capabilities). But if you want to move into AI or process mining, then you will need to get serious about data. Interestingly enough, RPA is often referred to as the “gateway drug” for AI.

  • Besides structured and unstructured data, there are other flavors like metadata (data about data), semi-structured data (this is a mash-up of structured and unstructured data, which often includes internal tags for categorization), and time-series data.

  • Big Data emerged in the early 2000s, as new technologies like cloud computing started to take off. The definition can be fuzzy. But Big Data is often explained by using the 3 V’s: volume (there needs to be large amounts of data, such tens of terabytes), variety (there is diversity, say, with structured, unstructured, and semi-structured data), and velocity (this is the speed of the data).

  • Big Data projects can be extremely complex and failure is fairly common. This is why there needs to be much planning and a focus on the business case.

  • One approach for planning is to use the CRISP-DM Process, which has seven steps: business understanding (what is the problem to be solved?), data understanding (this involves the selection of data sources, such as in-house, private, or publicly available), data preparation (this is where you take steps to clean up the data), modelling (here you determine the right algorithms), evaluation (you will use systems to test the model), and deployment.

  • Training data is the dataset you use to create a model. You will then evaluate it by using a separate dataset, which is called test data.

  • Some of the general types of algorithms include supervised learning (where the data is labeled), unsupervised learning (sophisticated techniques to find patterns in unlabeled data), reinforcement learning (this involves a reward–punishment system), and semi-supervised learning (this is where there is some labeled data).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.161.77