Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

5. Core AI Capabilities

Ronald Ashri¹

(1)

Ragusa, Italy

In the previous chapter we saw that there are lots of different techniques we can use and combine to model aspects of intelligent behavior. On their own, however, they will not get us far. These techniques only have value in as much as they allow us to do something specific and clearly identifiable: transcribing speech to text, classifying a document, or recognizing objects in an image. To achieve these tasks, we typically need to combine techniques into capabilities.

AI capabilities represent a concrete thing we can do to better understand the world and affect change within it. They are analogous to the human senses. Humans can see, hear, smell, touch, and taste. Each one of these senses involves a number of subsystems (techniques) that combine to provide the final result.

Take the ability to see, as an example. Light passes through the cornea and lens of our eyes to form an image on the photoreceptors. From there, via the optical nerve it reaches our brain to the primary visual cortex. Information there gets processed again, and is eventually mapped to specific concepts. We employ different techniques for collecting light, transforming it, and processing the results in support of a single capability: sight.

In this chapter we will focus on three broad classes of capabilities that represent the most frequent types we encounter in a work environment. They are also the most likely to provide immediate benefits in any work environment:

The ability to understand and manipulate language (both voice and text) and generate language
The ability to manipulate images, classify them, and identify specific objects in images
The ability to combine organizational-specific knowledge and data to create organizational-specific capabilities—our very own superpowers that can be incredibly hard for others to replicate

The aim of the chapter is to give you a high-level understanding of how these capabilities work and examples of their application, so as to demystify the processes and allow you to more clearly consider how you could exploit them in your own work environment.

Language

Language is a critical capability that organizations should be looking to exploit as much as possible. As knowledge workers our currency, in many ways, is words. Whatever the end result of the activity of any office, the way to collaborate with colleagues and share ideas is through language.

Language has some fascinating idiosyncrasies and calls from the outset for a rich and interdisciplinary approach. It would be impossible to cover all the challenges here, but I think it is useful to consider a few so as to better comprehend the scale of the task and realize what an incredible amount of progress has taken place.

To start with, there are obviously multiple languages to deal with. Luckily, different languages present several similar characteristics, which means that techniques developed to handle one language can often be applied to others, with the main caveat being the availability of large enough data in the language we are looking to analyze.¹ Language, however, is not static. The English spoken in the UK today is very different from that of past centuries, and the English spoken in the United States or Australia is sufficiently different from that of the UK that different language models and datasets may be required. Language also morphs as it moves from one domain to another. If two experts in civil engineering listen in on the conversation of two experts in aerospace engineering, they may understand most of the individual words but the overall meaning will be lost to them. Words take on new meanings, acronyms are introduced, and quite often, especially in spoken language, slang is used that only makes sense in very specific contexts and time periods. I am sure that if I asked my dad to “Slack me” he would have a very puzzled look, but if I said, “Skype me” he would understand and likely reply with “Why don’t we just use FaceTime, shall we?”

Then there is the issue of understanding what we say when we speak and transcribing that to text. Our accent, the acoustics of the space, whether we have a cold or not, background noise, or other people talking at the same time all come into play to influence what sounds will reach the machine, which needs to then isolate the specific data it cares about and transform that into words. Once more, it’s not just about a faithful transcription of the sounds into words. We structure things differently when we speak. We add “ums” and “ahs” and stop and start in strange ways that somehow all make sense to us but are not the same way we write.

As you can see, the challenges are considerable, and it is amazing that we now have readily available AI tools that allow us to recognize speech, transcribe that to text, understand its meaning, and even generate language. We haven’t solved all the problems, but we’ve solved enough of them to make these tools viable for use in the development of AI-powered applications.

We briefly consider the implications of all this in the next section across speech recognition, natural language processing (NLP), translation, and natural language generation.

Speech Recognition

Speech recognition deals with our ability to transform the sounds that we produce when we speak to text. It is often also referred to as ASR, which stands for automatic speech recognition. Quite easily an entire field of study on its own, it combines a breathtaking set of technologies.

An ASR system starts by picking up the sound of our voice through a microphone. That signal gets cleaned and processed in the hope of isolating only those frequencies that represent a human voice. Those analogue continuous sound waves are then sampled and translated into what are referred to as speech frames (a couple of dozen milliseconds of sampled waveform information). Speech frames are then used to help us understand what phonemes the user has uttered. Phonemes are units of sound that combine to give us words and are used to differentiate between words—the linguist’s equivalent to a grammatical syllable.² Linguists define the specific phonemes of each language and how they combine into words; that knowledge is then used by ASR systems. This information is then further combined with a pronunciation model and a language model, nowadays largely based on deep learning, to produce the final text.

Speech recognition systems, especially after the huge enhancements that improved neural network algorithms introduced, provide an impressive amount of accuracy (all major technology companies report human level or better accuracy with error rates close to or below 5%). That does not mean, however, that we can assume that they will be able to tackle any situation with ease. The specific context needs to be taken into account, and a realistic investigation needs to happen into the viability of using speech recognition in order to solve a given problem. You probably already noticed how voice assistants are not that effective in crowded rooms with lots of other people speaking, whereas they perform much more reliably in a car where outside sounds are cut out.

The domain of discourse is also very important. Here is a very simple experiment you can run on your own to understand how it can affect speech recognition. Call up whatever voice assistant you have on your smartphone, be it Siri, Cortana, or the Google Assistant. First try telling them something that might be said in your work setting using domain-specific terminology, and then try an everyday phrase that is about dealing with more general life tasks. Look at the transcription of the text to see how accurate each got it.

I used the following work-related sentence:

“The high-level objective for Task 1 is to produce a chatbot that is able to assist a user to search through multiple document repositories that are accessed through a federated search service.”

This is a relatively friendly test. There are some domain specific keywords, but they are not too arcane. I am sure you, the reader, will have no difficulty with the individual words although you may have some questions about the overall meaning; for example, what exactly is a federated search service?

Google Assistant came back with:

“The high-level objectives for task wants to produce a chat but the table to sister user to search through multiple document repositories access with federated search service.”

Siri gave me:

“The high-level objective for task one is to produce a chalkboard that is able to sister user to search through multiple document repository other access through federated search service.”

Those are admirable efforts, but not very usable.

However, if I try the following sentence:

“Remind me to drop off the kids at school then go collect groceries, pass by the pharmacy, and then meet Julia for late breakfast.”

Siri gets it word for word correct and so does Google Assistant. I didn’t even have to enunciate too carefully, something that I did do in the previous example.

Clearly, they work well for exactly what they were designed: to help us handle everyday life, rather than transcribe domain specific information. It is no surprise that one of the leading transcription software companies, Nuance, provides different software solutions for different industries such as legal, professional, and law enforcement. Each solution advertises the fact that it has been trained for that industry’s specific vocabulary, precisely because that is a necessary precondition for effective operation in that industry.

In summary, although speech recognition has come a long way, it is important to keep a realistic view of where it can currently help, especially in an office setting. It can be extremely effective and less onerous to train if we want to use voice to issue straightforward commands or directions to a machine. In these cases, we are only uttering smaller phrases with a specific intent, such as “Open Microsoft Word” or “Call my HR contact.” It becomes more challenging if we are trying to use it to transcribe complex phrases with domain specific (and especially acronym heavy) content.

Natural Language Processing

With speech recognition we go from sound to text. Once we do have text, how do we understand what we can do with it, though? This is where NLP comes into play. Let’s look at some of the key stages to both understand what is possible and as a way to inspire ideas of how you can use it in your own work environment.

Analysis and Entity Extraction

The first stage is, typically, the syntactic analysis of the text we want to understand and something called entity extraction. Consider just a simple phrase such as:

“This is a book on the use of artificial intelligence in the office. It’s published by Apress, part of Springer Nature.”

To start with, we need to break up the text into its individual components; understand what constitutes punctuation and what does not, and how that affects the sentence structure.

Using Google’s NLP demo³ we get an analysis such as the one in Figure 5-1.

../images/472369_1_En_5_Chapter/472369_1_En_5_Fig1_HTML.jpg — Figure 5-1.
Syntax analysis of a phrase using Google’s NLP API

You can see there is quite a bit going on. The NLP system has been able to successfully identify all the different words, including where we’ve used apostrophes such as “it’s.” It is also identifying nouns, verbs, punctuation, adjectives, and more.

Entity extraction is able to tell us that book, Apress, Springer Nature, and artificial intelligence are all salient entities in this piece of text and for some, such as “Springer Nature” and “Apress,” it is able to say that they are organizations and provide links to their websites.

With just this information we can start thinking of a search powered by NLP that can be so much more effective than a “normal” search that only compares strings without any contextual information—a search that will, for example, be able to distinguish between when a specific organization is mentioned, such as Apple, instead of simply the fruit apple. Imagine being able to search through your document store and then filter against mentions of a specific company, product, or coworker names just the same way you filter against different brands on Amazon.com, without having had to painstakingly annotate those documents up front. The NLP system can do the heavy lifting for us.

Classification

What is the document about? Is it a sales report, meeting notes, or a pitch to win a new contract? Are the sentiments expressed within a document positive, negative, or neutral? Is the content sensitive or potentially offensive?

Classification, and in particular classification that is relevant to your specific needs, is one of the most frequent applications of NLP.

NLP tools have become particularly adept at this, and the good news is that it is already possible to train your own organization-specific classifiers with minimal specialized expertise. This is possible because you can base your classifier on existing language models (that have been prepared on much larger datasets) and specialize them with either rule-based classification or with data-driven approaches that work as a layer on top of the existing models.

Intent Extraction

When we are using language to communicate, especially when we are asking someone to do something for us, our words can be mapped to a specific intent. For example, if I say:

“Could you please open the window?”

The intent is quite clear. I am asking someone to open the window for me.

However, I could also say:

“It’s hot; can you let some air come through?”

Although I didn’t explicitly say “open the window,” the intent is the same.

The job of intent extraction is to help us understand what action is conveyed in the words. As you can imagine, it is particularly important in conversational engines that power chatbots. They need to be able to map all the myriad ways we, as humans, can say something to a specific response or action. In addition, they need to do that while taking contextual information into consideration. Consider the following dialog.

Human: “I’d like two pizzas, a Coke, and some garlic bread.”
Bot: “Thanks, what type of pizzas would you like?”
Human: “A pepperoni pizza and a margherita. Oh, make that Coke a Sprite.”

What to us is a very simple dialog is quite a challenge for a bot. It asked the user for the types of pizzas but it also got some information about a change in the order of the drinks. It needs to understand that the phrase the user uttered carried two intents, that the second intent was a reference to the previous phrase about drinks, and it’s about changing the existing Coke to a Sprite!

Nowadays there is a wide range of tooling to help organizations develop applications that can handle such conversations, and problems like the preceding one can be solved in well-defined domains. The key is to clearly weigh where intent extraction and conversations would be most effective. It is a balancing act between the complexity of the NLP problem to be solved and the value the solution is going to generate.

Translation

We’ve probably all seen AI-powered translation at work. It is what makes possible those links on Twitter, LinkedIn, and Facebook that say “See Translation.” It’s what powers the translation feature of Google Chrome that translates an entire web page.

According to Google Research,⁴ automated translation systems, under some circumstances, are approaching or even surpassing the translation quality that you would expect from human translators. It is important to take such claims with a healthy pinch of salt though. Those “circumstances” are important. If we are dealing with single words, short phrases, or web pages with small sections and not too complex concepts, automated translation can do an impressively effective job. The more sophisticated the concepts and the more layered the text, however, the less effective the translation.

A recent contest in South Korea pitching automated systems against professionals translating text from Korean to English and vice-versa concluded that about 90% of the automatically translated text was “grammatically awkward” and not comparable with what a skilled translator would produce.⁵

As such, the same limitations that we discussed so far apply here. Generic automated translation capabilities are impressive, but the more specific the domain the less efficient the translation model will be. If we are dealing with single words, simple commands, or small text, automated translation offers a viable avenue. For more complex scenarios, organizations need to evaluate the tools available and consider where they can invest in their own tooling if commercially available translators are not enough.

Natural Language Generation

The mirror image of natural language processing is the automated generation of new text.

We are far more likely to digest information and understand its implications if it is set in an appropriate narrative for us. We have all gone through that feeling of blanking out when presented with walls and walls of tabular data. Even with more pleasing graphs and charts, after a while it can feel like one blurs into the other. What we care about is the story that those tables and chart tell. Natural language generation (NLG) allows us to input structured, labeled data and get a natural language document that provides an appropriate narrative around that data as a result.

A particular strength of NLG is that it can produce multiple narratives from a single set of data, adapted or personalized to a specific situation. Take, for example, financial data. Analysts need to provide reports for all their different clients following the reporting of performance of a particular company or the release of data around a specific sector. The inputs, in this example, would be something like the annual company report and the portfolio situation of a specific client. An NLG system can then produce a narrative that describes what happened and how it affects a specific portfolio.

There are several levels of analysis that the NLG system performs to get to a final structured document. It needs to determine the relevant input data points that should be mentioned in the generated document. For example, did the company make a profit or a loss? What are the biggest expenditures? Where did sales mostly come from? The NLG system then manipulates what can be imagined as a very complex template that provides rules around the structure of the overall document and the structure of individual phrases. The end result is a document that is not only grammatically correct but structured in a way that is comfortable and natural for us to read.

Another example is weather reporting. From a single weather data set, a news organization can produce localized weather reports for all its affiliates without requiring a writer to go through the data and come up with appropriate narratives.

Within organizations, NLG is increasingly being used to provide the narrative around how the company is reporting in a more efficient and impactful way than charts and purely numerical reports. This can be particularly empowering for users who do not have the skills to do the necessary data analysis on their own.

Vision

Vision refers to a machine’s ability to process visual data and interpret it appropriately. It can range from something as “simple” as scanning a bar code to identifying objects within a photograph.

The advancement in the interpretation of images, as we discussed in Chapter 3, is what opened the floodgates for more general applications of AI. Unlocking the ability to correctly interpret an image enables so many applications, from autonomous driving to the ability to better monitor and manage the growth of crops across large areas.

The toolsets to enable training of data-driven models (the overwhelming majority being deep learning models) is potentially the most evolved across AI capabilities. This is a combination of the incredible amount of work that has gone into machine vision⁶ coupled with the suitability of deep learning architectures to handle raw image data.

There are powerful tools to label or annotate images that can then be used to train models and, unsurprisingly, there are several possibilities within a work environment.

Authentication and authorization: Face recognition can be used to identify people and provide access to work office spaces. It is not without its challenges, though. It can offer a more seamless experience and, under the right conditions, a more reliable security environment. However, it comes with risks, as companies will need to store biometric data.⁷
Fraud detection: In industries such as hospitality and retail machine vision can be used to detect when items are not properly processed at point of sales systems. They can monitor employees or clients as they are passing objects over barcode readers.⁸
Asset monitoring and management. The analysis of images of physical assets can reveal where faults are close to occurring and optimize the maintenance of workspaces.
Digitization and categorization of analog documents: We are a long way away from becoming entirely digital, and we have a swath of historical documentation that we still need to deal with. Machine vision can be applied both to categorize documents (e.g., identify receipts, sales reports, pay slips) and also digitize them so that the information within them is immediately accessible.⁹

Just as with NLP, there are powerful tools readily available to test out ideas with machine vision. A great example I’ve encountered, that tells the story of how accessible tooling has become, is of an intern building a fully digitized system for monitoring parking availability for the entire workforce over a single summer. They used the data coming from security cameras to figure out what parking spaces where available based on the movement of cars, exposed that information in the company intranet, and set up parking boards for everyone to see. This improved everyone’s experience of coming to work and required, all said, minimal effort.

As with everything else, care needs to be taken to ensure that the capability you think you have developed can translate to wider deployment. Machine vision is notorious for providing false positives or completely missing the target. When dealing with life and death situations, such as autonomous driving, that is simply not acceptable. However, when the goal is to make an existing process more efficient (such as letting people know whether there is a parking space), some errors can be tolerated. Similarly, if we are using images to detect people in pictures in order to better classify a catalogue of media images, we are saving ourselves time and don’t mind some errors. If we are using face detection technologies coupled with emotion recognition to determine the emotional state of a workforce,¹⁰ we are definitely overstepping what the technology can usefully achieve and risk alienating users.

Custom Capabilities

Language and Vision are generic capabilities with wide applicability across different domains. Exploiting them appropriately across your organization can give you a significant advantage. There is space to innovate in how you use them and where you apply them, but it will become increasingly harder to complete with others on building better NLP or vision systems. The effort required will likely not justify the potential benefits for most companies. Ultimately, we can expect powerful NLP and vision capabilities to become the minimum standard necessary, rather than a competitive differentiator.

Instead, an area where there is possibility for your organization to differentiate and create more of a moat around your competitive advantage is in creating your own “custom” capabilities. These are ways you can represent, reason, and act in the world in a way that is specific to your organization because it is a result of models that you have devised and data that only you own. I like to think of these as your organizational superpowers. Just like hero superpowers, they are the things that separate you from the other superheroes. Some heroes can see better, while some can jump higher or pack a mightier punch. Those are their super capabilities. The question is: what is your organization’s superpower when it comes to AI capabilities?

To develop a custom capability, you need to create the right circumstances. Just like the Hulk, Ironman, or Spiderman, you need to walk into a lab and mess around with the different ingredients to see what can come out of them.

For example, there may be something specific in the way you collect customer data that allows you to model and reason about the behavior of your clients in a way that others simply can’t. You may have developed a culture and put in a place a process that means your team provides structured feedback in a consistent manner. This enables you to get a better overall understanding of team well-being and what needs to change, leading to a happier and better performing workforce.

Perhaps, just like Superman coming from a different planet, you are entering a new market and can bring a perspective to it and new capabilities, in terms of how a process can be automated, that incumbents have simply not considered or are too comfortable to care about. The way fintech start-ups are disrupting traditional banks is a good example of this. They don’t carry any baggage and are approaching the problem from a technology-first perspective in a way that the incumbents find hard to achieve.

The crucial element is to recognize that what you are looking to develop is a capability: a way to understand and reason about a specific aspect of the world. Starting from there you can then explore the techniques available and start combining them to get to a specific solution.

From Capabilities to Applications

AI capabilities are the ways that you can understand and manipulate your environment. Core capabilities such as Language and Vision offer a wide array of opportunities to organizations. There are easy to access tools making the barrier to entry low. The challenge lies in identifying the most fruitful ways of using them, cognizant of their limitations. Ultimately, these core capabilities will become part of everyone’s toolkit. What is important is to grow the skills and experience to use them effectively now, in order to gain some first-mover advantage.

In addition, you can start thinking of what custom capabilities you can develop. These organizational “superpowers” can be exclusively yours because they depend solely on how you exploit the innovation capability of your people and your understanding of the world (your knowledge and your data). The more mature AI-powered applications become, the more important these custom capabilities become, as they are the ones that will provide true differentiation.

Footnotes

While languages do have some innately similar characteristics, we should be careful to not overgeneralize. A more nuanced statement would be to say that languages with similar heritage share similar characteristics.

For example, the word “Five” would be represented with three phonemes: “F-ay-v.”

https://cloud.google.com/natural-language/#natural-language-api-demo.

https://ai.googleblog.com/search/label/Translate.

www.koreatimes.co.kr/www/tech/2017/02/133_224449.html.

Just in terms of investment in autonomous driving, 2018 saw venture capitalists committing 4.2 billion dollars—the key technology developed there is real-time AI-powered machine vision: www.axios.com/autonomous-vehicles-technology-investment-7a6b40d3-c4d2-47dc-98e2-89f3120c6d40.html.

In August 2019, for the first time, a large database of biometric data was found exposed on the open Web. It contained fingerprints and facial recognition data for millions of people and was managed by a company named Suprema. One of the biggest implications is that while one can change their password, if the digital equivalent of their fingerprint is stolen, there is no mechanism to replace it! www.forbes.com/sites/zakdoffman/2019/08/14/new-data-breach-has-exposed-millions-of-fingerprint-and-facial-recognition-records-report/#4cef3ee046c6.

Beyond fraud detection, vision is also a core capability to automate the entire retail experience, as Amazon demonstrated with their automated grocery store, Amazon Go. The solution there relies heavily on cameras to track how clients interact with items in the store.

A great example of on-demand-digitization is a new feature in Microsoft Excel whereby you can point with your smartphone camera at tabular data in a printed document and have that converted to digital spreadsheet data.

This is an application of vision that has been deployed in certain schools in China with the aim of classifying students based on six behavior categories, with the goal of identifying students who were not sufficiently immersed in study. Such applications of technology should rightly raise alarms: https://www.theglobeandmail.com/world/article-in-china-classroom-cameras-scan-student-faces-for-emotion-stoking/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 5. Core AI Capabilities

Create new playlist

Sign In

Sign Up