Chapter 5. AI for natural language

This chapter covers

  • Understanding the main challenges of natural language processing
  • Measuring opinions from text with sentiment analysis
  • Searching for textual content with natural queries
  • Building conversational interfaces (chatbots)
  • Case study: ML-assisted language translation

Making computers understand language has been a pipe dream of computer scientists since the 1950s. This is probably because most human knowledge and culture is encoded in written words and the use of language is one of the most powerful abilities that sets us apart from animals. The umbrella term for these techniques, which includes both understanding and producing language, is natural language processing ( NLP ).

Unfortunately, giving machines the ability to understand and produce language is as hard as it is desirable: the technology to sustain such a seamless computer-human interaction simply isn’t here yet. This didn’t stop companies, agencies, and media from overselling the potential of end-to-end solutions. Chatbots are a prime example of this phenomenon, hyped as autonomous agents capable of messaging with customers, seamlessly augmenting or replacing human workers for complex matters like sales or after-market service. Results were underwhelming: in 2017, the Silicon Valley blog The Information published its research about the performance of Facebook’s Messenger personal assistant, M. The blog reported that M was failing to handle 70% of user requests. As you can imagine, broken promises yield nothing but wasted money, wasted time, and frustration.

If this introduction doesn’t seem encouraging, there’s good news. The first is that technology is evolving rapidly, and the horizons of possibility are expanding accordingly. The second is that you’re about to learn how to find and develop the hidden projects that can benefit from simpler, more mature NLP tools. These opportunities too often are undervalued in the midst of the hype bubble around NLP, but can be a powerful success story for your AI efforts.

This chapter has two broad goals:

  • Give you the knowledge to judge whether an idea is feasible with today’s technology (and your resources)
  • Teach you how to plan a natural language effort with an eye to maximizing return on investment. This will enable you to be the one lifting the fog and finding hidden opportunities in your organization.

We’ll start by explaining why language is really hard for computers, and how machine learning, once again, came to the rescue over the past few years, allowing us to build amazing products and solutions.

5.1 The allure of natural language understanding

Let’s face it. The way we consume information on the internet has been stuck in an unnatural compromise for a while: we’re still clicking buttons and selecting drop-downs, while in the real world we simply use plain language. Natural language processing is the technology that can allow us to remove this barrier and allow people to use the digital world in the same way they experience the physical one.

Let’s try to imagine how NLP could change the game in the by-now-familiar world of a real estate brokerage website. We’ll put ourselves in the shoes of one of its users: a wealthy startup investor from San Francisco who wants to purchase a home. She wants a big villa with at least four rooms, close to the Highway 101, with a mandatory ocean view.

Right now, when she visits the website, she’s faced with the classical internet portal view we all are used to: a menu, some boxes for prices, location, bedrooms, and a bunch of information for each listing (images, description, and so forth).

Playing with all these buttons, selectors, and drop-downs can feel like being in the cockpit of a jet when all you wanted was a simple answer to this question:

What are the villas in San Francisco that are close to the highway, have an amazing view of the ocean, and at least four bedrooms? I want to visit the best ones this weekend.

Every flesh-and-blood broker can answer this question, but we can’t have a dedicated human for every person searching for homes on the internet. But what if we could, thanks to AI? This is the promise of machine language understanding: a superior experience for everyone interacting with a computer, increasing their satisfaction and efficiency, at scale.

You’re already aware that completely matching human capabilities in using language is out of reach for today’s technology. The way out of this challenge is designing solutions that reduce the complexity to fit the technology available. The first step is understanding and measuring complexity. Let’s get started.

5.2 Breaking down NLP: Measuring complexity

Language is one of the most important abilities that distinguishes humans from animals. While the family dog can use sounds to communicate, it’s limited to a small set of moods (for example, pain, sadness, and anger). On the other hand, humans can use words to communicate (almost) anything: from expressing our feelings to the love of our life, to ordering a double cheeseburger.

Obviously, a whole spectrum of complexity ranges from the simple capabilities of a dog to the nuances that a poet can express with language. In technology, complexity is the enemy: the more complex a task, the more chances we have to fail and ship underwhelming products, especially with immature technologies like NLP. How do we measure complexity in the realm of NLP tasks? Based on our experience, there are two qualitative key metrics:

  • Width --The diversity of topics or language domains required to perform a task. How large is the vocabulary that our algorithm has to master?
  • Depth --How sophisticated is the result of our algorithm? How many things does it have to understand or do?

Table 5.1 lists examples of NLP tasks and indicates their width and depth.

Table 5.1 NLP tasks and an evaluation of their width and depth

Task

Width

Depth

Understand whether an Amazon product review is positive or negative.

High: The review can be about a wide range of
products and attributes.

Low: We have just two possible outputs
(positive or negative sentiments).

Help travelers book and manage their plane tickets.

Low: The text we’ll be
dealing with is relative to
a restricted subset of
topics (flights, bookings).

High: The agent must understand a vast range of options, from checking prices to changing tickets, addressing complaints,
providing airport assistance, and so on.

Help a manager plan their time and appointments (a secretary bot).

High: A secretary needs to understand queries about many domains--scheduling. appointments, invoicing, travel, and so on.

High: In each of these topics, a secretary-bot should have a solid understanding and perform complex actions, such as setting up a meeting at a comfortable time while making arrangements for transportation.

Notice that because both width and depth contribute to the complexity of a task, we can say that the overall complexity of the task is proportional to their product:

Complexity = Width × Depth

If we draw a graph with these two metrics on the axes, we get an effective way of visualizing the complexity of NLP tasks. Tasks that form a rectangle with a large area are more complex than tasks with a smaller area, as shown in figure 5.1.


Figure 5.1 Width and depth of various NLP tasks that a real estate website could deliver through a chatbot

Let’s go back to the example of a chatbot for our real-estate website. A simple action that this chatbot could perform is understanding basic user queries, basically replacing the conventional search interface. This is an example of a query:

Show me homes with more than four bedrooms in downtown San Francisco that cost less than $2 million.

In this case, the chatbot needs to perform a single action--querying a database--related to a single domain: house properties. We could ask more from our chatbot, and extend its functionality to extract meta information out of this database or answer more-complex queries, for instance:

What is the average price for a home in the Presidio neighborhood, with an ocean view, four bedrooms, and that’s less than 10 minutes away from Highway 101?

This wouldn’t increase the width, because the language domain is still limited to houses, but depth is way higher, as the number of possible actions that the bot must be able to handle has increased. The last step could be to manage the booking of home tours as well. This would increase the depth again as the number of actions increases. Now, width increases too, as the bot needs to understand new language expressions relative to bookings.

Keep in mind that two very different tasks can have similar complexity. Let’s switch gears for a second and look at two typical examples of NLP that you’re likely already familiar with:

  • Sentiment analysis on tweets --For this task, the model reads a short text and must decide whether it expresses a positive or negative sentiment. In other words, sentiment analysis is about finding out what the author thinks about the main subject of the text. Examples: “I loved Spielberg’s new movie” is positive. “The camera of this new phone is awful” is negative.
  • Classification of customer support tickets --For this task, the model reads complaints from customers and assigns them to one of several classes of potential causes. Examples: “The password is wrong” is an authentication issue, and “The site doesn’t load” is a connectivity problem.

The first task has high width, as it can deal with a lot of subjects (people use Twitter to talk about everything), but low depth as it has only two possible outcomes: a positive or negative review flag.

The second task, on the other hand, has low width because it needs to deal with tickets that will all be somehow related to the same domain (IT stuff not working), but it has higher depth because it needs to put the tickets into several buckets. On our width/depth graph, the two tasks would look something like figure 5.2.

Figure 5.2 Two tasks on the width/depth diagram: categorizing customer support tickets (high depth, low width), and sentiment analysis from tweets (high width, low depth)

How do we translate these insights into actionable information? Looking at an NLP task in terms of width and depth helps assess its feasibility. In general, you want to keep both as low as possible by starting to design your NLP application from the lowest width/depth combination. You’ll be building on top of it according to your needs, and depending on the performance you can achieve. In figure 5.3, you can see a representation of the feasibility of an application depending on its depth and width.

Figure 5.3 NLP applications based
on their width and depth. The closer an application is to the top-right corner, the more unlikely it is to be feasible with today’s technology.

In the next part of this chapter, we’ll use this mental model as our guide on the path to add NLP capabilities to our home listings site.

5.3 Adding NLP capabilities to your organization

Let’s bring back our real estate brokerage site from where we left off in the previous chapter. Our DL-based picture classifier has greatly improved the user experience, and the number of monthly visitors is growing fast. Customers visit our website, can quickly find the details they’re interested in through the images of each listing, and move along the process by asking for more information about their favorite home. Right now, someone who’s interested in a specific listing has two options to find out more:

  1. Spend a long time reading through the description left by the homeowner and all the reviews by other persons who visited it (let’s just assume that our real estate platform has reviews on houses).
  2. Call customer service, which will connect the customer to the broker who is handling that home.

The first option is not a great user experience, and the second is really expensive for your business. We already pictured a human-like computer program (often referred to as a bot , chatbot , or agent ) that can completely take over the role of a house broker, but we also understood that such an application of NLP is unlikely to work with today’s technology. We can call this advanced application the brokerbot : it has both high width and high depth.

Let’s take a step back and look at our business objectives. The business goal is to allow customers to quickly find the information they need about each listing. If you look at it this way, the brokerbot is a bit of overkill: we can still inch toward a better user experience and faster search while using simpler technology.

We know that users spend a lot of time going through reviews written by other prospective home buyers. These reviews are packed with information that often isn’t included in the house description, such as firsthand opinions of the lightning fixtures, the neighborhood, and so on. We would like to use this information to help other users make the best decisions possible, but few would take the time to read all this material.

What we’ll do in the next chapters is progressively come up with more-complex NLP features for our website, by taking advantage of the primary natural language content that we have: user reviews. You’ll find that even basic NLP functionality can be useful for people, without having to design super-complex human-like features that will probably end up being underwhelming and frustrate users. As you can see in figure 5.4, in a complex and immature field like natural language processing, increasing the complexity of your product exponentially increases the technological risk, while business value plateaus after a while.

Figure 5.4 Different features with increasing complexity bring exponentially higher technological risk, while the business value plateaus quickly.

The first step toward adding NLP capabilities to our website should be the easiest thing we can build that still delivers value to people. In our case, it can be developing a system that classifies each home review as positive or negative, making it easier for new users to understand how good the listed home actually is.

Basically, we want to go from an unorganized review section like the one in table 5.2 to a much more digestible and informative layout, as in table 5.3.

Table 5.2 A list of positive and negative reviews all mixed together

Reviews

Great location!

The house is very old and run-down.

The view from the rooftop is breathtaking.

I’d never live on the ground floor in such a dangerous neighborhood.

Table 5.3 An organized table of positive and negative reviews

What people liked

What people didn’t like

Great location!

The house is very old and run-down.

The view from the rooftop is breathtaking.

I’d never live on the ground floor in such a dangerous neighborhood.

Let’s see how sentiment analysis fares in terms of depth and width. Remember that the depth of an application is related to how complex the capabilities of the model need to be. In this case, the only thing we’re asking our chatbot to classify is a review into one of two classes (positive or negative), and the depth is therefore low. Width measures the size of the vocabulary that the model has to “know” (the number of words it has to understand to classify a review correctly). In this case, the kinds of words required are limited to positive words like “beautiful,” “safe,” “breathtaking,” and negative words like “worn,” “awful,” “dangerous.” These words are the union of general positive and negative words like “beautiful” and “ugly,” together with some domain-specific terms like “spacious” and “worn” that are specific to the houses domain. The width is therefore a bit higher than that of a standard sentiment analysis task, but not dramatically so. Figure 5.5 compares sentiment analysis with the brokerbot on the width/depth diagram introduced in figure 5.1.

Figure 5.5 The brokerbot versus sentiment analysis on the width/depth diagram. Sentiment analysis has less depth and width because its final outcome is narrow and it deals only with words that express opinions.

As you can see, it’s a much smaller area compared to the brokerbot, and therefore it’s much easier to build while still bringing significant improvements to the user experience.

Sentiment analysis is such a classic and important application of NLP that it’s worth going into a bit more detail. We’ll also use it as a tool to shed some light on the inner workings of NLP algorithms in the next section.

5.3.1 Sentiment analysis

A surprising amount of textual content produced by users on the internet deals with opinions and feelings. For example, think about user reviews on Amazon, Twitter replies, or even blog articles. Sentiment analysis algorithms digest all this free-form text, and figure out whether the author had a positive, negative, or neutral opinion about the subject.

Sentiment analysis is an important task, and it can be used for much more than sorting reviews of houses. You could monitor the results of your marketing campaign in real time by following feeds from social media, or even gauge the reaction to individual presentations during a conference. Wall Street trading firms routinely use sentiment analysis to measure what the public thinks about a company and to inform their trading decisions.

But let’s go back to our real estate listings website. These are excerpts from some of the reviews posted by users:

  • The kitchen appliances are old and smelly.
  • Views from the living room are amazing at sunset.

What we want to do is build an algorithm that can label each sentence as positive or negative, as in figure 5.6.

Figure 5.6 The basic concept of sentiment analysis

Let’s suppose that we don’t know machine learning, and we need to explain to a computer how to rate these sentences by developing an algorithm. A good rule of thumb would be to look at certain telltale words, such as “terrible” and “amazing.” In fact, this is exactly how the earliest algorithms for sentiment analysis worked: researchers painstakingly built a dictionary of important words, and labeled each as positive, negative, or neutral. For example, such a word-sentiment dictionary might look like this:

  • Delighted --Positive
  • Killed --Negative
  • Shout --Negative
  • Desk --Neutral

Once you have an emotional glossary like that, you can classify each sentence by counting the number of positive and negative words and get a final score for the sentence.

This simplistic approach has a bunch of problems. Language is extremely complex, and we use it in very different ways from person to person; the same word can be used in different ways to communicate completely opposite messages. Let’s say you listed “nice” as a positive-sentiment word, and one of the reviewers writes this:

It would be nice if they completely removed that tapestry.

Even if the sentence has an overall negative opinion of the house, our naive system would consider it positive, because of the positive connotation of “nice.” Maybe you could try improving this system by adding more-complex rules, something like this:

‘It would be [POSITIVE WORD] if ..’ => negative

Although this rule would work on the preceding snippet, it’s still easy to fool. For example, the following sentence is actually extremely positive, but would be ranked as a negative opinion:

It would be nice if I could move into this place right away!

Should we keep adding hardcoded rules? The game is already becoming complicated (and boring), yet it’s still easy to fool our system. Notice also that we’re still playing with just a few words; we haven’t even started mapping the vast ocean of the English vocabulary. Even if we did get to the bottom of this, we would almost need to start all over again for other languages.

By now, you’re probably starting to think that this approach is doomed to failure. If you are familiar with the main theme of this book, you might think that we have something much better coming up for you, and you would be right. Maybe a machine can find out how to perform the task by itself, without us explaining it.

We can try looking at sentiment analysis in the same way we looked at other ML problems in the previous chapters. Given a piece of text, we want to decide if it belongs to the “positive” or “negative” class: a classification problem. The classes (labels) will simply be “positive” or “negative,” and the features will be the words of the sentence. Instead of manually creating rules to classify each word as positive or negative, we can train an ML model to do that job for us. It turns out that the presence or absence of specific words in the text is enough for the model to figure out whether the opinion is negative or positive.

This approach is simple yet effective. One of the first to sketch out this idea was Paul Graham, the legendary founder of Silicon Valley’s most successful startup accelerator, Y Combinator. He thought of this approach while trying to fix what he thought was one of the biggest threats to the early internet: spam email. Back then, your email inbox would fill up with advertisements for all sorts of frauds, from fake lottery prizes to counterfeit Viagra. In 2002, Graham came up with a new method that he described in his essay “A Plan for Spam” ( www.paulgraham.com/spam.html ).

Graham’s plan begins with the realization that most developers will be drawn to programming explicit rules to detect spam email. Back in those early days of the internet, by using explicit rules, it wasn’t hard to correctly identify almost 80% of spam. However, catching that last percent of spammers would turn out to be extremely different. On the other hand, a simple ML approach was more effective right off the bat. The concept was pretty simple: let an ML algorithm see a bunch of spammy and not-spammy emails, and let it figure out by itself which are the words that are most likely to be indicative of spam.

With limited effort, Graham’s algorithm was capable of beating the most complex hand-coded rules that expert programmers painstakingly encoded. It also found surprising correlations between weird words and spammy email that no one had thought of: it turns out that ff0000, the color code for “red” used on the web, is as good of an indicator of spam as pornographic words.

Graham’s experience provides two interesting takeaways. First of all, in 2002, Graham was considering autonomous spam filtering to be AI. Today, we take this feature for granted and think it’s less sophisticated than a coffee maker. Also, we won’t get into the details of the specific classifier that Graham used (for your information, it’s called naive Bayes ), but it’s probably one of the simplest algorithms a mathematician can think of. Yet, it was still able to outsmart him and some of the smartest engineers in the world. Learning by experience beats hand-coded rules, once again.

5.3.2 From sentiment analysis to text classification

Sentiment analysis has already improved the experience of our users on the site, but we want to take some more steps toward the brokerbot. We decide to tackle another problem: reviews are wordy and talk about all sorts of aspects about the house, from the lighting in the living room to the neighborhood. A potential buyer who is particularly interested in a specific topic would have to go through all the reviews to find the ones that talk about that topic. Tagging each review by its topic would be an awesome feature that allows potential buyers to quickly find the information they want.

Instead of having just two classes (positive and negative) as in sentiment analysis, we now have many more, as reviews can be about the furniture, the view, the prestige of the neighborhood, and so on. This task is usually called text or document classification .

Is document classification more or less complex than sentiment analysis? Again, let’s use the width/depth diagram to visualize the difference. The width of the task is higher, because document classification models need to understand a bigger vocabulary of words. Terms like “shelves,” “dishwasher,” or “orchard” can be ignored in sentiment analysis, but are clearly important for classification. The depth of an NLP application measures how complex the task is. In the sentiment analysis case, we simply wanted to put our review in one of two buckets. For text classification, we could have as many buckets as the categories we want to divide our reviews in; for example, reviews about the living room, the wear and tear of the house, the neighborhood, and so on. The more classes we decide to use, the deeper the model becomes, as shown in figure 5.7.

Figure 5.7 Comparing the brokerbot to the other, simpler NLP applications we’ve developed in this chapter

Basic ML models are usually good enough for sentiment analysis, but start having problems with more-complex tasks like classification, mostly because they ignore the meaning of words, and simply count how often they appear in the labeled dataset. This means that the classifier has no insight about the fact that “awesome” and “breathtaking” have similar meanings. Why is this a problem? Suppose you have 20 training examples with the world “awesome” but none with the word “breathtaking.” A model that doesn’t have any knowledge about the meaning of words will learn to classify as a positive example every review with the word “awesome” but won’t know what to do with the word “breathtaking.” If a model can grasp the meaning of words, it’s enough for it to see a few examples with a few positive words to extend its classification power to all the words that are positive (that is, it will correctly interpret “stunning,” “fabulous,” and “gorgeous” even if the training example contains just “awesome”).

The high variability of words is creating a problem similar to the one we had in chapter 4 with images. When we started talking about computer vision, our main issue was that images are made of millions of pixels, and it’s hard for us to find ways to break the complexity of this data into simpler items, like the presence of basic shapes. Our solution to the problem was using deep learning, a special class of algorithms that are capable of autonomously extracting complex features (in the case of images, from basic information like corners and edges, to complex ones like facial traits for face recognition). You also saw how to use these algorithms to build embeddings , a computer-friendly summary of the contents of an image.

Luckily for us, deep learning also works on text and can help us build more powerful features than the simple presence of certain words in a sentence. In the same way we used embeddings to transform raw image pixels into meaningful information (for example, pointy nose or big ears), we can use them to transform words into a set of numbers that represent abstract word features, like “positiveness” and “strength.” Just as two images with similar objects have similar embeddings, similar words will have similar embeddings as well (that is, similar words like “awesome” and “fabulous” will be transformed into similar sets of numbers).

If we interpret the numbers of a word embedding as coordinates in space, we can visually understand the power of this representation. For instance, in figure 5.8, you can see words represented on a 3-D plane. In the left plane, you can notice that positive words are close to each other, and they’re far away from negative words (which, on the other hand, are close to each other). The same applies to the diagram on the right: objects that are related to the furniture in a living room tend to cluster together, and lie far away from adjectives related to the neighborhood. In this setting, closeness simply means that the word embeddings of two near words are similar, which allows our text document classifier to greatly simplify its work: it can now ignore the differences between similar words like “amazing,” “fabulous,” or “beautiful,” and consider all of them as similar numbers that represent “positive” feelings.

Figure 5.8 In these word embeddings represented on 3-D planes, positive and negative words are on the left, while words related to neighborhoods and living rooms are on the right.

Just like computer vision, NLP applications can also take advantage of transfer learning to reduce the amount of training data needed. As you saw in chapter 4, transfer learning is a way of “warming up” a model with large amounts of easily available generic data before training it by using domain-specific training data, which is often harder to get. Just like computer vision applications, in which models can learn simple shapes and patterns, NLP models will pick up basic vocabulary and grammar, which are useful for pretty much everything (“beautiful” has the same meaning whether you’re talking about houses or anything else).

The NLP research community has released several DL models that were trained on huge text databases and can thus understand and create embeddings of almost a complete vocabulary. Figure 5.9 sketches out the basic idea behind adopting freely available embeddings for your specific classification task.

Figure 5.9 A typical NLP pipeline: For the first three steps (in the gray box), you can take advantage of open source models and embeddings. The last three steps involve domain-specific datasets.

The plan is to use the freely available pretrained models to compute embeddings of your own text (for example, the collection of home reviews on the website). These models do all the hard work of understanding grammar and vocabulary because they’ve been trained on a large corpus of text that allowed them to learn all those complex rules. When you have the embedding vectors for the text you care about (for example, the home reviews), it’s much easier to build a classifier for sentiment analysis or other tasks, because embeddings express much more meaning than the raw text would.

5.3.3 Scoping a NLP classification project

Text classification is a generic and useful application that can be helpful in different contexts across an organization. Table 5.4 gives you some inspiration.

Table 5.4 Three examples of NLP-based classification for business applications

Application

Input data

Possible classes

Automatic ticket dispatcher

Tweets and email sent to customer service

Root cause of the issue (lack of internet connection, wrong billing, and so on)

Automatic car insurance claims categorizer

Insurance claim report

Kind of damage (broken engine, scratch, and so on)

Medical claims classifier

Doctor’s report

Disease

When applying document classification to business problems, it’s important to keep in mind that risk and performance depend on the number of classes we want to identify. On the width/depth graph in figure 5.10, you can visualize how complexity increases by imagining that the more classes we’re asking our AI algorithm to identify, the more the depth of the application will go up, making the task more challenging.

Figure 5.10 The depth of a classification NLP task increases as the number of classes grows, while its width remains the same.

Something else to keep in mind is that if you have a limited number of examples for a class in your training data, it will be complicated for the algorithm to learn to recognize elements from that class, as it won’t have seen enough cases. Going back to our home review classifier, say you have a situation like the one in table 5.5. In this case, we have a lot of training data for the first three classes, but we’re short on examples for the last. Class 4 has only three examples, compared to hundreds of reviews in the other classes.

The solution for this kind of situation is either to get more training data, or simply to merge the unlucky class with another similar one. For instance, classes 3 and 4 are both related to the neighborhood, so we could just merge them into a broader class of “reviews about the neighborhood,” which would have 988 + 3 = 991 reviews in total. The final dataset has one fewer class, but is much more balanced in terms of examples, as represented in table 5.6.

Table 5.5 Initial example dataset of home reviews

Class ID

Class name

Number of examples in class

1

Wear and tear review

1,541

2

Furnitures review

769

3

Public service in neighborhood reviews

988

4

Neighborhood safety reviews

3

Table 5.6 The same dataset after merging classes 3 and 4

Class ID

Class name

Number of examples

1

Wear and tear review

1,541

2

Furniture review

769

3

Neighborhood reviews

991

Reducing the number of classes also helps keep down the depth of the task, increasing overall accuracy and therefore reducing the risk that the system will make mistakes when used in the real world.

5.3.4 Document search

In the beginning of this chapter, we pitched the brokerbot: an AI application capable of interpreting text as a human would. It’s too complex for modern AI technology, so we started from the simplest application we could build: a sentiment analysis classifier. The next step toward deeper language understanding was text classification. Unfortunately, it still suffers from two important limitations:

  • It’s a classification task, which means we need to collect labels.
  • The classes are defined up front: if a user wants to search for something that is not included in the classes, we can’t help them.

It’s possible to overcome the first problem either with a clever data collection hack or simply by paying humans to add labels by hand. The second problem is more challenging, and it’s exactly what Google is great at: its little search bar allows you to type in anything you’re interested in and immediately get a list of results sorted by relevance. This task is a branch of NLP often called document search and retrieval .

What we’re trying to do is the following: we want to have a search bar at the top of our website, and when a user types something they’re interested in (let’s say “house insulation”), the algorithm will search through all the available reviews and pick the ones that are the most relevant to their search, as in figure 5.11.

Figure 5.11 In document search, the algorithm is trained on a set of documents (top left). When it receives a query by the user (top right), it produces a list of the most relevant content.

For document search, the concept of word embeddings can help us big time. Recall that a word embedding is a set of a few hundred numbers that encode the meaning of a word, with the property that similar words have similar embeddings. Embeddings allow us to transfer the meaning of similar from the nebulous world of literature into the cold world of numbers. We can take advantage of this so that when the user asks about “house insulation,” we can transform these two words into their embeddings, and if we’ve done the same with all the reviews, we can then look for the reviews whose numerical representation is closer to the one we got for “house insulation.”

We want to help you visualize this concept in your head, so imagine a two-element embedding: each word is represented by only two numbers, as drawn in figure 5.12. Of course, such an embedding won’t be very useful in practice, because two numbers are not nearly enough to capture the richness of vocabulary (the embeddings used in real-world applications are made by hundreds or thousands of numbers).

Figure 5.12 Document search algorithms answer user queries by looking up documents that have similar embeddings to the query sentence (shown with a star in the plot).

Imagine drawing a point for each of the documents we have in our database, and then drawing a new point corresponding to the embedding of the user query. Thanks to the way the embeddings are built, the documents most similar to the user query will be physically close to the query point. This means that it’s easy to answer to the user query: just pick a few of the points that are closest to it, and those will correspond to the most similar documents. This solution also works when embedding vectors have more than two numbers; it’s just harder to visualize them.

Document search is a fairly mature application of NLP that often plays the role of a building block in more-complex products. It’s obviously not as fancy as other techniques, but it’s often one of the first steps you can take in getting value out of your text-based data assets. In particular, the flavor of document search that we have explained here can’t really answer any query that depends on logic or interpretation; the only thing it can do is return documents that sound similar to your query, hoping that you’ll find them somehow useful.

5.3.5 Natural conversation

In all our efforts to get closer to the brokerbot, we always relied on human-generated text, and were happy to create algorithms that made it easier for people to access that information. A natural improvement would be a model that can generate language on its own, instead of cobbling together premade sentences.

We want to break the bad news right away: this field is still very much at the cutting edge of research, and today’s technology is not ready for prime time (yet). Our goal with this section is to show you where the limits are, and how to work around them given what we have today.

The most advanced piece of technology on the market is a model called GPT-2, developed by the nonprofit research organization OpenAI. This model is capable of a variety of tasks, and one of the most impressive ones is generating text based on a human prompt. We downloaded an open source implementation of this model and tried it ourselves:

  • Our prompt --Achieving human-like performance in language generation would be an amazing milestone for AI. Can the new OpenAI’s GPT-2 model be a turning point?
  • GPT-2 completion --And what impact can this model have on other artificial intelligence projects like AIQ and HCI?
  • OpenAI was born to tackle this kind of problem. We were working in the field of speech recognition for the last five years. While working in this space we developed our speech recognition in the open source code. After the GPT-2 announcement some weeks ago, it is time to take the next step in speech recognition and we are proud to present the version 2.
  • The biggest challenges facing AI, the most important problems for it are
  • Information in general
  • Knowledge is often not accessible
  • Information can be obtained from many sources that can conflict

As you can see, the results are fairly convincing. The content itself doesn’t necessarily make sense, but stays on topic with almost flawless grammar. This was made possible because of the gigantic amount of data that the model has been trained on: more than 8 million web pages. Going back to real estate, here’s a last try of GPT-2 with another sentence related to a house:

  • Our prompt --This house is an amazing deal: the view from the living room is outstanding, and the neighborhood is very up-and-coming.
  • GPT-2 completion --I’m hoping to upgrade to a two bed as I really love this house and this neighborhood. If you live in central Florida you should probably go check it out.

My husband loves my house on the north side of town and we are close to Miami and many beach places. Our neighborhood is just a block away. We are both super happy and our home is super cozy. We did get moved but never planned to leave. We have also had other places to take a break and have never had such a convenient place for vacationing. I highly recommend it! We love it!!!

The model “noticed” that the text it was fed resembled a review of something like an Airbnb house listing, and it kept adding text as if it were a home review.

You can also use GPT-2 to answer questions about a body of text, which would probably be a more interesting application than generating random sentences. In fact, this would be extremely close to the brokerbot, as we could use GPT-2 to answer users who have inquiries about houses in plain language. Here are a few examples of the Q&A capabilities of GPT-2:

  • Our prompt --The GPT-2 model was developed by OpenAI, training an advanced deep learning model on over 40GB of internet text from 8 million web pages. The model has amazing performance, but it’s far from perfect.

Q: Who developed GPT-2?

  • GPT-2 answe r --GPT-2 by OpenAI was developed to develop and train a deep learning model, which can then be used in machine learning applications. The team has made the goal to build a “machine learning engine for the internet” for the next generation of applications.
  • Our prompt -- From Zero to AI is a book about artificial intelligence written for nontechnical people. The book will help you understand the core principles of AI and how to apply them to your business. It’s written by Nicolò Valigi and Gianluca Mauro, and published by Manning Publications.

Q: Who wrote From Zero to AI ?

  • GPT-2 answer -- From Zero to AI has always been written by Nicolò Valigi. I always have been an ardent fan of Valigi, and his previous books are quite influential in my writing today. I know him from his work for Apple in their Mac, iPod, and iPhone. He also worked on the screenplay for “The Lego Movie”, “Hobbit” the movie, and “Serendity”. I had the opportunity to write the book for my daughter’s first year of college this past summer, and it was quite interesting to read how he uses his years of experience and successes at the highest level of the industry to help write a book that will appeal to a wider audience. That being said I want to thank Nicolò Valigi, and hope that he will continue to write a great book to help all of us become more intelligent.

Even if these performances seem amazing, severe limitations still affect this technology, and researchers are working hard to fix them. Examples are as follows:

  • It’s hard to control the output of the model. For instance, we had to try three to four times before getting the results presented here. Even with highly popular topics that are widely present in the training data, OpenAI researchers reported that the model can produce credible results only about 50% of the time.
  • When the model doesn’t know a topic, it invents stuff based on what may be plausible, even if it’s completely false. A good example is the last one: the model invented that one of us authors had experience at Apple and wrote screenplays, just because the prompt was about technology and books.
  • The model doesn’t really understand the meaning of what it’s saying: it sometimes mentions impossible scenarios like fires burning under water.
  • It can get stuck and start repeating words on a loop.
  • It may switch topics unnaturally.

Arguably, the biggest limitation of these models is that they may seem to understand the concepts in the prompt, but are actually tricking us: all they do is guess which word is more likely to follow in a sentence, based on the billions of sentences they’ve seen. In other words, they’re parroting their huge training datasets, not understanding.

You can visualize the difference between understanding and probabilistic reasoning by comparing a doctor and an ML algorithm. A doctor can diagnose a patient by reading a textbook, understanding its content, and perhaps practicing on X-rays. On the other hand, no ML algorithm today can do the same. To make a diagnosis on a scan, an ML algorithm needs a large number of examples, and having “read” just a single book before won’t help it improve its diagnosis in any way.

How do we get closer to the brokerbot, then? Do we just have to sit around waiting for text generation to mature? Thankfully, no. You can adopt other strategies to create value today, so bear with us.

5.3.6 Designing products that overcome technology limitations

Let’s imagine we really want to build the brokerbot with whatever technology offers us today. We could go back to our sentiment analysis algorithm and our topic classifier, having a nice list of ready-made sentences for each topic. Say the user writes something like this:

What are the apartments with the best view?

Our topic classifier may understand that the topic of the question is the panorama that you can enjoy from the apartment, and our sentiment analysis knows that we’re interested in something positive (we also know that this is a question because of the question mark in the end, no fancy tech needed). Now, you may have a series of questions that match the requirements “topic = view” and “sentiment = good,” like so:

Would you rather be by the ocean or close to a park?

Notice that this question isn’t generated by any clever AI algorithm; a human came up with it and stored it in a database with a bunch of other questions. The user may answer something like this:

I love trees! Golden Gate Park is my favorite

Again, our sentiment analysis will understand that the sentence has a positive feeling, our topic classifier will understand that we’re talking about trees (which have an embedding similar to the parks mentioned previously--you see how powerful embeddings are?), and therefore the answer to the question is “parks.” If you built the classifier to do so, you may also catch the reference to Golden Gate Park and look for houses close to it. From there, you may have scripted something that understands it’s the right time to show properties, and suggest the best homes for that specific user.

It looks amazing and smooth, right? But what if the answer that the user gives to the previous question is something like this?

I’d say that I love the ocean, but when it’s winter, there are many storms here, so I’d go with something else.

This is probably the unluckiest we can get. Clearly, the user prefers a view to a nice park, but the sentence offers several challenges:

  • It talks about the ocean view.
  • It never really mentions anything related to parks.
  • It says “love,” which conveys a positive meaning.
  • No words hint to a negative sentiment about the ocean view.

An algorithm will most likely be confused by a sentence like this, and move along the predetermined script with the wrong follow-up questions. Even worse, the user may answer as follows:

You know what? I really like looking at tall buildings with nice architecture.

What if you haven’t thought of this answer? The user followed a script path with a dead end, and you don’t have an answer. You probably will answer with something generic like this: “I don’t get what you mean, can you repeat?” The user will never be able to answer the question in a way you can manage (you simply didn’t precode anything that matches those interests!) and will become frustrated and abandon the product.

Does it sound familiar? It certainly does to us, and this is the frustrating experience that consumers get from companies that decided to use chatbots without making sure that their functionality matches what today’s technology can deliver.

The solution lies in the way you scope your application. If you’re aware of the limitations of today’s technology, do your best to stay within these limitations, building your application around them and keeping complexity low, both in terms of width and depth. In the case of the brokerbot, we might explicitly limit the functionality of the brokerbot to a small subset of actions, like just querying houses based on basic parameters (location, size, price). Also, try avoiding natural language as much as possible, using either close-ended questions (yes/no), or buttons that allow the user to choose from different options. For instance, the bot might be the one starting the conversation with a user so that it can lead toward specific topics, thereby limiting the width of the answer. The bot might kick off the conversation by asking this:

Hello! I’m the brokerbot, I’m here to help you find your new house. First of all, what’s your budget?

Here, you may decide to let the user give an answer in natural language or provide a slider for choosing the price. The latter is the safest, but because we know that the user’s answer will be about pricing, we are also confident that we’ll be able to understand an answer in natural language. From there, the bot could move forward with questions about location, size, and so on, until it can finally show the user a selection of homes, and eventually even ask for feedback.

Conversation scripting and design is still the only way to build technology that brings value to people. We encourage you to use the knowledge you gained in this chapter to think critically about what you want your chatbot to do and how you want it to relate to users.

5.4 Case study: Translated

In 1999, Marco Trombetti and Isabelle Andrieu decided to start a business together. Trombetti was studying physics, and Andrieu was a linguist. To use their skills, the young couple came up with the idea of building an internet-based translation company. They called it Translated , and launched it with an initial investment of $100 for the internet domain Translated.net. In the following 20 years, Translated would become the largest online professional translation company in the world, growing organically without any external investment.

Translated focuses on professional translations, like medical journals and engineering user manuals. Such translations are traditionally performed by professional translators contracted by companies that act as brokers. Matching a body of text with a translator is not trivial: many documents need both deep language skills and subject matter expertise. For instance, translating a paper about transcatheter aortic valve implantations from English to German would need a cardiologist skilled in both languages. Translation brokers that own a database of translators can take advantage of this and charge large premiums to provide clients with the right translators. The larger the translation company, the larger the challenges and inefficiencies in matching translators with jobs. This was an opportunity that Trombetti and Andrieu set out to exploit.

The first challenge that Trombetti and Andrieu had to face was building a large database of translators. From the outset, the duo invested in search engine optimization (SEO) so they could land at the top of Google searches for professional translation services. These being the early days of the internet, their investment quickly paid off, and Translated started receiving hundreds of resumes per month. The early success led to a new challenge: many of the translators sending their resumes didn’t meet the quality standards of the company, and sorting professional translators from nonprofessional ones required a great deal of human effort. Translated risked falling into the same inefficiencies as other translation brokers.

The inspiration for how to address this challenge came to Trombetti in 2002, when he read the very same Paul Graham essay “A Plan for Spam,” that we talked about earlier in the chapter. Trombetti thought that the same method used to discriminate between spam and not-spam email could be used to make a first assessment on whether a resume they received was from a professional or nonprofessional translator, as depicted in figure 5.13.

Figure 5.13 Translated’s ML algorithm helped the company automatically identify professional and nonprofessional translators.

The algorithm was first introduced in 2002, when the company started receiving hundreds of translation requests per day, and it has been improved ever since. The next evolution of the algorithm was to turn it into an automatic translator-document matching AI, training it using the matchings made by Translated’s project managers (PMs) over the years. Once deployed, at any new translation request, the algorithm would look at the document and at the company’s pool of translators and calculate a match probability for each translator. Translated’s PMs would then validate the matches and assign the task. This process allowed the young startup to handle many clients and translators without compromising delivery speed (a factor often crucial for clients) and to keep their overhead low by having a smaller team of PMs.

Today, Translated acts as a double-sided platform: from one side, translators apply to offer their services; and from the other, customers submit the documents they need to have translated. In the backend, Translated matches the right translator for each job and delivers the final result to the client, without them needing to interact with each other. The current version of their algorithm is branded as T-Rank.

The algorithm takes into account up to 30 factors from the source document, including the translator’s resume (area of expertise) and previous projects (quality and timeliness of previous jobs). Using this data, T-Rank can rank the translators in the company’s portfolio according to their fit with the work and provide the best professional for each task, as in figure 5.14. As in the first versions of the algorithm, the final recommendations are vetted by expert PMs. In case the PM rejects the recommendation, the feedback is recorded and used to fine-tune the algorithm to continuously improve its performance.

Figure 5.14 Translated is a double-sided platform: customers send documents, translators send their data,
and the company finds the best translator in its portfolio for each document to translate.

Success in the market brought Translated both cash and data. A consistent cash flow enabled Trombetti and Andrieu to run the company without seeking external investments and to keep complete ownership. On the other hand, Translated started collecting the translations made by professionals through its computer-aided translation system called Matecat, improving its matching algorithms and pushing the company toward a new mission: becoming the leader in machine-aided translation.

To evaluate the performance of its ML-generated translation, Translated has always been tracking the percentage of suggested words that are changed by the professional translator. In the early days, translators had to change 45% of suggested words.

Together with AI-assisted matching of translators, this was enough to deliver translations quicker than the competition. In 2018, the company could count on more than 1.2 million professional translations, and percentage of words changed by the professional translator dropped to 24% on the eight most-spoken languages, 30% lower than other translation engines.

To further improve the experience and productivity of translators, the company has invested in a technology branded as ModernMT, the first adaptive neural machine translation. Traditionally, ML algorithms are trained on a large amount of data (training phase) and are static while they’re used (inference phase). If a translator is presented with a poor suggestion and fixes it, this fix won’t be taken into account by the machine translation (MT) algorithm until training is repeated with the new data. This means that while a translator is working on a document, the same fixes might have to be applied multiple times. The result is a frustrating and inefficient experience for the translator.

As illustrated in figure 5.15, ModernMT learns and adjusts in real time while the translator is working through the translation. If the same phrase occurs more than once in a document and the translator fixes it once, ModernMT will learn the correction the first time and avoid repeating the mistake.

Figure 5.15 Translated’s ModernMT translation engine proposes a translation that can eventually be corrected by the translator. The translator’s correction is then learned by ModernMT.

As of 2019, Translated receives 2,000 resumes a month from translators. The company has brokered 1.2 million professional translations for more than 130,000 customers, and the translations are performed in 151 languages by more than 180,000 professional translators. The algorithms built on this dataset allow state-of-the-art machine translation performances on the 27 most common languages. Translated’s algorithms facilitate, but don’t replace, the work of professionals. They provide a first automatic translation of the sentences that just need to be refined and fine-tuned: today, translators need to change just 24% of the words suggested by the company’s platform. This is 30% better than any other solution currently on the market. Translated reported revenues of €21 million in 2018, with 51 employees: 12 developers, 35 in operations, with an expected growth of 40% for 2019.

Translated is still investing in improving its translation engine. The goal is not to get AI to produce perfect translations. In fact, if a translation performed by a professional translator is reviewed by another professional, that person would still change about 11% of the words. This is because of the extreme expressiveness of language: the same concept can be expressed with different words to express slightly different meanings and nuances. According to Trombetti, this is the most important and fulfilling task required for a translator: making sure that the richness of the document is expressed in the translation. His goal is therefore improving the company’s AI algorithm to allow translators to focus on the 11% of the words they need to change to convey the original message of the document in the most expressive way possible.

Trombetti and Andrieu want to allow translators to perform a more fulfilling and higher-quality job, and democratize translation along the way. Here’s what they said during our interview with them:

Our technology is creating a lot of opportunities not just for us but also for the translators, who are finally not working on correcting again and again the same tedious stuff but are spending their time in being more human, communicating what the text actually means and being more persuasive and creative.

5.4.1 Case questions

  1. How can you develop a strategy for AI in NLP tasks?
  2. Is AI a tool to replace humans?

5.4.2 Case discussion

Translated is a great example of a successful incremental strategy for AI. In its beginnings, the company used AI technology to address a simple use case: classifying translators’ resumes into professionals or nonprofessionals. Notice that this first use case was one of the simplest applications the company could have built, but its technology risk was very low because Trombetti and Andrieu knew that the same approach had been tried in other domains (email spam classification). Even if simple, this use case had several characteristics that made it a good fit for AI at the time, such as the following:

  • It played a highly strategic role in outperforming competition.
  • It had a short time to ROI (its results could be seen immediately).
  • It carried low implementation risk.

The success of this first application brought more customers and more data. The enriched data asset of Translated allowed a positive loop, making the translator-matching technology more powerful as data volume increased, and allowing new applications that weren’t possible in its early days, such as AI-assisted translations.

Even if AI-assisted translations would have been a killer feature 20 years ago, the company didn’t start investing in it immediately. Lack of data was the first reason not to do so, but the biggest reason why Trombetti waited was that the technology wasn’t ripe enough. It’s a prime example of riding the technology wave : letting research take the first hard steps toward the maturity of a technology, and using the findings together with strategic data assets to build on them, reaching state-of-the-art performance. Great timing can be the key success factor for a company like Translated working at the edge of technology.

Notice also that Translated knows that machine translation is an extremely complex task, which would fit well up and to the right on the width/depth diagram, but framed the technology in a way that allows it to be extremely useful anyway. The company does it by giving the professional translators the last word: the technology doesn’t produce the final document, but a first approximation that an expert human can work on.

This focus on user experience was a key element in the company’s strategy. Translated focused its ML research efforts on solving problems that their translators were experiencing the most, and this led to the development of ModernMT: the first neural network that adapts to the typical translation workflow. This innovation allows translators to avoid fixing their translations multiple times along the same document, saving them time and frustration. Achieving the same performance using standard static neural networks would have required massive effort in terms of research and data.

Let’s talk now about the relationship between AI and replacing humans. As you now know well, modern AI is based on machine learning: a technology capable of learning how to perform tasks from data. This means that AI applications today can handle single well-defined tasks, with a specific output and a clearly defined evaluation metric.

Most human jobs are more complex than this. A job may require performing a series of tasks that, taken singularly, may be a fit for ML algorithms--but as a whole require additional skills to be performed properly. Examples of skills required are as follows:

  • Exceptions handling
  • Empathy
  • Creativity
  • Making decisions based on incomplete data

On the other hand, AI algorithms are usually better than humans at these tasks:

  • Making decisions based on large amounts of data
  • Speed
  • Objectiveness (provided that they’re trained on unbiased data)

If we look at the tasks that humans do to fulfill their jobs, we find a subset that doesn’t require human-specific skills and are a good fit for AI agents. For Translated, this was scanning through resumes. If the company employed people to scan thousands of resumes and find the best match for a translation, it would have needed a large team of people to make straightforward decisions (for example, discarding resumes of doctors to translate an airplane manual). This high-level selection requires mostly speed and the ability to process large amounts of information, two characteristics that ML algorithms excel at.

When identifying high-potential ML projects, Trombetti has a simple rule:

Look at the tasks that a company could perform using 1,000 untrained interns. These are tasks that can be automated by AI.

The company knows that selecting a translator for a delicate job requires sensibility and experience. For this reason, the final call is not left to the algorithms: its output is still vetted by a project manager using soft skills to validate or reject the algorithmic pick.

What about Translated’s AI-aided translation services? Performing a high-quality translation of a document requires empathy and creativity. These two characteristics are the domain of human beings, and ML algorithms are notoriously not a viable solution for such problems. Yet, Translated has successfully used AI for this application as well.

The main reason behind Translated’s success in offering AI-aided translations is that it has never marketed the AI as a replacement for humans. This would have meant overstepping on AI capabilities and delivering a poor product. What the company has done is use AI while clearly stating that its results aren’t perfect, but just good enough to make translators’ jobs easier. In this way, their attention can be directed at the parts of the document that require more empathy and creativity, while leaving to Translated’s AI the task of translating the easiest sentences.

Summary

  • Natural language processing models can understand and classify the meaning of text, and even write something of their own.
  • Today’s NLP technology doesn’t always live up to the hype. Framing complexity in terms of width and depth helps your team understand what’s possible and what’s not.
  • By breaking down and scoping an NLP project, you can work around limitations in state-of-the-art algorithms. Sentiment analysis, text classification, and document search are all examples of these more mature building blocks.
  • A staggered rollout of AI features is a great way to build confidence in NLP models and build an AI-based powerhouse, as you saw in the Translated case study.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.123.120