Introduction

In the previous chapter, we saw how to apply ConvNets to images. During this chapter, we will apply similar ideas to texts.

What do a text and an image have in common? At first glance, very little. However, if we represent sentences or documents as a matrix then this matrix is not different from an image matrix where each cell is a pixel. So, the next question is, how can we represent a text as a matrix? Well, it is pretty simple: each row of a matrix is a vector which represents a basic unit of the text. Of course, now we need to define what a basic unit is. A simple choice could be to say that the basic unit is a character. Another choice would be to say that a basic unit is a word, yet another choice is to aggregate similar words together and then denote each aggregation (sometimes called cluster or embedding) with a representative symbol.

Note that regardless of the specific choice adopted for our basic units, we need to have a 1:1 map from basic units into integer IDs so that a text can be seen as a matrix. For instance, if we have a document with 10 lines of text and each line is a 100-dimensional embedding, then we will represent our text with a matrix 10 x 100. In this very particular image, a pixel is turned on if that sentence x contains the embedding represented by position y. You might also notice that a text is not really a matrix but more a vector because two words located in adjacent rows of text have very little in common. Indeed, there is a major difference with images where two pixels located in adjacent columns most likely have some correlation.

Now you might wonder: I understand that you represent the text as a vector but, in doing so, we lose the position of the words and this position should be important, shouldn't it?

Well, it turns out that in many real applications knowing whether a sentence contains a particular basic unit (a char, a word, or an aggregate) or not is pretty accurate information, even if we don't memorize where exactly in the sentence this basic unit is located.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.24.106