Data massage

When we tested that the data structure made sense, we printed the FullText field. We wish to cluster based on the content of the tweet. What matters to us is that content. This can be found in the FullText field of the struct. Later on in the chapter, we will see how we may use the metadata of the tweets, such as location, to help cluster the tweets better.

As mentioned in the previous sections, each individual tweet needs to be represented as a coordinate in some higher-dimensional space. Thus, our goal is to take all the tweets in a timeline and preprocess them in such a way that we can get this output table:

| Tweet ID | twitter | test | right | wrong |
|:--------:|:------:|:----:|:----:|:---:|
| 1 | 0 | 1 | 0 | 0 |
| 2 | 1 | 0 | 0 | 0 |
| 3 | 0 | 0 | 1 | 1 |

Each row in the table represents a tweet, indexed by the tweet ID. The columns that follow are words that exist in the tweet, indexed by its header. So, in the first row, test appears in the tweet, while twitter, right, and wrong do not. The slice of numbers [0 1 0 0] in the first row is the input we require for the clustering algorithms.

Of course, binary numbers indicating the presence of a word in a tweet isn't the best. It'd be more interesting if the relative importance of the word is used instead. Again, we turn to the familiar TF-IDF, first introduced in Chapter 2, Linear Regression – House Price Predictionfor this. More advanced techniques such as using word embeddings exist. But you'd be surprised how well something as simple as TF-IDF can perform.

By now, the process should be familiar—we want to represent the text as a slice of numbers, not as a slice of bytes. In order to do so, we would have to require some sort of dictionary to convert the words in the text into IDs. From there, we can built the table.

Again, like in Chapter 2, Linear Regression – House Price Prediction, we shall approach this with a simple tokenization strategy. More advanced tokenizers are nice, but not necessary for our purpose. Instead, we'll rely on good old strings.Field.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.228.88