Normalizing a string

First, the word is to be normalized into NFKC form. In Chapter 2, Linear Regression–House Price Prediction, this was introduced, but I then mentioned that LingSpam basically provides normalized datasets. In real-world data, which Twitter is, data is often dirty. Hence, we need to be able to compare them on an apples-to-apples basis.

To show this, let's write a side program:

 package main
import (
"fmt"
"unicode"
"golang.org/x/text/transform"
"golang.org/x/text/unicode/norm"
)
func isMn(r rune) bool { return unicode.Is(unicode.Mn, r) }
func main() {
str1 := "cafe"
str2 := "café"
str3 := "cafeu0301"
fmt.Println(str1 == str2)
fmt.Println(str2 == str3)
t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFKC)
str1a, _, _ := transform.String(t, str1)
str2a, _, _ := transform.String(t, str2)
str3a, _, _ := transform.String(t, str3)
fmt.Println(str1a == str2a)
fmt.Println(str2a == str3a)
}

The first thing to note is that there are at least three ways of writing the word café, which for the purposes of this demonstration means coffee shop. It's clear from the first two comparisons that the words are not the same. But since they mean the same thing, a comparison should return true.

To do that, we will need to transform all the text to one form, and then comapare it. To do so, we would need to define a transformer:

t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFKC)

This transformer is a chain of text transformers, applied one after another.

First, we convert all the text to its decomposing form, NFD. This would turn café into cafeu0301.

Then, we remove any non-spacing mark. This turns cafeu0301 into cafe. This removal function is done with the isMn function, defined as follows:

func isMn(r rune) bool { return unicode.Is(unicode.Mn, r) }

Lastly, convert everything to NKFC form for maximum compatibility and space saving. All three strings are now equal.

Note that this type of comparison is done with one single assumption that belies it all: there is one language that we're doing our comparisons in—English. Café in French means coffee as well as coffee shop. This kind of normalization, where we remove diacritical marks, works so long as removing a diacritic mark does not change the meaning of the word. We'd have to be more careful around normalization when dealing with multiple languages. But for this project, this is a good enough assumption.

With this new knowledge, we will need to update our processor type:

 type processor struct {
tfidf *tfidf.TFIDF
corpus *corpus.Corpus
transformer transformer.Transformer
locations map[string]int
locCount int
}
func newProcessor() *processor {
c, err := corpus.Construct(corpus.WithWords([]string{mention, hashtag, retweet, url}))
dieIfErr(err)
t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFKC)
return &processor{
tfidf: tfidf.New(),
corpus: c,
transformer: t,
locations: make(map[string]int),
}
}

The first line of our p.single function would have to change too, from this:

 func (p *processor) single(a string) (wordID int, ok bool) {
word := strings.ToLower(a)

It will change to this:

 func (p *processor) single(a string) (wordID int, ok bool) {
word, _, err := transform.String(p.transformer, a)
dieIfErr(err)
word = strings.ToLower(word)

If you're feeling extra hard-working, try making strings.ToLower a transform.Transformer. It is harder than you might expect, but not as hard as it appears.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.29.119