In the earlier parts of the chapter, we sketched out a dummy Classifier type that does nothing. Let's make it do something now:
type Classifier struct {
corpus *corpus.Corpus
tfidfs [MAXCLASS]*tfidf.TFIDF
totals [MAXCLASS]float64
ready bool
sync.Mutex
}
Here, there are introductions to a few things. Let's walk them through one by one:
- We'll start with the corpus.Corpus type.
- This is a type imported from the corpus package, which is a subpackage of the NLP library for Go, lingo.
- To install lingo, simply run go get -u github.com/chewxy/lingo/....
- To use the corpus package, simply import it like so: import "github.com/chewxy/lingo/corpus".
Bear in mind that in the near future, the package will change to github.com/go-nlp/lingo. If you are reading this after January 2019, use the new address.
A corpus.Corpus object simply maps from a word to an integer. The reason for doing this is twofold:
- It saves on memory: A []int uses considerably less memory than []string. Once a corpus has been converted to be IDs, the memory for the strings can be freed. The purpose of this is to provide an alternative to string interning.
- String interning is fickle: String interning is a procedure where for the entire program's memory, only exactly one copy of the string exists. This turns out to be harder than expected for most tasks. Integers provide a more stable interning procedure.
Next, we are faced with two fields which are arrays. Specifically, tfidfs [MAXCLASS]*tfidf.TFIDF and totals [MAXCLASS]float64. At this point, it might be a good idea to talk about the Class type.