MALLET

The Machine Learning for Language Toolkit (MALLET) is a large library of natural language processing algorithms and utilities. It can be used in a variety of tasks such as document classification, document clustering, information extraction, and topic modelling. It features a command-line interface as well as a Java API for several algorithms such as Naive Bayes, HMM, Latent Dirichlet topic models, logistic regression, and conditional random fields.

MALLET is available under the Common Public License 1.0, which means that you can even use it in commercial applications. It can be downloaded from http://mallet.cs.umass.edu. A MALLET instance is represented by name, label, data, and source. However, there are two methods to import data into the MALLET format, as shown in the following list:

  • Instance per file: Each file or document corresponds to an instance and MALLET accepts the directory name for the input.
  • Instance per line: Each line corresponds to an instance, where the following format is assumed—the instance_name label token. Data will be a feature vector, consisting of distinct words that appear as tokens and their occurrence count.

The library is comprised of the following packages:

  • cc.mallet.classify: These are algorithms for training and classifying instances, including AdaBoost, bagging, C4.5, as well as other decision tree models, multivariate logistic regression, Naive Bayes, and Winnow2.
  • cc.mallet.cluster: These are unsupervised clustering algorithms, including greedy agglomerative, hill climbing, k-best, and k-means clustering.
  • cc.mallet.extract: This implements tokenizers, document extractors, document viewers, cleaners, and so on.
  • cc.mallet.fst: This implements sequence models, including conditional random fields, HMM, maximum entropy Markov models, and corresponding algorithms and evaluators.
  • cc.mallet.grmm: This implements graphical models and factor graphs such as inference algorithms, learning, and testing, for example, loopy belief propagation, Gibbs sampling, and so on.
  • cc.mallet.optimize: These are optimization algorithms for finding the maximum of a function, such as gradient ascent, limited-memory BFGS, stochastic meta ascent, and so on.
  • cc.mallet.pipe: These are methods as pipelines to process data into MALLET instances.
  • cc.mallet.topics: These are topics modelling algorithms, such as Latent Dirichlet allocation, four-level pachinko allocation, hierarchical PAM, DMRT, and so on.
  • cc.mallet.types: This implements fundamental data types such as dataset, feature vector, instance, and label.
  • cc.mallet.util: These are miscellaneous utility functions such as command-line processing, search, math, test, and so on.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.249.198