Creating Vectorizer

Now let's initialize the Tf-idf vectorizer and define few parameters such as:

  • min_df:  When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.
  • ngram_range: Configuring our vectorizer to capture n-words at a time
  • normNorm used to normalize term vectors using L1 or L2 norms
  • encoding: To handle the Unicode characters.

There are lot more other parameters which one can look into and configure and play.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=0, ngram_range=(2, 4), strip_accents='unicode',norm='l2' , encoding='ISO-8859-1')

Now we train the model on the questions.

# We create an array for our train data set (questions)
X_train = vectorizer.fit_transform(np.array([''.join(que) for que in question_list]))

# Next step is to transform the query sent by user to bot (test data)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.