Creating Vectorizer

Now let's initialize the Tf-idf vectorizer and define few parameters such as:

  • min_df:  When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.
  • ngram_range: Configuring our vectorizer to capture n-words at a time
  • normNorm used to normalize term vectors using L1 or L2 norms
  • encoding: To handle the Unicode characters.

There are lot more other parameters which one can look into and configure and play.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=0, ngram_range=(2, 4), strip_accents='unicode',norm='l2' , encoding='ISO-8859-1')

Now we train the model on the questions.

# We create an array for our train data set (questions)
X_train = vectorizer.fit_transform(np.array([''.join(que) for que in question_list]))

# Next step is to transform the query sent by user to bot (test data)
X_query=vectorizer.transform(query)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.184.200