Now let's initialize the Tf-idf vectorizer and define few parameters such as:
- min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.
- ngram_range: Configuring our vectorizer to capture n-words at a time
- norm: Norm used to normalize term vectors using L1 or L2 norms
- encoding: To handle the Unicode characters.
There are lot more other parameters which one can look into and configure and play.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=0, ngram_range=(2, 4), strip_accents='unicode',norm='l2' , encoding='ISO-8859-1')
Now we train the model on the questions.
# We create an array for our train data set (questions)
X_train = vectorizer.fit_transform(np.array([''.join(que) for que in question_list]))
# Next step is to transform the query sent by user to bot (test data)
X_query=vectorizer.transform(query)