Text preprocessing

We start with retaining letter-only words so that numbers such as 00 and 000 and combinations of letter and number such as b8f will be removed. The filter function is defined as follows:

>>> def is_letter_only(word):
... for char in word:
... if not char.isalpha():
... return False
... return True
...

>>> data_cleaned = []
>>> for doc in groups.data:
... doc_cleaned = ' '.join(word for word in doc.split()
if is_letter_only(word) )
... data_cleaned.append(doc_cleaned)

It will generate a cleaned version of the newsgroups data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.212.160