Building a data matrix using pandas

Now, it's time to introduce another essential data science tool that comes preinstalled with Python Anaconda: pandas. pandas is built on NumPy and provides several useful tools and methods to deal with data structures in Python. Just as we generally import NumPy under the alias, np, it is common to import pandas under the pd alias:

In [6]: import pandas as pd

pandas provide a useful data structure called a DataFrame, which can be understood as a generalization of a 2D NumPy array, as shown here:

In [7]: pd.DataFrame({
... 'model': [
... 'Normal Bayes',
... 'Multinomial Bayes',
... 'Bernoulli Bayes'
... ],
... 'class': [
... 'cv2.ml.NormalBayesClassifier_create()',
... 'sklearn.naive_bayes.MultinomialNB()',
... 'sklearn.naive_bayes.BernoulliNB()'
... ]
... })

The output of the cell will look like this:

We can combine the preceding functions to build a pandas DataFrame from the extracted data:

In [8]: def build_data_frame(extractdir, classification):
... rows = []
... index = []
... for file_name, text in read_files(extractdir):
... rows.append({'text': text, 'class': classification})
... index.append(file_name)
...
... data_frame = pd.DataFrame(rows, index=index)
... return data_frame

We then call it with the following command:

In [9]: data = pd.DataFrame({'text': [], 'class': []})
... for source, classification in sources:
... extractdir = '%s/%s' % (datadir, source[:-7])
... data = data.append(build_data_frame(extractdir,
... classification))
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.96.105