In Chapter 7, we learned how to import a generic file using basic functions. Here we explore pandas—one of the most important libraries for dataset management.
Libraries for Data Mining
pandas: imports, manages, and manipulates data frames in various formats, extracts part of the data, combines two data frames, and also contains some basic statistical functions
NumPy: a package for scientific computing, contains several high-level mathematical and algebraic functions, and random number generation; and allows the creation of arrays
Matplotlib: a library that allows the creation charts from datasets
SciPy: contains more than 60 statistical functions related to mathematical and statistical analysis
scikit-learn: the most important tool for machine learning and data analysis.
pandas
In this chapter, we move away from discussions of Python structures and start to look at the most important data mining packages, beginning with the pandas library.
Read and import structured data
Organize and manipulate them
Calculate some basic statistics
As we shall see, pandas, NumPy, SciPy, and Matplotlib typically work together. I present them in the best possible way—separately at first—to effect clarity.
pandas: Series
In this case, when we call up or create one of these two structures, we need not specify “pd” at the beginning.
25
pandas: Data Frames
.loc: works through the index
.iloc: extracts via position
.ix: takes both into account
pandas: Importing and Exporting Data
The pd.read_csv() function can also be used to import files from the Web. The UCI machine learning repository ( https://archive.ics.uci.edu/ml/index.php ) features several data sets used for machine learning.
filepath the address of the file (on the computer or externally)
sep = ',' the separator dividing data, such as a comma or semicolon
dtype= a means of specifying column format
header= names of variables in the first line, if any
skiprows= a means of importing only one part of cases—for example, skiprows = 50 reads data from the 51st case onward
index_col= a means of setting a column as a data index
skip_blank_lines= a means of removing any blank lines in the dataset
na_filter= a means of identifying the missing values in the dataset and, if set to False, removing them
One specular function is df.to_csv(“filename”), which allows us to write a .csv file into our work directory. We can specify the index=False argument to avoid downloading the index together with data.
As an alternative to this file import function, we can use the pd.read_table(filepath, sep = ‘,’) function , which sets both the file path (filepath) and the separator.
In this case, not only must we specify the address of the file, but also whether the Excel file features more than one data sheet, and the name of the sheet from which we want to read data. As with .csv, for Excel formats we also have a formula that allows us to write an Excel file in the work directory of our computer: df.to_excel(). The pandas package also contains a function for reading files in JSON—pd.read_json()—and also allows us to access Web data via the pd.read_html(url) function and to convert a data frame to an HTML table via pd.to_html().
pandas: Data Manipulation
In parentheses (n=2), we inserted the number of cases to be extracted. How can we always extract the same cases randomly, so that extraction can be repeated? We use the np.random.seed() function. The number included in parentheses in this function does not really matter, but if two people use the same dataset and use the same number, they will extract the same cases.
pandas: Missing Values
pandas: Merging Two Datasets
When we use ‘outer’, we merge the cases of the two data frames, holding only a copy of the double cases, but without losing data from one of the two initial datasets.
pandas: Basic Statistics
Statistical Methods
Method | Description |
---|---|
.describe | Provides some descriptive statistics |
.count | Returns the number of values per variable |
.mean | Returns the average |
.median | Returns the median |
.mode | Returns the mode |
.min | Returns the lowest value |
.max | Returns the highest value |
.std | Returns the standard deviation |
.var | Returns the variance |
.skew | Returns skewness |
.kurt | Returns kurtosis |
Summary
In this chapter, we learned some easier ways to import and manage our data using pandas—one of the most important libraries for data manipulation and data science. In Chapter 9, we examine another package that is important for data manipulation: NumPy.