As mentioned earlier, we are going to use Python as the main tool for data analysis. Yay! Well, if you ask me why, Python has been consistently ranked among the top 10 programming languages and is widely adopted for data analysis and data mining by data science experts. In this book, we assume you have a working knowledge of Python. If you are not familiar with Python, it's probably too early to get started with data analysis. I assume you are familiar with the following Python tools and packages:
Python programming |
Fundamental concepts of variables, string, and data types Conditionals and functions Sequences, collections, and iterations Working with files Object-oriented programming |
NumPy |
Create arrays with NumPy, copy arrays, and divide arrays Perform different operations on NumPy arrays Understand array selections, advanced indexing, and expanding Working with multi-dimensional arrays Linear algebraic functions and built-in NumPy functions |
pandas |
Understand and create DataFrame objects Subsetting data and indexing data Arithmetic functions, and mapping with pandas Managing index Building style for visual analysis |
Matplotlib |
Loading linear datasets Adjusting axes, grids, labels, titles, and legends Saving plots |
SciPy |
Importing the package Using statistical packages from SciPy Performing descriptive statistics Inference and data analysis |
Before diving into details about analysis, we need to make sure we are on the same page. Let's go through the checklist and verify that you meet all of the prerequisites to get the best out of this book:
Setting up a virtual environment |
> pip install virtualenv |
Reading/writing to files |
filename = "datamining.txt" |
Error handling |
try: |
Object-oriented concept |
class Disease: |
Next, let's look at the basic operations of EDA using the NumPy library.