A programmer of pandas will spend most of their time using two primary objects provided by the pandas framework: Series
and DataFrame
. The DataFrame
objects will be the overall workhorse of pandas and the most frequently used as they provide the means to manipulate tabular and heterogeneous data.
The base data structure of pandas is the Series
object, which is designed to operate similar to a NumPy array but also adds index capabilities. A simple way to create a Series
object is by initializing a Series
object with a Python array or Python list.
In [2]: # create a four item DataFrame s = Series([1, 2, 3, 4]) s Out [2]: 0 1 1 2 2 3 3 4 dtype: int64
This has created a pandas Series
from the list. Notice that printing the series resulted in what appears to be two columns of data. The first column in the output is not a column of the Series
object, but the index labels. The second column is the values of the Series
object. Each row represents the index label and the value for that label. This Series
was created without specifying an index, so pandas automatically creates indexes starting at zero and increasing by one.
Elements of a Series
object can be accessed through the index using []
. This informs the Series
which value to return given one or more index values (referred to in pandas as labels). The following code retrieves the items in the series with labels 1
and 3
.
In [3]: # return a Series with the rows with labels 1 and 3 s[[1, 3]] Out [3]: 1 2 3 4 dtype: int64
A Series
object can be created with a user-defined index by specifying the labels for the index using the index
parameter.
In [4]: # create a series using an explicit index s = Series([1, 2, 3, 4], index = ['a', 'b', 'c', 'd']) s Out [4]: a 1 b 2 c 3 d 4 dtype: int64
Data in the Series
object can now be accessed by alphanumeric index labels by passing a list of the desired labels, as the following demonstrates:
In [5]: # look up items the series having index 'a' and 'd' s[['a', 'd']] Out [5]: a 1 d 4 dtype: int64
It is still possible to refer to the elements of the Series
object by their numerical position.
In [6]: # passing a list of integers to a Series that has # non-integer index labels will look up based upon # 0-based index like an array s[[1, 2]] Out [6]: b 2 c 3 dtype: int64
The s.index
property allows direct access to the index of the Series
object.
In [7]: # get only the index of the Series s.index Out [7]: Index([u'a', u'b', u'c', u'd'], dtype='object')
The index is itself actually a pandas object. This shows us the values of the index and that the data type of each label in the index is object
.
A common usage of a Series
in pandas is to represent a time series that associates date/time index labels with a value. A date range can be created using the pandas method pd.date_range()
.
In [8]: # create a Series who's index is a series of dates # between the two specified dates (inclusive) dates = pd.date_range('2014-07-01', '2014-07-06') dates Out [8]: <class 'pandas.tseries.index.DatetimeIndex'> [2014-07-01, ..., 2014-07-06] Length: 6, Freq: D, Timezone: None
At this point, the index is not particularly useful without having values for each index. We can use this index to create a new Series
object with values for each of the dates.
In [9]: # create a Series with values (representing temperatures) # for each date in the index temps1 = Series([80, 82, 85, 90, 83, 87], index = dates) temps1 Out [9]: 2014-07-01 80 2014-07-02 82 2014-07-03 85 2014-07-04 90 2014-07-05 83 2014-07-06 87 Freq: D, dtype: int64
Statistical methods provided by NumPy can be applied to a pandas Series
. The following returns the mean of the values in the Series
.
In [10]: # calculate the mean of the values in the Series temps1.mean() Out [10]: 84.5
Two Series
objects can be applied to each other with an arithmetic operation. The following code calculates the difference in temperature between two Series
.
In [11]: # create a second series of values using the same index temps2 = Series([70, 75, 69, 83, 79, 77], index = dates) # the following aligns the two by their index values # and calculates the difference at those matching labels temp_diffs = temps1 - temps2 temp_diffs Out [11]: 2014-07-01 10 2014-07-02 7 2014-07-03 16 2014-07-04 7 2014-07-05 4 2014-07-06 10 Freq: D, dtype: int64
Time series data such as that shown here can also be accessed via the index or by an offset into the Series
object.
In [12]: # lookup a value by date using the index temp_diffs['2014-07-03'] Out [12]: 16 In [13]: # and also possible by integer position as if the # series was an array temp_diffs[2] Out [13]: 16
A pandas Series
represents a single array of values, with an index label for each value. If you want to have more than one Series
of data that is aligned by a common index, then a pandas DataFrame
is used.
The following code creates a DataFrame
object with two columns representing the temperatures from the Series
objects used earlier.
In [14]: # create a DataFrame from the two series objects temp1 and temp2 # and give them column names temps_df = DataFrame( {'Missoula': temps1, 'Philadelphia': temps2}) temps_df Out [14]: Missoula Philadelphia 2014-07-01 80 70 2014-07-02 82 75 2014-07-03 85 69 2014-07-04 90 83 2014-07-05 83 79 2014-07-06 87 77
Columns in a DataFrame
object can be accessed using an array indexer []
with the name of the column or a list of column names. The following code retrieves the Missoula
column of the DataFrame
object:
In [15] # get the column with the name Missoula temps_df['Missoula'] Out [15]: 2014-07-01 80 2014-07-02 82 2014-07-03 85 2014-07-04 90 2014-07-05 83 2014-07-06 87 Freq: D, Name: Missoula, dtype: int64
The following code retrieves the Philadelphia
column:
In [16]: # likewise we can get just the Philadelphia column temps_df['Philadelphia'] Out [16]: 2014-07-01 70 2014-07-02 75 2014-07-03 69 2014-07-04 83 2014-07-05 79 2014-07-06 77 Freq: D, Name: Philadelphia, dtype: int64
The following code returns both the columns, but reversed.
In [17]: # return both columns in a different order temps_df[['Philadelphia', 'Missoula']] Out [17]: Philadelphia Missoula 2014-07-01 70 80 2014-07-02 75 82 2014-07-03 69 85 2014-07-04 83 90 2014-07-05 79 83 2014-07-06 77 87
Very conveniently, if the name of a column does not have spaces, you can use property-style names to access the columns in a DataFrame
.
In [18]: # retrieve the Missoula column through property syntax temps_df.Missoula Out [18]: 2014-07-01 80 2014-07-02 82 2014-07-03 85 2014-07-04 90 2014-07-05 83 2014-07-06 87 Freq: D, Name: Missoula, dtype: int64
Arithmetic operations between columns within a DataFrame
are identical in operation to those on multiple Series
as each column in a DataFrame
is a Series
. To demonstrate, the following code calculates the difference between temperatures using property notation.
In [19]: # calculate the temperature difference between the two cities temps_df.Missoula - temps_df.Philadelphia Out [19]: 2014-07-01 10 2014-07-02 7 2014-07-03 16 2014-07-04 7 2014-07-05 4 2014-07-06 10 Freq: D, dtype: int64
A new column can be added to DataFrame
simply by assigning another Series
to a column using the array indexer []
notation. The following code adds a new column in the DataFrame
, which contains the difference in temperature on the respective dates.
In [20]: # add a column to temp_df that contains the difference in temps temps_df['Difference'] = temp_diffs temps_df Out [20]: Missoula Philadelphia Difference 2014-07-01 80 70 10 2014-07-02 82 75 7 2014-07-03 85 69 16 2014-07-04 90 83 7 2014-07-05 83 79 4 2014-07-06 87 77 10
The names of the columns in a DataFrame
are object accessible via the DataFrame
object's .columns
property, which itself is a pandas Index
object.
In [21]: # get the columns, which is also an Index object temps_df.columns Out [21]: Index([u'Missoula', u'Philadelphia', u'Difference'], dtype='object')
The DataFrame
(and Series
) objects can be sliced to retrieve specific rows. A simple example here shows how to select the second through fourth rows of temperature difference values.
In [22]: # slice the temp differences column for the rows at # location 1 through 4 (as though it is an array) temps_df.Difference[1:4] Out [22]: 2014-07-02 7 2014-07-03 16 2014-07-04 7 Freq: D, Name: Difference, dtype: int64
Entire rows from a DataFrame
can be retrieved using its .loc
and .iloc
properties. The following code returns a Series
object representing the second row of temps_df
of the DataFrame
object by zero-based position of the row using the .iloc
property:
In [23]: # get the row at array position 1 temps_df.iloc[1] Out [23]: Missoula 82 Philadelphia 75 Difference 7 Name: 2014-07-02 00:00:00, dtype: int64
This has converted the row into a Series
, with the column names of the DataFrame
pivoted into the index labels of the resulting Series
.
In [24]: # the names of the columns have become the index # they have been 'pivoted' temps_df.ix[1].index Out [24]: Index([u'Missoula', u'Philadelphia', u'Difference'], dtype='object')
Rows can be explicitly accessed via index label using the .loc
property. The following code retrieves a row by the index label:
In [25]: # retrieve row by index label using .loc temps_df.loc['2014-07-03'] Out [25]: Missoula 85 Philadelphia 69 Difference 16 Name: 2014-07-03 00:00:00, dtype: int64
Specific rows in a DataFrame
object can be selected using a list of integer positions. The following code selects the values from the Difference
column in rows at locations 1
, 3
, and 5
.
In [26]: # get the values in the Differences column in rows 1, 3, and 5 # using 0-based location temps_df.iloc[[1, 3, 5]].Difference Out [26]: 2014-07-02 7 2014-07-04 7 2014-07-06 10 Name: Difference, dtype: int64
Rows of a DataFrame
can be selected based upon a logical expression applied to the data in each row. The following code returns the evaluation of the value in the Missoula
temperature column being greater than 82
degrees:
In [27]: # which values in the Missoula column are > 82? temps_df.Missoula > 82 Out [27]: 2014-07-01 False 2014-07-02 False 2014-07-03 True 2014-07-04 True 2014-07-05 True 2014-07-06 True Freq: D, Name: Missoula, dtype: bool
When using the result of an expression as the parameter to the []
operator of a DataFrame
, the rows where the expression evaluated to True
will be returned.
In [28]: # return the rows where the temps for Missoula > 82 temps_df[temps_df.Missoula > 82] Out [28]: Missoula Philadelphia Difference 2014-07-03 85 69 16 2014-07-04 90 83 7 2014-07-05 83 79 4 2014-07-06 87 77 10
This technique of selection in pandas terminology is referred to as a Boolean selection, and will form the basis of selecting data based upon its values.
18.226.163.229