Data types and data conversion

As you may notice, when we print out a Series object, its data type will be declared in the last line. An alternative way is to call dtype for each Series, or .dtypes for the entire dataframe (it will return a Series object). Those data types are defined in C, and not Python. The majority of them largely match Python ones; for example, integers, floats, and Booleans. There are, however, a few caveats to be aware of regarding the data types:

  • First, there is no existing data type for strings. As you may notice in the last code block, all strings are defined as objects, that is, an arbitrary Python object. This type is the last resort, the type that suits any Python value but does not give any computation benefits. 
  • Next, None. This is an NaN (Not a Number, numpy.nan) data type—but it is a subclass of float. Most of the time, it does not bother you, but there are two cases where it does: since NaN isn't None, neither an equality operation (df['col'] == None) nor an is statement will work. The only way is to check using pd.isnull() and pd.notnull() (or their NumPy analogs). Both functions will work on scalars and collections of values. As NaN is a subclass of float, and values have to be of the same type in the Series or NumPy arrays, any column of integers, once an NaN is added, will be converted to floats.

Finally, some data types—specifically, strings and DateTime types, have corresponding special commands in pandas, accessible via column.str and column.dt, respectively. These include split, replace, slicing, and case changes for strings and retrieving a specific number of minutes/hours/days/months/years, and weekdays (we will get to that later in this chapter). Strings can also be added similar to vanilla Python. Datetimes can be subtracted, resulting in time delta objects.

The situation with the data types may change in the near future for two reasons. First, NumPy defined a standard API for arrays and vectorized functions, allowing other parties to add data types to the ecosystem. Second, Pandas 2.0 will be published soon, based on the arrow dataframe representation, instead of NumPy. Arrow promises to support NaN for integers.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.227.52