How it works...

It is typical to call read_html multiple times before arriving at the table (or tables) that you desire. There are two primary parameters at your disposal to specify a table, match and attrs. The string provided to match is used to find an exact match for the actual text in the table. This is text that will show up on the web page itself. The attrs parameter, on the other hand, searches for HTML table attributes found directly after the start of the table tag, <table. To see more of the table attributes, visit this page from W3 Schools (http://bit.ly/2hzUzdD).

Once we find our table in step 8, we can still take advantage of some other parameters to simplify things. HTML tables don't typically translate directly to nice DataFrames. There are often missing column names, extra rows, and misaligned data. In this recipe, skiprows is passed a list of row numbers to skip over when reading the file. They correspond to the rows of missing values in the DataFrame output from step 8. The header parameter is also used to specify the location of the column names. Notice that header is equal to zero, which may seem wrong at first. Whenever the header parameter is used in conjunction with skiprows, the rows are skipped first resulting in a new integer label for each row. The correct column names are in row 4 but as we skipped rows 0 through 3, the new integer label for it is 0.

In step 11, the ffill method fills any missing values vertically, going down with the last non-missing value. This method is just a shortcut for fillna(method='ffill').

Step 13 builds a function composed of all the previous steps to automatically get approval ratings from any President, provided you have the order number. There are a few differences in the function. Instead of applying the ffill method to the entire DataFrame, we only apply it to the President column. In Trump's DataFrame, the other columns had no missing data but this does not guarantee that all the scraped tables will have no missing data in their other columns. The last line of the function sorts the dates in a more natural way for data analysis from the oldest to newest. This changes the order of the index too, so we discard it with reset_index to have it begin from zero again.

Step 16 shows a common pandas idiom for collecting multiple, similarly indexed DataFrames into a list before combining them together with the concat function. After concatenation into a single DataFrame, we should visually inspect it to ensure its accuracy. One way to do this is to take a glance at the first few rows from each President's section by grouping the data and then using the head method on each group.

The summary statistics in step 18 are interesting as each successive President has had lower median approval than the last. Extrapolating the data would lead to naively predicting a negative approval rating within the next several Presidents.

The plotting code in step 19 is fairly complex. You might be wondering why we need to iterate through a groupby object, to begin with. In the DataFrame's current structure, it has no ability to plot different groups based on values in a single column. However, step 23 shows you how to set up your DataFrame so that pandas can directly plot each President's data without a loop like this.

To understand the plotting code in step 19, you must first be aware that a groupby object is iterable and, when iterating through, yields a tuple containing the current group (here it's just the name of the President) and the sub-DataFrame for just that group. This groupby object is zipped together with values controlling the color and linestyle of the plot. We import the colormap module, cm, from matplotlib which contains dozens of different colormaps. Passing a float between 0 and 1 chooses a specific color from that colormap and we use it in our plot method with the color parameter. It is also important to note that we had to create the figure, fig, along with a plotting surface, ax, to ensure that each approval line was placed on the same plot. At each iteration in the loop, we use the same plotting surface with the identically named parameter, ax.

To make a better comparison between Presidents, we create a new column equal to the number of days in office. We subtract the first date from the rest of the dates per President group. When two datetime64 columns are subtracted, the result is a timedelta64 object, which represents some length of time, days in this case. If we leave the column with nanosecond precision, the x-axis will similarly display too much precision by using the special dt accessor to return the number of days.

A crucial step comes in step 23. We structure the data such that each President has a unique column for their approval rating. Pandas makes a separate line for each column. Finally, in step 24, we use the .loc indexer to simultaneously select the first 250 days (rows) along with only the columns for just Trump and Obama. The ffill method is used in the rare instances that one of the Presidents has a missing value for a particular day. In Python, it is possible to pass dictionaries that contain the parameter names and their values to functions by preceding them with ** in a process called dictionary unpacking.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.146.155